Tuesday, October 11, 2016

Group Homework #2 - Webscraping with rvest

Class,

Pls find uploaded in linkstreet a markdown document for using the rvest package to scrape IMDB results.

Run the code line-by-line and check results. The code is documented to some extent. However, if you're unclear about a particular line is doing, web-search first, ask your peers, etc.

You should get a file containing URLs for the top 250 movies on IMDB. Once this is done, your real homework task begins.

Task:

1. Find all movies that released between 1996 and 1998 (both years including).

2. Read in the URL's imdb page and scrape the following information:

Director, stars, Taglines, Genres, (partial) storyline, Box office budget and box office gross.

3. Make a dataframe out of these variables as columns with movie name being the first variable.

4. Make a table movie-count versus Genres.

4a. Bonus points: See if you can come up with some interesting hypotheses. For example, you could hypothesize that "Action Genres occur significantly more often than Drama in the top-250 list." Or that "Action movies gross higher than Romance movies in the top 250 list." Etc.

5. Write a markdown doc with your code and explanation. See if you can storify your hypotheses.

Update: You're not restricted to only use rvest in R. If you're more comfortable with other platforms (say, Python 3), pls feel free to use the same to get the job done. But insert a markdown with the python code you used as part of your submission.

Deliverables and Deadline:

1. Zip a folder containing

  • (a) your .Rmd file,
  • (b) your markdown as a web page,
  • (c) .R script files you used,
  • (d) the data you collected in .txt or .csv form,
  • (e) a notepad with your group name and member roll numbers written on it,
  • (f) Bonus: Include a link to RPubs if you have published your markdown to the web.

2. Deliver the zipped folder to the appropriate dropbox.

3. Deadline will be the midnight of 23-October-2016.

Any Qs etc contact me. Or use the comments section below.

Sudhir

4 comments:

  1. Dear Professor ! Hope you are doing great !

    Thanks for sharing the assignment details above. How can we access the linkstreet? Apologies if I have missed any communications around this.

    ReplyDelete
    Replies
    1. Pls contact vivekananda_pochiraju@isb.edu asap. Are you the only one facing this problem? Ask your peers in the class for guidance, if not.

      Sudhir

      Delete
    2. This comment has been removed by the author.

      Delete
  2. Thanks Prof ! Got the reply from Vivekananda. It seems LMS is being used for linkstreet.

    ReplyDelete