Class,
Your Group Homework is coming up. I describe your Task below, pay careful attention to detail, pls.
Its actually two tasks which are only vaguely related. Past feedback says its better to combine all homeworks into one to enable groups to better plan and schedule work.
Task 1: Web scraping
Step 1: Go here and examine the page.
Step 2: Scrape the page and tabulate the output into a data frame with columns 'name, url, count, style'.
Step 3: Filter the data frame. Retain only those beers that got over 500 reviews. Let's call this Table 1.
Step 4: Now for each of the remaining beers, go to the beer's own webpage on the ratebeer site, and scrape the following information:
'Brewed by, Weighted Avg, Seasonal, Est.Calories, ABV, commercial description' from the top of the page.
Add these fields to Table 1 in that beer's row.
Step 5: Now build a separate Table for each beer in Table 1 from that beer's ratebeer webpage. Scrape the first 3 pages of reviews of that beer and in each review, scrape the following info:
'rating, aroma, appearance, taste, palate, overall, review (text), location (of the reviewer), date of the review.'
Store the output in a dataframe, let's call it Table 2.
Step 6: Create a list (let's call it List 1) with as many elements as there are rows in Table 1. For the i_th beer in Table 1, store Table 2 as the i_th element List 2.
Task 2: DC from Twitter API
Step 1: Read up on how to use the twitter API here. If required, make a twitter ID (if you don't already have one).
R has a dedicated package twitteR (note capital R in the end). For additional functions refer twitteR.pdf (twitteR package manual).
Step 2: Recall three evaluation dimensions for beer at ratebeer.com, viz. aroma, taste and palate. More than the dictionary meanings of these words, its how they're used in context that's interesting.
So, pull 50 tweets each that contain these terms.
Step 3: Read through these tweets and note what product categories they typically refer to. Insights here could, for instance, be useful in designing promotional campaigns for the beers. We'll do text analysis etc next visit.
Deliverables: There are 4 deliverables per group - [1] R markdown of your (well-documented) code and a storified narrative of your thinking behind the code.
[2] Table 1 as an csv file. List 1 as an .RData file.
[3] List of tweets you scraped in a .csv file.
[4] List of brands and product categories that those terms were used in context of.
Zip these deliverables and submit into the drop box before deadline.
Deadline: Midnight before your Visit 2 starts.
Any Qs etc, ask Sudha or Aashish in the Tutorial.
Sudhir