Saturday, March 12, 2016

Final Goup Homework - Scraping Search Results

Hi Class,

Your final group homework is coming in. It involves scraping and sentiment-analyzing google search results, the way I'd once demonstrated in class.

The idea is to scrape the top few pages of google search results for some of the world's most valuable brands (both B2B and B2C)...

... and thereafter assess the relative sentiment gaps between what the firm or brand says about itself and what the rest of the world says about the brand.

Task:

1. We have assigned 8 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group.

2. Carefully go through the explanation in the word and markdown files for how to run the python script for scraping the data, and the R code for sentiment analysis.

3. For each brand in your list, run the data collection python script and collect data on the first 5 pages worth of search results. 4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".

5. Build a corpus of the content from firm-sponsored links and sentiment-analyze the corpus. Record sentiment polarities, top positive and negative terms etc corresponding to each link in the excel sheet template given.

6. Do the same for non-firm-sponsored links as well.

7. Speculate on how wide the gap is and why such a gap might have arisen in the first place. For instance, are links in page 1 more positive than those in page 2? Are news articles more positive (negative) than blogs and social media posts? Etc.

8. The deliverables include:

  • (A) the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand).
  • (B) The firm-sponsored corpus and the non-firm-sponsored corpus into a dropbox for that purpose (zip the 2 corpora as text files and name it after your group while submitting).
  • (C) A markdown file either on RPubs or as a webpage that very briefly (in a few paragraphs per brand) shows the story you have been able to uncover for the brands assigned to your group.

8. Deadline: Submit by the midnight before the DC exam, whenever that is.

Any queries etc, contact me via the blog comments section or write to Aashish directly.

P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.

No comments:

Post a Comment