Analytics Yogi: March 2016

Saturday, March 12, 2016

Final Goup Homework - Scraping Search Results

Hi Class,

Your final group homework is coming in. It involves scraping and sentiment-analyzing google search results, the way I'd once demonstrated in class.

The idea is to scrape the top few pages of google search results for some of the world's most valuable brands (both B2B and B2C)...

... and thereafter assess the relative sentiment gaps between what the firm or brand says about itself and what the rest of the world says about the brand.

Task:

1. We have assigned 8 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group.

2. Carefully go through the explanation in the word and markdown files for how to run the python script for scraping the data, and the R code for sentiment analysis.

3. For each brand in your list, run the data collection python script and collect data on the first 5 pages worth of search results. 4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".

5. Build a corpus of the content from firm-sponsored links and sentiment-analyze the corpus. Record sentiment polarities, top positive and negative terms etc corresponding to each link in the excel sheet template given.

6. Do the same for non-firm-sponsored links as well.

7. Speculate on how wide the gap is and why such a gap might have arisen in the first place. For instance, are links in page 1 more positive than those in page 2? Are news articles more positive (negative) than blogs and social media posts? Etc.

8. The deliverables include:

(A) the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand).
(B) The firm-sponsored corpus and the non-firm-sponsored corpus into a dropbox for that purpose (zip the 2 corpora as text files and name it after your group while submitting).
(C) A markdown file either on RPubs or as a webpage that very briefly (in a few paragraphs per brand) shows the story you have been able to uncover for the brands assigned to your group.

8. Deadline: Submit by the midnight before the DC exam, whenever that is.

Any queries etc, contact me via the blog comments section or write to Aashish directly.

P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.

Friday, March 4, 2016

Group Homework Session 4 - Topic analysis

Hi Class,

Your second-last group homework is coming up. This one concerns topic-mining, analysis and interpretation.

I hope you have completed the individual homework for session 4. If not, I urge you to do so before attempting this one.

After this, there is one last group homework remaining - that for session 5 - based on python script that scrapes google search results.

To ease things a bit, I've already scraped and provided the data for download on the linkstreet assignment folder.

Further, we've also made available the Term Document Matrices (TDMs) that you'll need for the analysis and constructing which might have taken some time.

To make things even more easier, Pandeyji has put-up a markdown workflow in PDF form on linkstreet of code + sample output that walks you through what needs to be done to get the results. Later, I'll ask you to interpret the results.

Pls go through the workflow, line by line. Try the code on the datasets given.

Task:
Find in linkstreet the text corpus pertaining to the 2013 10-K SEC filings by Tech Sector firms from the Fortune 1000. Read it into R.
There are two sections of interest in the 10-K form:
One is Item 1- Business description (wherein firms describe their business and key products),
and the second is Item 1A - Risk Factors (wherein firms reveal exposure to anticipated risks in the coming year).
Section A people (or groups with majority section A people) will analyze Item 1- Business description.
Section B people (or groups with majority section B people) will analyze Item 1A- Risk Factors.
In each case, run the topic mining algorithm on the corpus. The output will be similar to what you saw in the Samsung data survey - word-clouds and co-occurrence graphs.
Interpret the topic. Label it in < 6 words and describe your interpretation in a few lines.
Bonus question: Any potential applications of such analysis you can think of? For instance, ask yourself why might the results of your analysis be interesting to which firms?
Remember, I'm not looking for just some results putup in a file. I'm looking for interconnections, deeper meaning, story ...

Challenge:

The assignment may not be as straightforward as first seen. There are multiple combinations of analysis possibilities.

For instance: (a) Should you use TF or TFIDF? I suggest you try both and go with whatever yields cleaner results.
(b) Should you use K = 2 topics? 3? 4? I suggest you try between 2 and 5 topics and see which yields the 'best' or most interpret-able and clear story.
(c) Debate, discuss, tweak the code, try-and-err etc. Massive learning opportunity if you are willing to go the distance.

Submission Details:

Workflow with code and explaining the story behind your final results can be submitted to linkstreet as a html file. Bonus points if you are able to publish it to Rpubs and send us the link.

Deadline is 20-March (Sunday) Midnight. Any queries etc, let me know.

Sudhir

Thursday, March 3, 2016

Individual Homework - Session 4 (Topic Mining)

Class,

Hi again. Your next individual homework (yup, another survey) is coming up.

Remember this is *individual* homework, no consulting your peers is allowed for this one.

I hope you have replicated the classwork Nokia example in topic-mining at home. It would allow you some idea about what is coming. At the least, carefully go over the relevant slides from session 4.

This is what I wrote in the opening statement of your survey:

Recall the Nokia Lumia example from class? We topic-mined consumer reviews of Nokia Lumia 925 and identified two latent topics underlying the text. We labeled the topics and interpreted what they likely meant.
We used the word-clouds and co-occurrence graphs as visual aids for topic identification. Recall that font-size in a word-cloud is proportional to word frequency in the text. In a co-occurrence graph, words that likely occur together within a document are connected in a network of the words. Aside from that, we also used the top few documents loading onto each topic to help in topic interpretation.
The following survey task asks you to repeat the Nokia type exercise on a new mobile brand - Samsung Galaxy.
Below, you will be presented with the topic-mining results for Samsung Galaxy reviews (collected around the same time as Nokia's). Two topics were found to be optimal. You'll be shown word-clouds and co-occurrence graphs for each of the topics. Further, you'll also be shown the top 5 of the documents that loaded on each topic (as a separate file on linkstreet).
Your task is (i) to interpret these topics using the output, (ii) to meaningfully label these two topics with a short, informative name and (iii) provide a brief description of your interpretation.
Remember, there is no right or wrong answer, just your carefully considered responses.

This is the link to your survey.

I have putup on linkstreet a text document that carries the top 5 reviews loading onto Topics 1 and 2 respectively.

Deadline: 12-March (Saturday) midnight. Pls remember: Timely and complete submission carries course credit.

Any queries etc, let me or Aashish know.

Sudhir