Monday, March 9, 2015

Session 4 based Group Homework for Batch 4

Class,

This group homework is based on session 4 - text analytics.

I did toy with the idea of inserting a latent topic modeling and interpretation component to this homework but decided against it as it isn't strictly in D.C.'s domain.

The code required to do this HW will be up soon on LMS.

Group HW:

1. Pick up any well-known brand- product or service. E.g. Xbox360 or Jabong or iphone6 or Nike.

2. Collect 3 sets of data for it:

  • (a) 100+ consumer reviews from either flipkart or Amazon India
  • (b) 500+ tweets
  • (c) 50+ articles from Googlenews or any other news aggregator sites.

3. Feel free to either use R or any other means you know of to collect the data (e.g. Python, chrome scraper etc.). But clearly mention the data collection tool used.

4. For each set of data, perform the following analyses:

  • (a) General wordcloud using both TF and TFIDF weighing schemes. Update stopwords list to filter out noisy or irrelevant terms.
  • (b) Sentiment analysis. Display wordclouds separately for the top 50 most positive and most negative words.
  • (c) Identify the top few most positive and most negative documents. Read them and speculate on why they are so positive or negative about it.

5. Session 4 HW submission format:

  • Use a plain white blank PPT.
  • On the title slide, write your group name and the names + ISB students IDs of all group members.
  • Give your homework an informative title (include name of the product/brand you chose).
  • Have 3 sections in your PPT - one corresponding to one data source and separated by separator slides.
  • As slide separators, mention the source of the data. E.g., "Data source: Amazon Consumer reviews" or "Data Source:Twitter" and so on.
  • For slide headers, use format "TF Wordcloud" or "Positive wordcloud" and so on.
  • Save the slide deck as session4HW_yourgroup.ppt.
  • Put all the raw data you collected, the code you used and your PPT in a zip folder (so that I can replicate your analysis if need arises). Save the folder as session4HW_yourgroup.zip and upload in in the dropbox on LMS before the deadline.

Any Qs etc., let Atreyee or me know. Feel free to use the comments section to this post for any Q&A or discussions.

There are two more homeworks coming your way - both individual - and only one of them is a survey based one.

Sudhir

6 comments:

  1. Do we need to take information from flipkart or amazon , can we take the information from complaint board , travel advisor etc . Since for service like telecom , banks etc we will not get info from flipkart or amazon. Please suggest.

    ReplyDelete
    Replies
    1. Oh, you aren't required to be restricted to flipkart or amazon. Pls feel free to expand your search to other domains like tripadvisor etc.

      Sudhir

      Delete
  2. Your instructions for extracting Twitter data using R are outdated. It throws up an error saying "ROAuth is no longer used in favor of httr." I found instructions for the httr authentication method here ( https://github.com/hadley/httr/blob/master/demo/oauth1-twitter.r ), but somehow both R and my browser crash every time I try to run this code.

    I will probably end up using Python for extracting Twitter data, but I'm still curious to know how to do it with R. Thanks in advance for any suggestions you can offer.

    -NM

    ReplyDelete
    Replies
    1. OK. Thanks for the pointer.

      Those who don't know python can type in a hastag and manually copy for this homework.

      Will look into this.

      Sudhir

      Delete
  3. What is the due date for submission?

    ReplyDelete
  4. Porf, Let's say we have taken a product and fetched all reviews from Flipkart but as for same product we don't have enough tweets or Google News article are available in that case can we use some other topic or event for remaining two?

    ReplyDelete