Friday, October 2, 2015

Session 4 homework - Text analytics and topic modeling

Class,

This group homework is based on session 4 - text analytics and topic-mining.

I hope you are comfortable replicating the classwork R code on LMS. There's also an instructional video Pandeyji putup on twitteR. I urge you to leverage the tutorial session with Aashish this Saturday to clarify all issues in this space.

Homework Instructions:

1. Pick up any well-known brand- product or service. E.g. Xbox360 or Jabong or iphone6 or Nike.

2. Collect 3 sets of data for it:

(a) 100+ consumer reviews from either flipkart or Amazon India
(b) 500+ tweets
(c) 50+ articles from Googlenews or any other news aggregator sites.
3. Feel free to either use R or any other means you know of to collect the data (e.g. Python, chrome scraper etc.). But clearly mention the data collection tool used.

4. For each set of data, perform the following analyses:

(a) General wordcloud using both TF and TFIDF weighing schemes. Update stopwords list to filter out noisy or irrelevant terms.
(b) Sentiment analysis. Display wordclouds separately for the top 50 most positive and most negative words.
(c) Identify the top few most positive and most negative documents. Read them and speculate on why they are so positive or negative about it.

5. Latent topic mining: Topic mine a corpus from any one dataset. Use no more than 2 or 3 topics. Make wordclouds of the topic's tokens. Interpret in a few lines what the topics are saying.

6. Session 4 HW submission format:

i. Use a plain white blank PPT.

ii. On the title slide, write your group name and the names + ISB students IDs of all group members.

iii. Give your homework an informative title (include name of the product/brand you chose).

iv. Have 3 sections in your PPT - one corresponding to one data source and separated by separator slides.

v. As slide separators, mention the source of the data. E.g., "Data source: Amazon Consumer reviews" or "Data Source:Twitter" and so on.

vi. For slide headers, use format "TF Wordcloud" or "Positive wordcloud" and so on.

vii. Save the slide deck as session4HW_yourgroup.ppt.

viii. Put all the raw data you collected, the code you used and your PPT in a zip folder (so that I can replicate your analysis if need arises). Save the folder as session4HW_yourgroup.zip and upload in in the dropbox on LMS before the deadline.

ix. Any Qs etc., let Aashish or me know. Feel free to use the comments section to this post for any Q&A or discussions.

There's just one more [group] homework left - on spatial DC using the Google Maps based on a business problem.

Deadline: Sunday 25-Oct Midnight.

Sudhir

3 comments:

  1. Dear Prof, Extracted tweets from Twitter as a part of assignment and I wasn't able to find much reviews in 500+ tweets. The wordclouds that get generated doesn't make much sense. Please let us know how to proceed.

    ReplyDelete
    Replies
    1. Hi Mohan,

      The tweets cannot ordinarily be analyzed. Sentiment analysis is perhaps the best we can do there. Figuring that out is part of the learning exercise int his homework.

      Sudhir

      Delete
  2. Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write ups thanks once again.Network Analytics Market Report | Enterprise Social Software (ESS) Market Report

    ReplyDelete