Friday, November 18, 2016

Group Homeworks for TABA

Class,

Pls find below your two group homeworks. There's ample time to do them and Aashish will discuss them in his tutorial this Sunday.

=======

Task 1 – Empirical Topic Modeling from First Principles:

Recall we did factor-An from first principles by simulating a small data matrix, factorizing it and recovering the original (with errors)?

We'll repeat that simulation-analysis-recovery exercise, only this time with topic mining instead.

Follow the steps below and lookup sample / example code etc in the homeworks folder in LMS.

Step 1 - Choose 3 completely different subjects. E.g., I'd choose "Cricket", "Macroeconomics" and "Astronomy". You pls choose any 3 other, very different subjects / fields.

Step 2 – Scrape the first 50 websites for each of these selected subjects in google search and extract their text. Use python google search code. Thus, we'll have 3 different sets of 50 documents each, one for each subject.

Step 3 – Now create a simulated corpus of 50 documents thus: The first of the 50 documents is a simple concatenation of the first document from subject 1, from subject 2 and from subject 3. Likewise, for the other 49 documents.

Thus, our simulated corpus now has 'composite' documents, i.e. documents composed of 3 distinct subjects each.

Step 4 – Run the latent topic model code for k = 3 topics on this simulated corpus of 50 composite documents.

Step 5 – Analyse the topic model results - wordclouds, COGs, topic proportions in documents. USe the classwork code for Lift and ETA directly.

Step 6 – Comment on (i) whether the topic model is able to separate each subject from other subjects. To what extent is it able to do so?
(ii) Are there mixed tokens (with high lift in more than one topic)? Are the highest LIFT tokens and the document topic proportions (ETA scores) clear and able to identify each topic?
(iii) What are your learnings from this exercise.

Your deliverable should be a R markdown that documents your efforts and/or results from each of the above steps. Do mention which subjects you chose and present the exercise as a narrative or a story, as far as possible.

=======

Task 2 - Training a machine to classify tweets according to sentiment.

Step 1 – Choose any six different recent twitter hashtags with or without sentiment content (e.g. #ClimateChange, #Trump, #Demonetization, #Kejriwal #Technology etc)

Step 2 – Extract ~ 500 tweets for each hash tag. You may use the twitter API conector and associated R code (or equivalent in python, if you wish).

Step 3 – Stack all the ~ 3000 tweets into one corpus.

Step 4 – Remove #keywords, web URL, @user_name from tweets. Clean the raw corpus,basically.

Step 5 – Make a unique tweets corpus (~ 2500) out of ~ 3000 tweets. Drop duplicates due to retweets etc.

Step 6 – Randomly select 70% tweets (training data) and classify them manually as positive(1), neutral (0) or negative(-1).

Step 7 – From this training data, build a simple classifier model (as we did in the simple classwork exercise). Split the sample into two-thirds (calibration) and one thirds (holdout) and check the prediction accuracy of the model. Build its confusion matrix.

Step 8 – Try changing the pre-processing a few times - dropping most common and uninformative words using the stopwords filter, for instance. Does it affect prediction accuracy?

Step 8 – Using the best classifier model, classify remaining 30% tweets (virgin data).

Step 9 – Write a narrative commentary on this process - (i) what hastags you picked and why. (ii) What was the distribution of sentiment across your corpus. (iii) What was the predictive accuracy of the model and did re-processing the raw data help improve it? (iv) Learnings from the exercise.

Your deliverable is again a R markdown that documents your efforts and/or results from each of the above steps. Do present the exercise as a narrative or a story, as far as possible.

=======

Deadline is the last day of Term 2 for both assignments.

Any Qs etc, contact us. Good luck.

Sudhir

2 comments:

  1. Interesting link
    http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

    professor is it possible to apply tow models at a time on the same dataset. because as per above link it appears to me that can be done.. so in that case any model which is failed to classify certain topics/factors properly but other is able to do so and vice versa , the two can be applied together to predict all the topics/factors properly if not accurately... seeking your comments... Mahesh Jadhav

    ReplyDelete
    Replies
    1. Hi Mahesh,

      Interesting article indeed. To quote from the article:

      "In general, stacking produces small gains with a lot of added complexity – not worth it for most businesses."

      I guess that says it for its scope in TABA, where we're more about learning to walk than sprint a 100m dash at the moment.

      Do keep putting up interesting stuff you find, however. The stacking idea is intriguing (sorta intuitive, but hard to pull off).

      Sudhir

      Delete