Wednesday, November 23, 2016

Some (optional) readings.

Class,

I got some queries on additional readings. I'd also resolved to upload some readings to the blog.

Pls find the same below. I'm quoting from an email response I'd sent recently.

First look at the working paper I'd uploaded onto LMS. It has further academic references in its bibliography section. This is for both topic modeling and matrix factorization.

In addition to the above the following could be of use:

========

Topic modeling:

Wikipedia entry on topic models

Topic Modeling 101 simple example

Technical paper on Topic model estimation

========

On Matrix Factorization

A simple introduction to Matrix Factorization

Non-negative matrix factorization in linear algebra (Wiki entry)

Matrix decomposition for data mining (technical paper)

========

These readings are optional from an exam point of view.

Sudhir

Friday, November 18, 2016

Group Homeworks for TABA

Class,

Pls find below your two group homeworks. There's ample time to do them and Aashish will discuss them in his tutorial this Sunday.

=======

Task 1 – Empirical Topic Modeling from First Principles:

Recall we did factor-An from first principles by simulating a small data matrix, factorizing it and recovering the original (with errors)?

We'll repeat that simulation-analysis-recovery exercise, only this time with topic mining instead.

Follow the steps below and lookup sample / example code etc in the homeworks folder in LMS.

Step 1 - Choose 3 completely different subjects. E.g., I'd choose "Cricket", "Macroeconomics" and "Astronomy". You pls choose any 3 other, very different subjects / fields.

Step 2 – Scrape the first 50 websites for each of these selected subjects in google search and extract their text. Use python google search code. Thus, we'll have 3 different sets of 50 documents each, one for each subject.

Step 3 – Now create a simulated corpus of 50 documents thus: The first of the 50 documents is a simple concatenation of the first document from subject 1, from subject 2 and from subject 3. Likewise, for the other 49 documents.

Thus, our simulated corpus now has 'composite' documents, i.e. documents composed of 3 distinct subjects each.

Step 4 – Run the latent topic model code for k = 3 topics on this simulated corpus of 50 composite documents.

Step 5 – Analyse the topic model results - wordclouds, COGs, topic proportions in documents. USe the classwork code for Lift and ETA directly.

Step 6 – Comment on (i) whether the topic model is able to separate each subject from other subjects. To what extent is it able to do so?
(ii) Are there mixed tokens (with high lift in more than one topic)? Are the highest LIFT tokens and the document topic proportions (ETA scores) clear and able to identify each topic?
(iii) What are your learnings from this exercise.

Your deliverable should be a R markdown that documents your efforts and/or results from each of the above steps. Do mention which subjects you chose and present the exercise as a narrative or a story, as far as possible.

=======

Task 2 - Training a machine to classify tweets according to sentiment.

Step 1 – Choose any six different recent twitter hashtags with or without sentiment content (e.g. #ClimateChange, #Trump, #Demonetization, #Kejriwal #Technology etc)

Step 2 – Extract ~ 500 tweets for each hash tag. You may use the twitter API conector and associated R code (or equivalent in python, if you wish).

Step 3 – Stack all the ~ 3000 tweets into one corpus.

Step 4 – Remove #keywords, web URL, @user_name from tweets. Clean the raw corpus,basically.

Step 5 – Make a unique tweets corpus (~ 2500) out of ~ 3000 tweets. Drop duplicates due to retweets etc.

Step 6 – Randomly select 70% tweets (training data) and classify them manually as positive(1), neutral (0) or negative(-1).

Step 7 – From this training data, build a simple classifier model (as we did in the simple classwork exercise). Split the sample into two-thirds (calibration) and one thirds (holdout) and check the prediction accuracy of the model. Build its confusion matrix.

Step 8 – Try changing the pre-processing a few times - dropping most common and uninformative words using the stopwords filter, for instance. Does it affect prediction accuracy?

Step 8 – Using the best classifier model, classify remaining 30% tweets (virgin data).

Step 9 – Write a narrative commentary on this process - (i) what hastags you picked and why. (ii) What was the distribution of sentiment across your corpus. (iii) What was the predictive accuracy of the model and did re-processing the raw data help improve it? (iv) Learnings from the exercise.

Your deliverable is again a R markdown that documents your efforts and/or results from each of the above steps. Do present the exercise as a narrative or a story, as far as possible.

=======

Deadline is the last day of Term 2 for both assignments.

Any Qs etc, contact us. Good luck.

Sudhir

Wednesday, November 16, 2016

Individual Assignments

Class,

Pls find below all your individual assignments for TABA. Vivek will putup the requisite dropboxes. The assignments are due on 27th November (Sunday) midnight.

=====

Task 1 – Text-Analyzing a simple set of documents.

Imagine you're a Data Scientist / consultant for a movie studio. Your brief is to recommend the top 2-3 movie aspects or attributes the studio should focus on in making a sequel.

The aim is to get you to explore with trial-and-error different configurations of possibilities (e.g., what stop-words to use for maximum meaning? TF or IDF? etc) in the text-An of a simple corpus. You are free to use topic modeling if you wish but it is not necessary that you do so.

Step 1 – Go to IMDB and extract 100 reviews (50 positive and 50 negative) for your favourite movie. You can refer to the code provided for reviews extraction from IMDB (IMDB reviews extraction.R) on LMS.

Step 2 – Pre-process the data and create Document term Matrix. Check word-clouds and COGs under both TF and TFIDF weighing schemes for which configs appear most meaningful / informative. Iterate by updating the stop-words list etc.

Step 3 – Compare each review's polarity score with its star rating. You can choose to use a simple cor() function to check correlation between the two data columns.

Step 4 - Now, make a recommendation. What movie attributes or aspects (e.g., plot? star cast? length? etc.) worked well, which the studio should retain? Which ones didn't work well and which the studio should change?

Step 5 – Deliverable: Create a markdown document as your deliverable. Keep only the final configuration that you arrived at. In the text you can describe the trials and errors you did. Ensure the graphs/tables on which you are basing your recommendations are part of the markdown. Also ensure that the stop-words you used are visible (either as image or as a vector).

Overall, I'd say no more than 2 hours of work, provided you diligently replicated the class work examples prior to this.

========

Task 2 – Basic NLP and Named Entity Extraction from one document.

Step 1 – Select one well-known firm from the list of the Fortune 500 firms.

Step 2 – For the selected firm, scrape it’s Wikipedia page.

Step 3 – Using openNLP, find all the locations and persons mentioned in the Wikipedia page. Its good practice to set timers and report runtimes for heavy functions.

Step 4 – Plot all the extracted locations from the Wikipedia page on a map. You may want to see 'NLP location extract and plot.R' file on LMS for this.

Step 5 – Briefly describe your observations based on the map and persons extracted from Wikipedia page.

Step 6 - Deliverable: R markdown submitted as HTML page. Ensure that the lists of people and persons mentioned are clearly visible. Also that the map of places is visible too. Use the text part of the markdown to record your observations.

Overall, I'd say no more than 1.5 hours of work.

Pls feel free to discuss these assignments with peers, help out and take help where necessary but at the end of the day, your submission must be individual and driven primarily by your own effort.

Any Qs etc, contact me, Vivek or Aashish.

Sudhir

Monday, November 14, 2016

TABA Welcome Message

Class,

This is a proforma welcome message for Batch 7 folks to the Text-Analytics for Business Applications (TABA) course.

Re what we did in the course:

We've gone through 5 TABA sessions which covered, respectively:

Session 1: Elementary Text-An + sentiment-An
Session 2: Data factorization + Topic Modeling 1
Session 3: Topic Modeling 2
Session 4: Basic NLP + PoS tagging + Named entity recognition (NER) and finally,
Session 5: Supervised text classification (using mainly the Maxent algo in R).

Re Readings and other material:

I shall put up links to relevant readings etc. In a lot of cases, just googling for particular topics can throw up a wealth of information.

The recommended textbook is theory-heavy and fairly comprehensive. I've linked a review of the same.

If you don;t have a free Github account, now is the time to get one and familiarize yourself with what Git is, how it works etc. For instance, here's a nice 10 minute read that can act as a starting guide.

I want you to think about sourcing some of your code, data etc directly from Git into Rstudio. I'll ask Pandey ji to address any of your queries on this score.

Re Home-works:

I'm committed to have your home-works (HWs) out early and on time. There'll be 2 individual and 2 group HWs.

The individual HWs will be simple, practice-based and clear-cut. Deliverable will be in R markdown form. I don;t expect any higher than 1.5-2 hrs of effort into each indiv HW.

The group HWs will be a little more comprehensive. Again, these too aren;t meant to be overly intricate or complex. Kindly ensure you're not over-thinking, over-analyzing or over-complicating any group HW.

Update:

I'm combining the 2 individual HWs into one. Likewise, combining the 2 group HWs into 1. So, there'll only be two HWs in TABA - one individual and one group.

Any Qs, comments, feedback etc. pls let me know.

Sudhir