Analytics Yogi: 2016

Wednesday, November 23, 2016

Some (optional) readings.

Class,

I got some queries on additional readings. I'd also resolved to upload some readings to the blog.

Pls find the same below. I'm quoting from an email response I'd sent recently.

First look at the working paper I'd uploaded onto LMS. It has further academic references in its bibliography section. This is for both topic modeling and matrix factorization.

In addition to the above the following could be of use:

========

Topic modeling:

Wikipedia entry on topic models

Topic Modeling 101 simple example

Technical paper on Topic model estimation

========

On Matrix Factorization

A simple introduction to Matrix Factorization

Non-negative matrix factorization in linear algebra (Wiki entry)

Matrix decomposition for data mining (technical paper)

========

These readings are optional from an exam point of view.

Sudhir

Friday, November 18, 2016

Group Homeworks for TABA

Class,

Pls find below your two group homeworks. There's ample time to do them and Aashish will discuss them in his tutorial this Sunday.

=======

Task 1 – Empirical Topic Modeling from First Principles:

Recall we did factor-An from first principles by simulating a small data matrix, factorizing it and recovering the original (with errors)?

We'll repeat that simulation-analysis-recovery exercise, only this time with topic mining instead.

Follow the steps below and lookup sample / example code etc in the homeworks folder in LMS.

Step 1 - Choose 3 completely different subjects. E.g., I'd choose "Cricket", "Macroeconomics" and "Astronomy". You pls choose any 3 other, very different subjects / fields.

Step 2 – Scrape the first 50 websites for each of these selected subjects in google search and extract their text. Use python google search code. Thus, we'll have 3 different sets of 50 documents each, one for each subject.

Step 3 – Now create a simulated corpus of 50 documents thus: The first of the 50 documents is a simple concatenation of the first document from subject 1, from subject 2 and from subject 3. Likewise, for the other 49 documents.

Thus, our simulated corpus now has 'composite' documents, i.e. documents composed of 3 distinct subjects each.

Step 4 – Run the latent topic model code for k = 3 topics on this simulated corpus of 50 composite documents.

Step 5 – Analyse the topic model results - wordclouds, COGs, topic proportions in documents. USe the classwork code for Lift and ETA directly.

Step 6 – Comment on (i) whether the topic model is able to separate each subject from other subjects. To what extent is it able to do so?
(ii) Are there mixed tokens (with high lift in more than one topic)? Are the highest LIFT tokens and the document topic proportions (ETA scores) clear and able to identify each topic?
(iii) What are your learnings from this exercise.

Your deliverable should be a R markdown that documents your efforts and/or results from each of the above steps. Do mention which subjects you chose and present the exercise as a narrative or a story, as far as possible.

=======

Task 2 - Training a machine to classify tweets according to sentiment.

Step 1 – Choose any six different recent twitter hashtags with or without sentiment content (e.g. #ClimateChange, #Trump, #Demonetization, #Kejriwal #Technology etc)

Step 2 – Extract ~ 500 tweets for each hash tag. You may use the twitter API conector and associated R code (or equivalent in python, if you wish).

Step 3 – Stack all the ~ 3000 tweets into one corpus.

Step 4 – Remove #keywords, web URL, @user_name from tweets. Clean the raw corpus,basically.

Step 5 – Make a unique tweets corpus (~ 2500) out of ~ 3000 tweets. Drop duplicates due to retweets etc.

Step 6 – Randomly select 70% tweets (training data) and classify them manually as positive(1), neutral (0) or negative(-1).

Step 7 – From this training data, build a simple classifier model (as we did in the simple classwork exercise). Split the sample into two-thirds (calibration) and one thirds (holdout) and check the prediction accuracy of the model. Build its confusion matrix.

Step 8 – Try changing the pre-processing a few times - dropping most common and uninformative words using the stopwords filter, for instance. Does it affect prediction accuracy?

Step 8 – Using the best classifier model, classify remaining 30% tweets (virgin data).

Step 9 – Write a narrative commentary on this process - (i) what hastags you picked and why. (ii) What was the distribution of sentiment across your corpus. (iii) What was the predictive accuracy of the model and did re-processing the raw data help improve it? (iv) Learnings from the exercise.

Your deliverable is again a R markdown that documents your efforts and/or results from each of the above steps. Do present the exercise as a narrative or a story, as far as possible.

=======

Deadline is the last day of Term 2 for both assignments.

Any Qs etc, contact us. Good luck.

Sudhir

Wednesday, November 16, 2016

Individual Assignments

Class,

Pls find below all your individual assignments for TABA. Vivek will putup the requisite dropboxes. The assignments are due on 27th November (Sunday) midnight.

=====

Task 1 – Text-Analyzing a simple set of documents.

Imagine you're a Data Scientist / consultant for a movie studio. Your brief is to recommend the top 2-3 movie aspects or attributes the studio should focus on in making a sequel.

The aim is to get you to explore with trial-and-error different configurations of possibilities (e.g., what stop-words to use for maximum meaning? TF or IDF? etc) in the text-An of a simple corpus. You are free to use topic modeling if you wish but it is not necessary that you do so.

Step 1 – Go to IMDB and extract 100 reviews (50 positive and 50 negative) for your favourite movie. You can refer to the code provided for reviews extraction from IMDB (IMDB reviews extraction.R) on LMS.

Step 2 – Pre-process the data and create Document term Matrix. Check word-clouds and COGs under both TF and TFIDF weighing schemes for which configs appear most meaningful / informative. Iterate by updating the stop-words list etc.

Step 3 – Compare each review's polarity score with its star rating. You can choose to use a simple cor() function to check correlation between the two data columns.

Step 4 - Now, make a recommendation. What movie attributes or aspects (e.g., plot? star cast? length? etc.) worked well, which the studio should retain? Which ones didn't work well and which the studio should change?

Step 5 – Deliverable: Create a markdown document as your deliverable. Keep only the final configuration that you arrived at. In the text you can describe the trials and errors you did. Ensure the graphs/tables on which you are basing your recommendations are part of the markdown. Also ensure that the stop-words you used are visible (either as image or as a vector).

Overall, I'd say no more than 2 hours of work, provided you diligently replicated the class work examples prior to this.

========

Task 2 – Basic NLP and Named Entity Extraction from one document.

Step 1 – Select one well-known firm from the list of the Fortune 500 firms.

Step 2 – For the selected firm, scrape it’s Wikipedia page.

Step 3 – Using openNLP, find all the locations and persons mentioned in the Wikipedia page. Its good practice to set timers and report runtimes for heavy functions.

Step 4 – Plot all the extracted locations from the Wikipedia page on a map. You may want to see 'NLP location extract and plot.R' file on LMS for this.

Step 5 – Briefly describe your observations based on the map and persons extracted from Wikipedia page.

Step 6 - Deliverable: R markdown submitted as HTML page. Ensure that the lists of people and persons mentioned are clearly visible. Also that the map of places is visible too. Use the text part of the markdown to record your observations.

Overall, I'd say no more than 1.5 hours of work.

Pls feel free to discuss these assignments with peers, help out and take help where necessary but at the end of the day, your submission must be individual and driven primarily by your own effort.

Any Qs etc, contact me, Vivek or Aashish.

Sudhir

Monday, November 14, 2016

TABA Welcome Message

Class,

This is a proforma welcome message for Batch 7 folks to the Text-Analytics for Business Applications (TABA) course.

Re what we did in the course:

We've gone through 5 TABA sessions which covered, respectively:

Session 1: Elementary Text-An + sentiment-An
Session 2: Data factorization + Topic Modeling 1
Session 3: Topic Modeling 2
Session 4: Basic NLP + PoS tagging + Named entity recognition (NER) and finally,
Session 5: Supervised text classification (using mainly the Maxent algo in R).

Re Readings and other material:

I shall put up links to relevant readings etc. In a lot of cases, just googling for particular topics can throw up a wealth of information.

The recommended textbook is theory-heavy and fairly comprehensive. I've linked a review of the same.

If you don;t have a free Github account, now is the time to get one and familiarize yourself with what Git is, how it works etc. For instance, here's a nice 10 minute read that can act as a starting guide.

I want you to think about sourcing some of your code, data etc directly from Git into Rstudio. I'll ask Pandey ji to address any of your queries on this score.

Re Home-works:

I'm committed to have your home-works (HWs) out early and on time. There'll be 2 individual and 2 group HWs.

The individual HWs will be simple, practice-based and clear-cut. Deliverable will be in R markdown form. I don;t expect any higher than 1.5-2 hrs of effort into each indiv HW.

The group HWs will be a little more comprehensive. Again, these too aren;t meant to be overly intricate or complex. Kindly ensure you're not over-thinking, over-analyzing or over-complicating any group HW.

Update:

I'm combining the 2 individual HWs into one. Likewise, combining the 2 group HWs into 1. So, there'll only be two HWs in TABA - one individual and one group.

Any Qs, comments, feedback etc. pls let me know.

Sudhir

Wednesday, October 12, 2016

Scraping Google Search Results - Group HW #3

Class,

Your final group homework is coming up soon. It involves scraping google search results and post-processing them a little bit. We will learn about text-processing techniques in some detail during your visit 2.

The idea is to scrape the top few pages of google search results for some of the world's most valuable brands and/or firms (both B2B and B2C)

And thereafter classify the URLs as originating from the firm or brand versus those originating from the rest of the world.

After that, a lot of interesting questions that can be asked and answered downstream.

Materials:

I'll be putting up the following materials up on linkstreet by this weekend.

Googlesearch.py code for running on Python 3 - either Jupyter or Spyder

CBA Group Assignment Allocation v2.xls - an excel sheet that allocates 8 brands/firms to each group

CBA Group Assignment v2.docx - a word doc explaining the process of demarking fiorm-sponsored from non-firm content.

A zipped folder called 'Examples.zip' that gives a sample submission for Apple Inc. Pls ignore anything in the excel sheet that we haven't covered in class so far, like 'Sentiment' etc.

Task:

1. I have assigned 8-10 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group. 2. Carefully go through the explanation in the word file for how to run the python script for scraping the data.

3. For each brand or firm in your list, run the data collection python script and collect data on the first 5 pages worth of search results. Use the keyword given as is in the search term.

4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".

Deliverables:

Pls submit the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand). Name the excel sheet as "group HW 3 for group .xls"

Deadline:

Submit anytime before the midnight before (i) either the DC exam, or (ii) visit 2, whichever is earlier.

Any queries etc, contact me via the blog comments section or write to Vivek directly.

P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.

Sudhir

Tuesday, October 11, 2016

Group Homework #2 - Webscraping with rvest

Class,

Pls find uploaded in linkstreet a markdown document for using the rvest package to scrape IMDB results.

Run the code line-by-line and check results. The code is documented to some extent. However, if you're unclear about a particular line is doing, web-search first, ask your peers, etc.

You should get a file containing URLs for the top 250 movies on IMDB. Once this is done, your real homework task begins.

Task:

1. Find all movies that released between 1996 and 1998 (both years including).

2. Read in the URL's imdb page and scrape the following information:

Director, stars, Taglines, Genres, (partial) storyline, Box office budget and box office gross.

3. Make a dataframe out of these variables as columns with movie name being the first variable.

4. Make a table movie-count versus Genres.

4a. Bonus points: See if you can come up with some interesting hypotheses. For example, you could hypothesize that "Action Genres occur significantly more often than Drama in the top-250 list." Or that "Action movies gross higher than Romance movies in the top 250 list." Etc.

5. Write a markdown doc with your code and explanation. See if you can storify your hypotheses.

Update: You're not restricted to only use rvest in R. If you're more comfortable with other platforms (say, Python 3), pls feel free to use the same to get the job done. But insert a markdown with the python code you used as part of your submission.

Deliverables and Deadline:

1. Zip a folder containing

(a) your .Rmd file,
(b) your markdown as a web page,
(c) .R script files you used,
(d) the data you collected in .txt or .csv form,
(e) a notepad with your group name and member roll numbers written on it,
(f) Bonus: Include a link to RPubs if you have published your markdown to the web.

2. Deliver the zipped folder to the appropriate dropbox.

3. Deadline will be the midnight of 23-October-2016.

Any Qs etc contact me. Or use the comments section below.

Sudhir

Additional readings - mandatory and optional

Class,

Wide range of topics we'd seen in the DC course. Some of you asked for more sources and reading material. Pls find the same below (in no particular order).

I do realize I got delayed in putting them out and some of you had followed up with me regarding this. Some readings are mandatory and others are totally optional only.

+++

The readings below are NOT optional in the sense that questions based on these readings may feature in your exam.

Readings relating technology to data collection and data use (from the Economist):

1. The first article titled 'Little Brother' (in an obvious play on George Orwell's famous 'big Brother' theme) details the impact of digital on advertising spends of firms worldwide.

2. The second article, 'Getting to know you', is about the various ways in which data is collected about consumers online.

3. The third article in this series, 'The world wild web', extrapolates some of what we are seeing into the future and asks 'Where are we going?'.

Ideally, I'd like you to read and discuss these articles within your groups. Again, remember, questions based on ideas and facts in these articles are fair game in your final exam for DC. Happy reading.

+++

Now, these readings that follow below are optional, more for leisure reading and folks with interest in particular topics/ verticals etc.

a. More from the Atlantic on how its now technologically feasible to arrive at one's Identity. Big Data Can Guess Who You Are Based on Your Zip Code.

b. Recall the habit patterns class we'd covered? Here's an article from HBR blogs on How Customers Get Hooked on Products.

c. There's an undercurrent somewhere in the program that spells the words "data science". This link here offers a rounded perspective on what precisely is data science. This follow-on link here describes 8 concrete steps you must take to become a data scientist. Yes, R features there. Apt read for all CBA students, IMO.

d. For sessions 1-3 which focussed more on constructs, designing questionnaires around constructs etc., here below is some interesting material which you may consider browsing at leisure. They're basically to help understanding for those folks who may have felt the coverage in class was not detailed enough on certain topics:

i. This is a Wikipedia link to Quantitative psychology as a subject area. It provides a nice, concise and precise introduction to the area in general and has a good number of downstream links that you can pick up on as and when necessary.

ii. This is the Wiki entry to Scaling techniques in general in the social sciences. As you can see the comparative versus noncomparative dichotomy comes in early on here. More links to detaiuled topics are also available.

iii. This is the wiki entry to psychometrics as a discipline. I thought it a tad too inclined towards educational testing but still, worth a read perhaps, for those interested.

+++

More reading links (optional) that I could gather, below. Recall the Xerox-evolv caselet we did in class - The one where psychographic Likerts were combined with straightforward machine learning? We then discussed implications, pros and cons etc, and speculated on when such practices may spread worldwide and into India. Well, speaking of India...

Startups, and India Inc use psychometric tests to peek into potential recruits’ minds

This is the session 5 reading from NYT on 'Data janitor' work: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

This is the source for the session 5 table on Data munging functions in R from computer world: Great R packages for data import, wrangling & visualization

This is the NYT article on 'the data driven life' about the quantified self movement.

This is the API directory from Programmable web. Recommended that as CBA folks, you register with the site and get periodic updates (if you haven't already). +++

You have 2 more group homeworks coming - one on webscraping using the rvest package and the other on python for scraping google search results. I'm teaching PGP currently, so posting a tad slow, but will put these HWs out there soon.

Any Qs etc, let me know.

Sudhir

Saturday, October 8, 2016

Individual Homework 2 - Fill up a survey

Class,

I trust you've been busy with all the homework assignments etc from visit 1 visiting your weekends at home.

Your second individual homework is coming up. And not to worry, it won't be very burdensome, let me assure you. Please take this 20-30 minute web-survey. I plan to use this data to demonstrate concepts and techniques in the Text Analysis course during your next visit.

The survey should be completed (and is automatically submitted online) no later than the midnight of Sunday 16-Oct-2016 (i.e., a week from now).

I estimate the survey should take no more than 20-30 minutes.

+++

On a separate note, I recommend you register (free) with and subscribe to newsletters from the following sites:

1. www.kaggle.com which hosts data science competitions, releases datasets and tutorials. Can use your FB or goog login. We'll be using some kaggle datasets in NLP.

2. r-bloggers.com - which collates and publishes lots of interesting blogs on R, R news, new packages, cool coding hacks and tricks, etc.

3. 'programmable web' - which is sort of API central. Get news on updates, releases, changes etc in different popular APIs from all around.

Apart from the readings in past posts in this blog, this recent bloomberg article is of some interest from the DC and data science perspective.

Any Qs etc, pls feel free to contact me.

Sudhir

Wednesday, October 5, 2016

On Markdowns and files for DC

Hi Class,

Vivek will upload all classwork related files - slides + code + data - in the next few days. Below are some tips for what to do re code and data

About Code:

Code files will be primarily be script files (.R for R and .py for Python) but in some cases, markdown files will also be available.

Open .R files directly via Rstudio and .py files via Spyder or Jupyter.

Markdown files will appear as .Rmd files in R (for Rmarkdown) which open in Rstudio and as web pages for python.

If unfamiliar with R (or python), pls read & execute each line of code + documented comments individually.

Issues etc, first ask Google or your peers before reaching out to Aashish.

About Data:

Data files are usually available as .txt in LMS.

In a few cases, you'll have to signup for an API key and then scrape the dataset yourself. Follow instructions diligently in such cases.

Issues etc, first ask Google or your peers before reaching out to Vivek.

Pls ensure you are very comfortable with replicating classwork examples over the next week-10 days odd. The homeworks + deadlines will start coming in after that.

Recall that your assignment submissions will be in markdown format and the core insights therein should ideally should be wrapped in a narrative / storyline.

I've putup on linkstreet a file, RMarkdown.Rmd, that walks you through a simple procedure of how to write in markdown. The same I published on RPubs (and you can publish your markdowns too). It is available here. Here's a link to how markdown works. Its a very simple 1 page get-started guide in case you're 100% new to Markdowns.

And this page here, from the same author is about a few useful writing tools for how to craft a narrative into your markdowns. Again, use your judgment and don't follow the articles to the letter.

And this is an example of a markdown doc done for a project by somebody in 2014, published on Rpubs. Recall the blog page of Astronomer Julia Silge I'd shown in class? This is the link to her page. Its a nice intro to how to craft simple narratives around regular R code and workflows. Would be great if you can reproduce her workflow on your twitter feed, for instance...

OK. That's it for now. Feel free to use the comments section in case of anything.

Ciao and Cheers.

Sudhir

Tuesday, October 4, 2016

Group HW for sessions 1-2, Batch 7 @ CBA

Class,

This group homework is based on (a) Session 1 - problem formulation, (b) Session 2 - survey questionnaire design and part of (c) Session 3 - Qualitative methods.

It aims to familiarize the intricacies, advantages and limitations of the Questionnaire tool by actively getting you to design that tool.

This is a group based homework. Only one submission per group. If you don't know who your group is, pls ask Vivek about this.

Problem Context

You are a startup bridging doctors and patients in metropolitan India. But your startup offers a house-call solution to elderly or otherwise infirm patients in the reasonably wealthy upper-middle class.

You connect doctors who are willing to make house calls (at some premium, presumably) with patients who are willing to pay the premium for the convenience of a house-call visit by a trustworthy doctor.

Imagine you are still in the process of figuring out the contours of their revenue/business model, which crucially depends what the demand levels are like for such services at what price, and what the supply levels are like at that price. Based on that one can figure out who to target, how and how large the market is.

Task

Your task is to (1) Formulate your D.P. and corresponding R.O.s. (My suggestion: Choose a sharply defined D.P. that can be well-covered by at most 1-2 R.O.s; choose a suitably narrow demographic as your target population that isn't too heterogeneous in its likely needs and preferences).

(2) Next, identify the main construct(s) of interest ones that will likely drive demand for your client, ideally based on your R.O.s. (Hint: Think in terms of the average target customer's motives/needs and the his/her self-perception of the requirements)

(3) Design a questionnaire that: [a] surveys target segment respondents on their propensity to use app-based on-demand commuting solution services;
[b] can be taken in under 15 minutes on a good net connection;
[c] collects info on the distribution of quantities of interest (such as awareness levels about these services, interest levels in using them, what price levels might be viable, etc.)

My suggestions before starting:

1. To understand this target segment's needs and preferences, do some preliminary, quick qualitative research: E.g., conduct a few interviews (these could be casual conversations or telephonic ones) with a few people in that target segment about the subject. Find out what they think, what they need, what they see others around them doing etc.

2. Write (or 'Program') your survey into either Qualtrics or the free versions of websurvey softwares available such as zoomerang or surveymonkey. See if you can obtain the "launch" survey link.

3. Bonus points for pretesting the survey with a few folks first, accounting for order effects etc in questionnaire design, etc.

Submission Format

You can submit a traditional PPT for this assignment. All future assignments will involve code and hence only markdown will be allowed.

Below is the submission format for a PPT. (That for a markdown will be similar. Just divide the markdown webpage into sections [Title, Data, model, results etc])

Start with a plain white PPT. Save it as groupName_session1.pptx

Title slide: Homework name and names+ roll numbers of group members.

First slide: Brief description of your client and their business (Also, a line or two to justify why you picked them)

Second slide: Statement of D.P and corresponding R.O.s

Third slide: Brief Description of qualitative research carried out to first narrow-down what topics to cover in the survey.

Fourth slide: Make a table that maps the 1-2 major constructs to the corresponding survey question numbers.

Fifth slide: Deliverable - websurvey link. Should be a working link. Also, attach the word or PDF version of your questionnaire onto this slide. The Q numbers in the 4th slide should match the ones here.

Sixth slide: Any learnings you as a group made - E.g., what constructs were the easiest to measure? hardest? ON the average, how many Qs per construct did you have to use? Etc.

Update:

In the past, I got quite a few Qs asking if a scale other than Likert can be used etc. Sure, it can. Likert is important in the context of behavioral constructs. For regular, descriptive Qs, use other scales by all means. *Not* every Q has to be a likert.

Whether PPT or markdown, wrap your submission inside a story, as far as possible.

Deadline for this is midnight of 16-Oct (Sunday). Drop box for the PPT will be set accordingly on linkstreet.

Any queries etc, let Vivek or me know.

Sudhir

Monday, October 3, 2016

Individual Assignment 1 for Batch 7

Hi Class,

A series of homeworks are coming your way. The first is described below. I will putup details for the others shortly.

Pls watch this ~ 20 minute video carefully. It features Scott McDonald of Condé Nast holding fort on where Marketing Research is headed in the next decade.

“Social Technological and Economic forces affecting Marketing Research over the next decade”

Now, for your HW, pls answer a few simple Qs (True-False, fill in the blanks variety) about the above talk in the following survey:

Questions based on the video

HW Notes:

(i) This is an individual-only HW. Since it involves no R, consulting peers is not permitted.

(ii) I found that using earphones works great in making out what the speaker is saying much more clearly than ordinary speakers. FYI.

(iii) Deadline: The HW should be completed and submitted latest by midnight 10-October Monday.

Any Qs etc, pls feel free to email me or use the comments section below.

Sudhir Voleti

Friday, September 30, 2016

Photo-finish with CBA Batch 7

With Section B above

With Section A

Thursday, September 29, 2016

Quick stock-taking and FAQs

Hi Class,

We're well past the halfway mark in the DC course and as such, its as good a time as any to take stock of where we are.

To summarize in brief, we covered problem formulation (session 1) and thereafter 3 basic types of research for analytics thereafter which each depend on the clarity and extent of problem definition.

These are Descriptive Analytics (surveys, primarily in Session 2), Exploratory Analytics (Session 3) and Causal (session 4).

Along the way we also picked up concepts and frameworks that help us define, frame and measure objects of interest such as multidimensional mental objects (session 1), constructs (session 2), DC from APIs (session 3), conjoint experimentation and marketplace simulators (session 4).

That's quite a bit covered in just under 8 hours, if you look back and think about it.

However, mere classroom coverage isn't enough. These concepts can get internalized and concrete-ized only with hands-on practice. That is precisely what the homework assignments are supposed to do.

Your group allocation for the group home-works will shortly happen. I've asked the teaching assistant (TA) Mr Vivek to assign people into groups of 4 keeping in mind diversity in background and demographics.

FAQs: Folks have asked me re textbooks, reference material etc. I'd encourage you to visit old blog posts in this blog for past batches where I had shared many links for additional readings. Let me know if they do not meet what you were specifically looking for.

Lastly, I'm happy to see a good amount of interaction, curiosity, Q&A, quite a bit of startup- and entrepreneurial energy in your batch, in line with that in previous batches.

At the end of the class tomorrow, keeping in mind past tradition, I'll request a "photo-finish" - a class photograph with each section before we end the course.

See in class tomorrow. Ciao.

Sudhir

Saturday, September 24, 2016

Hi all

A big "Hi" to everybody.

This is a pro-forma welcome message from me to CBA batch 7 taking Data Collection (DC) in Sept 2016.

I'll cover DC in your visit 1 and then, later, Text Analytics for Business Applications (TABA) in visit 2. Both DC and TABA will use mainly R and some Python.

This blog can be a repository for related R and Python code + assistance. Feedback, Q&A etc are always welcome via the comments sections.

Pls download and install both R + Rstudio and Anaconda (with Spyder & Jupyter) from either Linkstreet or directly from the web, by 28 Sept 2016.

Looking forward to smooth sailing.

Sudhir Voleti
Assistant Professor of Marketing
ISB Hyderabad

Saturday, March 12, 2016

Final Goup Homework - Scraping Search Results

Hi Class,

Your final group homework is coming in. It involves scraping and sentiment-analyzing google search results, the way I'd once demonstrated in class.

The idea is to scrape the top few pages of google search results for some of the world's most valuable brands (both B2B and B2C)...

... and thereafter assess the relative sentiment gaps between what the firm or brand says about itself and what the rest of the world says about the brand.

Task:

1. We have assigned 8 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group.

2. Carefully go through the explanation in the word and markdown files for how to run the python script for scraping the data, and the R code for sentiment analysis.

3. For each brand in your list, run the data collection python script and collect data on the first 5 pages worth of search results. 4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".

5. Build a corpus of the content from firm-sponsored links and sentiment-analyze the corpus. Record sentiment polarities, top positive and negative terms etc corresponding to each link in the excel sheet template given.

6. Do the same for non-firm-sponsored links as well.

7. Speculate on how wide the gap is and why such a gap might have arisen in the first place. For instance, are links in page 1 more positive than those in page 2? Are news articles more positive (negative) than blogs and social media posts? Etc.

8. The deliverables include:

(A) the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand).
(B) The firm-sponsored corpus and the non-firm-sponsored corpus into a dropbox for that purpose (zip the 2 corpora as text files and name it after your group while submitting).
(C) A markdown file either on RPubs or as a webpage that very briefly (in a few paragraphs per brand) shows the story you have been able to uncover for the brands assigned to your group.

8. Deadline: Submit by the midnight before the DC exam, whenever that is.

Any queries etc, contact me via the blog comments section or write to Aashish directly.

P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.

Friday, March 4, 2016

Group Homework Session 4 - Topic analysis

Hi Class,

Your second-last group homework is coming up. This one concerns topic-mining, analysis and interpretation.

I hope you have completed the individual homework for session 4. If not, I urge you to do so before attempting this one.

After this, there is one last group homework remaining - that for session 5 - based on python script that scrapes google search results.

To ease things a bit, I've already scraped and provided the data for download on the linkstreet assignment folder.

Further, we've also made available the Term Document Matrices (TDMs) that you'll need for the analysis and constructing which might have taken some time.

To make things even more easier, Pandeyji has put-up a markdown workflow in PDF form on linkstreet of code + sample output that walks you through what needs to be done to get the results. Later, I'll ask you to interpret the results.

Pls go through the workflow, line by line. Try the code on the datasets given.

Task:
Find in linkstreet the text corpus pertaining to the 2013 10-K SEC filings by Tech Sector firms from the Fortune 1000. Read it into R.
There are two sections of interest in the 10-K form:
One is Item 1- Business description (wherein firms describe their business and key products),
and the second is Item 1A - Risk Factors (wherein firms reveal exposure to anticipated risks in the coming year).
Section A people (or groups with majority section A people) will analyze Item 1- Business description.
Section B people (or groups with majority section B people) will analyze Item 1A- Risk Factors.
In each case, run the topic mining algorithm on the corpus. The output will be similar to what you saw in the Samsung data survey - word-clouds and co-occurrence graphs.
Interpret the topic. Label it in < 6 words and describe your interpretation in a few lines.
Bonus question: Any potential applications of such analysis you can think of? For instance, ask yourself why might the results of your analysis be interesting to which firms?
Remember, I'm not looking for just some results putup in a file. I'm looking for interconnections, deeper meaning, story ...

Challenge:

The assignment may not be as straightforward as first seen. There are multiple combinations of analysis possibilities.

For instance: (a) Should you use TF or TFIDF? I suggest you try both and go with whatever yields cleaner results.
(b) Should you use K = 2 topics? 3? 4? I suggest you try between 2 and 5 topics and see which yields the 'best' or most interpret-able and clear story.
(c) Debate, discuss, tweak the code, try-and-err etc. Massive learning opportunity if you are willing to go the distance.

Submission Details:

Workflow with code and explaining the story behind your final results can be submitted to linkstreet as a html file. Bonus points if you are able to publish it to Rpubs and send us the link.

Deadline is 20-March (Sunday) Midnight. Any queries etc, let me know.

Sudhir

Thursday, March 3, 2016

Individual Homework - Session 4 (Topic Mining)

Class,

Hi again. Your next individual homework (yup, another survey) is coming up.

Remember this is *individual* homework, no consulting your peers is allowed for this one.

I hope you have replicated the classwork Nokia example in topic-mining at home. It would allow you some idea about what is coming. At the least, carefully go over the relevant slides from session 4.

This is what I wrote in the opening statement of your survey:

Recall the Nokia Lumia example from class? We topic-mined consumer reviews of Nokia Lumia 925 and identified two latent topics underlying the text. We labeled the topics and interpreted what they likely meant.
We used the word-clouds and co-occurrence graphs as visual aids for topic identification. Recall that font-size in a word-cloud is proportional to word frequency in the text. In a co-occurrence graph, words that likely occur together within a document are connected in a network of the words. Aside from that, we also used the top few documents loading onto each topic to help in topic interpretation.
The following survey task asks you to repeat the Nokia type exercise on a new mobile brand - Samsung Galaxy.
Below, you will be presented with the topic-mining results for Samsung Galaxy reviews (collected around the same time as Nokia's). Two topics were found to be optimal. You'll be shown word-clouds and co-occurrence graphs for each of the topics. Further, you'll also be shown the top 5 of the documents that loaded on each topic (as a separate file on linkstreet).
Your task is (i) to interpret these topics using the output, (ii) to meaningfully label these two topics with a short, informative name and (iii) provide a brief description of your interpretation.
Remember, there is no right or wrong answer, just your carefully considered responses.

This is the link to your survey.

I have putup on linkstreet a text document that carries the top 5 reviews loading onto Topics 1 and 2 respectively.

Deadline: 12-March (Saturday) midnight. Pls remember: Timely and complete submission carries course credit.

Any queries etc, let me or Aashish know.

Sudhir

Friday, February 26, 2016

Group Homework for Session 3 - Using APIs

Class,

First, some context. Recall the Google maps example we did for spatially plotting commercial entities of interest in Hyderabad.

There, we were trying to know the distribution of purchasing power over Hyd [or more generally, *any* other] city. How to know – quickly, cheaply, scaleably and reliably?

i. Replicate that classwork example at home. Download code from Linkstreet. Pls view the Linstreet video on how to create your own account and get data.

ii. Now pick a city as your focal city. Any city except Hyderabad (since it's already done in class).

iii. You are consulting for a client who is a major hospital specializing in cardiac care. The client wants to advertise their emergency numbers, ambulance services etc.

iv. The client wants to build/hire 4 billboards (hoardings) on which to display its messages. Client asks you to look for 4 optimal billboard locations.

iv. Profile your client's target segment, who could be either individuals or organizations [e.g., lower, middle or upper SEC; high net worth Individuals; startups, SMEs, service businesses such as in education or healthcare, large MNCs etc]

v. Pick a list of entities from the entity list in Google places API that could serve as proxies for the presence / purchasing power / needs of your target segment. Pick around 2-3 proxies in all. E.g., in the classwork example, we picked banks, malls and hospitals as proxy entities to indicate nearby presence of middle and upper middle class SEC population.

vi. Collect data on these proxy entities in the focal city from the Google Maps API and plot them on Google maps. Interpret what the map is saying.

Your deliverable will be a markdown document in HTML form ( a webpage basically) with these maps' screen caps should be visible.

Highlight at least 4 areas of particular interest for your client. In markdown documentation, explain why your chose those areas as interesting for your client.

----------------------------------------

Deliverable:

Your markdown submission should ideally cover the following:.

a. Title and detail: City and your Client's Business. Also, names and roll numbers of group members. Name the webpage as group_name.html [Pick a group name if you haven't already]

b. Problem Formulation: State in brief your client's business problem, using, say 1 D.P and 1-2 R.Os

c. Description: State why you picked your focal city.

d. Entity List: List the proxy entities you are going to search for. Justify your list in 1-2 lines for each entity type you have chosen.

e. Results: Ensure google map with proxy entities on it is clear. Highlight which areas of interest you have chosen. Choose 4 promising areas.

f. Interpretation: Justify your choice for the areas picked in a few lines.

g. Bonus points if you could build a distance matrix, cluster the proxy entities and display the clustered entities in a separate map. [Check the classwork for this, I did something similar there].

h. Code: Ensure you submit your code inline in the markdown which we can test and run using our AppIDs here.

i. Submit the document in the dropbox before deadline. Or upload a text doc containing Rpubs URL and group details into the dropbox

----------------------------------------

Deadline is midnight of Saturday, 12-March-2016.

Any queries etc, contact aashish_pandey@isb.edu or me. Use the comments sections for general FAQs.

Sudhir

Sunday, February 21, 2016

Individual assignment for Sessions 1-2

Hi Class,

A series of Assignments are coming your way. This is your first individual assignment (i.e., one submission per student).

Pls watch this ~ 20 minute video carefully. It features Scott McDonald of Condé Nast holding fort on where Marketing Research is headed in the next decade.

“Social Technological and Economic forces affecting Marketing Research over the next decade”

Now, for your HW, pls answer a few simple Qs (True-False, fill in the blanks variety) about the above talk in the following survey:

Questions based on the video

Some Notes:

(i) This is an individual-only assignment. Since it involves no R/Python, consulting peers is not permitted.

(ii) I found that using earphones works great in making out what the speaker is saying much more clearly than ordinary speakers. FYI.

(iii) Deadline: The assignment should be completed and submitted latest by midnight 06-March Sunday.

Any Qs etc, pls feel free to email me or use the comments section below.

Sudhir Voleti

Saturday, February 20, 2016

Group Assignment for Sessions 1-2

Class,

This group homework is based on (a) Session 1 - problem formulation, (b) Session 2 - survey questionnaire design and part of (c) Session 3 - Qualitative methods.

It aims to familiarize the intricacies, advantages and limitations of the Questionnaire tool by actively getting you to design that tool. This is a group based homework. Only one submission per group. If you don't know who your group is, pls ask Aashish about this.

I hear some groups have enrolled 5 members whereas I'd specifically specified 4 per group. You have a choice to either let one member go (in which case we'll assign him/her) or be prepared to accept a higher burden of expectations when it comes to grading. Your call.

Problem Context

First read the following (small) newspaper articles from the past year's Economic Times.

How new startups like LiftO, Shuttl, rBus are trying to solve urban commuting problem

Ola launches social ride-sharing feature

If security concerns not addressed, on-demand service startups could spell disaster for firms

Your project consulting client is a startup in the urban social-commuting solutions space in India's top metros is still in the process of figuring out the contours of their revenue/business model which crucially depends what the demand levels are like for such services at what price, who to target and how.

Pick any one of the firms mentioned in the first two articles as your consulting client.

Task

Your task is to (1) Formulate the client's D.P. and corresponding R.O.s. (My suggestion: Choose a sharply defined D.P. that can be well-covered by at most 1-2 R.O.s; choose a suitably narrow demographic as your target population that isn't too heterogeneous in its likely needs and preferences).

(3) Design a questionnaire that: [a] surveys target segment respondents on their propensity to use app-based on-demand commuting solution services;
[b] can be taken in <15 minutes on a good net connection;
[c] collects info on the distribution of quantities of interest (such as awareness levels about these services, interest levels in using them, what price levels might be viable, etc.)

My suggestions before starting:

2. Write (or 'Program') your survey into Qualtrics. Obtain the "launch" survey link.

3. Bonus points for using SKIP logic in Qualtrics (or any other free websurvey software such as surveyMonkey or zoomerang), pretesting the survey with a few folks first, accounting for order effects etc in questionnaire design, etc.

Submission Format

You can either markdown (in Jupyter, or in Rstudio, or in Blogs) or a traditional PPT for this assignment. All future assignments will involve code and hence only markdown will be allowed.

Below is the submission format for a PPT. That for a markdown will be similar. Just divide the markdown webpage into sections [Title, Data, model, results etc]

Start with a plain white PPT. Save it as groupName_session1.pptx

Title slide: Homework name and names+ roll numbers of group members.

First slide: Brief description of your client and their business (Also, a line or two to justify why you picked them)

Second slide: Statement of D.P and corresponding R.O.s

Third slide: Brief Description of qualitative research carried out to first narrow-down what topics to cover in the survey.

Fourth slide: Make a table that maps the 1-2 major constructs to the corresponding survey question numbers.

Sixth slide: Any learnings you as a group made - E.g., what constructs were the easiest to measure? hardest? ON the average, how many Qs per construct did you have to use? Etc.

Update:

Whether PPT or markdown, wrap your submission inside a story, as far as possible.

The instructions for how to get a qualtrics account will be put up on Linkstreet, if they haven't been done so already.

Deadline for this is midnight of 05-March (Saturday). Drop box for the PPT/ markdown webpage will be set accordingly.

Any queries etc, let me know.

Sudhir

Friday, February 19, 2016

Classwork Uploads

Hi Class,

I expect that all of you have by now (i) formed your groups and (ii) installed all necessary analysis software (Rstudio and Python).

Aashish will upload all classwork related files - slides + code + data - in the next 2 days. Below are some tips for what to do re code and data

About Code:

Code files will be primarily be script files (.R for R and .py for Python) but in some cases, markdown files will also be available.

Open .R files directly via Rstudio and .py files via Spyder or Jupyter.

Markdown files will appear as .Rmd files in R (for Rmarkdown) which open in Rstudio and as web pages for python.

If unfamiliar with R (or python), pls read & execute each line of code + documented comments individually.

Issues etc, first ask Google or your peers before reaching out to Aashish.

About Data:

Data files are usually available as .txt in LMS.

In a few cases, you'll have to signup for an API key and then scrape the dataset yourself. Follow instructions diligently in such cases.

Issues etc, first ask Google or your peers before reaching out to Aashish.

Pls ensure you are very comfortable with replicating classwork examples over the next week-10 days odd. The homeworks + deadlines will start coming in after that.

Recall that your assignment submissions will be in markdown format and the core insights therein should ideally should be wrapped in a narrative / storyline.

Here's a link to how markdown works. Its a very simple 1 page get-started guide in case you're 100% new to Markdowns.

And this page here, from the same author is about a few useful writing tools for how to craft a narrative into your markdowns. Again, use your judgment and don't follow the articles to the letter.

Recall the blog page of Astronomer Julia Silge I'd shown in class? This is the link to her page. Its a nice intro to how to craft simple narratives around regular R code and workflows. Would be great if you can reproduce her workflow on your twitter feed, for instance...

OK. That's it for now. Feel free to use the comments section in case of anything.

Ciao and Cheers.

Sudhir

Thursday, February 18, 2016

Hi again

Well, folks.

Our 5 sessions together are done. Am happy to note they went by smoothly enough. Thanks for being a nice class.

Watch this space for homework assignments, additional reading material etc that I'll put up. Feel free to explore blog entries from previous years to gain some idea regarding assignments and readings.

It won't be possible for me to put-up links to every topic and sub-topic mentioned or covered in class. However, searching the web will often reveal the info required.

For example, from session 5, if you want to know more about ROS or Baxter or a Turtlebot, just google search and explore the results.

Here are the class pictures we took. It's a tradition of sorts with me the past year odd, clicking pics with every class I've taught on the last day of class.

This is for Section A (I think)

And this with Section B.

One important perspective to keep in mind is that you're all in this learning journey together. You'll have a choice between adopting a collaborative mindset versus a competitive mindset.

I urge you to collaborate as a class to accomplish more learning for the class as a whole. Its better for everybody if everybody is willing to help/share with peers, than if everybody seals themselves off into silos. But crucially, this requires that *everybody* buys into the idea.

I encourage you to use the comments section of this blog for queries, ideas, feedback etc regarding the DC course.

Cheers and Ciao.

Sudhir

Friday, February 12, 2016

Hi Class

A big "Hi" to everybody.

This is a pro-forma welcome message from me to CBA batch 6 taking Data Collection (DC) in Feb 2016.

DC will use some R like it has in the past but this time, for the first time, I will try Python in the classroom.

This blog can be a repository for related R and Python code + assistance. Feedback, Q&A etc are always welcome via the comments sections.

Pls download and install both R + Rstudio and Anaconda (with Spyder & Jupyter) from LMS, if you haven't already.

Looking forward to smooth sailing.

Sudhir Voleti
Assistant Professor of Marketing
ISB Hyderabad