Analytics Yogi: October 2015

Wednesday, October 7, 2015

Session 5 Homeworks

Class,

Your last set of two homeworks, corresponding to session 5, is coming up. One is short and simple. The other a tad challenging and more of a capstone. I realize I couldn't offer any homework practice on the crowdsource and location-based DC part but I guess this is quite solid for a 5-session course.

Individual Homework:

Pls fill up these surveys according to your section.

Survey for section A

Survey for section B

These surveys are again based on latent topic interpretation, similar to what you've previously. Shouldn't take too much time.

Deadline is a week from now: Midnight of Thursday, 14-Oct-2015. Do remember that the individual homeworks are graded on timeliness and completeness.

----------------------------------------

Group Homework:

First, some context. Recall the Google maps example we did for spatially plotting commercial entities of interest in Hyderabad.

There, we were trying to know the distribution of purchasing power over Hyd [or more generally, *any* other] city. How to know – quickly, cheaply, scaleably and reliably?

i. Replicate that classwork example at home. Download code from LMS. Pls view the LMS video on how to create your own account and get data.

ii. Now pick a city as your focal city. Any city except Hyderabad (since it's already done in class).

iii. Pick a sector-Industry. For example Food--> Pizzerias or High End restaurants. Finance --> ATMs, bank branches etc. Consider yourself to be a consultant for a client who wants to enter the focal city in that business.

iv. Profile your client's target segment, who could be either individuals or organizations [e.g., lower, middle or upper SEC; high net worth Individuals; startups, SMEs, service businesses such as in education or healthcare, large MNCs etc]

v. Pick a list of entities from the entity list that Google provides that could serve as proxies for the presence / purchasing power / needs of your target segment. Pick around 2-3 proxies in all. E.g., in the classwork example, we picked banks, malls and hospitals as proxy entities to indicate nearby presence of middle and upper middle class SEC population.

vi. Collect data on these proxy entities in the focal city from the Google Maps API and plot them on Google maps. Interpret what the map is saying.

Your deliverable will be a PPT with these maps' screen caps should be pasted on the slides. Highlight at least 2-3 areas of particular interest for your client using ovals and textboxes. In a separate slide, explain why your chose those areas as interesting for your client.

vii. Bonus points: Run a simple clustering based on a distance matrix for the entities chosen. Display the clusters in a separate map.

----------------------------------------

Deliverable:

PPT form as per the instructions below.

a. Title slide: City and your Client's Business. Also, names and roll numbers of group members. Name the ppt as group_name.pptx [Pick a group name if you haven't already]

b. Problem Formulation slide: State in brief your client's business problem, using, say 1 D.P and 1-2 R.Os

c. Description slide(s): State why you picked your focal city in 1-2 lines. Describe why you picked your client's business in 1-2 lines. Bonus points for slightly out-of-the-way (or non-mainstream) instances.

d. Proxy List: List the proxy entities you are going to search for. Justify your list in 1-2 lines for each entity type you have chosen.

e. Result slides: Paste google map screen cap with proxy entities on it. Highlight using ovals and arros which areas of interest you have chosen. Choose 2-3 promising areas.

f. Interpretation slide: Justify your choice for the areas picked in a few lines.

g. Bonus points if you could build a distance matrix, cluster the proxy entities and display the clustered entities on a slide. [Check the classwork for this, I did something similar there].

h. Bonus points if you submit your code which we can test and run using our AppIDs here. Submit R code as

i. Submit the ppt in the dropbox before deadline.

Deadline is midnight of Sunday, 1-Nov-2015.

Any queries etc, contact aashish_pandey@isb.edu or me. USe the comments sections for general FAQs.

Sudhir

Additional (Optional) Readings

Class,

Here is a set of readings from sessions 4 and 5 that may be of interest. I do realize I got delayed in putting them out and some of you had followed up with me regarding this.

Recall the Xerox-evolv caselet we did in class - The one where psychographic Likerts were combined with straightforward machine learning? We then discussed implications, pros and cons etc, and speculated on when such practices may spread worldwide and into India. Well, speaking of India...

Startups, and India Inc use psychometric tests to peek into potential recruits’ minds

This is the session 5 reading from NYT on 'Data janitor' work: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

This is the source for the session 5 table on Data munging functions in R from computer world: Great R packages for data import, wrangling & visualization

This is the Dell Ideastorm website we talked about in 'DC from crowds' ...

...and this is the landing page for the P&G connect + develop FMCG idea-sourcing endeavor.

This is the NYT article on 'the data driven life' about the quantified self movement.

From session 3, this is the API directory from Programmable web. Recommended that as CBA folks, you register with the site and get periodic updates (if youhaven't already). These are all optional reads, so read at leisure.

One last set of homeworks for session 5 are on their way. Another survey to fill and some Google Maps data to pull and play with. Trust that's doable. Well, shall close here.

Ciao

Sudhir

Friday, October 2, 2015

Session 4 homework - Text analytics and topic modeling

Class,

This group homework is based on session 4 - text analytics and topic-mining.

I hope you are comfortable replicating the classwork R code on LMS. There's also an instructional video Pandeyji putup on twitteR. I urge you to leverage the tutorial session with Aashish this Saturday to clarify all issues in this space.

Homework Instructions:

1. Pick up any well-known brand- product or service. E.g. Xbox360 or Jabong or iphone6 or Nike.

2. Collect 3 sets of data for it:

(a) 100+ consumer reviews from either flipkart or Amazon India
(b) 500+ tweets
(c) 50+ articles from Googlenews or any other news aggregator sites.
3. Feel free to either use R or any other means you know of to collect the data (e.g. Python, chrome scraper etc.). But clearly mention the data collection tool used.

4. For each set of data, perform the following analyses:

(a) General wordcloud using both TF and TFIDF weighing schemes. Update stopwords list to filter out noisy or irrelevant terms.
(b) Sentiment analysis. Display wordclouds separately for the top 50 most positive and most negative words.
(c) Identify the top few most positive and most negative documents. Read them and speculate on why they are so positive or negative about it.

5. Latent topic mining: Topic mine a corpus from any one dataset. Use no more than 2 or 3 topics. Make wordclouds of the topic's tokens. Interpret in a few lines what the topics are saying.

6. Session 4 HW submission format:

i. Use a plain white blank PPT.

ii. On the title slide, write your group name and the names + ISB students IDs of all group members.

iii. Give your homework an informative title (include name of the product/brand you chose).

iv. Have 3 sections in your PPT - one corresponding to one data source and separated by separator slides.

v. As slide separators, mention the source of the data. E.g., "Data source: Amazon Consumer reviews" or "Data Source:Twitter" and so on.

vi. For slide headers, use format "TF Wordcloud" or "Positive wordcloud" and so on.

vii. Save the slide deck as session4HW_yourgroup.ppt.

viii. Put all the raw data you collected, the code you used and your PPT in a zip folder (so that I can replicate your analysis if need arises). Save the folder as session4HW_yourgroup.zip and upload in in the dropbox on LMS before the deadline.

ix. Any Qs etc., let Aashish or me know. Feel free to use the comments section to this post for any Q&A or discussions.

There's just one more [group] homework left - on spatial DC using the Google Maps based on a business problem.

Deadline: Sunday 25-Oct Midnight.

Sudhir

Thursday, October 1, 2015

Session 3 Homeworks - Individual and Group

Hi Class,

Recall that in Session 3 we covered two distinct topics - 'DC from Qualitative research' and 'DC from APIs'. Corresponding to these two session 3 topics, are the following two homeworks - one individual (to be done and submitted independently) and one group (one submission per group).

Individual homework: Qualitative DC

This individual homework is about primary DC on Word-of-mouth (henceforth, WOM) communications.

Read carefully the following 5 steps for your homework.

Instructions:

1. For one full weekday, make a quick note of every instance of offline WOM communication about *any* brand (product or service) that you come across.

Thus for instance, if you happened to mention to your friend that you liked "Bahubali" (movie), then make a note of it.

If your colleague happens to mention to you that s/he was at Continental Hospital for a checkup, make a note of that too.

Or it could be that you were the third party at a conversation between two people arguing over whether 'The Times of India' is better of 'The Hindu'. Make a note of that too.

2. Mind you, this is just for 24 hours, when, in the course of your regular day, you make a mental note of what all products, brands, services etc that you came across via interpersonal WOM (offline only) and then record them in a notepad or an excel sheet or some such place.

Important: Do NOT deliberately indulge in WOM for the homework. Only record that WOM which happens naturally, in the course of your everyday routines.

3. I want you to record 3 things:

(a) Name of the product/brand etc and which category/ industry it belongs to.
(b) Who was the source of the WOM (was it you? a colleague? family member? etc.) and who was the recipient?
(c) what was the time of the day (roughly) when the WOM exchange took place.

Model the worksheet columns as shown in the example below:

4. Repeat steps 1-3 for any 24 hour period during a weekend or holiday.

5. Finally, write your primary data collected into an excel sheet with 5 columns: brand/product, industry or category, WOM source, WOM recipient and Date-time.

Name the excel sheet as "YourName_ISB student number.xls" and upload it to the requisite dropbox in LMS.

Deadline for this individual homework is 15 days from now - i.e. 16 October 2015, Friday, midnight.

Any queries etc pls let me or Suresh know.

-----------------------------------------------------

Group Homework: DC from APIs

One submission per group. Use any tools/platform you prefer, not necessarily R.

Instructions:

1. Replicate at home the classwork exercises on API based DC, including viewing the related video instructions on LMS)

2. Google for free traffic APIs available. A few I found, for instance, were:

Microsoft Bing

Yahoo's traffic API

HERE traffic API

3. Read the relevant documentation. Connect to the API. This is a HCC Level 1 assignment - meaning, one can consult peers and other groups for help as required.

4. There are typically two types of information given out in Traffic APIs - incident data (accidents, crashes etc) and flow data. Your task is to obtain either of the two for any major US city.

5. Display the output as a table (or dataframe or matrix object) with well-defined columns and a few rows for illustration.

6. Submit a PPT with your group members names in the title slide, your chosen API's details in the next slide, code used to pull data from the API in the third slide (at least the URL constructed) and a snapshot of the output on the fourth slide.

Submission Deadline is 15 days from now: midnight of 16-Oct Friday.

Any queries etc pls let me or Aashish know.

Sudhir