Analytics Yogi: October 2016

Wednesday, October 12, 2016

Scraping Google Search Results - Group HW #3

Class,

Your final group homework is coming up soon. It involves scraping google search results and post-processing them a little bit. We will learn about text-processing techniques in some detail during your visit 2.

The idea is to scrape the top few pages of google search results for some of the world's most valuable brands and/or firms (both B2B and B2C)

And thereafter classify the URLs as originating from the firm or brand versus those originating from the rest of the world.

After that, a lot of interesting questions that can be asked and answered downstream.

Materials:

I'll be putting up the following materials up on linkstreet by this weekend.

Googlesearch.py code for running on Python 3 - either Jupyter or Spyder

CBA Group Assignment Allocation v2.xls - an excel sheet that allocates 8 brands/firms to each group

CBA Group Assignment v2.docx - a word doc explaining the process of demarking fiorm-sponsored from non-firm content.

A zipped folder called 'Examples.zip' that gives a sample submission for Apple Inc. Pls ignore anything in the excel sheet that we haven't covered in class so far, like 'Sentiment' etc.

Task:

1. I have assigned 8-10 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group. 2. Carefully go through the explanation in the word file for how to run the python script for scraping the data.

3. For each brand or firm in your list, run the data collection python script and collect data on the first 5 pages worth of search results. Use the keyword given as is in the search term.

4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".

Deliverables:

Pls submit the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand). Name the excel sheet as "group HW 3 for group .xls"

Deadline:

Submit anytime before the midnight before (i) either the DC exam, or (ii) visit 2, whichever is earlier.

Any queries etc, contact me via the blog comments section or write to Vivek directly.

P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.

Sudhir

Tuesday, October 11, 2016

Group Homework #2 - Webscraping with rvest

Class,

Pls find uploaded in linkstreet a markdown document for using the rvest package to scrape IMDB results.

Run the code line-by-line and check results. The code is documented to some extent. However, if you're unclear about a particular line is doing, web-search first, ask your peers, etc.

You should get a file containing URLs for the top 250 movies on IMDB. Once this is done, your real homework task begins.

Task:

1. Find all movies that released between 1996 and 1998 (both years including).

2. Read in the URL's imdb page and scrape the following information:

Director, stars, Taglines, Genres, (partial) storyline, Box office budget and box office gross.

3. Make a dataframe out of these variables as columns with movie name being the first variable.

4. Make a table movie-count versus Genres.

4a. Bonus points: See if you can come up with some interesting hypotheses. For example, you could hypothesize that "Action Genres occur significantly more often than Drama in the top-250 list." Or that "Action movies gross higher than Romance movies in the top 250 list." Etc.

5. Write a markdown doc with your code and explanation. See if you can storify your hypotheses.

Update: You're not restricted to only use rvest in R. If you're more comfortable with other platforms (say, Python 3), pls feel free to use the same to get the job done. But insert a markdown with the python code you used as part of your submission.

Deliverables and Deadline:

1. Zip a folder containing

(a) your .Rmd file,
(b) your markdown as a web page,
(c) .R script files you used,
(d) the data you collected in .txt or .csv form,
(e) a notepad with your group name and member roll numbers written on it,
(f) Bonus: Include a link to RPubs if you have published your markdown to the web.

2. Deliver the zipped folder to the appropriate dropbox.

3. Deadline will be the midnight of 23-October-2016.

Any Qs etc contact me. Or use the comments section below.

Sudhir

Additional readings - mandatory and optional

Class,

Wide range of topics we'd seen in the DC course. Some of you asked for more sources and reading material. Pls find the same below (in no particular order).

I do realize I got delayed in putting them out and some of you had followed up with me regarding this. Some readings are mandatory and others are totally optional only.

+++

The readings below are NOT optional in the sense that questions based on these readings may feature in your exam.

Readings relating technology to data collection and data use (from the Economist):

1. The first article titled 'Little Brother' (in an obvious play on George Orwell's famous 'big Brother' theme) details the impact of digital on advertising spends of firms worldwide.

2. The second article, 'Getting to know you', is about the various ways in which data is collected about consumers online.

3. The third article in this series, 'The world wild web', extrapolates some of what we are seeing into the future and asks 'Where are we going?'.

Ideally, I'd like you to read and discuss these articles within your groups. Again, remember, questions based on ideas and facts in these articles are fair game in your final exam for DC. Happy reading.

+++

Now, these readings that follow below are optional, more for leisure reading and folks with interest in particular topics/ verticals etc.

a. More from the Atlantic on how its now technologically feasible to arrive at one's Identity. Big Data Can Guess Who You Are Based on Your Zip Code.

b. Recall the habit patterns class we'd covered? Here's an article from HBR blogs on How Customers Get Hooked on Products.

c. There's an undercurrent somewhere in the program that spells the words "data science". This link here offers a rounded perspective on what precisely is data science. This follow-on link here describes 8 concrete steps you must take to become a data scientist. Yes, R features there. Apt read for all CBA students, IMO.

d. For sessions 1-3 which focussed more on constructs, designing questionnaires around constructs etc., here below is some interesting material which you may consider browsing at leisure. They're basically to help understanding for those folks who may have felt the coverage in class was not detailed enough on certain topics:

i. This is a Wikipedia link to Quantitative psychology as a subject area. It provides a nice, concise and precise introduction to the area in general and has a good number of downstream links that you can pick up on as and when necessary.

ii. This is the Wiki entry to Scaling techniques in general in the social sciences. As you can see the comparative versus noncomparative dichotomy comes in early on here. More links to detaiuled topics are also available.

iii. This is the wiki entry to psychometrics as a discipline. I thought it a tad too inclined towards educational testing but still, worth a read perhaps, for those interested.

+++

More reading links (optional) that I could gather, below. Recall the Xerox-evolv caselet we did in class - The one where psychographic Likerts were combined with straightforward machine learning? We then discussed implications, pros and cons etc, and speculated on when such practices may spread worldwide and into India. Well, speaking of India...

Startups, and India Inc use psychometric tests to peek into potential recruits’ minds

This is the session 5 reading from NYT on 'Data janitor' work: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

This is the source for the session 5 table on Data munging functions in R from computer world: Great R packages for data import, wrangling & visualization

This is the NYT article on 'the data driven life' about the quantified self movement.

This is the API directory from Programmable web. Recommended that as CBA folks, you register with the site and get periodic updates (if you haven't already). +++

You have 2 more group homeworks coming - one on webscraping using the rvest package and the other on python for scraping google search results. I'm teaching PGP currently, so posting a tad slow, but will put these HWs out there soon.

Any Qs etc, let me know.

Sudhir

Saturday, October 8, 2016

Individual Homework 2 - Fill up a survey

Class,

I trust you've been busy with all the homework assignments etc from visit 1 visiting your weekends at home.

Your second individual homework is coming up. And not to worry, it won't be very burdensome, let me assure you. Please take this 20-30 minute web-survey. I plan to use this data to demonstrate concepts and techniques in the Text Analysis course during your next visit.

The survey should be completed (and is automatically submitted online) no later than the midnight of Sunday 16-Oct-2016 (i.e., a week from now).

I estimate the survey should take no more than 20-30 minutes.

+++

On a separate note, I recommend you register (free) with and subscribe to newsletters from the following sites:

1. www.kaggle.com which hosts data science competitions, releases datasets and tutorials. Can use your FB or goog login. We'll be using some kaggle datasets in NLP.

2. r-bloggers.com - which collates and publishes lots of interesting blogs on R, R news, new packages, cool coding hacks and tricks, etc.

3. 'programmable web' - which is sort of API central. Get news on updates, releases, changes etc in different popular APIs from all around.

Apart from the readings in past posts in this blog, this recent bloomberg article is of some interest from the DC and data science perspective.

Any Qs etc, pls feel free to contact me.

Sudhir

Wednesday, October 5, 2016

On Markdowns and files for DC

Hi Class,

Vivek will upload all classwork related files - slides + code + data - in the next few days. Below are some tips for what to do re code and data

About Code:

Code files will be primarily be script files (.R for R and .py for Python) but in some cases, markdown files will also be available.

Open .R files directly via Rstudio and .py files via Spyder or Jupyter.

Markdown files will appear as .Rmd files in R (for Rmarkdown) which open in Rstudio and as web pages for python.

If unfamiliar with R (or python), pls read & execute each line of code + documented comments individually.

Issues etc, first ask Google or your peers before reaching out to Aashish.

About Data:

Data files are usually available as .txt in LMS.

In a few cases, you'll have to signup for an API key and then scrape the dataset yourself. Follow instructions diligently in such cases.

Issues etc, first ask Google or your peers before reaching out to Vivek.

Pls ensure you are very comfortable with replicating classwork examples over the next week-10 days odd. The homeworks + deadlines will start coming in after that.

Recall that your assignment submissions will be in markdown format and the core insights therein should ideally should be wrapped in a narrative / storyline.

I've putup on linkstreet a file, RMarkdown.Rmd, that walks you through a simple procedure of how to write in markdown. The same I published on RPubs (and you can publish your markdowns too). It is available here. Here's a link to how markdown works. Its a very simple 1 page get-started guide in case you're 100% new to Markdowns.

And this page here, from the same author is about a few useful writing tools for how to craft a narrative into your markdowns. Again, use your judgment and don't follow the articles to the letter.

And this is an example of a markdown doc done for a project by somebody in 2014, published on Rpubs. Recall the blog page of Astronomer Julia Silge I'd shown in class? This is the link to her page. Its a nice intro to how to craft simple narratives around regular R code and workflows. Would be great if you can reproduce her workflow on your twitter feed, for instance...

OK. That's it for now. Feel free to use the comments section in case of anything.

Ciao and Cheers.

Sudhir

Tuesday, October 4, 2016

Group HW for sessions 1-2, Batch 7 @ CBA

Class,

This group homework is based on (a) Session 1 - problem formulation, (b) Session 2 - survey questionnaire design and part of (c) Session 3 - Qualitative methods.

It aims to familiarize the intricacies, advantages and limitations of the Questionnaire tool by actively getting you to design that tool.

This is a group based homework. Only one submission per group. If you don't know who your group is, pls ask Vivek about this.

Problem Context

You are a startup bridging doctors and patients in metropolitan India. But your startup offers a house-call solution to elderly or otherwise infirm patients in the reasonably wealthy upper-middle class.

You connect doctors who are willing to make house calls (at some premium, presumably) with patients who are willing to pay the premium for the convenience of a house-call visit by a trustworthy doctor.

Imagine you are still in the process of figuring out the contours of their revenue/business model, which crucially depends what the demand levels are like for such services at what price, and what the supply levels are like at that price. Based on that one can figure out who to target, how and how large the market is.

Task

Your task is to (1) Formulate your D.P. and corresponding R.O.s. (My suggestion: Choose a sharply defined D.P. that can be well-covered by at most 1-2 R.O.s; choose a suitably narrow demographic as your target population that isn't too heterogeneous in its likely needs and preferences).

(2) Next, identify the main construct(s) of interest ones that will likely drive demand for your client, ideally based on your R.O.s. (Hint: Think in terms of the average target customer's motives/needs and the his/her self-perception of the requirements)

(3) Design a questionnaire that: [a] surveys target segment respondents on their propensity to use app-based on-demand commuting solution services;
[b] can be taken in under 15 minutes on a good net connection;
[c] collects info on the distribution of quantities of interest (such as awareness levels about these services, interest levels in using them, what price levels might be viable, etc.)

My suggestions before starting:

1. To understand this target segment's needs and preferences, do some preliminary, quick qualitative research: E.g., conduct a few interviews (these could be casual conversations or telephonic ones) with a few people in that target segment about the subject. Find out what they think, what they need, what they see others around them doing etc.

2. Write (or 'Program') your survey into either Qualtrics or the free versions of websurvey softwares available such as zoomerang or surveymonkey. See if you can obtain the "launch" survey link.

3. Bonus points for pretesting the survey with a few folks first, accounting for order effects etc in questionnaire design, etc.

Submission Format

You can submit a traditional PPT for this assignment. All future assignments will involve code and hence only markdown will be allowed.

Below is the submission format for a PPT. (That for a markdown will be similar. Just divide the markdown webpage into sections [Title, Data, model, results etc])

Start with a plain white PPT. Save it as groupName_session1.pptx

Title slide: Homework name and names+ roll numbers of group members.

First slide: Brief description of your client and their business (Also, a line or two to justify why you picked them)

Second slide: Statement of D.P and corresponding R.O.s

Third slide: Brief Description of qualitative research carried out to first narrow-down what topics to cover in the survey.

Fourth slide: Make a table that maps the 1-2 major constructs to the corresponding survey question numbers.

Fifth slide: Deliverable - websurvey link. Should be a working link. Also, attach the word or PDF version of your questionnaire onto this slide. The Q numbers in the 4th slide should match the ones here.

Sixth slide: Any learnings you as a group made - E.g., what constructs were the easiest to measure? hardest? ON the average, how many Qs per construct did you have to use? Etc.

Update:

In the past, I got quite a few Qs asking if a scale other than Likert can be used etc. Sure, it can. Likert is important in the context of behavioral constructs. For regular, descriptive Qs, use other scales by all means. *Not* every Q has to be a likert.

Whether PPT or markdown, wrap your submission inside a story, as far as possible.

Deadline for this is midnight of 16-Oct (Sunday). Drop box for the PPT will be set accordingly on linkstreet.

Any queries etc, let Vivek or me know.

Sudhir

Monday, October 3, 2016

Individual Assignment 1 for Batch 7

Hi Class,

A series of homeworks are coming your way. The first is described below. I will putup details for the others shortly.

Pls watch this ~ 20 minute video carefully. It features Scott McDonald of Condé Nast holding fort on where Marketing Research is headed in the next decade.

“Social Technological and Economic forces affecting Marketing Research over the next decade”

Now, for your HW, pls answer a few simple Qs (True-False, fill in the blanks variety) about the above talk in the following survey:

Questions based on the video

HW Notes:

(i) This is an individual-only HW. Since it involves no R, consulting peers is not permitted.

(ii) I found that using earphones works great in making out what the speaker is saying much more clearly than ordinary speakers. FYI.

(iii) Deadline: The HW should be completed and submitted latest by midnight 10-October Monday.

Any Qs etc, pls feel free to email me or use the comments section below.

Sudhir Voleti