Analytics Yogi

Sunday, October 20, 2019

Reg Text-An - quanteda, DFms, LDA and Naive bayes

Thisis a nice article on quanteda use in R.

Yup, yet another R texn-an package. I trace a long history on this one, starting with tm some 6 yrs ago. Have settled on tidytext for all BOW work nowadays.

Anyway, interesting pieces of this article are [1] its dataset - UN speeches etc, [2] ref to papers that show how to do unsupervised (LDA) as well as supervised learning (classification) on text, etc.

Will write more later, this is it for now.

Sunday, October 13, 2019

Association Rules in R

Here's a nice post on the use of associative rules for market basket analysis (MBA) in R.

Yup, as suspected, it does use R's well-ish-known arules package and the datasets in that package.

I have to wonder if there's any other datasets out there and available for in-class demos.

Problem with MBA typically is that at the SKU level there's way too many items to handle and at the category level, it just isn't as interesting only.

Some sorta hierarchical setup might help perhaps, eh?

Wednesday, October 9, 2019

Restarting An-Yogi.

Or testing Waters, more like it.

This blog closed after an unfoirtunate run-in with Batch 8, CBA. I'd then decided to close this blog for CBA course-related work.

But nothing stops me from restarting it for keeping track of my own work. Which is what follows.

I read many posts online on tech and code. This blog seems like a good place to keep track of some of the more interesting or useful ones.

+++++++++++

First, here's a nice post on how to use LASSO (Least Absolute Shrinkage and Selection Operator) to do [political] micro-targeting and online-ad-retargeting using FB 'Likes' and preference data on the one side and the Big-5 personality framework on the other.

Let me quote from the blog itself:

Basically, Microtargeting is the prediction of psychological profiles on the basis of social media activity and using that knowledge to address different personality types with customized ads. Microtargeting is not only used in the political arena but of course also in Marketing and Customer Relationship Management (CRM).

So, what's *LASSO*? An extension of the classical linear regression wherein, given many IVs on the RHS, we fish for the 'best'/'most important' ones by *shrinking* all coefficients to zero via a constrained optimization routine neatly packaged and delivered via the **glm** R library.

Another advantage of this post - it provides dummy data to test the code on. As well as decent interpretations for the shrunk yet non-zero beta coefficients in the output.

Even better, since its all pure-R, *shiny-fying* it for later use in a PGP classroom is eased.

Note: Machine Learning Basics for Marketers (MLBM) is on offer for the first time in Term 7 this acad year.

+++++++++++

Well, will close here for now. More posts will hopefully follow and this attempt won't fizzle out yet again.

Sudhir Voleti, Oct 2019.

Saturday, March 11, 2017

Some Clarifications

Deleted. Not worth it.

Tuesday, March 7, 2017

Group Homework

Class,

Your Group Homework is coming up. I describe your Task below, pay careful attention to detail, pls.

Its actually two tasks which are only vaguely related. Past feedback says its better to combine all homeworks into one to enable groups to better plan and schedule work.

Task 1: Web scraping

Step 1: Go here and examine the page.

Step 2: Scrape the page and tabulate the output into a data frame with columns 'name, url, count, style'.

Step 3: Filter the data frame. Retain only those beers that got over 500 reviews. Let's call this Table 1.

Step 4: Now for each of the remaining beers, go to the beer's own webpage on the ratebeer site, and scrape the following information:

'Brewed by, Weighted Avg, Seasonal, Est.Calories, ABV, commercial description' from the top of the page.

Add these fields to Table 1 in that beer's row.

Step 5: Now build a separate Table for each beer in Table 1 from that beer's ratebeer webpage. Scrape the first 3 pages of reviews of that beer and in each review, scrape the following info:

'rating, aroma, appearance, taste, palate, overall, review (text), location (of the reviewer), date of the review.'

Store the output in a dataframe, let's call it Table 2.

Step 6: Create a list (let's call it List 1) with as many elements as there are rows in Table 1. For the i_th beer in Table 1, store Table 2 as the i_th element List 2.

Task 2: DC from Twitter API Step 1: Read up on how to use the twitter API here. If required, make a twitter ID (if you don't already have one).

R has a dedicated package twitteR (note capital R in the end). For additional functions refer twitteR.pdf (twitteR package manual).

Step 2: Recall three evaluation dimensions for beer at ratebeer.com, viz. aroma, taste and palate. More than the dictionary meanings of these words, its how they're used in context that's interesting.

So, pull 50 tweets each that contain these terms.

Step 3: Read through these tweets and note what product categories they typically refer to. Insights here could, for instance, be useful in designing promotional campaigns for the beers. We'll do text analysis etc next visit.

Deliverables: There are 4 deliverables per group - [1] R markdown of your (well-documented) code and a storified narrative of your thinking behind the code.

[2] Table 1 as an csv file. List 1 as an .RData file.

[3] List of tweets you scraped in a .csv file.

[4] List of brands and product categories that those terms were used in context of.

Zip these deliverables and submit into the drop box before deadline.

Deadline: Midnight before your Visit 2 starts.

Any Qs etc, ask Sudha or Aashish in the Tutorial.

Sudhir

Saturday, March 4, 2017

Individual Homework No. 2

Class,

There will be three homeworks in DC - two individual and one group.

This is your second individual homework and involves relatively low-tech DC on individuals' locations. Follow the instructions below step by step.

Assignment Instructions:

1. Form a Whatsapp group with your group mates. 2. Whenever you travel or visit different places as part of your everyday work, share your location to the Whatsapp group.

For example - if you are visiting an ATM, your office, a grocery store, the local mall etc., then send the WhatsApp group a message saying: "ATM, [share of location here]."

Ideally, you should share a handful of locations every day. Do this DC exercise for a week. Its possible you may repeat-share certain locations.

P.S. I'll assume you have a smartphone with google maps enabled on it to share locations with.

3. Once this exercise is completed export the WhatsApp chat history of DC group to a text file. To do this, see below:

Go to WhatsApp > Settings > Chat history > Email Chat > Select the chat you want to export.

4. Your data file should look like in this:

28/02/17, 7:17 pm - aashish pandey: location: https://maps.google.com/?q=17.463869,78.367403
28/02/17, 7:17 pm - aashish pandey: ATM

P.S. As you can see, if you have any queries or issues running this thing, reach out to Pandey ji first.

5. Now compile this data in a tabular format. Your data should have these columns -

a - Sender Name

b - Time

c - Latitude

d - Longitude

e - Type of place

6. Now comes the deliverable part. Remember this is an individual assignment even though you need your group's help to form and export the WhatsApp chat history.

Extract your locations from the chat history table and plot it on google maps. You can use the Spatial DC code we used on this list of latitude and longitude co-ordinates or use leaflet() package in R to do the same.

Remember: you should extract and map only your own locations, not that of your team-mates. They will do for theirs.

7. Deliverable: A Markdown file that shows the code you used to plot your locations over a week.

Analyze your own movements over a week *AND* record your observations about your travels as a story that connects these locations together.

8. Deadline: 19-March-2017 (Sunday) midnight. Sudha will make a dropbox where you can share the markdown documents as html files.

9. Submission Format: Individual submissions only. Name your html file as 'your-ISB-ID-number DC HW 2.htm' Inside the html file, introduce yourself with name, city, ISB ID number etc in the first paragraph.

That's it for this simple exercise, folks. Any Queries etc, feel free to contact me.

Sudhir

Thursday, March 2, 2017

Some readings for DC, Batch 8.

Hi Class,

This post is to put up some good practices and (optional) readings for DC. More such posts with readings may follow in the days to come.

Good practices:

1. If you haven't already you should register and signup for daily email newsletters from r-bloggers (for creative uses of R code and info on new packages). Even though 80% of the posts on the daily email may not be of direct interest to me, the other 20% makes it very worthwhile indeed.

For example, here's a post detailing job market trends globally in Data science jobs (with data and visualizations done in R).

2. Similarly, signup for programmable web (for API related directory and news) as well.

3. www.kaggle.com hosts data science competitions, releases datasets and tutorials. Can use your FB or goog login. We'll be using some kaggle datasets in NLP. Good practice to keep checking what's new and hot on kaggle from time to time.

4. Replicate *all* the classwork code line by line. Especially so if you're new to R. Lookout for new functions that may come in (do ?function_name in console to see its description), read inline comment documentation carefully, etc.

Should you have trouble running any particular piece of code, search the web, ask peers etc. The coming DC tutorial on 4-March which will be conducted jointly by Sudha and Aashish Pandey is another good opportunity to get clarifications.

Some (optional) readings:

1. This is the NYT article on 'Data Janitor' work we saw in Session 5.

2. This is the NYT article on 'the data driven life' about the quantified self movement.

3. The next couple of readings relate technology to data collection and data use (from the Economist): 'Getting to know you', is about the various ways in which data is collected about consumers online.

4. 'The world wild web', extrapolates some of what we are seeing into the future and asks 'Where are we going?'.

5. From India, here's an article from ET on how data brokers are syndicating and selling user data sans oversight. 'How data brokers are selling all your personal info for less than a rupee to whoever wants it'.

6. Recall the Xerox-evolv caselet we did in class - The one where psychographic Likerts were combined with straightforward machine learning? We then discussed implications, pros and cons etc, and speculated on when such practices may spread worldwide and into India. Well, speaking of India...

'Startups, and India Inc use psychometric tests to peek into potential recruits’ minds'.

Update: Some more readings of interest

7. Recall the 'Exponential learning curve' example in session 5? Well, here's an article I wrote for the NASSCOM sales and Marketing community I wrote a few months ago on that issue.

8. Recall the GE example in 'Information Imperative' in Session 5. Here's the interview by McKinsey of GE's CEO Jeff Immelt on that topic from Oct 2015. (Might require you to register for free with McKinsey quarterly).

9. Recall what the GEexample led to? A discussion on predictive analytics and maintenance on the human machine.... Here's a timely piece from the Economist from 2 days ago on the data revolution in personal healthcare 'A digital revolution in health care is speeding up'.

Well, that's it for now. Watch this space for more to come.

Sudhir