Friday, March 4, 2016

Group Homework Session 4 - Topic analysis

Hi Class,

Your second-last group homework is coming up. This one concerns topic-mining, analysis and interpretation.

I hope you have completed the individual homework for session 4. If not, I urge you to do so before attempting this one.

After this, there is one last group homework remaining - that for session 5 - based on python script that scrapes google search results.

To ease things a bit, I've already scraped and provided the data for download on the linkstreet assignment folder.

Further, we've also made available the Term Document Matrices (TDMs) that you'll need for the analysis and constructing which might have taken some time.

To make things even more easier, Pandeyji has put-up a markdown workflow in PDF form on linkstreet of code + sample output that walks you through what needs to be done to get the results. Later, I'll ask you to interpret the results.

Pls go through the workflow, line by line. Try the code on the datasets given.

Task:

Find in linkstreet the text corpus pertaining to the 2013 10-K SEC filings by Tech Sector firms from the Fortune 1000. Read it into R.

There are two sections of interest in the 10-K form:

One is Item 1- Business description (wherein firms describe their business and key products),

and the second is Item 1A - Risk Factors (wherein firms reveal exposure to anticipated risks in the coming year).

Section A people (or groups with majority section A people) will analyze Item 1- Business description.

Section B people (or groups with majority section B people) will analyze Item 1A- Risk Factors.

In each case, run the topic mining algorithm on the corpus. The output will be similar to what you saw in the Samsung data survey - word-clouds and co-occurrence graphs.

Interpret the topic. Label it in < 6 words and describe your interpretation in a few lines.

Bonus question: Any potential applications of such analysis you can think of? For instance, ask yourself why might the results of your analysis be interesting to which firms?

Remember, I'm not looking for just some results putup in a file. I'm looking for interconnections, deeper meaning, story ...

Challenge:

The assignment may not be as straightforward as first seen. There are multiple combinations of analysis possibilities.

For instance: (a) Should you use TF or TFIDF? I suggest you try both and go with whatever yields cleaner results.
(b) Should you use K = 2 topics? 3? 4? I suggest you try between 2 and 5 topics and see which yields the 'best' or most interpret-able and clear story.
(c) Debate, discuss, tweak the code, try-and-err etc. Massive learning opportunity if you are willing to go the distance.

Submission Details:

Workflow with code and explaining the story behind your final results can be submitted to linkstreet as a html file. Bonus points if you are able to publish it to Rpubs and send us the link.

Deadline is 20-March (Sunday) Midnight. Any queries etc, let me know.

Sudhir

3 comments:

  1. There is already tdm file shared, is it with tf or tfidf?

    Is there any code to generate tdm file with tf or tfidf scores from given "RF.Technology.Rds"?

    ReplyDelete
    Replies
    1. Hi Suman.

      Its TF. If you use inspect() for a subset of the TDM, you can see its mostly term counts only. There're a couple of ways in which you can use R's inbuilt routines to get IDF. Google for the same and see.

      Hope that clarifies.

      Sudhir

      Delete