Your second-last group homework is coming up. This one concerns topic-mining, analysis and interpretation.
I hope you have completed the individual homework for session 4. If not, I urge you to do so before attempting this one.
After this, there is one last group homework remaining - that for session 5 - based on python script that scrapes google search results.
To ease things a bit, I've already scraped and provided the data for download on the linkstreet assignment folder.
Further, we've also made available the Term Document Matrices (TDMs) that you'll need for the analysis and constructing which might have taken some time.
To make things even more easier, Pandeyji has put-up a markdown workflow in PDF form on linkstreet of code + sample output that walks you through what needs to be done to get the results. Later, I'll ask you to interpret the results.
Pls go through the workflow, line by line. Try the code on the datasets given.
Task: |
Challenge:
The assignment may not be as straightforward as first seen. There are multiple combinations of analysis possibilities.
For instance: (a) Should you use TF or TFIDF? I suggest you try both and go with whatever yields cleaner results.
(b) Should you use K = 2 topics? 3? 4? I suggest you try between 2 and 5 topics and see which yields the 'best' or most interpret-able and clear story.
(c) Debate, discuss, tweak the code, try-and-err etc. Massive learning opportunity if you are willing to go the distance.
Submission Details:
Workflow with code and explaining the story behind your final results can be submitted to linkstreet as a html file. Bonus points if you are able to publish it to Rpubs and send us the link.
Deadline is 20-March (Sunday) Midnight. Any queries etc, let me know.
Sudhir
There is already tdm file shared, is it with tf or tfidf?
ReplyDeleteIs there any code to generate tdm file with tf or tfidf scores from given "RF.Technology.Rds"?
Hi Suman.
DeleteIts TF. If you use inspect() for a subset of the TDM, you can see its mostly term counts only. There're a couple of ways in which you can use R's inbuilt routines to get IDF. Google for the same and see.
Hope that clarifies.
Sudhir
Oh. thanks a lot.
ReplyDelete