Pls find below all your individual assignments for TABA. Vivek will putup the requisite dropboxes. The assignments are due on 27th November (Sunday) midnight.
=====
Task 1 – Text-Analyzing a simple set of documents.
Imagine you're a Data Scientist / consultant for a movie studio. Your brief is to recommend the top 2-3 movie aspects or attributes the studio should focus on in making a sequel.
The aim is to get you to explore with trial-and-error different configurations of possibilities (e.g., what stop-words to use for maximum meaning? TF or IDF? etc) in the text-An of a simple corpus. You are free to use topic modeling if you wish but it is not necessary that you do so.
Step 1 – Go to IMDB and extract 100 reviews (50 positive and 50 negative) for your favourite movie. You can refer to the code provided for reviews extraction from IMDB (IMDB reviews extraction.R) on LMS.
Step 2 – Pre-process the data and create Document term Matrix. Check word-clouds and COGs under both TF and TFIDF weighing schemes for which configs appear most meaningful / informative. Iterate by updating the stop-words list etc.
Step 3 – Compare each review's polarity score with its star rating. You can choose to use a simple cor() function to check correlation between the two data columns.
Step 4 - Now, make a recommendation. What movie attributes or aspects (e.g., plot? star cast? length? etc.) worked well, which the studio should retain? Which ones didn't work well and which the studio should change?
Step 5 – Deliverable: Create a markdown document as your deliverable. Keep only the final configuration that you arrived at. In the text you can describe the trials and errors you did. Ensure the graphs/tables on which you are basing your recommendations are part of the markdown. Also ensure that the stop-words you used are visible (either as image or as a vector).
Overall, I'd say no more than 2 hours of work, provided you diligently replicated the class work examples prior to this.
========
Task 2 – Basic NLP and Named Entity Extraction from one document.
Step 1 – Select one well-known firm from the list of the Fortune 500 firms.
Step 2 – For the selected firm, scrape it’s Wikipedia page.
Step 3 – Using openNLP, find all the locations and persons mentioned in the Wikipedia page. Its good practice to set timers and report runtimes for heavy functions.
Step 4 – Plot all the extracted locations from the Wikipedia page on a map. You may want to see 'NLP location extract and plot.R' file on LMS for this.
Step 5 – Briefly describe your observations based on the map and persons extracted from Wikipedia page.
Step 6 - Deliverable: R markdown submitted as HTML page. Ensure that the lists of people and persons mentioned are clearly visible. Also that the map of places is visible too. Use the text part of the markdown to record your observations.
Overall, I'd say no more than 1.5 hours of work.
Pls feel free to discuss these assignments with peers, help out and take help where necessary but at the end of the day, your submission must be individual and driven primarily by your own effort.
Any Qs etc, contact me, Vivek or Aashish.
Sudhir
No comments:
Post a Comment