Sunday, February 23, 2014

Group Homework for Sessions 4 and 5 (Text and Social network analysis)

Hi all,

This is the HW for sessions 4 and 5. It is the last and final HW in this course. It will involve data collection, analysis and inference.

It will require you to go through and understand the R code I used in class. Pls *ensure* you can replicate the classwork examples and exercises before attempting the homework.

This is a group homework, so ensure you divide it amongst your group and co-ordinate. That way, too much burden won't fall on any individual.

If your group formation is not yet done, let Krishna Pusuluri know, he'll assign you to a group.

Any doubts, issues etc, let me know through the blog or via email.

### following Qs are for text analysis using R code from the class on your survey data ###

Q1a. One Q in your survey asks you to "List some brands (five or more) you are personally loyal to." Text analyze this component for the entire class by building a TDMN and a wordcloud. Comment on which brands seem most popular and which categories they come from. Comment on why this may be the case.
Q1b. Now build a simple semantic network for the terms found above. Basically, which brands co-coccur in documents. (See my R code for a function on how to build simple semantic networks from term-document matrices). Speculate on which brands seem to be preferred together by people.

Q2. Text analyze the answers to the Q "List two places OUTSIDE India that you would like to visit. Explain / why in a few lines for each place." Build wordclouds under both TF and TFIDF. Comment on what can be inferred from the wordcloud.

Q3a. Text analyze the responses to the Q: "What are your career goals in the short, medium and long terms? / Explain in a few lines." Build a wordcloud under both TF and TFIDF. Comment on what can be inferred from the wordcloud.
Q3b. Build a semantic network connecting the terms for this Q. Which terms occur together the most in documents? What can be inferred?

### following Qs are for web extraction of data from amazon ###

Q4. Collect 100 odd reviews from Amazon for xbox 360. Analyze the wordcloud. What themes seem to emerge from the wordcloud?

Q5. Analyze the positive wordcloud. What are the xbox's seeming strengths? What can they position around?

Q6. Analyze the negative wordcloud. What are the xbox's seeming weaknesses? What can they prioritize and fix?

Deadline is before the exam. Submission must be in the form of PPTs only. Write your group name, individual members' names and ISB IDs on the title slide and write your group name as file name. Dropbox will be made for this.

Any Qs etc, contact me.

Sudhir

15 comments:

  1. Dear sir,
    I tried the codes for Q4 (data extraction), its working fine but showing error for could not find function "%do%"
    and how to save the 99 reviews to a file, as in RStudio we are able to see only first few reviews...

    ReplyDelete
    Replies
    1. Hi Sray,

      I've asked my RA Mr Ankit anand to look in to this and he'll respond to you soon. I'm traveling currently.

      Sudhir

      Delete
  2. Hi Sir,
    I need some help. I was trying to execute the R code, which is part of Session4 in LMS. The R code is in the file "text and sentiment analysis code ver2". While executing the following file :
    tdm1 = TermDocumentMatrix(x1, control = list(weighting = function(x) weightTfidf(x, normalize = FALSE, stopwords = TRUE)));

    I have encountered the error :
    "Error in weighting(x) : could not find function "weightTfidf""

    I did not find much help in Google on this error. Any suggestions how to resolve this issue ? Thanks.

    ReplyDelete
    Replies
    1. Sorry, my mistake.

      It should be 'weightTfIdf' rather than 'weightTfidf' - the 'i' is capital in the actual function. R is very case sensitive. Pls try now and lemme know if it runs OK.

      Sudhir

      Delete
  3. Thank you professor. After changing the 'weightTfidf' to 'weightTfIdf', it is working. I have also changed the statement : "dtm = t(tdm11)" to "dtm = tfidf(tdm11)" for troubleshooting the other issue. Just thought of sharing.

    I have encountered some other issue downstream. Debugging the code now. Will request if I need any additional help. Thanks.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. After changing the 'weightTfidf' to 'weightTfIdf', It is showing Error in weightTfIdf(x, normalize = FALSE, stopwords = TRUE) :
    unused argument (stopwords = TRUE)
    So should we remove this stopwords = TRUE also?

    ReplyDelete
    Replies
    1. Sure, try removing and see. In general, keep trying new things, you don't have to go with my code which is indicative only. You're ideally encouraged to branch out and explore your own applications on R once you're a little comfortable with the R platform.

      By the way, qdap is a natural language processing package now available on R.

      Sudhir

      Delete
    2. Yes, by changing/removing parameters, I'm able to see the flow of R functions.

      Thank you.

      Delete
  6. I'm trying to do Tweet Extraction in R using twitteR. Using code

    twitCred<-OAuthFactory$new(consumerKey=consumerKey,
    consumerSecret=consumerSecret,
    requestURL=reqURL,
    accessURL=accessURL,
    authURL=authURL)

    (I've created ConsumerKey and secret) but after running above code, I'm still getting error "OAuth authentication is required with Twitter's API v1.1"
    I tried to generate authentication Pin but not able to do so. I got a page which was mentioned "OAuth Signing Results" with some string and authorization header. How can I get authorization PIN.

    Request you to please help me on this.

    ReplyDelete
  7. URLs:
    reqURL<-"https://api.twitter.com/oauth/request_token"
    accessURL<-"https://api.twitter.com/oauth/access_token"
    authURL<-"https://api.twitter.com/oauth/authorize"

    ReplyDelete
  8. Hi Ashutosh,

    I used the code I shared last year. Seems twitter API's connection protocols have changed since. I haven't had occasion to use it again and don;t really have the bandwidth to investigate that aspect now. Should you find the code to get through to twitter, pls share the same here for everyone's benefit. Thanks.

    Sudhir

    ReplyDelete
  9. Hello Sir,

    Need your advice. When I try TDMN function it shows me the term frequency correctly, but I do not see those terms in Wordcloud though they are having higher freq. For eg, in the case of Brands example, I could see "Google" having more term frequency than others but it was missing from the word cloud.
    Could memory cache be the reason? Please advice.

    Thanks,
    Rashmi

    ReplyDelete
    Replies
    1. Sorry abt the delayed reply. Pl send me the code you're using and send a copy to Ankit.

      Sudhir

      Delete
  10. Hello Sir, Came across Facial Analytics .Reminded me of the New York Museum example from class. http://www.informationweek.com/big-data/big-data-analytics/facial-analytics-what-are-you-smiling-at/d/d-id/1127726

    ReplyDelete