Your final group homework is coming up soon. It involves scraping google search results and post-processing them a little bit. We will learn about text-processing techniques in some detail during your visit 2.
The idea is to scrape the top few pages of google search results for some of the world's most valuable brands and/or firms (both B2B and B2C)
And thereafter classify the URLs as originating from the firm or brand versus those originating from the rest of the world.
After that, a lot of interesting questions that can be asked and answered downstream.
Materials:
I'll be putting up the following materials up on linkstreet by this weekend.
- Googlesearch.py code for running on Python 3 - either Jupyter or Spyder
- CBA Group Assignment Allocation v2.xls - an excel sheet that allocates 8 brands/firms to each group
- CBA Group Assignment v2.docx - a word doc explaining the process of demarking fiorm-sponsored from non-firm content.
- A zipped folder called 'Examples.zip' that gives a sample submission for Apple Inc. Pls ignore anything in the excel sheet that we haven't covered in class so far, like 'Sentiment' etc.
Task:
1. I have assigned 8-10 brands to each group. Check the excel sheet on linkstreet for which brands have been assigned to your group. 2. Carefully go through the explanation in the word file for how to run the python script for scraping the data.
3. For each brand or firm in your list, run the data collection python script and collect data on the first 5 pages worth of search results. Use the keyword given as is in the search term.
4. Use the excel template to identify and mark each URL in the results as either "firm-sponsored" (if the link leads to a firm/brand sponsored page) or as "non-firm-sponsored" (if the contents of a URL are outside the brand's control. If neither holds, mark the link as "Irrelevant".
Deliverables:
Pls submit the excel sheets in the correct template for each brand (use separate worksheets in the same file for each brand). Name the excel sheet as "group HW 3 for group
Deadline:
Submit anytime before the midnight before (i) either the DC exam, or (ii) visit 2, whichever is earlier.
Any queries etc, contact me via the blog comments section or write to Vivek directly.
P.S. The results you bring in may be used for academic research. I'm hoping the results will be interesting enough to be publishable.
Sudhir