Why scrape news articles?
You may need to be updated about the latest news in your professional field or a specific topic of interest. The conventional way of opening a news website, typing your field of interest, scrolling through all news feeds, opening the same links again and again, and saving some useful links in separate files can be quite repetitive for daily usage. Thus, we attempted to automate this process using simple and open source tools such as Jupyter Notebook to run Python code and Google Sheets, where we can save all necessary data in a visually appealing way.
The contents of this blog post
- In Part One, we will share how to create Google API credentials in order to access Google Sheets without manually opening it.
- In Part Two, we will discuss how to do web scraping in Python in Jupyter Notebook.
- In Part Three, we will connect output data and Google Sheets in order to save the data for further analysis and display.
Part One: Google API credentials creation
- Create a Google Account if you do not have one.
- Open Google Developer Console.
- Create a New Project by clicking on “Create Project” button,
- Type your project name and ID (it cannot be changed after this stage) and click “Create”
- Open your Project (refresh page if you do not see it) and click API -> Dashboard
- In the APIs Dashboard, click “Enable APIs” to enable all required APIs.
- New page is opened, where you can type “Google Drive API” and “Google Sheets API”.
- Enable both APIs.
- After enabling Google Drive API, go to the “Credentials” section and click “Create Credentials” -> “Service Account”.
We need to create these credentials to use them to access Google Sheets from Jupyter Notebook.
- Start creating credentials. Type any name and skip (if you want) optional settings. For the role, you can select “Owner” to have full access (for your usage only).
- After you create the credentials, you can see them in the “Service Accounts” sections. Now in order to access the key, go to the “Manage Keys” section.
- Create a key for this service account. Select “json” as the type of the file. When you create a key, it will automatically download the key.
In case it did not download automatically, go to the KEYS tab and download the key manually. The key is basically a .json format file which has all credential information of a service account. The content of the key is as following:
Well done! You have your Google API credentials ready and now let’s get into coding!
Part Two: Web Scraping using Python
- Install Jupyter Notebook to run Python.
- In order to get information from a webpage, we need to install web scraping libraries, and we used BeautifulSoup due to its easy usage.
Run the following command from cmd/terminal.
- Install Requests to access <html> code of the webpage.
Run the following command from cmd/terminal.
- Now when you have the environment installed, create a blank python file.
- Import all our libraries. We will need numpy to manipulate the data.
- Define you keywords as an array, For example:
- Go to the desired website and type “EXAMPLE” in the search bar. Here you can apply all sorting needed, for example, “Newest” or “Popular” etc. Copy the result page URL. For example, you get the URL as: https://search.naver.com/……s&query=EXAMPLE&sm=tab_….
- In the URL, you can see your EXAMPLE word as equals to “query”. This is the place where you can put any other keyword and save the URL for each keyword. We used for loop and concatenate the URL string
- Now when you have all URLs from which to get data, use the following commands to retrieve content of the page.
- In order to understand which information you need from the content, go to the wanted page (the result page) and right click “View page source”. You will see <html> of the page (which is the content) and find the wanted type (<a>, <p>, <h1> etc) of the information. For example, in news websites, each news has a title in <a> tag with a link attached to it (href component of <a> tag). Now, use the following easy lines of code to get the wanted section.
Please note that these parameters (tag type, class name) depend on the website you are scraping.
- You can manage the data as you want, but in our case, we saved data regarding each keyword as an object. Then, we combined data for all keywords, which resulted in a big array that will be used as an output dataframe.
- Import Pandas Library to work with data frames. Save the resultant big array as a dataframe using the installed library.
Well done on preparing the data output! Now it is time to connect the Jupyter Notebook to Google Sheets.
Part Three: Connecting Jupyter Notebook with Google Sheets
- To login into Google Sheets using Jupyter Notebook, install the following libraries.
- Get your Google API credentials and authorize into your Google Account.
Important note: all your Python files should be in the same folder as your credentials key file (.json). If it shows an error at this stage, go to your Google Developer Console and check if both Google Drive and Google Sheets APIs are enabled in your project.
- Create a google spreadsheet file and share it with the service account email you created before. To see the service account email, open the credentials key .json file and look for “client_email”.
- Now you will upload your dataframe (big table) into the Google Sheets using the following command
Let’s go one by one for each variable,
table=> dataframe you created in Step 12 of Part 2.
spreadsheet_key=> Open your google spreadsheet and look at the URL.
The spreadsheet_key is basically the highlighted part of the URL as shown below. https://docs.google.com/spreadsheets/d/1..longstringofletters..w/edit..
So it is a string of letters and numbers which come after “/d/” and before “/edit..” parts.
s_name=> spreadsheet name, you can define as any string (for example, s_name = “TEST”)
credentials=> your Google API credentials, which you defined in Step 2 of Part 3. You can define optional variables if you want to change the upload method, see more here. If there are no errors, Jupyter displays the following message:
<Worksheet ‘TEST’ id:123..example..123>
Now go to the spreadsheet you created and you can see your “TEST” spreadsheet and the dataframe in it.
Please see the full code on our github repository.