Today I'm here to tell you that we might be living one of the greatest technological revolutions of man kind. Big Data.
“ Data is the oil of 21st century, and analytics is the combustion engine. ”
But where is this data? Sometimes it will be on a Data Warehouse ready to be used, as easy as it gets. But sometimes, it will be spread all over the web.
Web Scraping is a classic technique used to collect massive (or not) amounts of data from websites. It is very common to attach other tasks within a web scraping project, such as cleaning, pre processing and analyzing the data.
This type of tasks requirer to first understand the HTML code behind the website you're planing to scrape. Some important things to point out before going ahead on this post:
- There's no template for Web Scraping. Having to manipulate HTML tags along the web page you're trying to scrape makes every project unique.
- If you intend to keep your project for a long time, just know you'll have to periodically maintain it. It's absolute normal if one day you try to run your code and face never seen before errors. Probably the web page your code scraped had structure changes.
There are many Python Web Scraping packages to be used out there. In this post I'll be showing a practical example with the help of webdriver from Selenium package.
To download webdriver for Chrome, please access the link bellow and choose the operational system of your preference.
Downloads - ChromeDriver - WebDriver for Chrome
WebDriver for Chrome
WebDriver for Chromesites.google.com
Selenium was created with the purpose to be a web site testing tool. That's right, after the development of a website is over you can code your testing batches with Selenium. Although, Selenium also showed pretty useful as a web scraping package and started to be used as one. Why?
- Selenium can execute actions on websites, like clicking buttons and parsing texts. So it's pretty handy to know you have that option when coding a web scraping project.
Getting into our Project
To introduce you to Web Scraping, I decided to build out a project that will require us to combine another passion of mine, the Stock Market.
The idea is to create a code that will be always accessing a web page that contains real time stock market prices and collect that data.
I'm from Brazil, so I'll use the local market stock exchange as well, Ibovespa.
Please, take a minute to look at the web page we will be scraping below:
Altas e baixas | Mercados | InfoMoney
As maiores altas e baixas entre as ações de empresas listadas na B3. Para refinar sua busca por índices, setores e…
Remember, one of the first things I said about Web Scraping is that we need understand the code behind the page we're planning on scrape. In this link, you can see that we have a huge table containing information about many stocks from Brazil Market.
To see the code behind the curtains, right click anywhere in the web page and then go to inspect.
This command will open the HTML inspector of the page. If you don't know much about HTML language that fine! I don't either. But one thing we have to know is that this huge stock prices table is a <tbody> tag over the huge amount of code that will be showing. These body tags generally comes with <tr> and <td> tags, which are the lines and values of the table.
There are many ways to reach the same scraping result in python. But all of them consists in mapping and interacting with the HTML elements on the webpage. In our example, I found out that we do have that table we're expecting and it is full of lines and values translated to HTML tags. Take a look of how the first line of the table looks like:
This table can be found by its ID value "altas_e_baixas". Each line of the table is translated as a <tr> tag and each of these lines have multiple values.
Now that we already inspected a little bit of the page let's get to the python code. First I will import the libs I'll be working along the code and establish a request connection with the page we will be scraping data from.
To interact with the HTML elements, let's use the find_elements_by_xpath function from webdriver. To know how's the xpath of an element you can simply right click on it and "Copy XPath".
This is the XPath I just copied from the image above:
This is also the first row and first value of the first table we're seeing on the page. Look at the bold words on it.
Now look what happens if you parse this XPath string to the function along with the function text on Jupyter Notebook:
It returns you the text this XPath represents! In this case, the name of the stock in the first row of the table.
Looping thru the Lines of the Table
You got the idea right? All we have to do now is create a loop that will be switching the values of <tr> in order to collect all the values of the table.
Let's create lists that will hold these values during the loop.
I want to get the name of the stock, which is the first value of the row. I also want to get the last time that stock price was updated, which is the second value of the row, and so goes on…
The added following lines of code will simply loop from a range of 1 to 100 and append every text value to the corresponding lists. After that is done, I create a pandas DataFrame with the information I collected from the five lists.
Here's the full code:
Pretty cool uh? But this is just the beginning! In future posts I will show you how to make this script run multiple times during the day using while loops in python or the Mac scheduler tool crontab. We'll also make our code send us an alert email in case the price of a selected stock reach a pre determined condition.
Rodrigo Deboni Dutcosky.