Data scraping using Python

In an earlier post, I mentioned a piece of work we had done for a part of Oxford University, in which we extracted details from a number of documents on the NICE website using a technique known colloquially as web scraping. That post talked mostly about the implications of the technique for small and medium-sized businesses (SMEs) but now I would like to talk briefly about the approach we used from a technical perspective.

Our platform of choice for such work is Python – a very flexible and easy to use tool for all sorts of ad-hoc data access and analysis. In fact, we usually use the iPython environment for data manipulation tasks. This is an interactive environment ideally suited to ‘train of thought’ type examination of business data.

Python libraries and classes are available for retrieving pages, parsing html, navigating the page structure and, if required, formatting the results into a new html page. I would like to make the point that I am far from considering myself an expert in Python. However, the tools and classes referred to below are so simple that they can easily be used by anyone with a basic grounding in the language. In this post, I will briefly mention a few of the classes and packages that I have found most helpful and easy to use.

The requests package for querying urls. Retrieving a page is as simple as executing the command: results_page = requests.get(url), which returns the server’s response as an object which can be queried. The html for the page can be obtained by looking at results_page.content.
The delightfully-named BeautifulSoup class for parsing html. This can take a string containing html and parse it into an object that can easily be queried to examine the contents of the page. The ‘soup’ as it is known can be created from the page returned as above by executing the command:
soup = BeautifulSoup(results_page.content, 'html.parser').
BeautifulSoup is located in a package called bs4.
The ‘soup’ object can easily be navigated/queried in ways that anyone familiar with css or jquery will find very intuitive. For example:
soup.title returns the title tag
soup.find_all(“a”) returns all anchor tags
soup.find_all(id=“link1”) returns the node(s) with an id of ‘link1’
soup.select(“table.bigtable”) returns all tables with a class of ‘bigtable’ (the select() method treats the parameter as a css selector).
Where results are to be viewed in tabular form, or manipulated in Excel, we find the DataFrame class in the pandas library very useful. This class has a wealth of powerful methods for data analysis and manipulation and we use it extensively in our consultancy work. However, it is also convenient as a way of building up a table of results from a web scrape and saving it as an Excel spreadsheet using the DataFrame.to_excel() method.
An alternative approach to presenting results, particularly when the number of results is relatively small and the information scraped consists primarily of text rather than numbers, is to create an html document summarising the output. This has the advantage that it can contain chunks of html retrieved from the source websites (e.g. tables of information), can be formatted nicely and can contain hyperlinks, both within the document and to the source pages. To create an html page, we use the HTML class from the html package, which is extremely easy to use and very intuitive. For example:
doc = HTML('html') creates an html document object called doc
doc.body.h1('Document title') adds an h1 tag to the body containing the text ‘Document title’
doc.body.text('xyz') adds some text to the body of the document
doc.head.link(href='/stylesheets/xyz.css', rel='stylesheet', type = 'text/css') adds a link to a stylesheet.
The finished document can be converted to a string for saving to an html file using str(doc), or unicode(doc) if it contains unicode. In the latter case, you may need to tell it how to encode the unicode – for what it is worth, we found the following worked well: unicode(doc).encode('ascii', 'xmlcharrefreplace').
If the data retrieved is to be stored in a database, this is also easily achieved in Python. For example, SQLite provides a simple and lightweight database which can be accessed using the sqlite3 package. For Oracle, we have found cx_Oracle to be extremely effective, and pypyodbc provides an odbc connection that can be used for SQL Server, for example (although I don’t have personal experience of this library).

This post is not intended as a comprehensive tutorial on web scraping in Python. There are many such resources available already on the web and I am unlikely to be able to add anything that is not already available elsewhere. Rather, it is intended to provide a brief introduction to the topic and to provide some basic pointers to help and inspire our more technically-minded readers to have a go.

The classes I have mentioned make it extremely easy to retrieve web pages, extract specific information and store it in a text document, spreadsheet or database. The critical question is, how can you use this to improve your business? Information about the pricing and activities of your competitors, the desires and needs of your customers, and the factors that will shape your markets in the long term, is out there for the taking. Whether or not you make use of it, you can be sure the smartest of your competitors will.

Data scraping using Python

Share This Story, Choose Your Platform!

Categories