Post

Web Scraping with Python — Understanding Requests and BeautifulSoup

Web Scraping proves to be a vital tool when it comes to extracting information from the web. Think of automated lead generators that scour Google Maps to find businesses, which you can then forward over to your sales team. These tools help with collecting and collating information. Today, we'll learn how to use the requests and BeautifulSoup modules in Python.

Web Scraping with Python — Understanding Requests and BeautifulSoup

Initially, web scraping was a fairly new concept, still is, that I came across. On this pass, I used the opportunity to get more acquainted with the feature. This allowed me to get to grips with using this technology to access the web.

To iron out some ambiguity, think of your browser sending a request to access any website you instruct it to. Typically, you would enter the URL in the URL bar at the top, press enter, and after a couple of seconds, you have your website on screen.

What’s happening behind all of this, is what would we refer to as parsing which I did briefly explain on an earlier post. Your browser doesn’t just display the information; it has to go through the process of translating the information and then displaying it.

The correlation between making a request using Python and a browser is exactly that, and the only difference is in the way the result is provided to us. Python will give us an HTML extract (code/text), and your browser will use all the styling and animations described on the website’s files.

I’ve already learnt about using RegEx, and I can conclude that in the way that we would use RegEx to look for patterns in strings, we can also do with requests and BeautifulSoup, except with a live website.

What Is Web Scraping?

Web scraping is the term used to describe the process of information extraction from websites with patterns and features you like. Then, processing the information into a format that can be stored, analysed, and used.

Think of the following mental model:

Send a Request to a Web Server ➡️ Receive an HTML Response ➡️ Parsing and Extracting Structured Data.

HTTP Requests and Responses

Two methods are renowned for their use in making HTML requests using Python, BeautifulSoup, and Requests.

1
2
pip install requests
pip install beautifulsoup4

I’m going to use requests first, as this is the more popular function for this.

1
2
3
4
5
6
import requests

url = "https://www.google.com"
response = requests.get(url)

print(response.text)

The response.text will give us the source code of the website that we are attempting to access, with all the information inside also.

We can use response.content, which will only give us what the browser parses as being the website content alone.

In Cybersecurity, the effective equivalent tool is curl, and this is what a request would look like.

1
curl https://www.google.com

Note: Of course there are some nuances as to what you can use a web scraper for and therefore it would be worth while to look into that. As of writing this, this is my only real experience with using web scrapers.

It is also possible to inspect metadata about the request:

1
print(response.status_code)

This will produce an output of the HTML response code, which will indicate the status of the web server for the site you have requested. There are many status codes that you could get; some of the more common ones are:

  • Error 404 (issue with access or missing resource)
  • Successful 200 (successfully located website)

Hint: This is a collection of all the status codes list on W3 Schools 👉HTML Status Codes

HTML and Parsing

HTML is what websites have as their backbone, which helps specify to a browser what each element on the website is. This helps the browser interpret the correct information accurately like, like headings, paragraphs, tables, links, etc.

1
2
3
4
5
6
7
8
9
10
11
<h1>This Is a Heading1</h1>
<p>Paragraph</p>
<table>
  <tr>
    <!-- table row -->
    <th>tableheader</th>
    <td>tabledata</td>
  </tr>
</table>

<a>anchor/link</a>

The method used to extract information from raw html documents is referred to as parsing. Think of a system that goes into a website and scans its html documents for the elements I listed above.

Unlike requests, BeautifulSoup can extrapolate information on a more granular level from HTML documents. Think of requests as just sending and receiving information, and BeautifulSoup as being able to scan a website for its elements and then bring back specific portions.

Using BeautifulSoup

This function can get pretty interesting, and again, since this is my second pass over this topic, I want to focus more on explaining the function and not writing every aspect of it down.

BeautifulSoup will use requests to bring back the HTML code, which it then parses and brings us back the information we want.

1
2
3
4
5
6
7
import requests
from bs4 import BeautifulSoup

url = "https://archive.ics.uci.edu/ml/datasets.php"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

Once parsed, the raw HTML data becomes structured and accessible allowing us to retrieve specific parts of the extract.

For example:

1
2
3
print(soup.title)
print(soup.title.get_text())
print(soup.body)

We can also go on ahead bring back table data.

1
2
3
4
5
tables = soup.find_all("table", {"cellpadding": "3"})
table = tables[0]

for td in table.find("tr").find_all("td"):
    print(td.text)

The mental model for BeautifulSoup requests would be:

Inspect the HTML response from requests ➡️ Identify correct semantic tags or elements ➡️ Extract the data (or exact tag and it’s data) ➡️ Store the information

Synopsis

Web scraping is all about understanding how the web communicates with your device. By learning how to use requests and BeautifulSoup we gain the depth of understanding that allows us to get the correct mental model we need to start brainstorming on ways that we could use this feature.

On one of my earlier posts I mentioned using RegEx to extract information to then feed it to a machine learning algorithm which would then provide us with detailed statistics that we can use to inform decisions, like purchase orders for inventories that reach a certain price point.

Another application for this would be to think about how day traders could use this to build something that would tell them whether or not they should purchase a stock. It’s all about knowing how to make a request, interpret response, and parse the documents. This is how we extract meaningful information from any accessible website.

This post is licensed under CC BY 4.0 by the author.