Webscrapping using beautifulsoup and python

The Python programming language has an ecosystem of modules and tools that can be used for scrapping data from websites. In this article we will be focusing on the Beautiful Soup module.

Step#1 Install beautifulsoup and other required modules

To get started, you need few modules such as requests, lxml to use beautifulsoup. Install required modules as beloww

pip install beautifulsoup4
pip install requests
pip install lxml

Step#2 Understand the web page html tags structure

Let us try to scrape this wikipedia page https://en.wikipedia.org/wiki/List_of_programming_languages

Some observations looking at webpage structure:

  1. There is only one h1 element and its page title
  2. There are multiple h2 elements
  3. Each h2 element has unordered list
  4. Some tags have attribute such as id, class etc.

Step#3 fetch the required data webpage.

Before fetching required value, we need to fetch whole webpage. This is achieved by using requests module.

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_programming_languages'
r  = requests.get(url)
data = r.text
print(data)

Step#4 parse the webpage

Now we have page loaded in html format in variable called as “data”. If you see the output of the page, it will have html tags. Now to access required tag, we need to parse this html. This is where bueatifulsoup comes into picture.

soup = BeautifulSoup(data, features = "lxml")

Step#3 search the required html tag

Variable named as soup has required html tags in a format which can be parsed.

Let us assume we want to see header (h1 tag) content. Here is the final code.

Output

header :  <h1 class="firstHeading" id="firstHeading" lang="en">List of programming languages</h1>
header Text:  List of programming language

Please note that we need to use .text method to get the content of the tag.

Here is another example with some additional details

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.