Webscrapping using beautifulsoup and python

The Python programming language has an ecosystem of modules and tools that can be used for scrapping data from websites. In this article we will be focusing on the Beautiful Soup module.

Step#1 Install beautifulsoup and other required modules

To get started, you need few modules such as requests, lxml to use beautifulsoup. Install required modules as beloww

pip install beautifulsoup4
pip install requests
pip install lxml

Step#2 Understand the web page html tags structure

Let us try to scrape this wikipedia page https://en.wikipedia.org/wiki/List_of_programming_languages

Some observations looking at webpage structure:

  1. There is only one h1 element and its page title
  2. There are multiple h2 elements
  3. Each h2 element has unordered list
  4. Some tags have attribute such as id, class etc.

Step#3 fetch the required data webpage.

Before fetching required value, we need to fetch whole webpage. This is achieved by using requests module.

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_programming_languages'
r  = requests.get(url)
data = r.text
print(data)

Step#4 parse the webpage

Now we have page loaded in html format in variable called as “data”. If you see the output of the page, it will have html tags. Now to access required tag, we need to parse this html. This is where bueatifulsoup comes into picture.

soup = BeautifulSoup(data, features = "lxml")

Step#3 search the required html tag

Variable named as soup has required html tags in a format which can be parsed.

Let us assume we want to see header (h1 tag) content. Here is the final code.

Output

header :  <h1 class="firstHeading" id="firstHeading" lang="en">List of programming languages</h1>
header Text:  List of programming language

Please note that we need to use .text method to get the content of the tag.

Here is another example with some additional details

How to get rid of “No parser was explicitly specified” while using beautifulsoup pythong

While using beautifulsoup parsing a page, I got following warning. Although I ignored this warning for sometime, it started to become distracting to see this warning every time I run my program.


UserWarning: No parser was explicitly specified, so I'm u
sing the best available HTML parser for this system ("lxml"). This usually isn't
 a problem, but if you run this code on another system, or in a different virtua
l environment, it may use a different parser and behave differently.

The code that caused this warning is on line 12 of the file filename.py.
To get rid of this warning, pass the additional argument 'features="lxml"' to th
e BeautifulSoup constructor.

There is nothing wrong with this warning and you can continue your coding however I wanted to get it corrected due to following

  1. Its distracting to see this error every time I run my program
  2. If I run my program on some other machine, it might not perform as expected since system will chose which is available.

Besides these two primary error, I get an itching if I see unformated code or uncesessary warinings. Many time I havr burned by fingures while correcting warnings (read , i was able to fix warning but it led to errors and the whole process ate considerable amount of my time.)

Don’t worry, fixing above error would not lead to an error.

Before fixing this error, install lxml

pip install lxml

To fix this warning, simple replace following line

soup = BeautifulSoup(data)

with this line

soup = BeautifulSoup(data, features = "lxml")

Now run your program and it will run without any warning.