The Python programming language has an ecosystem of modules and tools that can be used for scrapping data from websites. In this article we will be focusing on the Beautiful Soup module.
Step#1 Install beautifulsoup and other required modules
To get started, you need few modules such as requests, lxml to use beautifulsoup. Install required modules as beloww
pip install beautifulsoup4
pip install requests
pip install lxml
Step#2 Understand the web page html tags structure
Let us try to scrape this wikipedia page https://en.wikipedia.org/wiki/List_of_programming_languages
Some observations looking at webpage structure:
- There is only one h1 element and its page title
- There are multiple h2 elements
- Each h2 element has unordered list
- Some tags have attribute such as id, class etc.
Step#3 fetch the required data webpage.
Before fetching required value, we need to fetch whole webpage. This is achieved by using requests module.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_programming_languages'
r = requests.get(url)
data = r.text
print(data)
Step#4 parse the webpage
Now we have page loaded in html format in variable called as “data”. If you see the output of the page, it will have html tags. Now to access required tag, we need to parse this html. This is where bueatifulsoup comes into picture.
soup = BeautifulSoup(data, features = "lxml")
Step#3 search the required html tag
Variable named as soup
has required html tags in a format which can be parsed.
Let us assume we want to see header (h1 tag) content. Here is the final code.
Output
header : <h1 class="firstHeading" id="firstHeading" lang="en">List of programming languages</h1>
header Text: List of programming language
Please note that we need to use .text method to get the content of the tag.
Here is another example with some additional details