How to extract plain text from an HTML page in Python
There are many different ways to extract plain text from HTML and some are better than others depending on what we want to extract and if we know where to find it. In this article I will demonstrate a simple way to grab all text content from the HTML source so that we end up with a concatenated string of all texts on the page.
We will do it with Python and Beautiful Soup 4, a Python library for scraping information from web pages. So to start off, let’s install beautifulsoup4
package and lxml
parser (this is a fast parser that can be used together with BS):
# install using pip
pip install --user beautifulsoup4
pip install --user lxml
# install using poetry
poetry add beautifulsoup4
poetry add lxml
Now we will import Beautiful Soup’s classes for working with HTML: BeautifulSoup
for parsing the source and Tag
which we are going to use for checking whether a particular element in the parsed BeautifulSoup tree represents an HTML tag.
Besides the necessary imports, we will also define a list of block elements that we want to extract the text from. I have picked p for paragraphs, h1-h5 for headings and blockquote for quotes as an example:
from bs4 import BeautifulSoup
from bs4.element import Tag
blocks = ["p", "h1", "h2", "h3", "h4", "h5", "blockquote"]
Our main function to_plaintext(html_text: str) -> str
will take a string with the HTML source and return a concatenated string of all texts from our selected blocks:
def to_plaintext(html_text: str) -> str:
soup = BeautifulSoup(html_text, features="lxml")
extracted_blocks = _extract_blocks(soup.body)
extracted_blocks_texts = [block.get_text().strip() for block in extracted_blocks]
return "\n".join(extracted_blocks_texts)
- When initializing BeautifulSoup, we can choose which HTML parser will be used to parse the string, so we chose our installed lxml package.
- We called a helper function
_extract_blocks()
, passing it a root HTML element to work with – the HTMLbody
. We will implement the function soon. - As
_extract_blocks()
will return a list of our block elements, we will take the text withget_text()
function, strip them of left and right white space and concatenate together, separating them with a single new line.
The last thing is to define _extract_blocks()
function that will take a root element and return all block elements that we are interested in:
def _extract_blocks(parent_tag) -> list:
extracted_blocks = []
for tag in parent_tag:
if tag.name in blocks:
extracted_blocks.append(tag)
continue
if isinstance(tag, Tag):
if len(tag.contents) > 0:
inner_blocks = _extract_blocks(tag)
if len(inner_blocks) > 0:
extracted_blocks.extend(inner_blocks)
return extracted_blocks
- Inside the function, we recursively travel the element tree to find our block elements inside other elements (that are inside other elements and so on).
- If the tag name matches one of our block elements, we will add it to the list.
_extract_blocks()
function needs to be defined beforeto_plaintext()
, as it is called from there.
And that’s all!
Last updated on 16.10.2020.