Extracting Text from HTML File Using Python

Extracting text from HTML file using Python

html2text is a Python program that does a pretty good job at this.

Extract text from HTML Tags and plain text (not wrapped in tags)

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account

extract text from html using python

date = soup.find_all(class_="latest value")

You are using the wrong CSS class name ('latest value' != 'latest-value')

 print(soup.find_all(attrs={'class': 'latest-value'}))
# [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>]

for element in soup.find_all(attrs={'class': 'latest-value'}):
print(element.text)
# 2017-06-01
# 1430

I prefer to use attrs kwarg but your method works as well (given the correct CSS class name)

 for element in soup.find_all(class_='latest-value'):
print(element.text)
# 2017-06-01
# 1430

Extracting text inside tags from html document

What you need is the .contents function. documentation

Find the span <span id = "1"> ... </span> using

for x in soup.find(id = 1).contents:
print(x)

OR

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

This will give you :


10

that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.

The string will correctly be '\n10\n'.

If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.

Extract text from html file with BeautifulSoup/Python

The find_all method returns a list. Iterate over the list to get the text.

for name in names:
print(name.text)

Returns:

Baden-Württemberg
Bayern
Berlin

The builtin python dir() and type() methods are always handy to inspect an object.

print(dir(names))

[...,
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort',
'source']

Function to open a file and extract text from html Python

Try like this

from urllib.request import urlopen
from bs4 import BeautifulSoup

def read_list(fl):
with open(fl, 'r') as f:
for line in f:
html = urlopen(line.strip()).read().decode("utf8")
bs = BeautifulSoup(html, "html.parser")
content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])

with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')

Extract specific portions in html file using python

here you have mate, i found out that in this site, the claims section is a html with its own Id, making things easier. I just colected the section and gave the string so you can play with it.

import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry")
soup = BeautifulSoup(page.content, 'html.parser')
claim_sect = soup.find_all('section', attrs={"itemprop":"claims"})
print('This is the raw content: \n')
print(str(claim_sect))
print('This is the variable type: \n')
print(str(type(claim_sect)))
str_sect = claim_sect[0]

Extract all text from HTML file while checking for boldness (Python)

You could check the texts parent if its name is b or for an existing attribute style to get a step closer:

for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})
Example
from bs4 import BeautifulSoup

html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''

soup = BeautifulSoup(html)

data = []

for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})

data
Output
[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]

Or as DataFrame -> pd.DataFrame(data)



Leave a reply



Submit