Extracting Text from HTML File Using Python

Extracting text from HTML file using Python

html2text is a Python program that does a pretty good job at this.

Extract text from HTML Tags and plain text (not wrapped in tags)

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account

extract text from html using python

date = soup.find_all(class_="latest value")

You are using the wrong CSS class name ('latest value' != 'latest-value')

 print(soup.find_all(attrs={'class': 'latest-value'}))
 # [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>]

 for element in soup.find_all(attrs={'class': 'latest-value'}):
     print(element.text)
 # 2017-06-01
 # 1430

I prefer to use attrs kwarg but your method works as well (given the correct CSS class name)

 for element in soup.find_all(class_='latest-value'):
     print(element.text)
 # 2017-06-01
 # 1430

Extracting text inside tags from html document

What you need is the .contents function. documentation

Find the span <span id = "1"> ... </span> using

for x in soup.find(id = 1).contents:
    print(x)

x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)

This will give you :

that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.

The string will correctly be '\n10\n'.

If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.

Extract text from html file with BeautifulSoup/Python

The find_all method returns a list. Iterate over the list to get the text.

for name in names:
    print(name.text)

Returns:

Baden-Württemberg
Bayern
Berlin

The builtin python dir() and type() methods are always handy to inspect an object.

print(dir(names))

[...,
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort',
 'source']

Function to open a file and extract text from html Python

Try like this

from urllib.request import urlopen
from bs4 import BeautifulSoup

def read_list(fl):
    with open(fl, 'r') as f:
        for line in f:
            html = urlopen(line.strip()).read().decode("utf8")
            bs = BeautifulSoup(html, "html.parser")
            content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
        
    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')

Extract specific portions in html file using python

here you have mate, i found out that in this site, the claims section is a html with its own Id, making things easier. I just colected the section and gave the string so you can play with it.

import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry")
soup = BeautifulSoup(page.content, 'html.parser')
claim_sect = soup.find_all('section', attrs={"itemprop":"claims"})
print('This is the raw content: \n') 
print(str(claim_sect)) 
print('This is the variable type: \n') 
print(str(type(claim_sect))) 
str_sect  =  claim_sect[0]

Extract all text from HTML file while checking for boldness (Python)

You could check the texts parent if its name is b or for an existing attribute style to get a step closer:

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

Example

from bs4 import BeautifulSoup

html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''

soup = BeautifulSoup(html)

data = []

for e in soup.find_all(text=True, recursive=True):
    data.append({
        'text':e,
        'isBoldTag': True if e.parent.name == 'b' else False,
        'isBoldStyle':  True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
    })

data

Output

[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]

Or as DataFrame -> pd.DataFrame(data)

	text	isBoldTag	isBoldStyle
0	text (1)	False	False
1	text (2)	False	False
2	text (3)	False	False
3	text (4)	False	False
4	text (5)	False	False
5	text (6)	False	False
6	text (7)	False	False
7	bold text (8)	False	True
8	text (9)	False	False
9	bold text (10)	False	True
10	text (11)	False	False
11	bold text (12)	True	False
12	text (13)	False	False
13	bold text (14)	True	False
14	text (15)	False	False

Extracting Text from HTML File Using Python