Extracting text from HTML file using Python
html2text is a Python program that does a pretty good job at this.
Extract text from HTML Tags and plain text (not wrapped in tags)
from bs4 import BeautifulSoup
html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""
soup = BeautifulSoup(html)
print(soup.text)
# to pay
# charges
# from one's
# bank account
print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account
extract text from html using python
date = soup.find_all(class_="latest value")
You are using the wrong CSS class name ('latest value' != 'latest-value'
)
print(soup.find_all(attrs={'class': 'latest-value'}))
# [<div class="latest-value">2017-06-01</div>, <div class="latest-value">1430</div>]
for element in soup.find_all(attrs={'class': 'latest-value'}):
print(element.text)
# 2017-06-01
# 1430
I prefer to use attrs
kwarg but your method works as well (given the correct CSS class name)
for element in soup.find_all(class_='latest-value'):
print(element.text)
# 2017-06-01
# 1430
Extracting text inside tags from html document
What you need is the .contents
function. documentation
Find the span <span id = "1"> ... </span>
using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'
.
If you want just x = '10'
from x = '\n10\n'
you can do : x = x[1:-1]
since '\n'
is a single character. Hope this helped.
Extract text from html file with BeautifulSoup/Python
The find_all
method returns a list. Iterate over the list to get the text.
for name in names:
print(name.text)
Returns:
Baden-Württemberg
Bayern
Berlin
The builtin python dir()
and type()
methods are always handy to inspect an object.
print(dir(names))
[...,
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort',
'source']
Function to open a file and extract text from html Python
Try like this
from urllib.request import urlopen
from bs4 import BeautifulSoup
def read_list(fl):
with open(fl, 'r') as f:
for line in f:
html = urlopen(line.strip()).read().decode("utf8")
bs = BeautifulSoup(html, "html.parser")
content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')
Extract specific portions in html file using python
here you have mate, i found out that in this site, the claims section is a html with its own Id, making things easier. I just colected the section and gave the string so you can play with it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry")
soup = BeautifulSoup(page.content, 'html.parser')
claim_sect = soup.find_all('section', attrs={"itemprop":"claims"})
print('This is the raw content: \n')
print(str(claim_sect))
print('This is the variable type: \n')
print(str(type(claim_sect)))
str_sect = claim_sect[0]
Extract all text from HTML file while checking for boldness (Python)
You could check the texts parent
if its name
is b
or for an existing attribute
style to get a step closer:
for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})
Example
from bs4 import BeautifulSoup
html='''<html><head><title>text (1)</title></head><body><div>text (2)</div><div>text (3)</div><div><span>text (4)</span></div><div>text (5)</div><div><span>text (6)</span><span>text (7)</span></div><div><span style="font-weight:bold;">bold text (8)</span></div><div><span>text (9)</span></div><div><span style="font-weight:700;">bold text (10)</span></div><div><span>text (11)</span></div><div><span><b>bold text (12)</b></span></div><div><span>text (13)</span><span><a href="www.google.de"><b>bold text (14)</b></a></span></div><div><span>text (15)</span></div></body></html>'''
soup = BeautifulSoup(html)
data = []
for e in soup.find_all(text=True, recursive=True):
data.append({
'text':e,
'isBoldTag': True if e.parent.name == 'b' else False,
'isBoldStyle': True if e.parent.get('style') and 'font-weight' in e.parent.get('style') else False
})
data
Output
[{'text': 'text (1)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (2)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (3)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (4)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (5)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (6)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'text (7)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (8)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (9)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (10)', 'isBoldTag': False, 'isBoldStyle': True}, {'text': 'text (11)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (12)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (13)', 'isBoldTag': False, 'isBoldStyle': False}, {'text': 'bold text (14)', 'isBoldTag': True, 'isBoldStyle': False}, {'text': 'text (15)', 'isBoldTag': False, 'isBoldStyle': False}]
Or as DataFrame -> pd.DataFrame(data)
text | isBoldTag | isBoldStyle | |
---|---|---|---|
0 | text (1) | False | False |
1 | text (2) | False | False |
2 | text (3) | False | False |
3 | text (4) | False | False |
4 | text (5) | False | False |
5 | text (6) | False | False |
6 | text (7) | False | False |
7 | bold text (8) | False | True |
8 | text (9) | False | False |
9 | bold text (10) | False | True |
10 | text (11) | False | False |
11 | bold text (12) | True | False |
12 | text (13) | False | False |
13 | bold text (14) | True | False |
14 | text (15) | False | False |
Related Topics
Regexp Finding Longest Common Prefix of Two Strings
Call Python Code from an Existing Project Written in Swift
Hex/Binary String Conversion in Swift
How to Make Good Reproducible Pandas Examples
How to Create Variable Variables
List of Lists Changes Reflected Across Sublists Unexpectedly
"Least Astonishment" and the Mutable Default Argument
How to Clone a List So That It Doesn't Change Unexpectedly After Assignment
How to Remove Items from a List While Iterating
How to Make a Flat List Out of a List of Lists
How to Split a List into Equally-Sized Chunks