Python Beautifulsoup Parsing Table

python BeautifulSoup parsing table

Here you go:

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values

This gives you:

[ [u'1359711259', u'SRF', u'08/05/2013', u'5310 4 AVE', u'K', u'19', u'125.00', u'$'], 
[u'7086775850', u'PAS', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'125.00', u'$'],
[u'7355010165', u'OMT', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'145.00', u'$'],
[u'4002488755', u'OMT', u'02/12/2014', u'NB 1ST AVE @ E 23RD ST', u'5', u'115.00', u'$'],
[u'7913806837', u'OMT', u'03/03/2014', u'5015 4th Ave', u'K', u'46', u'115.00', u'$'],
[u'5080015366', u'OMT', u'03/10/2014', u'EB 65TH ST @ 16TH AV E', u'7', u'50.00', u'$'],
[u'7208770670', u'OMT', u'04/08/2014', u'333 15th St', u'K', u'70', u'65.00', u'$'],
[u'$0.00\n\n\nPayment Amount:']
]

Couple of things to note:

  • The last row in the output above, the Payment Amount is not a part
    of the table but that is how the table is laid out. You can filter it
    out by checking if the length of the list is less than 7.
  • The last column of every row will have to be handled separately since it is an input text box.

How to parse html table in python

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
tds = i.find_all('td')
print(tds[1].text)

Result:

3text 
6text

Python BeautifulSoup: parsing multiple tables with same table is

tb is just a list contains all the tables and you can get your table by index.. I think your target table has the index of 4.

tb = soup.findAll('table',{'id': 'btable'})
table_str = str(tb[4]) #select only one table by its index

Python parse table from HTML using BeautifulSoup

According to your error message, the problem is that the variable tables is a string. Try it without using 'tbody'.

for filename, text in tqdm(lowercase_dict.items()):
soup = BeautifulSoup(text, "lxml")
table = soup.find('table')
rows = table.find_all('tr')

How to Parse Table with BeautifulSoup4 And Elegantly Print?

Gather your text data into a flat array of individual rows and cells. Transpose this, so everything per column is gathered into a row. Create an array containing the length of the longest item per (originally) column. Use this data to space out each cell, while printing rows. In code:

from bs4 import BeautifulSoup

content = '''
<table class="gridtable">
<tbody>
<tr>
<th>Store #</th><th>City Name</th><th>Orders</th></tr>
<tr><td>1</td><td style="text-align:left">Phoenix</td><td>70</td></tr>
<tr><td>2</td><td style="text-align:left">Columbus</td><td>74</td></tr>
<tr><td>3</td><td style="text-align:left">New York</td><td>112</td></tr>
<tr><td></td><td>TOTAL</td><td>256</td></tr></tbody>
</table>
'''

def print_table_nice(table):
cells = [[cell.text for cell in row.find_all(['td','th'])] for row in table.find_all('tr')]
transposed = list(map(list, zip(*cells)))
widths = [str(max([len(str(item)) for item in items])) for items in transposed]
for row in cells:
print (' '.join(("{:"+width+"s}").format(item) for width,item in zip(widths,row)))

soup = BeautifulSoup(content, 'html.parser')
tables = soup.find_all('table')
table = tables[0]
print_table_nice(table)

Result:

Store # City Name Orders
1 Phoenix 70
2 Columbus 74
3 New York 112
TOTAL 256

which seems about as elegant as you can do on a console. (To add vertical lines, just join the rows with a | instead of a space.)

I inlined the table data because I don't have access to your Page.html, but getting access to the table data does not seem to be the problem here.


Oh let's add lines all around. Just because I can:

def print_table_nice(table):
header = [cell.text for cell in table.select('tr th')]
cells = [[cell.text for cell in row.select('td')] for row in table.select('tr') if row.select('td')]
table = [header]+cells
transposed = list(map(list, zip(*table)))
widths = [str(max([len(str(item)) for item in items])) for items in transposed]
print ('+'+('-+-'.join('-'*int(width) for width in widths))+'+')
print ('|'+(' | '.join(("{:"+width+"s}").format(item) for width,item in zip(widths,header)))+'|')
print ('+'+('-+-'.join('-'*int(width) for width in widths))+'+')
for row in cells:
print ('|'+(' | '.join(("{:"+width+"s}").format(item) for width,item in zip(widths,row)))+'|')
print ('+'+('-+-'.join('-'*int(width) for width in widths))+'+')

It turned out to be an interesting complication because this requires the th to be separated from the td rows. Won't work as-is for multi-line rows, though. Result, then, is:

+--------+-----------+-------+
|Store # | City Name | Orders|
+--------+-----------+-------+
|1 | Phoenix | 70 |
|2 | Columbus | 74 |
|3 | New York | 112 |
| | TOTAL | 256 |
+--------+-----------+-------+

Parsing html tables with Beautifulsoup in Python

You can use find_all() and get_text() to gather the table data. The find_all() method returns a list that contains all descendants of a tag; and get_text() returns a string that contains a tag's text contents. First select all tabes, for each table select all rows, for each row select all columns and finally extract the text. That would collect all table data in the same order and structure that it appears on the HTML document.

from bs4 import BeautifulSoup

html = 'my html document'
soup = BeautifulSoup(html, 'html.parser')
tables = [
[
[td.get_text(strip=True) for td in tr.find_all('td')]
for tr in table.find_all('tr')
]
for table in soup.find_all('table')
]

The tables variable contains all the tables in the document, and it is a nested list that has the following structure,

tables -> rows -> columns

If the structure is not important and you only want to collect text from all tables in one big list, use:

table_data = [i.text for i in soup.find_all('td')]

Or if you prefer CSS selectors:

table_data = [i.text for i in soup.select('td')]

If the goal is to gather table data regardless of HTML attributes or other parameters, then it may be best to use pandas. The pandas.read_html() method reads HTML from URLs, files or strings, parses it and returns a list of dataframes that contain the table data.

import pandas as pd

html = 'my html document'
tables = pd.read_html(html)

Note that pandas.read_html() is more fragile than BeautifulSoup and it will raise a Value Error if it fails to parse the HTML or if the document doesn't have any tables.

Parsing table with beautifulsoup

from bs4 import BeautifulSoup
import re

html = """
<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
<thead>
<tr>
<th width="14%" class="text-left nowrap fixed-side">Session Date</th>
<th width="14%" class="text-center">
<b>1</b>
</th>
<th width="14%" class="text-center">
<b>2</b>
</th>
<th width="14%" class="text-center">
<b>3</b>
</th>
<th width="14%" class="text-center">
<b>4</b>
</th>
<th width="14%" class="text-center">
<b>5</b>
</th>
<th width="14%" class="text-center">
<b>6</b>
</th>
</tr>
</thead>
<tbody class="tr-border-bottom">
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('10 Jun 2020');">10 Jun 2020</a>
<br> Wednesday
</th>

<td class="pb-15 text-center">
<a href="#" id="1217428_1_10/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tbody>
</table>
"""

soup = BeautifulSoup(html, 'html.parser')
target = soup.find("table", class_=re.compile("^table table-borderless"))

items = [item.get_text(strip=True) for item in target.findAll(
"td", class_="pb-15 text-center")]

print(items)

Output:

['8:15 AM ✔']

Why does python parsing table BeatifulSoup do not work on this website as intended?

Problem: The page uses javascript to fetch and display the content, so you cannot just use requests or other similars because javascript code would not be executed.

Solution: use selenium in order to load the page then parse the content with BeautifulSoup.

Sample code here:

from selenium import webdriver
d = webdriver.Chrome()
d.get(url)
bs = BeautifulSoup(d.page_source)

To use webdriver.Chrome you will also have to download chromedriver from here and put the executable in the same folder of your project or in PATH.

Beautiful Soup Parsing table within a div

I think your current work makes a lot of sense, good job!

To move ahead, we can leverage the structure of the td elements on the eBay page, and the fact that they come in two's with a attrLabels class on the header to extract the specific data.

This gives you the data in the same order as it appears on the page:

tds = attribute.findAll("td")
ordered_data = []
for i in range(0, len(tds), 2):
if tds[i].get('class') == ['attrLabels']:
key = tds[i].text.strip().strip(":")
value = tds[i+1].span.text
ordered_data.append({ key: value })

And this gives you the same thing but in a dict with key-value pairs so that you can easily access each attribute:

tds = attribute.findAll("td")
searchable_data = {}
for i in range(0, len(tds), 2):
if tds[i].get('class') == ['attrLabels']:
key = tds[i].text.strip().strip(":")
value = tds[i+1].span.text
searchable_data[key] = value

How to parse table with a internal link by BeautifulSoup?

You don't need that big round, you can easily using pandas.read_html() function to read it in a table as per your request. and then you can convert it to dict using pandas.to_dict() function as well:

import pandas as pd

table = """
<table log-set-param="table_view" class="ddtable qytable"><tbody>

<tr><th>Game_name</th><th>date</th><th>team</th><th>score</th><th>opponent</th><th>starting</th><th>play</th><th>scoring</th><th>warning</th><th>details</th><th>-</th></tr>

<tr>
<td align="center" valign="middle"><a target="_blank" href="/item/%E8%8B%B1%E8%B6%85">Premier_leagure</a></td>
<td align="center" valign="middle">11-19 23:00</td><td align="center" valign="middle"><a target="_blank" href="/item/%E6%96%AF%E6%89%98%E5%85%8B%E5%9F%8E">Stoke_city</a></td>
<td align="center" valign="middle"><b>0 - 1</b></td><td align="center" valign="middle"><a target="_blank" href="/item/%E4%BC%AF%E6%81%A9%E8%8C%85%E6%96%AF">Nournemouth</a></td>
<td align="center" valign="middle">Yes</td><td align="center" valign="middle">68’</td>
<td align="center" valign="middle">0</td><td align="center" valign="middle">-</td>
<td align="center" valign="middle">-</td><td align="center" valign="middle"><a target="_blank" href="/item/%E8%AF%A6%E6%83%85">detail</a></td>
</tr
"""
df = pd.read_html(table)[0]
print(df)

df.to_csv("Data.csv", index=False)

Output: view online

Sample Image

And to convert it to dict:

target = df.to_dict()

print(target)

Output:

{'Game_name': {0: 'Premier_leagure'}, 'date': {0: '11-19 23:00'}, 'team': {0: 'Stoke_city'}, 'score': {0: '0 - 1'}, 'opponent': {0: 'Nournemouth'}, 'starting': {0: 'Yes'}, 'play': {0: '68’'}, 'scoring': {0: 0}, 'warning': {0: '-'}, 'details': {0: '-'}, '-': {0: 'detail'}}

Note: Regarding your point of Also, how to make sure the 'href='#5' linked to the table:

There's 2 methods:

  1. to use pandas.read_html() function along with attrs= as below:

Here we used the table attribute.

<table log-set-param="table_view" class="ddtable qytable"><tbody>

So :

df = pd.read_html(table, attrs={'log-set-param': 'table_view', 'class': 'ddtable qytable'})[0]

  1. second method, using your lovely href="#5 we will use bs4 :P

So we will locate it first and then getting the next table for it.


element = soup.find("a", href="#5").find_next("table")

df = pd.read_html(str(element))[0]

print(df)


Related Topics



Leave a reply



Submit