Beautifulsoup: Get the Contents of a Specific Table

BeautifulSoup: Get the contents of a specific table

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who's id is "Table1" and gets all of its tr elements.

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1")
rows = table.findAll(lambda tag: tag.name=='tr')

Get table contents only after specific span content in Beautifulsoup

Just use find_all_next:

table = soup.find(text='Item 3').find_all_previous()[2].find_all_next()

My full code:

from bs4 import BeautifulSoup

html = '''
<div><span>Item 1</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>

<div><span>Item 2</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>

<div><span>Item 3</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>

<div><span>Item 4</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>
'''

soup = BeautifulSoup(html,'html5lib')

table = soup.find(text='Item 3').find_all_previous()[2].find_all_next()

table_html = ''.join([str(elem) for elem in table])

Output:

>>> table
[<div><span>Item 3</span></div>, <span>Item 3</span>, <div>some content</div>, <div>table content<table><tbody></tbody></table></div>, <table><tbody></tbody></table>, <tbody></tbody>, <div><span>Item 4</span></div>, <span>Item 4</span>, <div>some content</div>, <div>table content<table><tbody></tbody></table></div>, <table><tbody></tbody></table>, <tbody></tbody>]

>>> table_html
'<div><span>Item 3</span></div><span>Item 3</span><div>some content</div><div>table content<table><tbody></tbody></table></div><table><tbody></tbody></table><tbody></tbody><div><span>Item 4</span></div><span>Item 4</span><div>some content</div><div>table content<table><tbody></tbody></table></div><table><tbody></tbody></table><tbody></tbody>'

python BeautifulSoup parsing table

Here you go:

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values

This gives you:

[ [u'1359711259', u'SRF', u'08/05/2013', u'5310 4 AVE', u'K', u'19', u'125.00', u'$'], 
[u'7086775850', u'PAS', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'125.00', u'$'],
[u'7355010165', u'OMT', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'145.00', u'$'],
[u'4002488755', u'OMT', u'02/12/2014', u'NB 1ST AVE @ E 23RD ST', u'5', u'115.00', u'$'],
[u'7913806837', u'OMT', u'03/03/2014', u'5015 4th Ave', u'K', u'46', u'115.00', u'$'],
[u'5080015366', u'OMT', u'03/10/2014', u'EB 65TH ST @ 16TH AV E', u'7', u'50.00', u'$'],
[u'7208770670', u'OMT', u'04/08/2014', u'333 15th St', u'K', u'70', u'65.00', u'$'],
[u'$0.00\n\n\nPayment Amount:']
]

Couple of things to note:

  • The last row in the output above, the Payment Amount is not a part
    of the table but that is how the table is laid out. You can filter it
    out by checking if the length of the list is less than 7.
  • The last column of every row will have to be handled separately since it is an input text box.

How to extract a table from a website using BeautifulSoup?

There is one table so can iterate over the <tr> elements in that one table.

If want a data frame to include only one particular state then can filter it before adding to a data frame, or filter the data frame of all data for a subset data frame.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
for tr in soup.find('table', class_='data').find_all('tr'):
row = [td.text for td in tr.find_all('td')]
# If want to filter out all except LA then can do that here
if len(row) == 3 and row[2] == 'LA':
data.append(row)
df = pd.DataFrame(data, columns=['FIPS', 'Name', 'State'])
print(df)

Output:

     FIPS          Name State
0 22001 Acadia LA
1 22003 Allen LA
2 22005 Ascension LA
3 22007 Assumption LA
4 22009 Avoyelles LA
.. ... ... ...
63 22127 Winn LA

Find specific table using BeautifulSoup with specific caption

I would try to find all captions and then to match the caption text like this:

from bs4 import BeautifulSoup
import re
import requests


header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}

redirect = requests.get('http://goblueraiders.com/boxscore.aspx?path=baseball&id=6117', headers = header).text
soup = BeautifulSoup(redirect, 'html.parser')

for caption in soup.find_all('caption'):
if caption.get_text() == 'Tennessee Tech - Pitching Stats':
table = caption.find_parent('table', {'class': 'sidearm-table collapse-on-medium accordion'})

How do you get all the rows from a particular table using BeautifulSoup?

This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren method, then you can get the text value inside the cell with the string property.

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = """
... <html>
... <body>
... <table>
... <th><td>column 1</td><td>column 2</td></th>
... <tr><td>value 1</td><td>value 2</td></tr>
... </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
... cells = row.findChildren('td')
... for cell in cells:
... value = cell.string
... print("The value in this cell is %s" % value)
...
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>


Related Topics



Leave a reply



Submit