BeautifulSoup: Get the contents of a specific table
This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who's id is "Table1" and gets all of its tr elements.
html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1")
rows = table.findAll(lambda tag: tag.name=='tr')
Get table contents only after specific span content in Beautifulsoup
Just use find_all_next
:
table = soup.find(text='Item 3').find_all_previous()[2].find_all_next()
My full code:
from bs4 import BeautifulSoup
html = '''
<div><span>Item 1</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>
<div><span>Item 2</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>
<div><span>Item 3</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>
<div><span>Item 4</span></div>
<div>some content</div>
<div><table><tbody>table content</tbody></table></div>
'''
soup = BeautifulSoup(html,'html5lib')
table = soup.find(text='Item 3').find_all_previous()[2].find_all_next()
table_html = ''.join([str(elem) for elem in table])
Output:
>>> table
[<div><span>Item 3</span></div>, <span>Item 3</span>, <div>some content</div>, <div>table content<table><tbody></tbody></table></div>, <table><tbody></tbody></table>, <tbody></tbody>, <div><span>Item 4</span></div>, <span>Item 4</span>, <div>some content</div>, <div>table content<table><tbody></tbody></table></div>, <table><tbody></tbody></table>, <tbody></tbody>]
>>> table_html
'<div><span>Item 3</span></div><span>Item 3</span><div>some content</div><div>table content<table><tbody></tbody></table></div><table><tbody></tbody></table><tbody></tbody><div><span>Item 4</span></div><span>Item 4</span><div>some content</div><div>table content<table><tbody></tbody></table></div><table><tbody></tbody></table><tbody></tbody>'
python BeautifulSoup parsing table
Here you go:
data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
This gives you:
[ [u'1359711259', u'SRF', u'08/05/2013', u'5310 4 AVE', u'K', u'19', u'125.00', u'$'],
[u'7086775850', u'PAS', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'125.00', u'$'],
[u'7355010165', u'OMT', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'145.00', u'$'],
[u'4002488755', u'OMT', u'02/12/2014', u'NB 1ST AVE @ E 23RD ST', u'5', u'115.00', u'$'],
[u'7913806837', u'OMT', u'03/03/2014', u'5015 4th Ave', u'K', u'46', u'115.00', u'$'],
[u'5080015366', u'OMT', u'03/10/2014', u'EB 65TH ST @ 16TH AV E', u'7', u'50.00', u'$'],
[u'7208770670', u'OMT', u'04/08/2014', u'333 15th St', u'K', u'70', u'65.00', u'$'],
[u'$0.00\n\n\nPayment Amount:']
]
Couple of things to note:
- The last row in the output above, the Payment Amount is not a part
of the table but that is how the table is laid out. You can filter it
out by checking if the length of the list is less than 7. - The last column of every row will have to be handled separately since it is an input text box.
How to extract a table from a website using BeautifulSoup?
There is one table so can iterate over the <tr>
elements in that one table.
If want a data frame to include only one particular state then can filter it before adding to a data frame, or filter the data frame of all data for a subset data frame.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
for tr in soup.find('table', class_='data').find_all('tr'):
row = [td.text for td in tr.find_all('td')]
# If want to filter out all except LA then can do that here
if len(row) == 3 and row[2] == 'LA':
data.append(row)
df = pd.DataFrame(data, columns=['FIPS', 'Name', 'State'])
print(df)
Output:
FIPS Name State
0 22001 Acadia LA
1 22003 Allen LA
2 22005 Ascension LA
3 22007 Assumption LA
4 22009 Avoyelles LA
.. ... ... ...
63 22127 Winn LA
Find specific table using BeautifulSoup with specific caption
I would try to find all captions and then to match the caption text like this:
from bs4 import BeautifulSoup
import re
import requests
header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
redirect = requests.get('http://goblueraiders.com/boxscore.aspx?path=baseball&id=6117', headers = header).text
soup = BeautifulSoup(redirect, 'html.parser')
for caption in soup.find_all('caption'):
if caption.get_text() == 'Tennessee Tech - Pitching Stats':
table = caption.find_parent('table', {'class': 'sidearm-table collapse-on-medium accordion'})
How do you get all the rows from a particular table using BeautifulSoup?
This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren
method, then you can get the text value inside the cell with the string
property.
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = """
... <html>
... <body>
... <table>
... <th><td>column 1</td><td>column 2</td></th>
... <tr><td>value 1</td><td>value 2</td></tr>
... </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
... cells = row.findChildren('td')
... for cell in cells:
... value = cell.string
... print("The value in this cell is %s" % value)
...
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>
Related Topics
How to Display Index During List Iteration With Django
Compare Two Lists and Find the Unique Values
How to Open Excel File Fast in Python
How to Use Authenticated Proxy in Selenium Chromedriver
Python Does Not Match Format '%Y-%M-%Dt%H:%M:%S%Z.%F'
Deleting Dataframe Row in Pandas If a Combination of Column Values Equals a Tuple in a List
How to Find the Most Common Element in the List of List in Python
Sqlalchemy - Select for Update Example
Convert CSV File to Pipe Delimited File in Python
Fastest Way to Compute Image Dataset Channel Wise Mean and Standard Deviation in Python
How to Display a Float With Two Decimal Places
Swapping List Elements Effectively in Python
Codehs Python, Remove All from String
How to Extract Integer or Float from String
How to Convert a Float into Hex
Exclude First Row When Importing Data from Excel into Python