How to Specify Table for Beautifulsoup to Find

BeautifulSoup: Get the contents of a specific table

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who's id is "Table1" and gets all of its tr elements.

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1")
rows = table.findAll(lambda tag: tag.name=='tr')

How to specify table for BeautifulSoup to find?

In this case I'd probably just use pandas to retrieve all tables then index in for appropriate

import pandas as pd

table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)

If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4

import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)

Python Beautiful Soup can't find specific table

The tables are rendered after, so you'd need to use Selenium to let it render or as mentioned above. But that isn't necessary as most of the tables are within the comments. You could use BeautifulSoup to pull out the comments, then search through those for the table tags.

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd

#NBA season
year = 2019

url = 'https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base'.format(year)
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue

This will return you a list of dataframes, so just pull out the table you want from wherever it is located by its index position:

Output:

print (tables[3])
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 19780 3555 ... 615 486 1137 1608 9686
1 2.0 Golden State Warriors* 82 19805 3612 ... 625 525 1169 1757 9650
2 3.0 New Orleans Pelicans 82 19755 3581 ... 610 441 1215 1732 9466
3 4.0 Philadelphia 76ers* 82 19805 3407 ... 606 432 1223 1745 9445
4 5.0 Los Angeles Clippers* 82 19830 3384 ... 561 385 1193 1913 9442
5 6.0 Portland Trail Blazers* 82 19855 3470 ... 546 413 1135 1669 9402
6 7.0 Oklahoma City Thunder* 82 19855 3497 ... 766 425 1145 1839 9387
7 8.0 Toronto Raptors* 82 19880 3460 ... 680 437 1150 1724 9384
8 9.0 Sacramento Kings 82 19730 3541 ... 679 363 1095 1751 9363
9 10.0 Washington Wizards 82 19930 3456 ... 683 379 1154 1701 9350
10 11.0 Houston Rockets* 82 19830 3218 ... 700 405 1094 1803 9341
11 12.0 Atlanta Hawks 82 19855 3392 ... 675 419 1397 1932 9294
12 13.0 Minnesota Timberwolves 82 19830 3413 ... 683 411 1074 1664 9223
13 14.0 Boston Celtics* 82 19780 3451 ... 706 435 1052 1670 9216
14 15.0 Brooklyn Nets* 82 19980 3301 ... 539 339 1236 1763 9204
15 16.0 Los Angeles Lakers 82 19780 3491 ... 618 440 1284 1701 9165
16 17.0 Utah Jazz* 82 19755 3314 ... 663 483 1240 1728 9161
17 18.0 San Antonio Spurs* 82 19805 3468 ... 501 386 992 1487 9156
18 19.0 Charlotte Hornets 82 19830 3297 ... 591 405 1001 1550 9081
19 20.0 Denver Nuggets* 82 19730 3439 ... 634 363 1102 1644 9075
20 21.0 Dallas Mavericks 82 19780 3182 ... 533 351 1167 1650 8927
21 22.0 Indiana Pacers* 82 19705 3390 ... 713 404 1122 1594 8857
22 23.0 Phoenix Suns 82 19880 3289 ... 735 418 1279 1932 8815
23 24.0 Orlando Magic* 82 19780 3316 ... 543 445 1082 1526 8800
24 25.0 Detroit Pistons* 82 19855 3185 ... 569 331 1135 1811 8778
25 26.0 Miami Heat 82 19730 3251 ... 627 448 1208 1712 8668
26 27.0 Chicago Bulls 82 19905 3266 ... 603 351 1159 1663 8605
27 28.0 New York Knicks 82 19780 3134 ... 557 422 1151 1713 8575
28 29.0 Cleveland Cavaliers 82 19755 3189 ... 534 195 1106 1642 8567
29 30.0 Memphis Grizzlies 82 19880 3113 ... 684 448 1147 1801 8490
30 NaN League Average 82 19815 3369 ... 626 406 1155 1714 9119

[31 rows x 25 columns]

Find specific table using BeautifulSoup with specific caption

I would try to find all captions and then to match the caption text like this:

from bs4 import BeautifulSoup
import re
import requests


header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}

redirect = requests.get('http://goblueraiders.com/boxscore.aspx?path=baseball&id=6117', headers = header).text
soup = BeautifulSoup(redirect, 'html.parser')

for caption in soup.find_all('caption'):
if caption.get_text() == 'Tennessee Tech - Pitching Stats':
table = caption.find_parent('table', {'class': 'sidearm-table collapse-on-medium accordion'})

BeautifulSoup - find table with specified class on Wikipedia page

You shouldn't use jquery-tablesorter to select against in the response you get from requests because it is dynamically applied after the page loads. If you omit that, you should be good to go.

tab = soup.find("table",{"class":"wikitable sortable"})

python BeautifulSoup parsing table

Here you go:

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values

This gives you:

[ [u'1359711259', u'SRF', u'08/05/2013', u'5310 4 AVE', u'K', u'19', u'125.00', u'$'], 
[u'7086775850', u'PAS', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'125.00', u'$'],
[u'7355010165', u'OMT', u'12/14/2013', u'3908 6th Ave', u'K', u'40', u'145.00', u'$'],
[u'4002488755', u'OMT', u'02/12/2014', u'NB 1ST AVE @ E 23RD ST', u'5', u'115.00', u'$'],
[u'7913806837', u'OMT', u'03/03/2014', u'5015 4th Ave', u'K', u'46', u'115.00', u'$'],
[u'5080015366', u'OMT', u'03/10/2014', u'EB 65TH ST @ 16TH AV E', u'7', u'50.00', u'$'],
[u'7208770670', u'OMT', u'04/08/2014', u'333 15th St', u'K', u'70', u'65.00', u'$'],
[u'$0.00\n\n\nPayment Amount:']
]

Couple of things to note:

  • The last row in the output above, the Payment Amount is not a part
    of the table but that is how the table is laid out. You can filter it
    out by checking if the length of the list is less than 7.
  • The last column of every row will have to be handled separately since it is an input text box.

BeautifulSoup can't find table

Need selenium to extract the table data because data load through JavaScript. as an example i here extract the table one data and save to csv file.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

url = 'https://www.nba.com/standings?GroupBy=conf&Season=2019-20&Section=overall'
driver = webdriver.Chrome(r"C:\Users\Subrata\Downloads\chromedriver.exe")
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.select('div.StandingsGridRender_standingsContainer__2EwPy')
table1 = []
for td in tables[0].find_all('tr'):
first =[t.getText(strip=True, separator=' ') for t in td]
table1.append(first)


df = pd.DataFrame(table1[1:], columns=table1[0] )

df.to_csv('x.csv')

Beautiful Soup can't find tables

Try to disable javascript when you visit https://covid.knoxcountytn.gov/case-count.html and you will see no table. As @barny said the table is generated with javascript so you can't parse it with BeautifulSoup (at least not easily, see How to call JavaScript function using BeautifulSoup and Python).

how to select a particular table and print its data using beautifulsoup

Expanding on @furas' comment slightly, as report_tables[4] assumes it will always be the 5th table:

req = requests.get("https://www.ssllabs.com/ssltest/analyze.html?d=drtest.test.sentinelcloud.com")
data = req.text
soup = BeautifulSoup(data)

for found_table in soup.find_all('table', class_='reportTable'):
if 'Cipher Suites' in found_table.get_text():
values = found_table.find_all('td', class_='tableLeft')
entries = []
for row in values:
entries.append(row.get_text())
print entries

Checking for 'Cipher Suites' (though you could use a more complete title if needs be) should help you get the correct table more consistently.

You could simple use values as an output, but using get_text() helps us remove some of the html that you likely won't need. entries will contain the values you require, but you might need to look into functions like strip to clear whitespace from the results.

PRODUCED RESULT:

[u'\n                                            TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256\n                                        (0xc02f)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384\n                                        (0xc030)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_128_GCM_SHA256\n                                        (0x9e)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_256_GCM_SHA384\n                                        (0x9f)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256\n                                        (0xc027)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA\n                                        (0xc013)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384\n                                        (0xc028)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA\n                                        (0xc014)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_128_CBC_SHA256\n                                        (0x67)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_128_CBC_SHA\n                                        (0x33)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_256_CBC_SHA256\n                                        (0x6b)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_DHE_RSA_WITH_AES_256_CBC_SHA\n                                        (0x39)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA\n                                        (0xc012)\n                                                            \xa0  ECDH secp256r1 (eq. 3072 bits RSA) \xa0 FS\n', u'\n                                            TLS_RSA_WITH_AES_128_GCM_SHA256\n                                        (0x9c)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_AES_256_GCM_SHA384\n                                        (0x9d)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_AES_128_CBC_SHA256\n                                        (0x3c)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_AES_256_CBC_SHA256\n                                        (0x3d)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_AES_128_CBC_SHA\n                                        (0x2f)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_AES_256_CBC_SHA\n                                        (0x35)\n                                                                \n                    \n                ', u'\n                                            TLS_DHE_RSA_WITH_CAMELLIA_256_CBC_SHA\n                                        (0x88)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_RSA_WITH_CAMELLIA_256_CBC_SHA\n                                        (0x84)\n                                                                \n                    \n                ', u'\n                                            TLS_DHE_RSA_WITH_CAMELLIA_128_CBC_SHA\n                                        (0x45)\n                                 \xa0\n                                    \nDH 2048 bits \xa0 FS\n', u'\n                                            TLS_RSA_WITH_CAMELLIA_128_CBC_SHA\n                                        (0x41)\n                                                                \n                    \n                ', u'\n                                            TLS_RSA_WITH_3DES_EDE_CBC_SHA\n                                        (0xa)\n                                                                \n                    \n                ']

EDIT: to expand this in line with @PadraicCunningham's comments, we can remove the whitespace and return the first value as follows:

for found_table in soup.find_all('table', class_='reportTable'):
if 'Cipher Suites' in found_table.get_text():
vals = [td.text.split()[0] for td in found_table.select("td.tableLeft")]
print vals
break


Related Topics



Leave a reply



Submit