Parse HTML Table to Python List

Parse HTML table to Python list?

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}

How to parse html table in python

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')

for i in trs:
tds = i.find_all('td')
print(tds[1].text)

Result:

3text 
6text

How to parse HTML and get table ids using Python

You can use BeautifulSoup to get the IDs:

import requests
from bs4 import BeautifulSoup

url = 'http://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html'

resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')

for t in soup.select('table[id]'):
if 'Domains and IP Addresses to Add to Your Allow List' in t.getText():
print(t.attrs['id'])

I trust you can figure out how to incorporate this into your code.

Parsing html tables with Beautifulsoup in Python

You can use find_all() and get_text() to gather the table data. The find_all() method returns a list that contains all descendants of a tag; and get_text() returns a string that contains a tag's text contents. First select all tabes, for each table select all rows, for each row select all columns and finally extract the text. That would collect all table data in the same order and structure that it appears on the HTML document.

from bs4 import BeautifulSoup

html = 'my html document'
soup = BeautifulSoup(html, 'html.parser')
tables = [
[
[td.get_text(strip=True) for td in tr.find_all('td')]
for tr in table.find_all('tr')
]
for table in soup.find_all('table')
]

The tables variable contains all the tables in the document, and it is a nested list that has the following structure,

tables -> rows -> columns

If the structure is not important and you only want to collect text from all tables in one big list, use:

table_data = [i.text for i in soup.find_all('td')]

Or if you prefer CSS selectors:

table_data = [i.text for i in soup.select('td')]

If the goal is to gather table data regardless of HTML attributes or other parameters, then it may be best to use pandas. The pandas.read_html() method reads HTML from URLs, files or strings, parses it and returns a list of dataframes that contain the table data.

import pandas as pd

html = 'my html document'
tables = pd.read_html(html)

Note that pandas.read_html() is more fragile than BeautifulSoup and it will raise a Value Error if it fails to parse the HTML or if the document doesn't have any tables.

Python webscraping: How to parse html table, selenium

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen")
soup = BeautifulSoup(browser.page_source, 'html5lib')
table = soup.select('table')[1]
browser.quit()
final_list = []
for row in table.select('tr'):
final_list.append([x.text for x in row.find_all(['td', 'th'])])
final_df = pd.DataFrame(final_list[1:], columns = final_list[:1])
final_df[:-2]

This returns the actual table:

        Y-avg2  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec
0 2022 . 117.8 119.1 119.8 121.2 121.5 122.6 . . . . . .
1 2021 116.1 114.1 114.9 114.6 115.0 114.9 115.3 116.3 116.3 117.5 117.2 118.1 118.9
2 2020 112.2 111.3 111.2 111.2 111.7 111.9 112.1 112.9 112.5 112.9 113.2 112.4 112.9
3 2019 110.8 109.3 110.2 110.4 110.8 110.5 110.6 111.4 110.6 111.1 111.3 111.6 111.3
4 2018 108.4 106.0 107.0 107.3 107.7 107.8 108.5 109.3 108.9 109.5 109.3 109.8 109.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
89 1933 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.8 2.7 2.7 2.7 2.7
90 1932 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8
91 1931 2.8 2.9 2.9 2.9 2.9 2.8 2.8 2.8 2.8 2.8 2.8 2.8 2.8
92 1930 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.9 2.9 2.9
93 1929 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1

Regarding your 'html5lib' issue, without looking at your actual install/virtualenv etc, there is not much help I can offer. Maybe try reinstalling it, or try installing it in a new virtual environment.

Parsing HTML table structure with no class attributes

You can use itertools.zip_longest to "tie" the rows together.

For example:

import requests
from itertools import zip_longest
from bs4 import BeautifulSoup, NavigableString, Tag

url = 'http://www.abyznewslinks.com/costa.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for tr in soup.select('table')[4:-1]:
tds = []
for f in tr.select('font'):
tds.append([])
for c in f.contents:
if isinstance(c, NavigableString) and c.strip():
tds[-1].append(c.strip())
elif isinstance(c, Tag) and c.name == 'a':
tds[-1].append([c.text, c['href']])

for column in zip_longest(*tds, fillvalue=''):
print(column)

print('-' * 80)

Prints:

('Costa Rica Newspapers and News Media - National',)
--------------------------------------------------------------------------------
('Costa Rica - Broadcast News Media',)
--------------------------------------------------------------------------------
('National', ['Columbia', 'https://columbia.co.cr/'], 'BC', 'GI', 'SPA', 'Radio')
('National', ['Monumental', 'http://www.monumental.co.cr/'], 'BC', 'GI', 'SPA', 'Radio')
('National', ['Multimedios', 'https://www.multimedios.cr/'], 'BC', 'GI', 'SPA', 'TV')
('National', ['Repretel', 'http://www.repretel.com/'], 'BC', 'GI', 'SPA', 'TV')
('National', ['Sinart', 'http://www.costaricanoticias.cr/'], 'BC', 'GI', 'SPA', 'Radio TV')
('National', ['Teletica', 'https://www.teletica.com/'], 'BC', 'GI', 'SPA', 'TV')
--------------------------------------------------------------------------------
('Costa Rica - Internet News Media',)
--------------------------------------------------------------------------------
('National', ['A Diario CR', 'http://adiariocr.com/'], 'IN', 'GI', 'SPA')
('National', ['AM Costa Rica', 'http://www.amcostarica.com/'], 'IN', 'GI', 'ENG')
('National', ['AM Prensa', 'https://amprensa.com/'], 'IN', 'GI', 'SPA')
('National', ['BS Noticias', 'http://www.bsnoticias.cr/'], 'IN', 'GI', 'SPA')
('National', ['Costa Rica News', 'https://thecostaricanews.com/'], 'IN', 'GI', 'ENG')
('National', ['Costa Rica Star', 'https://news.co.cr/'], 'IN', 'GI', 'ENG')
('National', ['Costarican Times', 'https://www.costaricantimes.com/'], 'IN', 'GI', 'ENG')
('National', ['CR Hoy', 'https://www.crhoy.com/'], 'IN', 'GI', 'SPA')
('National', ['Delfino', 'https://delfino.cr/'], 'IN', 'GI', 'SPA')
('National', ['El Guardian', 'https://elguardian.cr/'], 'IN', 'GI', 'SPA')
('National', ['El Mundo', 'https://www.elmundo.cr/'], 'IN', 'GI', 'SPA')
('National', ['El Pais', 'http://www.elpais.cr/'], 'IN', 'GI', 'SPA')
('National', ['El Periodico CR', 'https://elperiodicocr.com/'], 'IN', 'GI', 'SPA')
('National', ['Informa Tico', 'http://informa-tico.com/'], 'IN', 'GI', 'SPA')
('National', ['La Prensa Libre', 'http://www.laprensalibre.cr/'], 'IN', 'GI', 'SPA')
('National', ['NCR Noticias Costa Rica', 'https://ncrnoticias.com/'], 'IN', 'GI', 'SPA')
('National', ['No Ticiero', 'http://no.ticiero.com/'], 'IN', 'GI', 'SPA')
('National', ['Noticias al Instante Costa Rica', 'https://www.noticiasalinstante.cr/'], 'IN', 'GI', 'SPA')
('National', ['Noticias Costa Rica', 'https://noticiascostarica.com/'], 'IN', 'GI', 'SPA')
('National', ['Q Costa Rica', 'http://qcostarica.com/'], 'IN', 'GI', 'ENG')
('National', ['Tico Deporte', 'https://www.ticodeporte.com/'], 'IN', 'SP', 'SPA')
('National', ['Today Costa Rica', 'http://todaycostarica.com/'], 'IN', 'GI', 'ENG')
--------------------------------------------------------------------------------
('Costa Rica - Magazine News Media',)
--------------------------------------------------------------------------------
('National', ['EKA', 'https://www.ekaenlinea.com/'], 'MG', 'BU', 'SPA')
--------------------------------------------------------------------------------
('Costa Rica - Newspaper News Media',)
--------------------------------------------------------------------------------
('National', ['Diario Extra', 'http://www.diarioextra.com/'], 'NP', 'GI', 'SPA')
('National', ['La Nacion', 'https://www.nacion.com/'], 'NP', 'GI', 'SPA')
('National', ['La Republica', 'https://www.larepublica.net/'], 'NP', 'GI', 'SPA')
('National', ['La Teja', 'https://www.lateja.cr/'], 'NP', 'GI', 'SPA')
--------------------------------------------------------------------------------
('Costa Rica Newspapers and News Media - Local',)
--------------------------------------------------------------------------------
('Alajuela',)
--------------------------------------------------------------------------------
('Alajuela', ['El Sol', 'https://elsoldeoccidente.com/'], 'NP', 'GI', 'SPA')
('Alajuela', ['La Segunda', 'http://www.periodicolasegundacr.com/'], 'NP', 'GI', 'SPA')
('Grecia', ['Mi Tierra', 'http://www.periodicomitierra.com/'], 'NP', 'GI', 'SPA')
('San Carlos', ['La Region', 'http://laregion.cr/'], 'NP', 'GI', 'SPA')
('San Carlos', ['San Carlos al Dia', 'https://www.sancarlosaldia.com/'], 'IN', 'GI', 'SPA')
('San Carlos', ['San Carlos Digital', 'https://sancarlosdigital.com/'], 'IN', 'GI', 'SPA')
--------------------------------------------------------------------------------
('Cartago',)
--------------------------------------------------------------------------------
('Cartago', ['Cartago Hoy', 'http://www.cartagohoy.com/'], 'IN', 'IG', 'SPA')
('Paraiso', ['Brujos Paraiso', 'http://www.brujosparaiso.com/'], 'IN', 'IG', 'SPA')
--------------------------------------------------------------------------------
('Guanacaste',)
--------------------------------------------------------------------------------
('Bagaces', ['Guanacaste \na la Altura', 'https://www.guanacastealaaltura.com/'], 'NP', 'GI', 'SPA', 'TV')
('Filadelfia', ['El Independiente', 'https://diariodigitalelindependiente.com/'], 'IN', 'GI', 'SPA', 'Radio')
('Liberia', ['Canal 5 Guanacaste', 'http://www.canal5guanacaste.com/'], 'BC', 'GI', 'SPA', '')
('Liberia', ['Guana Noticias', 'https://guananoticias.com/'], 'IN', 'GI', 'SPA', '')
('Liberia', ['Mensaje', 'https://www.periodicomensaje.com/'], 'NP', 'GI', 'SPA', '')
('Liberia', ['Mundo Guanacaste', 'http://www.mundoguanacaste.com/'], 'IN', 'GI', 'SPA', '')
('Liberia', ['NTG Noticias', 'https://ntgnoticias.com/'], 'IN', 'GI', 'SPA', '')
('Liberia', ['Radio Pampa', 'http://www.radiolapampa.net/'], 'BC', 'GI', 'SPA', '')
('Nicoya', ['La Voz de Guanacaste', 'https://vozdeguanacaste.com/'], 'NP', 'GI', 'SPA', '')
('Nicoya', ['Voice of Guanacaste', 'https://vozdeguanacaste.com/en'], 'NP', 'GI', 'ENG', '')
('Tamarindo', ['Tamarindo News', 'http://tamarindonews.com/'], 'IN', 'GI', 'ENG', '')
--------------------------------------------------------------------------------
('Heredia',)
--------------------------------------------------------------------------------
('Flores', ['El Florense', 'http://elflorense.com/'], 'NP', 'GI', 'SPA')
('Heredia', ['Fortinoticias', 'http://fortinoticias.com/'], 'IN', 'GI', 'SPA')
--------------------------------------------------------------------------------
('Limon',)
--------------------------------------------------------------------------------
('Limon', ['El Independiente', 'https://www.elindependiente.co.cr/'], 'NP', 'GI', 'SPA')
('Limon', ['Limon Hoy', 'https://www.limonhoy.com/'], 'IN', 'GI', 'SPA')
--------------------------------------------------------------------------------
('Puntarenas',)
--------------------------------------------------------------------------------
('Paquera', ['Mi Prensa', 'http://www.miprensacr.com/'], 'IN', 'GI', 'SPA')
('Puntarenas', ['Puntarenas Se Oye', 'https://www.puntarenasseoye.com/'], 'IN', 'GI', 'SPA')
--------------------------------------------------------------------------------
('San Jose',)
--------------------------------------------------------------------------------
('Acosta', ['El Jornal', 'http://eljornalcr.com/'], 'NP', 'GI', 'SPA', 'TV')
('Goicochea', ['La Voz de Goicochea', 'https://www.lavozdegoicoechea.info/'], 'IN', 'GI', 'SPA', 'TV')
('Perez Zeledon', ['Canal 14', 'http://www.tvsur.co.cr/'], 'BC', 'GI', 'SPA', '')
('Perez Zeledon', ['Enlace', 'https://www.enlacecr.com/'], 'NP', 'GI', 'SPA', '')
('Perez Zeledon', ['PZ Actual', 'http://www.pzactual.com/'], 'IN', 'GI', 'SPA', '')
('Perez Zeledon', ['PZ Noticias', 'http://www.pznoticias.org/'], 'IN', 'GI', 'SPA', '')
('San Jose', ['Diario Extra', 'http://www.diarioextra.com/'], 'NP', 'GI', 'SPA', '')
('San Jose', ['El Financiero', 'https://www.elfinancierocr.com/'], 'NP', 'BU', 'SPA', '')
('San Jose', ['Extra TV', 'http://www.extratv42.com/'], 'BC', 'GI', 'SPA', '')
('San Jose', ['La Gaceta', 'http://www.gaceta.go.cr/gaceta/'], 'NP', 'GO', 'SPA', '')
('San Jose', ['La Nacion', 'https://www.nacion.com/'], 'NP', 'GI', 'SPA', '')
('San Jose', ['La Republica', 'https://www.larepublica.net/'], 'NP', 'GI', 'SPA', '')
('San Jose', ['La Teja', 'https://www.lateja.cr/'], 'NP', 'GI', 'SPA', '')
('San Jose', ['Tico Times', 'http://www.ticotimes.net/'], 'NP', 'GI', 'ENG', '')
('Tibas', ['Gente', 'http://periodicogente.co.cr/'], 'NP', 'GI', 'SPA', '')
--------------------------------------------------------------------------------

How to parse an HTML table with rowspans in Python?

You'll have to track the rowspans on previous rows, one per column.

You could do this simply by copying the integer value of a rowspan into a dictionary, and subsequent rows decrement the rowspan value until it drops to 1 (or we could store the integer value minus 1 and drop to 0 for ease of coding). Then you can adjust subsequent table counts based on preceding rowspans.

Your table complicates this a little by using a default span of size 2, incrementing in steps of two, but that can easily be brought back to manageable numbers by dividing by 2.

Rather than use massive CSS selectors, select just the table rows and we'll iterate over those:

roster = []
rowspans = {} # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
# take direct child td cells, but skip the first cell:
daycells = row.select('> td')[1:]
rowspan_offset = 0
for daynum, daycell in enumerate(daycells, 1):
# rowspan handling; if there is a rowspan here, adjust to find correct position
daynum += rowspan_offset
while rowspans.get(daynum, 0):
rowspan_offset += 1
rowspans[daynum] -= 1
daynum += 1
# now we have a correct day number for this cell, adjusted for
# rowspanning cells.
# update the rowspan accounting for this cell
rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
if rowspan:
rowspans[daynum] = rowspan

texts = daycell.select("table > tr > td > font")
if texts:
# class info found
teacher, classroom, course = (c.get_text(strip=True) for c in texts)
roster.append({
'blok_start': block,
'blok_eind': block + rowspan,
'dag': daynum,
'leraar': teacher,
'lokaal': classroom,
'vak': course
})

# days that were skipped at the end due to a rowspan
while daynum < 5:
daynum += 1
if rowspans.get(daynum, 0):
rowspans[daynum] -= 1

This produces correct output:

[{'blok_eind': 2,
'blok_start': 1,
'dag': 5,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021',
'vak': u'WEBD'},
{'blok_eind': 3,
'blok_start': 2,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'WEBD'},
{'blok_eind': 4,
'blok_start': 3,
'dag': 5,
'leraar': u'DOODF000',
'lokaal': u'ALK C212',
'vak': u'PROJ-T'},
{'blok_eind': 5,
'blok_start': 4,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'MENT'},
{'blok_eind': 7,
'blok_start': 6,
'dag': 5,
'leraar': u'JONGJ003',
'lokaal': u'ALK B008',
'vak': u'BURG'},
{'blok_eind': 8,
'blok_start': 7,
'dag': 3,
'leraar': u'FLUIP000',
'lokaal': u'ALK B004',
'vak': u'ICT algemeen Prakti'},
{'blok_eind': 9,
'blok_start': 8,
'dag': 5,
'leraar': u'KOOLE000',
'lokaal': u'ALK B008',
'vak': u'NED'}]

Moreover, this code will continue to work even if courses span more than 2 blocks, or just one block; any rowspan size is supported.

Parsing data according html table

This isn't perfect, but I think it will get you what you are looking for. Your first loops through the data to collect GTIN, LOT, and Date is overwriting itself. Look for the "added" and "removed" in the comments. I also have a method of viewing the results commented out. (The code works if you wanted to use it.) I also have two that are not commented out. The last version requires the packaged tabulate. This code requires the packages numpy and re, as well.

I included all of your original code and the changes. Let me know if there's anything I failed to clarify.

import requests
from bs4 import BeautifulSoup
import numpy as np
import re
import tabulate

###################_____Parameter_____###################
url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
#########################################################

#Collecting links on rappel.gouv
def get_soup(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
return soup

def extract_product_urls(url):
links = [url+x['href'] for x in get_soup(url).select('a.product-link')]
return links

soup = get_soup(url)
url_data = extract_product_urls(url)

#Collecting data on each url collected
def data(url_data):
data_set = []
all_data = [] # <-- ADDED
tbler = [] # <-- ADDED

#Collecting bar_code
for row in url_data:
req_bar = requests.get(row, headers=headers)
ext_bar = BeautifulSoup(req_bar.text, 'html.parser')
#for code_bar in ext_bar.find_all('tbody', {'class' : 'text-left'}): <-- REMOVED
for code_bar in ext_bar.find_all('table'):
table_content_tr = code_bar.findAll('tr')
for td in table_content_tr:
all_data = td.findAll('td')
# all_dt2 = [x.text.strip(' ') for x in all_data] <-- REMOVED
all_dt1 = [x.get_text(strip = True) for x in all_data] # <-- ADDED

# all_data was being overwritten before
all_dt1 = np.ravel(all_dt1) # one dimensional array # <-- ADDED

# check for GTIN, Lot, and date <--- Added here down
if len(all_dt1) < 3:
# check for product GTIN
if not re.match('^(\d{13})$', all_dt1[0, ]):
# add a blank (underscore) for missing GTIN
all_dt1 = np.hstack(('_', all_dt1))
# check the lot field for dates (missing lot)
if re.search('[\d]{2}/[\d]{2}/[\d]{2,4}', all_dt1[1, ]):
all_dt1 = np.insert(all_dt1, 1, "_")
# missing date? any remaining that are missing collected here
s = 3 - len(all_dt1)
for i in range(0, s):
all_dt1 = np.hstack((all_dt1, ['_'])) # append for any other missing fields
# stack new and existing product data
if len(tbler) > 0: # is this the first time through the for loop?
tbler = np.vstack((tbler, all_dt1)) # stack existing rows
else: tbler = all_dt1 # or else, create first row
# ----- end added here
# removed ... don't loop twice

#collecting detailled products
#for data in url_data: <-- REMOVED
# req = requests.get(data, headers=headers) <-- REMOVED
# ext = BeautifulSoup(req.text, 'html.parser') <-- REMOVED
for products in ext_bar.find_all('div', {'class' : 'row site-wrapper'}): # mod!! ext => ext_bar
title = products.find('p', {'class' : 'h5 product-main-title'}).text
brand = products.find('p', {'class' : 'text-muted product-main-brand'}).text.replace('\xa0:\n',': ').strip()
category = products.find('p', {'class' : 'product-cat'}).text
detail_rappel = products.find_all('div', {'class' : 'card product-practical'})
for motif in detail_rappel:
val_1 = motif.find('span', {'class': 'val'}).text
results = row, title, brand, category, val_1 # mod!! data => row, removed *all_data
data_set.append(results)

return data_set, tbler #, all_data <--- MOD, => added tbler

# final result
final, eb = data(url_data)

# column stack the GTIN, lot, and date with the product data
newArr = np.hstack((eb, final))

# this commented out for loop will print by array row
# x, y = np.shape(newArr)
# for i in range(0, x):
# print(newArr[i, ])

# this is just an alternative print method (it doesn't how the brackets or quotes)
for i in newArr:
for j in i:
print(j, end = ' ')
print()

# yet another alternate look at the same inforamtion
print(tabulate.tabulate(newArr))

The first coded output snapshot
Sample Image

The use of tabulate (two shots since it's so wide)
Sample Image

Sample Image



Related Topics



Leave a reply



Submit