How to Extract Data from HTML Table in Shell Script

How to extract data from html table in shell script?

Go with (g)awk, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.

 awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print $3, $5, $7 }' FILE

Here you can see it in action: https://ideone.com/zGfLe

Some explanation:

  1. -F sets the input field separator to a regexp (any of tr's or td's opening or closing tag

  2. then works only on lines that matches those tags AND at least two upercasse fields

  3. then prints the needed fields.

HTH

Extract text from HTML based on table column via Shell Script

#/bin/bash

for i in `cat sample.html | grep '<\/div>' | sed 's/\s\+//'|sed 's/<.*>//'`; do
if [ $i == $1 ];
then
echo $prev
fi
prev=$i
done

Example of using

$ ./filter.sh primary
core6692.myserverdomain.com

P.s: format of the sample.html should be exacly you posted here, server and the name shouldends with tag and starts with whitespace or tab.

Parse HTML tables to variables in Bash

I will break down the answer which I tried using xmllint which supports a --html flag for parsing html files

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}'
content1 content2

First you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-

$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
content1
</table>
<table>
content2
</table>
</body></html>

with my original YourHTML.html file just being:-

$ cat YourHTML.html
<table>
content1
</table>
<table>
content2
</table>

Now for the value extraction part; The steps as they executed:-

Starting the file parsing from root-node to the repeating node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)

Running the command plainly produces a result,

/ >  -------
<table>
content1
</table>

-------
<table>
content2
</table>
/ >

Now removing the special characters using sed i.e. sed '/^\/ >/d' | sed 's/<[^>]*.//g' produces

content1

-------

content2

Now removing the newlines from the above command using tr so that awk can process the records using the field separator as -------

content1 -------content2

The awk command on the above output will produce the file as needed; awk -F"-------" '{print $1,$2}

content1  content2

Putting it together in a shell script, it looks like

#!/bin/bash

# extract table1 value
table1Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')

# extract table2 value
table2Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')

# can be extended up-to any number of nodes

Or quite simply:-

#!/bin/bash

echo "cat //html/body/table" | xmllint --html --shell file | sed '/^\/ >/d' | \
sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
while IFS= read -r value1 value2
do
# Do whatever with the values extracted
done

P.S:- The number of commands can be reduced/simplified with a reduced number of awk/sed command combination. This is just a solution that works. The xmllint version I have used is xmllint: using libxml version 20706

Extract cell value from html table using bash

Note that your document output is ill-formed (lack some opening <a>), is it normal/excpected or a typo ? Otherwise, here is a well-formed version.

Command

I like xmlstarlet, simple and straight forward XPath for short tests:

xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n 

Explaination

sel   (or select)        - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template - start a template
-m or --match <xpath> - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl - print new line

Output

Deployed
Deployed
# plus empty-cell

How to scrape a HTML table and print it on terminal using Bash?

An Example:

wget -q -O - 'http://www.tesouro.fazenda.gov.br/tesouro-direto-precos-e-taxas-dos-titulos' |\
xmlstarlet format --recover --html 2>/dev/null |\
xmlstarlet select --html --template --value-of "/html/body/div/div/div/div/div/div/div/div/div/div/table/tbody/tr/td[@class='listing0' or @class='listing' or @class='listing ']" |\
paste -d ";" - - - - - |\
column -s ";" -t

Output:


Tesouro IPCA+ 2024 (NTNB Princ) 15/08/2024 5,06 R$43,60 R$2.180,09
Tesouro IPCA+ 2035 (NTNB Princ) 15/05/2035 5,48 R$35,96 R$1.198,82
Tesouro IPCA+ 2045 (NTNB Princ) 15/05/2045 5,48 R$35,21 R$704,20
Tesouro IPCA+ com Juros Semestrais 2026 (NTNB) 15/08/2026 5,10 R$32,72 R$3.272,36
Tesouro IPCA+ com Juros Semestrais 2035 (NTNB) 15/05/2035 5,35 R$32,63 R$3.263,54
Tesouro IPCA+ com Juros Semestrais 2050 (NTNB) 15/08/2050 5,41 R$33,60 R$3.360,49
Tesouro Prefixado 2020 (LTN) 01/01/2020 8,38 R$33,85 R$846,45
Tesouro Prefixado 2023 (LTN) 01/01/2023 10,20 R$30,58 R$611,76
Tesouro Prefixado com Juros Semestrais 2027 (NTNF) 01/01/2027 10,37 R$30,67 R$1.022,61
Tesouro Selic 2023 (LFT) 01/03/2023 0,00 R$92,38 R$9.238,83

I inserted xmlstarlet format --recover --html 2>/dev/null because HTML from this URL is not valid.


See: xmlstarlet select --help, man paste and man column

How to extract tables from many html files into one csv file?

html can be extremely messy. I would therefore suggest to use something more highlevel than bash script. Since you had already tagged the question with the python-tag (rightly replaced with the bash tag in a later edit), let's go with python with BeautifulSoup.

EDIT: In comments to this answer the author of the OP clarified what the OP really wanted:

  1. Collect the contents of td tags in a html table.

As in:

<td class="bzt">data12</td></code>

  1. Additionally collect data from a link in the src attribute of one or more script tags in the same html file.

As in:

<script src="hq.sohujs.cn/list=data18" type="text/javascript" charset="gbk"></script>

  1. Perform 1. and 2. for all html files in the current working directory.

  2. Save this as a csv table with fields separated by TAB ("\t").

Working solution for python3 and BeautifulSoup

I extended the script from the earlier version of this answer to do this and added a some explanation in comments:

"""module import"""
from bs4 import BeautifulSoup
import glob

"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")

for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(open(filename), 'html.parser')

"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']

"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])

"""write data to output002.csv"""
with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")

How to run

From a terminal in the directory where the html files are, do:

python3 <script_name.py>

Alternatively, you can move the working directory to the correct location (where the html files are) at the beginning of the script with:

import os
os.chdir("</path/to/directory>")

Working solution for python2 and BeautifulSoup

Since the author of the OP requested a python2 version, I provide one here. The only difference to the python3 version above are the file handlers (python2 uses file(), not open()).

"""module import"""
from bs4 import BeautifulSoup
import glob

"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")

for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(file(filename), 'html.parser')

"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']

"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])

"""write data to output002.csv"""
with file("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")

Running the python2 version is analogous to python3 above.


Old version of this answer

The following script does what you describe:

  1. collect all the contents of all html files in the current directory

  2. write them to a csv with tab seperator.

Here is an example script:

from bs4 import BeautifulSoup
import glob

filenames = glob.glob("*.html")
tdTextList = []
for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList += [td.text for td in soup.find_all("td")]

with open("output001.csv", "w") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")

This is what you describe. It is probably not what you want.

Note that this will produce a file with a single very long row (you do not specify when you want a new row). And it may accidentally produce a malformed file if the contents of any of the td tags contains a newline character.

To give an output file that looks a bit nicer, let's write a new line for each html file that is read and let's remove leading and trailing spaces as well as newline characters from the data before writing it to the output.

from bs4 import BeautifulSoup
import glob

filenames = glob.glob("*.html")

for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")

Note: you can run either script from the bash shell with:

python3 <script_name.py>

How to fetch whole row from HTML data through shell script?

You will have to get required td and then parent tr of that td. You can do this with following xpath

To get required td

//table//td[contains(text(), '19.8.2')]

To get parent tr of required td

//table//td[contains(text(), '19.8.2')]/parent::tr



Related Topics



Leave a reply



Submit