How to extract data from html table in shell script?
Go with (g)awk
, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.
awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print $3, $5, $7 }' FILE
Here you can see it in action: https://ideone.com/zGfLe
Some explanation:
-F
sets the input field separator to a regexp (any oftr
's ortd
's opening or closing tagthen works only on lines that matches those tags AND at least two upercasse fields
then prints the needed fields.
HTH
Extract text from HTML based on table column via Shell Script
#/bin/bash
for i in `cat sample.html | grep '<\/div>' | sed 's/\s\+//'|sed 's/<.*>//'`; do
if [ $i == $1 ];
then
echo $prev
fi
prev=$i
done
Example of using
$ ./filter.sh primary
core6692.myserverdomain.com
P.s: format of the sample.html should be exacly you posted here, server and the name shouldends with tag and starts with whitespace or tab.
Parse HTML tables to variables in Bash
I will break down the answer which I tried using xmllint
which supports a --html
flag for parsing html
files
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}'
content1 content2
First you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-
$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
content1
</table>
<table>
content2
</table>
</body></html>
with my original YourHTML.html
file just being:-
$ cat YourHTML.html
<table>
content1
</table>
<table>
content2
</table>
Now for the value extraction part; The steps as they executed:-
Starting the file parsing from root-node to the repeating node (//html/body/table
) and running xmllint
in HTML parser & interactive shell mode (xmllint --html --shell
)
Running the command plainly produces a result,
/ > -------
<table>
content1
</table>
-------
<table>
content2
</table>
/ >
Now removing the special characters using sed
i.e. sed '/^\/ >/d' | sed 's/<[^>]*.//g'
produces
content1
-------
content2
Now removing the newlines from the above command using tr
so that awk
can process the records using the field separator as -------
content1 -------content2
The awk
command on the above output will produce the file as needed; awk -F"-------" '{print $1,$2}
content1 content2
Putting it together in a shell script, it looks like
#!/bin/bash
# extract table1 value
table1Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')
# extract table2 value
table2Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')
# can be extended up-to any number of nodes
Or quite simply:-
#!/bin/bash
echo "cat //html/body/table" | xmllint --html --shell file | sed '/^\/ >/d' | \
sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
while IFS= read -r value1 value2
do
# Do whatever with the values extracted
done
P.S:- The number of commands can be reduced/simplified with a reduced number of awk
/sed
command combination. This is just a solution that works. The xmllint
version I have used is xmllint: using libxml version 20706
Extract cell value from html table using bash
Note that your document output is ill-formed (lack some opening <a>
), is it normal/excpected or a typo ? Otherwise, here is a well-formed version.
Command
I like xmlstarlet, simple and straight forward XPath for short tests:
xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n
Explaination
sel (or select) - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template - start a template
-m or --match <xpath> - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl - print new line
Output
Deployed
Deployed
# plus empty-cell
How to scrape a HTML table and print it on terminal using Bash?
An Example:
wget -q -O - 'http://www.tesouro.fazenda.gov.br/tesouro-direto-precos-e-taxas-dos-titulos' |\
xmlstarlet format --recover --html 2>/dev/null |\
xmlstarlet select --html --template --value-of "/html/body/div/div/div/div/div/div/div/div/div/div/table/tbody/tr/td[@class='listing0' or @class='listing' or @class='listing ']" |\
paste -d ";" - - - - - |\
column -s ";" -t
Output:
Tesouro IPCA+ 2024 (NTNB Princ) 15/08/2024 5,06 R$43,60 R$2.180,09
Tesouro IPCA+ 2035 (NTNB Princ) 15/05/2035 5,48 R$35,96 R$1.198,82
Tesouro IPCA+ 2045 (NTNB Princ) 15/05/2045 5,48 R$35,21 R$704,20
Tesouro IPCA+ com Juros Semestrais 2026 (NTNB) 15/08/2026 5,10 R$32,72 R$3.272,36
Tesouro IPCA+ com Juros Semestrais 2035 (NTNB) 15/05/2035 5,35 R$32,63 R$3.263,54
Tesouro IPCA+ com Juros Semestrais 2050 (NTNB) 15/08/2050 5,41 R$33,60 R$3.360,49
Tesouro Prefixado 2020 (LTN) 01/01/2020 8,38 R$33,85 R$846,45
Tesouro Prefixado 2023 (LTN) 01/01/2023 10,20 R$30,58 R$611,76
Tesouro Prefixado com Juros Semestrais 2027 (NTNF) 01/01/2027 10,37 R$30,67 R$1.022,61
Tesouro Selic 2023 (LFT) 01/03/2023 0,00 R$92,38 R$9.238,83
I inserted xmlstarlet format --recover --html 2>/dev/null
because HTML from this URL is not valid.
See: xmlstarlet select --help
, man paste
and man column
How to extract tables from many html files into one csv file?
html can be extremely messy. I would therefore suggest to use something more highlevel than bash script. Since you had already tagged the question with the python-tag (rightly replaced with the bash tag in a later edit), let's go with python with BeautifulSoup.
EDIT: In comments to this answer the author of the OP clarified what the OP really wanted:
- Collect the contents of td tags in a html table.
As in:
<td class="bzt">data12</td></code>
- Additionally collect data from a link in the src attribute of one or more script tags in the same html file.
As in:
<script src="hq.sohujs.cn/list=data18" type="text/javascript" charset="gbk"></script>
Perform 1. and 2. for all html files in the current working directory.
Save this as a csv table with fields separated by TAB (
"\t"
).
Working solution for python3 and BeautifulSoup
I extended the script from the earlier version of this answer to do this and added a some explanation in comments:
"""module import"""
from bs4 import BeautifulSoup
import glob
"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")
for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(open(filename), 'html.parser')
"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']
"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])
"""write data to output002.csv"""
with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
How to run
From a terminal in the directory where the html files are, do:
python3 <script_name.py>
Alternatively, you can move the working directory to the correct location (where the html files are) at the beginning of the script with:
import os
os.chdir("</path/to/directory>")
Working solution for python2 and BeautifulSoup
Since the author of the OP requested a python2 version, I provide one here. The only difference to the python3 version above are the file handlers (python2 uses file()
, not open()
).
"""module import"""
from bs4 import BeautifulSoup
import glob
"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")
for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(file(filename), 'html.parser')
"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']
"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])
"""write data to output002.csv"""
with file("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
Running the python2 version is analogous to python3 above.
Old version of this answer
The following script does what you describe:
collect all the contents of all html files in the current directory
write them to a csv with tab seperator.
Here is an example script:
from bs4 import BeautifulSoup
import glob
filenames = glob.glob("*.html")
tdTextList = []
for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList += [td.text for td in soup.find_all("td")]
with open("output001.csv", "w") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
This is what you describe. It is probably not what you want.
Note that this will produce a file with a single very long row (you do not specify when you want a new row). And it may accidentally produce a malformed file if the contents of any of the td tags contains a newline character.
To give an output file that looks a bit nicer, let's write a new line for each html file that is read and let's remove leading and trailing spaces as well as newline characters from the data before writing it to the output.
from bs4 import BeautifulSoup
import glob
filenames = glob.glob("*.html")
for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
Note: you can run either script from the bash shell with:
python3 <script_name.py>
How to fetch whole row from HTML data through shell script?
You will have to get required td
and then parent tr
of that td
. You can do this with following xpath
To get required td
//table//td[contains(text(), '19.8.2')]
To get parent tr
of required td
//table//td[contains(text(), '19.8.2')]/parent::tr
Related Topics
What Happens If The Action Field in a <Form> Has Parameters
Hiding The Toolbars Surrounding an Embedded Pdf
How to Communicate Between Frames
How to Use Data Uris in Video and Audio Tags
How to Pre-Populate HTML Form Input Fields from Url Parameters
Flex Elements Ignore Percent Padding in Firefox
CSS: Image Link, Change on Hover
Fixed Body Background Scrolls with The Page on iOS7
Why Isn't Position:Sticky with Left:0 Working Inside a Scrollable Container
<Strong> VS. Font-Weight:Bold & <Em> VS. Font-Style:Italic
When to Use <Br> Line Breaks Vs CSS Positioning
Detecting 'Transform: Translate3D' Support
Select First Occurring Element After Another Element
How to Get Tiles Centered and Left-Justified at The Same Time
How to Force Fully Download Txt File on Link
HTML5 Input Box with Type="Number" Does Not Accept Comma in Chrome Browser