How do I use Python and lxml to parse a local html file?
If the file is local, you shouldn't be using requests
-- just open the file and read it in. requests
expects to be talking to a web server.
with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
Powershell parsing local HTML file
You can use the Internet Explorer COM object to do what you'd like the HTMLFile COM object to do. HTMLFile isn't working 100% in all versions of Powershell, so this is a viable alternative.
ForEach ($system in (Get-Content C:\temp\computers.txt)) {
$folder = "\\$system\c`$\ProgramData\Autodesk\AdLM\"
Get-ChildItem $folder *.html |
ForEach-Object {
$c = $_.BaseName
$ie=New-Object -ComObject InternetExplorer.Application
$ie.Navigate("$_")
while ($ie.busy -eq $true) {
Start-Sleep -Milliseconds 500
}
$doc=$ie.Document
$elements=$doc.GetElementByID('para1')
$elements.innerText | ForEach-Object { Add-Content -path c:\temp\results.csv "$c,$system,$para1" }
}
}
Parse local HTML file
You can use the file with a
web server
to get around the dumb limitation of Invoke-WebRequest
PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm
PS > $foo.Links.Count
1
Note this will work even with no connection, example
PS > Invoke-WebRequest http://example.com
Invoke-WebRequest : The remote name could not be resolved: 'example.com'
Scrape data from local HTML file
had to swap over to html agility pack but i got it
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load(saveLocation);
foreach (HtmlNode table in htmlDoc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
if(row.InnerText.Contains("DESIGN CAPACITY"))
{
designCapTxt.Text = row.InnerText;
}
if (row.InnerText.Contains("FULL CHARGE CAPACITY"))
{
fullCapTxt.Text = row.InnerText;
}
}
}
Parse local HTML python (lxml)
I think you need
tree = etree.parse("text1.html", parser)
without StringIO
and fromstring
How to use Mechanize to parse local file
Mechanize uses URI strings to point to what it's supposed to parse. Normally we'd use a "http
" or "https
" scheme to point to a web-server, and that's where Mechanize's strengths are, but other schemes are available, including "file
", which can be used to load a local file.
I have a little HTML file on my Desktop called "test.rb":
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>
Running this code:
require 'mechanize'
agent = Mechanize.new
page = agent.get('file:/Users/ttm/Desktop/test.html')
puts page.body
Outputs:
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>
Which tells me Mechanize loaded the file, parsed it, then accessed the body
.
However, unless you need to actually manipulate forms and/or navigate pages, then Mechanize is probably NOT what you want to use. Instead Nokogiri, which is under Mechanize, is a better choice for parsing, extracting data or manipulating the markup and it's agnostic as to what scheme was used or where the file is actually located:
require 'nokogiri'
doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
puts doc.to_html
which then output the same file after parsing it.
Back to your question, how to find the node only using Nokogiri:
Changing test.html
to:
<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>
and running:
require 'nokogiri'
doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
doc.search('div.product_name').map(&:text)
# => ["Hello World!"]
shows that Nokogiri found the node and returned the text.
This code in your sample could be better:
text = node.text
puts "product name: " + text.to_s
node.text
returns a string:
doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String
So text.to_s
is redundant. Simply use text
.
Issues updating a local HTML file with parsed data from uploaded HTML files
You have two possibilities: use fs.readFileSync
, which is easy to use but since it 's synchronous, it blocks your thread (and makes your server unresponsive while the files are being read). The more elegant solution is to use the Promise version and to await it.
const promises = require('fs').promises;
const htmlparse = require('node-html-parser').parse;
let element1, element2;
async function datatoString() {
let html = await promises.readFile(__dirname + "/api/upload/" + file1, 'utf8');
let root = htmlparse(html);
head = root.querySelector('head');
element1 = head.toString();
console.log("-------------break------------")
console.log(element1);
html = await promises.readFile(__dirname + "/api/upload/" + file2, 'utf8');
root = htmlparse(html);
body = root.querySelector('body');
element2 = body.toString();
console.log("-------------break------------")
console.log(element2);
};
HTML Parser for local HTML files
HTML Agility Pack is fine for local files, check out this example from the docs.
Alternatively, load the content from the file into a string using something like File.ReadAllText then pass it into HtmlDocument.LoadHtml(string html)
.
Related Topics
How to Change the Size of the Radio Button Using CSS
Center Image Element in Parent Div
Best Way to Manage Whitespace Between Inline List Items
Space Between Divs - Display Table-Cell
How to Align Entire HTML Body to the Center
iOS 11 Safari Bootstrap Modal Text Area Outside of Cursor
Flexbox: Wrong Width Calculation When Flex-Direction: Column, Flex-Wrap: Wrap
Input Type=Password, Don't Let Browser Remember the Password
CSS Horizontal Centering of a Fixed Div
Inspect Webkit-Input-Placeholder with Developer Tools