Parse Local HTML File

How do I use Python and lxml to parse a local html file?

If the file is local, you shouldn't be using requests -- just open the file and read it in. requests expects to be talking to a web server.

with open(r'C:\Users\...site_1.html', "r") as f:
page = f.read()
tree = html.fromstring(page)

Powershell parsing local HTML file

You can use the Internet Explorer COM object to do what you'd like the HTMLFile COM object to do. HTMLFile isn't working 100% in all versions of Powershell, so this is a viable alternative.

ForEach ($system in (Get-Content C:\temp\computers.txt)) {
$folder = "\\$system\c`$\ProgramData\Autodesk\AdLM\"
Get-ChildItem $folder *.html |
ForEach-Object {
$c = $_.BaseName
$ie=New-Object -ComObject InternetExplorer.Application
$ie.Navigate("$_")
while ($ie.busy -eq $true) {
Start-Sleep -Milliseconds 500
}
$doc=$ie.Document
$elements=$doc.GetElementByID('para1')
$elements.innerText | ForEach-Object { Add-Content -path c:\temp\results.csv "$c,$system,$para1" }
}
}

Parse local HTML file

You can use the file with a
web server
to get around the dumb limitation of Invoke-WebRequest

PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm

PS > $foo.Links.Count
1

Note this will work even with no connection, example


PS > Invoke-WebRequest http://example.com
Invoke-WebRequest : The remote name could not be resolved: 'example.com'

Scrape data from local HTML file

had to swap over to html agility pack but i got it

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load(saveLocation);
foreach (HtmlNode table in htmlDoc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
if(row.InnerText.Contains("DESIGN CAPACITY"))
{
designCapTxt.Text = row.InnerText;
}
if (row.InnerText.Contains("FULL CHARGE CAPACITY"))
{
fullCapTxt.Text = row.InnerText;
}
}
}

Parse local HTML python (lxml)

I think you need

tree = etree.parse("text1.html", parser)

without StringIO and fromstring

How to use Mechanize to parse local file

Mechanize uses URI strings to point to what it's supposed to parse. Normally we'd use a "http" or "https" scheme to point to a web-server, and that's where Mechanize's strengths are, but other schemes are available, including "file", which can be used to load a local file.

I have a little HTML file on my Desktop called "test.rb":

<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>

Running this code:

require 'mechanize'

agent = Mechanize.new
page = agent.get('file:/Users/ttm/Desktop/test.html')
puts page.body

Outputs:

<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>

Which tells me Mechanize loaded the file, parsed it, then accessed the body.

However, unless you need to actually manipulate forms and/or navigate pages, then Mechanize is probably NOT what you want to use. Instead Nokogiri, which is under Mechanize, is a better choice for parsing, extracting data or manipulating the markup and it's agnostic as to what scheme was used or where the file is actually located:

require 'nokogiri'

doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
puts doc.to_html

which then output the same file after parsing it.

Back to your question, how to find the node only using Nokogiri:

Changing test.html to:

<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>

and running:

require 'nokogiri'

doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
doc.search('div.product_name').map(&:text)
# => ["Hello World!"]

shows that Nokogiri found the node and returned the text.

This code in your sample could be better:

text = node.text  
puts "product name: " + text.to_s

node.text returns a string:

doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String

So text.to_s is redundant. Simply use text.

Issues updating a local HTML file with parsed data from uploaded HTML files

You have two possibilities: use fs.readFileSync, which is easy to use but since it 's synchronous, it blocks your thread (and makes your server unresponsive while the files are being read). The more elegant solution is to use the Promise version and to await it.

const promises = require('fs').promises;
const htmlparse = require('node-html-parser').parse;

let element1, element2;

async function datatoString() {

let html = await promises.readFile(__dirname + "/api/upload/" + file1, 'utf8');
let root = htmlparse(html);
head = root.querySelector('head');
element1 = head.toString();
console.log("-------------break------------")
console.log(element1);

html = await promises.readFile(__dirname + "/api/upload/" + file2, 'utf8');
root = htmlparse(html);
body = root.querySelector('body');
element2 = body.toString();
console.log("-------------break------------")
console.log(element2);
};

HTML Parser for local HTML files

HTML Agility Pack is fine for local files, check out this example from the docs.

Alternatively, load the content from the file into a string using something like File.ReadAllText then pass it into HtmlDocument.LoadHtml(string html).



Related Topics



Leave a reply



Submit