How to Pretty-Print HTML with Nokogiri

How do I pretty-print HTML with Nokogiri?

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

Nokogiri and XML Formatting When Inserting Tags

Found the answer in the Nokogiri mailing list:

In XML, whitespace can be considered
meaningful. If you parse a document
that contains whitespace nodes,
libxml2 will assume that whitespace
nodes are meaningful and will not
insert them for you.

You can tell libxml2 that whitespace
is not meaningful by passing the
"noblanks" flag to the parser. To
demonstrate, here is an example that
reproduces your error, then does what
you want:

require 'nokogiri'
def build_from node
builder = Nokogiri::XML::Builder.with(node) do|xml|
xml.hello do
xml.world
end
end
end

xml = DATA.read
doc = Nokogiri::XML(xml)
puts build_from(doc.at('bar')).to_xml
doc = Nokogiri::XML(xml) { |x| x.noblanks }
puts build_from(doc.at('bar')).to_xml

Output:

<root>
<foo>
<bar>
<baz />
</bar>
</foo>
</root>

Print attributes of Nokogiri::XML::Node only, without innerHTML

You could output the node with its content deleted:

doc = Nokogiri::HTML.fragment(
'<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>'
)

node = doc.at_css('#customer').clone
node.content = nil
p node.to_html
#=> "<div id=\"customer\" class=\"highlighted\"></div>"

Ruby Nokogiri take all the content

The URL you're using does not generate all the data as HTML; a lot of it is rendered after the page has been loaded.

Looking at the source code for the page, it appears that the data is rendered from a JSON script, embedded in the page.

it took quite some time to find the objects in order to work out what part of the JSON data has the contents that you want to work with:

  • The JSON object within the HTML, as a String object
page.css('script[type="application/json"]').first.inner_html

The JSON String converted to a real JSON Hash

JSON.parse(page.css('script[type="application/json"]').first.inner_html)

the position inside the JSON or the Array of Crypto Hashes

my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]

pretty print the first "crypto"

2.7.2 :142 > pp cryptos.first
{"id"=>1,
"name"=>"Bitcoin",
"symbol"=>"BTC",
"slug"=>"bitcoin",
"tags"=>
["mineable",
"pow",
"sha-256",
"store-of-value",
"state-channel",
"coinbase-ventures-portfolio",
"three-arrows-capital-portfolio",
"polychain-capital-portfolio",
"binance-labs-portfolio",
"blockchain-capital-portfolio",
"boostvc-portfolio",
"cms-holdings-portfolio",
"dcg-portfolio",
"dragonfly-capital-portfolio",
"electric-capital-portfolio",
"fabric-ventures-portfolio",
"framework-ventures-portfolio",
"galaxy-digital-portfolio",
"huobi-capital-portfolio",
"alameda-research-portfolio",
"a16z-portfolio",
"1confirmation-portfolio",
"winklevoss-capital-portfolio",
"usv-portfolio",
"placeholder-ventures-portfolio",
"pantera-capital-portfolio",
"multicoin-capital-portfolio",
"paradigm-portfolio"],
"cmcRank"=>1,
"marketPairCount"=>9158,
"circulatingSupply"=>18960043,
"selfReportedCirculatingSupply"=>0,
"totalSupply"=>18960043,
"maxSupply"=>21000000,
"isActive"=>1,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"dateAdded"=>"2013-04-28T00:00:00.000Z",
"quotes"=>
[{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}],
"isAudited"=>false,
"rank"=>1,
"hasFilters"=>false,
"quote"=>
{"USD"=>
{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}}
}

the value of the first "crypto"

crypto.first["quote"]["USD"]["price"]

the key that you use in your Hash for the first "crypto"

crypto.first["symbol"]

put it all together and you get the following code (looping through each "crypto" with each_with_object)

require `json`
require 'nokogiri'
require 'open-uri'

...

def crypto(page)
my_json = JSON.parse(page.css('script[type="application/json"]').first.inner_html)
cryptos = my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]

hash = cryptos.each_with_object({}) do |crypto, hsh|
hsh[crypto["name"]] = crypto["quote"]["USD"]["price"]
end

return hash
end
puts crypto(scrapper);

Writing Nokogiri output to a text file

If you are running the script in a Unix environment you can redirect the script output to a file like this:

$ script_name.rb > crawling.txt

This way, every output (p, puts, print, etc.) from your script will be written in the file.
Be aware that this is going to overwrite the file contents with the output of your script. If you want to just append the output to a file use this:

$ script_name.rb >> crawling.txt


Related Topics



Leave a reply



Submit