How do I pretty-print HTML with Nokogiri?
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print
method is for the "pp" library and the output is useful for debugging only.
There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
It comes down to this:
xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s
It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
Nokogiri and XML Formatting When Inserting Tags
Found the answer in the Nokogiri mailing list:
In XML, whitespace can be considered
meaningful. If you parse a document
that contains whitespace nodes,
libxml2 will assume that whitespace
nodes are meaningful and will not
insert them for you.You can tell libxml2 that whitespace
is not meaningful by passing the
"noblanks" flag to the parser. To
demonstrate, here is an example that
reproduces your error, then does what
you want:
require 'nokogiri'
def build_from node
builder = Nokogiri::XML::Builder.with(node) do|xml|
xml.hello do
xml.world
end
end
end
xml = DATA.read
doc = Nokogiri::XML(xml)
puts build_from(doc.at('bar')).to_xml
doc = Nokogiri::XML(xml) { |x| x.noblanks }
puts build_from(doc.at('bar')).to_xml
Output:
<root>
<foo>
<bar>
<baz />
</bar>
</foo>
</root>
Print attributes of Nokogiri::XML::Node only, without innerHTML
You could output the node with its content deleted:
doc = Nokogiri::HTML.fragment(
'<div id="customer" class="highlighted">
<h1>Customer Name</h1>
<p>Some customer description</p>
</div>'
)
node = doc.at_css('#customer').clone
node.content = nil
p node.to_html
#=> "<div id=\"customer\" class=\"highlighted\"></div>"
Ruby Nokogiri take all the content
The URL you're using does not generate all the data as HTML; a lot of it is rendered after the page has been loaded.
Looking at the source code for the page, it appears that the data is rendered from a JSON script, embedded in the page.
it took quite some time to find the objects in order to work out what part of the JSON data has the contents that you want to work with:
- The JSON object within the HTML, as a
String
object
page.css('script[type="application/json"]').first.inner_html
The JSON String
converted to a real JSON Hash
JSON.parse(page.css('script[type="application/json"]').first.inner_html)
the position inside the JSON or the Array
of Crypto Hash
es
my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
pretty print the first "crypto"
2.7.2 :142 > pp cryptos.first
{"id"=>1,
"name"=>"Bitcoin",
"symbol"=>"BTC",
"slug"=>"bitcoin",
"tags"=>
["mineable",
"pow",
"sha-256",
"store-of-value",
"state-channel",
"coinbase-ventures-portfolio",
"three-arrows-capital-portfolio",
"polychain-capital-portfolio",
"binance-labs-portfolio",
"blockchain-capital-portfolio",
"boostvc-portfolio",
"cms-holdings-portfolio",
"dcg-portfolio",
"dragonfly-capital-portfolio",
"electric-capital-portfolio",
"fabric-ventures-portfolio",
"framework-ventures-portfolio",
"galaxy-digital-portfolio",
"huobi-capital-portfolio",
"alameda-research-portfolio",
"a16z-portfolio",
"1confirmation-portfolio",
"winklevoss-capital-portfolio",
"usv-portfolio",
"placeholder-ventures-portfolio",
"pantera-capital-portfolio",
"multicoin-capital-portfolio",
"paradigm-portfolio"],
"cmcRank"=>1,
"marketPairCount"=>9158,
"circulatingSupply"=>18960043,
"selfReportedCirculatingSupply"=>0,
"totalSupply"=>18960043,
"maxSupply"=>21000000,
"isActive"=>1,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"dateAdded"=>"2013-04-28T00:00:00.000Z",
"quotes"=>
[{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}],
"isAudited"=>false,
"rank"=>1,
"hasFilters"=>false,
"quote"=>
{"USD"=>
{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}}
}
the value of the first "crypto"
crypto.first["quote"]["USD"]["price"]
the key that you use in your Hash
for the first "crypto"
crypto.first["symbol"]
put it all together and you get the following code (looping through each "crypto" with each_with_object
)
require `json`
require 'nokogiri'
require 'open-uri'
...
def crypto(page)
my_json = JSON.parse(page.css('script[type="application/json"]').first.inner_html)
cryptos = my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
hash = cryptos.each_with_object({}) do |crypto, hsh|
hsh[crypto["name"]] = crypto["quote"]["USD"]["price"]
end
return hash
end
puts crypto(scrapper);
Writing Nokogiri output to a text file
If you are running the script in a Unix environment you can redirect the script output to a file like this:
$ script_name.rb > crawling.txt
This way, every output (p
, puts
, print
, etc.) from your script will be written in the file.
Be aware that this is going to overwrite the file contents with the output of your script. If you want to just append the output to a file use this:
$ script_name.rb >> crawling.txt
Related Topics
Chrome Does Not Expand Flex Parent According to Children's Content
What Is a Non-Replaced Inline Element
Why Is Box-Sizing Acting Different on Table Vs Div
Why Are Bootstrap's Form Elements Rendered Terribly with Struts2-Boostrap-Plugin
An Url to a Windows Shared Folder
Get Parameters in The Url with Codeigniter
Apache Giving 403 Forbidden Errors
Alignment of Content Vertically in Adjacent Flexbox Containers
Jquery - Follow The Cursor with a Div
How to Remove The Dotted Line Around The Clicked a Element in HTML
Hide Text, But Have It Show Up If Copied and Pasted Without JavaScript
Offline iOS Web App: Loads My Manifest, But Doesn't Work Offline
<Div> into a <Tr>: Is It Correct
How to Display a Range Input Slider Vertically
How to Set Favicon.Ico Properly on Vue.Js Webpack Project
HTML 5 Audio Tag Multiple Files