Nokogiri Parsing

Nokogiri parsing HTML

At the first step find out all <td> tags with xpath('//td'). Then, for each, iterate on its children and collect its content, if the child it Nokogiri::XML::Text (you don't want to collect <br> tags):

doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
p = td.at_xpath('p')
a = []
td.children.each do |child|
if Nokogiri::XML::Text === child
t = child.text.strip
a << t unless t.empty?
end
end
h[p.text] = a.join(', ')
end

result:

{"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security", 
"Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"}

or in more compressed form, without using the strict loops:

doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
h[td.at_xpath('p').text] = td.children
.select{|x| Nokogiri::XML::Text === x && !x.text.strip.empty?}
.map{|x| x.text.strip}.join(', ')
end

I can't parse the page and get links Nokogiri

I found a more interesting solution )) for example:
link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read

contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system

And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloading

chromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/' - get using Capybara and cut only the version

zip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read

it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser

Parsing large HTML files with Nokogiri

The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:

require 'open-uri'
require 'nokogiri'

url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186

Note that |= is a bitwise OR assignment operator, don't confuse it with the logical operator ||=

According to Parse Options, you can also set this flag via config.huge

Parsing text using Nokogiri

require 'nokogiri'
require 'open-uri'

At first I open the source and use nokogiri to parse it and get a nokogiri document.

doc = Nokogiri::HTML(open("http://in.reuters.com/finance/stocks/companyOfficers?symbol=GOOGL.O"))

Now I select the elements you are interested in only via an xpath.

elements = doc.xpath('//*[@id="companyNews"]/div/div[2]/table/tbody/tr/td[1]/h2/a')

Last step is to clean the text of each element from newlines and tabs and return the names as unique values.

elements.map{|officer| officer.text.strip}.uniq
# => ["Eric Schmidt", "Sergey Brin", "Lawrence Page", "Ruth Porat", "Sundar Pichai", "David Drummond", "John Hennessy", "L. John Doerr", "Roger Ferguson", "Diane Greene", "Ann Mather", "Alan Mulally", "Paul Otellini", "Kavitark Shriram", "Shirley Tilghman"]

Parsing HTML for specific td tags with Nokogiri

Once you select the table you can do

table.last_element_child.previous

which returns the last child and then get the last childs previous sibling.

https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet

Parsing an XML file using nokogiri to create \index fields for LaTeX

does this help?

xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"

fields = Nokogiri::XML(xml).xpath(".//field")

puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]

Parsing html with rails and nokogiri

First, note that the HTML you posted is syntactically invalid: it is illegal to have more than one element with the same id attribute value. If you have control over your HTML, you should fix this problem.

Using that same (invalid) HTML, however, Nokogiri still has no trouble:

require 'nokogiri'
doc = Nokogiri::HTML(my_html)

doc.css('#mama').each_with_index do |div,i|
puts "#{div.at_css('.test1').text} from mama ##{i}"
puts "#{div.at_css('.test2').text} from mama ##{i}"
end

#=> text from mama #0
#=> text2 from mama #0
#=> text from mama #1
#=> text2 from mama #1
#=> text from mama #2
#=> text2 from mama #2

If you wanted to use XPath directly (as Nokogiri does behind the scenes for the CSS) you would do this:

doc.xpath("//div[@id='mama']").each_with_index do |div,i|
puts "#{div.at_xpath("./*[@class='test1']").text} from mama ##{i}"
puts "#{div.at_xpath("./*[@class='test2']").text} from mama ##{i}"
end

How do I use Nokogiri to parse an XML file?

Here I will try to explain you all the questions/confusions you are having:

require 'nokogiri'

doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML

So from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?

No, each Items are Nokogiri::XML::NodeSet. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element class object. You can say them also Nokogiri::XML::Node

doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]

We create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.

I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. Remember Nodeset is a collection of Nodes.

@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]

@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]


Related Topics



Leave a reply



Submit