Nokogiri parsing HTML
At the first step find out all <td>
tags with xpath('//td')
. Then, for each, iterate on its children and collect its content, if the child it Nokogiri::XML::Text
(you don't want to collect <br>
tags):
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
p = td.at_xpath('p')
a = []
td.children.each do |child|
if Nokogiri::XML::Text === child
t = child.text.strip
a << t unless t.empty?
end
end
h[p.text] = a.join(', ')
end
result:
{"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
"Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"}
or in more compressed form, without using the strict loops:
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
h[td.at_xpath('p').text] = td.children
.select{|x| Nokogiri::XML::Text === x && !x.text.strip.empty?}
.map{|x| x.text.strip}.join(', ')
end
I can't parse the page and get links Nokogiri
I found a more interesting solution )) for example: link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read
contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system
And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloadingchromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/'
- get using Capybara and cut only the versionzip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read
it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser
Parsing large HTML files with Nokogiri
The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE
flag:
require 'open-uri'
require 'nokogiri'
url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186
Note that |=
is a bitwise OR assignment operator, don't confuse it with the logical operator ||=
According to Parse Options, you can also set this flag via config.huge
Parsing text using Nokogiri
require 'nokogiri'
require 'open-uri'
At first I open the source and use nokogiri to parse it and get a nokogiri document.
doc = Nokogiri::HTML(open("http://in.reuters.com/finance/stocks/companyOfficers?symbol=GOOGL.O"))
Now I select the elements you are interested in only via an xpath.
elements = doc.xpath('//*[@id="companyNews"]/div/div[2]/table/tbody/tr/td[1]/h2/a')
Last step is to clean the text of each element from newlines and tabs and return the names as unique values.
elements.map{|officer| officer.text.strip}.uniq
# => ["Eric Schmidt", "Sergey Brin", "Lawrence Page", "Ruth Porat", "Sundar Pichai", "David Drummond", "John Hennessy", "L. John Doerr", "Roger Ferguson", "Diane Greene", "Ann Mather", "Alan Mulally", "Paul Otellini", "Kavitark Shriram", "Shirley Tilghman"]
Parsing HTML for specific td tags with Nokogiri
Once you select the table you can do
table.last_element_child.previous
which returns the last child and then get the last childs previous sibling.
https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet
Parsing an XML file using nokogiri to create \index fields for LaTeX
does this help?
xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"
fields = Nokogiri::XML(xml).xpath(".//field")
puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]
Parsing html with rails and nokogiri
First, note that the HTML you posted is syntactically invalid: it is illegal to have more than one element with the same id
attribute value. If you have control over your HTML, you should fix this problem.
Using that same (invalid) HTML, however, Nokogiri still has no trouble:
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
doc.css('#mama').each_with_index do |div,i|
puts "#{div.at_css('.test1').text} from mama ##{i}"
puts "#{div.at_css('.test2').text} from mama ##{i}"
end
#=> text from mama #0
#=> text2 from mama #0
#=> text from mama #1
#=> text2 from mama #1
#=> text from mama #2
#=> text2 from mama #2
If you wanted to use XPath directly (as Nokogiri does behind the scenes for the CSS) you would do this:
doc.xpath("//div[@id='mama']").each_with_index do |div,i|
puts "#{div.at_xpath("./*[@class='test1']").text} from mama ##{i}"
puts "#{div.at_xpath("./*[@class='test2']").text} from mama ##{i}"
end
How do I use Nokogiri to parse an XML file?
Here I will try to explain you all the questions/confusions you are having:
require 'nokogiri'
doc = Nokogiri::XML.parse <<-XML
<Collection version="2.0" id="74j5hc4je3b9">
<Name>A Funfair in Bangkok</Name>
<PermaLink>Funfair in Bangkok</PermaLink>
<PermaLinkIsName>True</PermaLinkIsName>
<Description>A small funfair near On Nut in Bangkok.</Description>
<Date>2009-08-03T00:00:00</Date>
<IsHidden>False</IsHidden>
<Items>
<Item filename="AGC_1998.jpg">
<Title>Funfair in Bangkok</Title>
<Caption>A small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-07T19:22:08</CreatedDate>
<Keywords>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="133" height="200" />
<PreviewSize width="532" height="800" />
<OriginalSize width="2279" height="3425" />
</Item>
<Item filename="AGC_1164.jpg" iscover="True">
<Title>Bumper Cars at a Funfair in Bangkok</Title>
<Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
<Authors>Anthony Bouch</Authors>
<Copyright>Copyright © Anthony Bouch</Copyright>
<CreatedDate>2009-08-03T22:08:24</CreatedDate>
<Keywords>
<Keyword>Bumper Cars</Keyword>
<Keyword>Funfair</Keyword>
<Keyword>Bangkok</Keyword>
<Keyword>Thailand</Keyword>
</Keywords>
<ThumbnailSize width="200" height="133" />
<PreviewSize width="800" height="532" />
<OriginalSize width="3725" height="2479" />
</Item>
</Items>
</Collection>
XML
So from my understanding of Nokogiri, each 'Items' is a node, and under that there are children nodes of 'Item'?
No, each Items are Nokogiri::XML::NodeSet
. And under that there are 2 children nodes of Items,which are of Nokogiri::XML::Element
class object. You can say them also Nokogiri::XML::Node
doc.class # => Nokogiri::XML::Document
@block = doc.xpath("//Items/Item")
@block.class # => Nokogiri::XML::NodeSet
@block.count # => 2
@block.map { |node| node.name }
# => ["Item", "Item"]
@block.map { |node| node.class }
# => [Nokogiri::XML::Element, Nokogiri::XML::Element]
@block.map { |node| node.children.count }
# => [19, 19]
@block.map { |node| node.class.superclass }
# => [Nokogiri::XML::Node, Nokogiri::XML::Node]
We create a map of this, which returns a hash I believe, and the code in {} goes through each node and places the children text into @block. Then I can display all of this child node's text to the screen.
I don't understand this. Although I tried to explain below to show what is Node,and what is Nodeset in Nokogiri. Remember Nodeset
is a collection of Nodes.
@chld_class = @block.map do |node|
node.children.class
end
@chld_class
# => [Nokogiri::XML::NodeSet, Nokogiri::XML::NodeSet]
@chld_name = @block.map do |node|
node.children.map { |n| [n.name,n.class] }
end
@chld_name
# => [[["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]],
# [["text", Nokogiri::XML::Text],
# ["Title", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Caption", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Authors", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Copyright", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["CreatedDate", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["Keywords", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["ThumbnailSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["PreviewSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text],
# ["OriginalSize", Nokogiri::XML::Element],
# ["text", Nokogiri::XML::Text]]]
@chld_name = @block.map do |node|
node.children.map{|n| [n.name,n.text.strip] if n.elem? }.compact
end.compact
@chld_name
# => [[["Title", "Funfair in Bangkok"],
# ["Caption", "A small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-07T19:22:08"],
# ["Keywords", "Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]],
# [["Title", "Bumper Cars at a Funfair in Bangkok"],
# ["Caption", "Bumper cars at a small funfair near On Nut in Bangkok."],
# ["Authors", "Anthony Bouch"],
# ["Copyright", "Copyright © Anthony Bouch"],
# ["CreatedDate", "2009-08-03T22:08:24"],
# ["Keywords",
# "Bumper Cars\n Funfair\n Bangkok\n Thailand"],
# ["ThumbnailSize", ""],
# ["PreviewSize", ""],
# ["OriginalSize", ""]]]
Related Topics
Does Ruby Provide a Constant_Added Hook Method
How to Know If an Io Is Empty Without Reading It
No Such File to Load -- Soap4R -- Why
Understanding Usage of Symbols in Routes.Rb Files
Importing CSV as Test Data in Cucumber
Consolidating Duplicate Array Items
Ruby: Unexpected End-Of-Input, Expecting Keyword_End for If Statement
+= Operator Overloading in Ruby
Utc Time Resets to 2000-01-01 (Ruby). How to Prevent the Time from Resetting
Encrypt Data Bag from Inside of Ruby Without Relying on Knife
Rails 3 - Has_And_Belongs_To_Many
Error When Pushing to Heroku - ...Appear in Group - Ruby on Rails
Rails + Google Calendar API Events Not Created
Rails Console Is Adding Nil Instead of Values
When Joining Table, Rails Anyway Makes Additional Request When Accessing Fields from Joined Table