How do I parse and scrape the meta tags of a URL with Nokogiri?
Here's how I'd go about it:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">
EOT
contents = %w[description keywords].map { |name|
doc.at("meta[name='#{name}']")['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]
Or:contents = doc.search("meta[name='description'], meta[name='keywords']").map { |n|
n['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]
How to get Meta Keywords using Nokogiri?
Here is a simple example:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")
doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
puts attr.value
end
How to extract search meta name within HTML page to extract content using Nokogiri
The XPath you’ve used, //meta[@name ="trackmetrics_verification"]
, will return any meta
element nodes that have a name
attribute of trackmetrics_verification
(there likely should be only one such node). You want the content
attribute of this node. One way to get it is to extend the query to specify the attribute:
//meta[@name ="trackmetrics_verification"]/@content
With Nokogiri, using at_xpath
since you only expect one matching node, you can get the value of the attribute node with the text
method with:@doc.at_xpath('//meta[@name ="trackmetrics_verification"]/@content').text
An alternative with Nokogiri is to select the meta
node and use the []
method to get the value of the attribute:@doc.at_xpath('//meta[@name ="trackmetrics_verification"])['content']
How to get Meta Keywords using Nokogiri?
Here is a simple example:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")
doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
puts attr.value
end
How to get content value out of meta tag in ruby on rails?
Given a @meta
variable containing some HTML snippet as a string:
@meta = <<-HTML
<meta name="foo" content="content1">
<meta name="bar" content="content2">
<meta content="2019/01/10 09:59:59 +0900" name="r_end">
HTML
You can use Nokogiri to parse it:require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(@meta)
doc.at_css('meta[name="r_end"]')['content']
#=> "2019/01/10 09:59:59 +0900"
at_css
returns the first element matching the given CSS selector and []
returns the value for the given attribute. How to use Nokogiri to change the HTML meta data?
Nokogiri is excellent for this:
require 'nokogiri'
doc = Nokogiri::HTML.parse(<<EOT)
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<meta name="description" content="Free Web tutorials">
</head>
<body></body>
</html>
EOT
meta = doc.at('meta[@name]')
meta['content'] = 'foo'
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >> <meta name="description" content="foo">
# >> </head>
# >> <body></body>
# >> </html>
If you want to append something to the description's content
:meta['content'] = meta['content'] + ' by foobar'
Which results in:# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >> <meta name="description" content="Free Web tutorials by foobar">
# >> </head>
# >> <body></body>
# >> </html>
HTML that you don't control can change in wild and wonderful ways if the creators change to different HTML generators. That can break your application unless you use something robust, and regular expressions for HTML are not robust enough.It's easy to write a pattern to match
<meta name="description" content="Free Web tutorials">
It's not so easy to write one that matches that one day, and then <meta
name="description"
content="Free Web tutorials"
>
the next.It's easy to imagine seeing various HTML output styles because the site's content people used different tools, along with some automation. A parser can handle it nicely.
Nokogiri get all HTML nodes
You could split the OuterXml over InnerXml of all opening elements that are not self closing, store the corresponding closing elements if any to retrieve it and parse the document using the Nokogiri reader to build the list according to the order within the document.
It requires that your document is a valid XML fragment as it is using the XML parser and not the HTML one.
require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
END
].each { |string_page|
elem_all = Array.new
elem_ends = Hash.new
reader = Nokogiri::XML::Reader(string_page)
reader.each { |node|
if node.node_type.eql?(1)
if node.self_closing?
elem_all << node.outer_xml
else
elem_tags = node.outer_xml.split(node.inner_xml)
elem_all << elem_tags.first
elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
end
end
elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
}
puts string_page
puts elem_all.to_s
puts
}
Outputs:<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]
<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]
XML generation using Nokogiri involving nested tags and namespace
You only need 1 builder:
env_ns = {
"xmlns:env" => "http://abc.ca"
}
mm7_ns = {
"xmlns:mm7" => "http://def.ca"
}
builder = Nokogiri::XML::Builder.new do |xml|
xml['env'].Envelope(env_ns) do
xml.Header do
xml['mm7'].TransactionID(mm7_ns, "Some Text Here")
end
end
end
puts builder.to_xml
# will render the following:
# <?xml version="1.0"?>
# <env:Envelope xmlns:env="http://abc.ca">
# <env:Header>
# <mm7:TransactionID xmlns:mm7="http://def.ca">Some Text Here</mm7:TransactionID>
# </env:Header>
# </env:Envelope>
Related Topics
Need Help on Reading Emails with "Mail" Gem in Ruby
Use Pry in Gems Without Modifying The Gemfile or Using 'Require'
In Ruby, Is Truthiness Idiomatic for a Method Name Ending with a Question Mark
How to Open File in Default Application. Ruby
Sorting a Hash in Ruby Based on Value and Then Key
Robust Way to Deploy a Rack Application (Sinatra)
Extend Model in Plugin with "Has_Many" Using a Module
How to Silence The Call to a Rails Controller's Action All Together
How to Handle Single Table Inheritance in Simpleform So a Single Helper Handles All Models
Bundler Using Wrong Ruby Version
Rails Redirecting Invalid Route to Root
What's The Best Way to Test Delayed_Job Chains with Rspec
What Are Ruby's Numbered Global Variables