How to Get Meta Keywords Using Nokogiri

How do I parse and scrape the meta tags of a URL with Nokogiri?

Here's how I'd go about it:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">
EOT

contents = %w[description keywords].map { |name|
  doc.at("meta[name='#{name}']")['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

Or:

contents = doc.search("meta[name='description'], meta[name='keywords']").map { |n| 
  n['content'] 
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

How to get Meta Keywords using Nokogiri?

Here is a simple example:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")

doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
  puts attr.value
end

How to extract search meta name within HTML page to extract content using Nokogiri

The XPath you’ve used, //meta[@name ="trackmetrics_verification"], will return any meta element nodes that have a name attribute of trackmetrics_verification (there likely should be only one such node). You want the content attribute of this node. One way to get it is to extend the query to specify the attribute:

//meta[@name ="trackmetrics_verification"]/@content

With Nokogiri, using at_xpath since you only expect one matching node, you can get the value of the attribute node with the text method with:

@doc.at_xpath('//meta[@name ="trackmetrics_verification"]/@content').text

An alternative with Nokogiri is to select the meta node and use the [] method to get the value of the attribute:

@doc.at_xpath('//meta[@name ="trackmetrics_verification"])['content']

How to get Meta Keywords using Nokogiri?

Here is a simple example:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")

doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
  puts attr.value
end

How to get content value out of meta tag in ruby on rails?

Given a @meta variable containing some HTML snippet as a string:

@meta = <<-HTML
  <meta name="foo" content="content1">
  <meta name="bar" content="content2">
  <meta content="2019/01/10 09:59:59 +0900" name="r_end">
HTML

You can use Nokogiri to parse it:

require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(@meta)
doc.at_css('meta[name="r_end"]')['content']
#=> "2019/01/10 09:59:59 +0900"

at_css returns the first element matching the given CSS selector and [] returns the value for the given attribute.

How to use Nokogiri to change the HTML meta data?

Nokogiri is excellent for this:

require 'nokogiri'

doc = Nokogiri::HTML.parse(<<EOT)
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
    <meta name="description" content="Free Web tutorials">
  </head>
  <body></body>
</html>
EOT

meta = doc.at('meta[@name]')
meta['content'] = 'foo'

puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <head>
# >>     <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >>     <meta name="description" content="foo">
# >>   </head>
# >>   <body></body>
# >> </html>

If you want to append something to the description's content:

meta['content'] = meta['content'] + ' by foobar'

Which results in:

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <head>
# >>     <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >>     <meta name="description" content="Free Web tutorials by foobar">
# >>   </head>
# >>   <body></body>
# >> </html>

HTML that you don't control can change in wild and wonderful ways if the creators change to different HTML generators. That can break your application unless you use something robust, and regular expressions for HTML are not robust enough.

It's easy to write a pattern to match

<meta name="description" content="Free Web tutorials">

It's not so easy to write one that matches that one day, and then

<meta 
name="description"

content="Free Web tutorials"
>

the next.

It's easy to imagine seeing various HTML output styles because the site's content people used different tools, along with some automation. A parser can handle it nicely.

Nokogiri get all HTML nodes

You could split the OuterXml over InnerXml of all opening elements that are not self closing, store the corresponding closing elements if any to retrieve it and parse the document using the Nokogiri reader to build the list according to the order within the document.

It requires that your document is a valid XML fragment as it is using the XML parser and not the HTML one.

require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
  <body>
      <h1>Test</h1>
      <p>test <strong> Jojo </strong></p>
  </body>
</html>
END
].each { |string_page|
  elem_all = Array.new
  elem_ends = Hash.new
  reader = Nokogiri::XML::Reader(string_page)
  reader.each { |node|
    if node.node_type.eql?(1)
      if node.self_closing?
        elem_all << node.outer_xml
      else
        elem_tags = node.outer_xml.split(node.inner_xml)
        elem_all << elem_tags.first
        elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
      end
    end
    elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
  }

  puts string_page
  puts elem_all.to_s
  puts
}

Outputs:

<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]

<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]

<html>
  <body>
      <h1>Test</h1>
      <p>test <strong> Jojo </strong></p>
  </body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]

XML generation using Nokogiri involving nested tags and namespace

You only need 1 builder:

env_ns = {
  "xmlns:env" => "http://abc.ca"
}

mm7_ns = {
  "xmlns:mm7" => "http://def.ca"
}

builder = Nokogiri::XML::Builder.new do |xml|
  xml['env'].Envelope(env_ns) do
    xml.Header do
      xml['mm7'].TransactionID(mm7_ns, "Some Text Here")
    end
  end
end

puts builder.to_xml

# will render the following:
# <?xml version="1.0"?>
# <env:Envelope xmlns:env="http://abc.ca">
#   <env:Header>
#     <mm7:TransactionID xmlns:mm7="http://def.ca">Some Text Here</mm7:TransactionID>
#   </env:Header>
# </env:Envelope>

How to Get Meta Keywords Using Nokogiri

How do I parse and scrape the meta tags of a URL with Nokogiri?

How to get Meta Keywords using Nokogiri?

How to extract search meta name within HTML page to extract content using Nokogiri

How to get Meta Keywords using Nokogiri?

How to get content value out of meta tag in ruby on rails?

How to use Nokogiri to change the HTML meta data?

Nokogiri get all HTML nodes

XML generation using Nokogiri involving nested tags and namespace

Related Topics

Leave a reply