Using Nokogiri to Split Content on Br Tags

Using Nokogiri to Split Content on BR tags

If your data really is that regular and you don't need the attributes from the <a> elements, then you could parse the text form of each table cell without having to worry about the <br> elements at all.

Given some HTML like this in html:

<table>
<tbody>
<tr>
<td class="j">
<a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br>
<a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br>
</td>
<td class="j">
<a title="title text1" href="http://link4.com">Link 4</a> (info1), Blah 2,<br>
<a title="title text2" href="http://link5.com">Link 5</a> (info1), Blah 2,<br>
<a title="title text2" href="http://link6.com">Link 6</a> (info2), Blah 2 Foo 2,<br>
</td>
</tr>
<tr>
<td class="j">
<a title="title text1" href="http://link7.com">Link 7</a> (info1), Blah 3,<br>
<a title="title text2" href="http://link8.com">Link 8</a> (info1), Blah 3,<br>
<a title="title text2" href="http://link9.com">Link 9</a> (info2), Blah 3 Foo 2,<br>
</td>
<td class="j">
<a title="title text1" href="http://linkA.com">Link A</a> (info1), Blah 4,<br>
<a title="title text2" href="http://linkB.com">Link B</a> (info1), Blah 4,<br>
<a title="title text2" href="http://linkC.com">Link C</a> (info2), Blah 4 Foo 2,<br>
</td>
</tr>
</tbody>
</table>

You could do this:

chunks = doc.search('.j').map { |td| td.text.strip.scan(/[^,]+,[^,]+/) }

and have this:

[
[ "Link 1 (info1), Blah 1", "Link 2 (info1), Blah 1", "Link 3 (info2), Blah 1 Foo 2" ],
[ "Link 4 (info1), Blah 2", "Link 5 (info1), Blah 2", "Link 6 (info2), Blah 2 Foo 2" ],
[ "Link 7 (info1), Blah 3", "Link 8 (info1), Blah 3", "Link 9 (info2), Blah 3 Foo 2" ],
[ "Link A (info1), Blah 4", "Link B (info1), Blah 4", "Link C (info2), Blah 4 Foo 2" ]
]

in chunks. Then you could convert that to whatever hash form you needed.

Split content in Nokogiri when tags are encountered

The text() node selector will select text nodes, which will give you each section of text in its own Node. You could then use map to get an array of strings:

document = Nokogiri::HTML(text)
# Note text() added to end of XPath here:
ad_description_nodes = document.xpath('.//div[contains(@id, "ad-description")]/text()'

strings = ad_description_nodes.map &:content

With your sample data, strings will now look like:

["\n\nRoom for rent in Sydney.\n", "For more information please contact us", "John :- 0491 570 156", "Jane :- (02) 5550 1234"]

As you can see you might get some extra leading or trailing whitespace, as well as possibly some nodes consisting solely of whitespace, so you’ll likely need some more processing. Also this would miss any text that isn’t a direct child of the div, e.g. if there is some text in strong or em tags. If that’s a possibility you could use //text() instead of /text().

How to split a string with br in Ruby

It's relatively simple, using the split method on the string. For example:

 'this<br>that<br> the other'.split("<br>")

Will result in

 ["this", "that", " the other"]

Nokogiri & returning all data between two tags

How about:

require "nokogiri"

html = '<li class="textbox" data-tid="42.5" data-cid="42" data-sid="263" style="display: inline-block;"> <a> <div onclick="" class="item reb-itm-new re-itm263"></div> <span> <p class="item-title">Clear Rune</p> <p class="r-itemid">ItemID: 263</p> <p class="pickup">"Rune mimic"</p> <p class="quality">Quality: 2</p> <p>When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)</p> <p>Drops a random rune on the floor when picked up</p> <p>The recharge time of this item depends on the Rune/Soul Stone held:</p> <p>1 room: Soul of Lazarus</p> <p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p> <p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p> <p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p> <p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p> <p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p> <ul> <p>Type: Active</p> <p>Recharge time: Varies</p> <p>Item Pool: Secret Room, Crane Game</p> </ul> <p class="tags">* Secret Room</p> </span> </a> </li>'
puts Nokogiri::HTML(html).css(".quality ~ p:not(.tags)")[1..].map {|e| e.text}

The ~ syntax selects the current and further sibling elements, so I use a slice to skip the first element. I'm assuming .tags is the only other class to omit after .quality; if there are other elements besides this, you'll need to :not them as well, or manually detect and skip them in an .each loop, unless someone knows a cleverer trick.

Extract text between br tags

Assuming that you already have a method to extract the example string you showed in your question, you can use split on the string:

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /\s*<\s*br\s*>\s*/
#=> ["domain1.com/index.html",
# "domain2.com/home/~john/index.html",
# "domain3.com/a/b/c/d/index.php"]

This will split the string at every <br> tag. It will also remove whitespace before and after the <br> and allow for whitespace inside the <br> tag, e.g. <br > or < br>. If you need to handle self-closing tags, too (e.g. <br />), use this regex instead:

/\s*<\s*br\s*\/?\s*>\s*/

Nokogiri Xpath to retrieve text after BR within TD and SPAN

Here's a concise way:

name, nick, email, *addr = doc.search('//td/text()[preceding-sibling::br]')

puts name, nick, email, "--", addr

The XPath does exactly what you stated: it takes all text nodes following a br. The address is slurped into one variable, but you can get the components separately if you want.

Output:

FirstName LastName
NickName
First.Last@SomeCompany.com
--
FirstName LastName
Attn: FirstName
1234 Main St.
TheCity, TheState, 12345
United States

Retrieving text between br in Rails + Nokogiri

How about this:

html = '<b>Multiple Sclerosis National Research Institute</b><br> ...'
doc = Nokogiri::HTML(html)
doc.css('br')[2].next.text.strip
#=> "Conducts research towards understanding, treating and halting the progression of multiple sclerosis and related diseases. Current research progress is promising. Please help us find cures!"

And with the live content:

url = "https://www.neighbortonation.org/ntn/charities/home.aspx"    
doc = Nokogiri::HTML(open(url))

doc.css("#site-pagecontent table table td").each do |item|
description = item.css('br')[2].next.text.strip unless item.css('br').empty?
...
end

How to parse HTML using nokogiri if the required content doesn't have a class or id?

require "nokogiri"

Nokogiri::HTML.parse(<<_).css("body").children.first.text
<body>
text <br/>
<ul>
<li>some more text </li>
</body>
_
# => "\ntext "

Nokogiri::HTML.parse(<<_).css("body").children.first.text.strip
<body>
text <br/>
<ul>
<li>some more text </li>
</body>
_
# => "text"

Finding partial string within horrible HTML using Nokogiri

If your HTML is that simple, then this will work:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
EOT

doc.at('p').children.last # => #<Nokogiri::XML::Text:0x3ff1995c5b00 "\nthis I do care about\n">
doc.at('p').children.last.text # => "\nthis I do care about\n"

Parsing HTML and XML is really a matter of looking for landmarks that can be used to find what you want. In this case, <span> is OK, but getting the content you want based on that isn't quite as easy as looking up one level, to the <p> tag, grabbing its content, the children, selecting the last node in that list, which is text node containing the text you want.

The reason using the <span> tag is not the way I'd go is, if the HTML formatting changes, the number of nodes between <span> and your desired text could change. Intervening text nodes containing "\n" could be introduced for the formatting of the source, which would mess up a simple indexed lookup. To work around that, the code would have to ignore blank nodes and find the one that wasn't blank.

I am no regex hero...

And you shouldn't try to be with HTML or XML. They're too flexible and can confound regular expressions unless you're dealing with extremely trivial searches on very static HTML, which isn't very likely in the real internet unless you're scanning abandoned pages. Instead, learn and rely on decent HTML/XML parsers, that can reduce a page into a DOM, making it easy to search and traverse the markup.



Related Topics



Leave a reply



Submit