How to Avoid Joining All Text from Nodes When Scraping

How to avoid joining all text from Nodes when scraping

This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

The NodeSet documentation says text will:

Get the inner text of all contained Node objects

Which is what we're seeing happen with:

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

because:

doc.search('p').class # => Nokogiri::XML::NodeSet

Instead, we want to get each Node and extract its text:

doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"

which can be done using map:

doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]

Ruby allows us to write that more concisely using:

doc.search('p').map(&:text) # => ["foo", "bar", "baz"]

The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.

A Node has several aliased methods for getting at its embedded text. From the documentation:

#content ⇒ Object

Also known as: text, inner_text

Returns the contents for this Node.

How to search a single node, not all nodes

You need to anchor the XPath queries on the elements:

  • node.xpath("//example") does a global search
  • node.xpath(".//example") does a local search starting at the current node

Notice the leading dot . which anchors the query at the current node. Otherwise the query is run against the root node, even if you call it from the current node.

If you are searching by tag name consider using CSS selectors instead. They have fewer pitfalls than XPath. CSS always searches from the current node.

Scrapy: Scraping all the text from a website but not the text of hyperlinks

You can check whether nodes parent or an ancestor is a node you dont want.

For example:

This xpath will find all text of nodes that are not children of <a> nodes:

//text()[not(parent::a)]

Alternatively you can use ancestor which checks whether any of the ancestors are <a> nodes (this means a parent, grandparent, grandgrandparent and so on):

//text()[not(ancestor::a)]

Nokogiri scrape text methods alternative?

You're walking the rows, but not the contained cells. You need to do both to get the cell's values in a usable form:

require 'open-uri'
require 'nokogiri'

URL = 'http://espn.go.com/nba/player/_/id/4299/jeremy-lin'
doc = Nokogiri::HTML(open(URL))

data = doc.css('tr[class^="oddrow team-46"]').map{ |tr|
tr.css('td').map(&:text)
}

data
# => [["Sat 11/16",
# "vsDEN",
# "W 122-111",
# "32",
# "6-11",
# ".545",
# "0-2",
# ".000",
# "4-6",
# ".667",
# "4",
# "7",
# "1",
# "1",
# "3",
# "1",
# "16"],
# ["Wed 11/13",
# "@ PHI",
# "L 117-123",
# "49",
# "10-19",
# ".526",
# "9-15",
# ".600",
# "5-6",
# ".833",
# "5",
# "12",
# "0",
# "0",
# "5",
# "8",
# "34"],
# ["Sat 11/9",
# "vsLAC",
# "L 94-107",
# "26",
# "3-7",
# ".429",
# "0-0",
# ".000",
# "0-0",
# ".000",
# "1",
# "7",
# "0",
# "1",
# "1",
# "5",
# "6"]]

Looking at the data differently, this outputs it as the rows:

data.each do |row|
puts row.join(', ')
end
# >> Sat 11/16, vsDEN, W 122-111, 32, 6-11, .545, 0-2, .000, 4-6, .667, 4, 7, 1, 1, 3, 1, 16
# >> Wed 11/13, @ PHI, L 117-123, 49, 10-19, .526, 9-15, .600, 5-6, .833, 5, 12, 0, 0, 5, 8, 34
# >> Sat 11/9, vsLAC, L 94-107, 26, 3-7, .429, 0-0, .000, 0-0, .000, 1, 7, 0, 1, 1, 5, 6

A table is really simple and is something you can create using two nested loops. To later access each cell you need to do the same, walk the rows in a loop, and, inside that loop, walk the cells. That's all the code I wrote does.

See "How to avoid joining all text from Nodes when scraping" also.

XPath Scrapy Join Text Nodes Separated by br tags in a class

To get all text nodes value, You have to invoke //text() instead of /text()

sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()

Proven by scrapy shell:

>>> from scrapy import Selector
>>> html_doc = '''
... <html>
... <body>
... <div class="discussionpost">
... “This is paragraph one.”
... <br/>
... <br/>
... “This is paragraph two."'
... <br/>
... <br/>
... "This is paragraph three.”
... </div>
... </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences
>>> txt
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences.replace('\n','').replace("\'",'').replace(' ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>

Update:

import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']

def parse(self, response):
for p in response.xpath('//*[@class="bbWrapper"]'):
yield {
'comment': ''.join(p.xpath(".//text()").getall()).strip()
}

Web page scraped with Nokogiri returns no data

The link you posted contains no data. The page you see is a frameset, with each frame created by its own URL. You want to parse the left frame, so you should edit your code to open the URL of the left frame:

  doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))

The individual projects are on separate pages, and you need to open each one. For example the first one is:

project_file = open(entries.first.css('a').attribute('href').value)       
project_doc = Nokogiri::HTML(project_file)

The "setoutForm" class scrapes lots of text. For example:

> project_doc.css('.setoutForm').text
=> "\n \n Field Type\n Location\n Water De
pth (m)\n First Production\n Contact\n \n \n
Oil\n 2/15\n 155m\n Q3/2018\n
\n John Gill\n Business Development Manager\n
jgill@alphapetroleum.com\n 01483 307204\n \n \n
\n \n Project Summary\n \n \n
\n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n
\n Reserves are approximately 46mmbbls oil.\n \
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n \n \n \n "

However the title is not in that text. If you want the title, scrape this part of the page:

<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>

Which you could do with this CSS selector:

> project_doc.css('.operator-container .field-header').text
=> "Cheviot"

Write this code step by step. It is hard to find out where your code goes wrong, unless you single-step it. For example, I used Nokogiri's command line tool to open an interactive Ruby shell, with

nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index


Related Topics



Leave a reply



Submit