How to avoid joining all text from Nodes when scraping
This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text
will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map
:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as:
text
,inner_text
Returns the contents for this Node.
How to search a single node, not all nodes
You need to anchor the XPath queries on the elements:
node.xpath("//example")
does a global searchnode.xpath(".//example")
does a local search starting at the current node
Notice the leading dot .
which anchors the query at the current node. Otherwise the query is run against the root node, even if you call it from the current node.
If you are searching by tag name consider using CSS selectors instead. They have fewer pitfalls than XPath. CSS always searches from the current node.
Scrapy: Scraping all the text from a website but not the text of hyperlinks
You can check whether nodes parent
or an ancestor
is a node you dont want.
For example:
This xpath will find all text of nodes that are not children of <a>
nodes:
//text()[not(parent::a)]
Alternatively you can use ancestor
which checks whether any of the ancestors are <a>
nodes (this means a parent, grandparent, grandgrandparent and so on):
//text()[not(ancestor::a)]
Nokogiri scrape text methods alternative?
You're walking the rows, but not the contained cells. You need to do both to get the cell's values in a usable form:
require 'open-uri'
require 'nokogiri'
URL = 'http://espn.go.com/nba/player/_/id/4299/jeremy-lin'
doc = Nokogiri::HTML(open(URL))
data = doc.css('tr[class^="oddrow team-46"]').map{ |tr|
tr.css('td').map(&:text)
}
data
# => [["Sat 11/16",
# "vsDEN",
# "W 122-111",
# "32",
# "6-11",
# ".545",
# "0-2",
# ".000",
# "4-6",
# ".667",
# "4",
# "7",
# "1",
# "1",
# "3",
# "1",
# "16"],
# ["Wed 11/13",
# "@ PHI",
# "L 117-123",
# "49",
# "10-19",
# ".526",
# "9-15",
# ".600",
# "5-6",
# ".833",
# "5",
# "12",
# "0",
# "0",
# "5",
# "8",
# "34"],
# ["Sat 11/9",
# "vsLAC",
# "L 94-107",
# "26",
# "3-7",
# ".429",
# "0-0",
# ".000",
# "0-0",
# ".000",
# "1",
# "7",
# "0",
# "1",
# "1",
# "5",
# "6"]]
Looking at the data differently, this outputs it as the rows:
data.each do |row|
puts row.join(', ')
end
# >> Sat 11/16, vsDEN, W 122-111, 32, 6-11, .545, 0-2, .000, 4-6, .667, 4, 7, 1, 1, 3, 1, 16
# >> Wed 11/13, @ PHI, L 117-123, 49, 10-19, .526, 9-15, .600, 5-6, .833, 5, 12, 0, 0, 5, 8, 34
# >> Sat 11/9, vsLAC, L 94-107, 26, 3-7, .429, 0-0, .000, 0-0, .000, 1, 7, 0, 1, 1, 5, 6
A table is really simple and is something you can create using two nested loops. To later access each cell you need to do the same, walk the rows in a loop, and, inside that loop, walk the cells. That's all the code I wrote does.
See "How to avoid joining all text from Nodes when scraping" also.
XPath Scrapy Join Text Nodes Separated by br tags in a class
To get all text nodes value, You have to invoke //text()
instead of /text()
sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()
Proven by scrapy shell:
>>> from scrapy import Selector
>>> html_doc = '''
... <html>
... <body>
... <div class="discussionpost">
... “This is paragraph one.”
... <br/>
... <br/>
... “This is paragraph two."'
... <br/>
... <br/>
... "This is paragraph three.”
... </div>
... </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences
>>> txt
'\n “This is paragraph one.”\n \n \n “This is paragraph two."\'\n \n \n "This is paragraph three.”\n '
>>> txt = sentences.replace('\n','').replace("\'",'').replace(' ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>
Update:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']
def parse(self, response):
for p in response.xpath('//*[@class="bbWrapper"]'):
yield {
'comment': ''.join(p.xpath(".//text()").getall()).strip()
}
Web page scraped with Nokogiri returns no data
The link you posted contains no data. The page you see is a frameset, with each frame created by its own URL. You want to parse the left frame, so you should edit your code to open the URL of the left frame:
doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))
The individual projects are on separate pages, and you need to open each one. For example the first one is:
project_file = open(entries.first.css('a').attribute('href').value)
project_doc = Nokogiri::HTML(project_file)
The "setoutForm" class scrapes lots of text. For example:
> project_doc.css('.setoutForm').text
=> "\n \n Field Type\n Location\n Water De
pth (m)\n First Production\n Contact\n \n \n
Oil\n 2/15\n 155m\n Q3/2018\n
\n John Gill\n Business Development Manager\n
jgill@alphapetroleum.com\n 01483 307204\n \n \n
\n \n Project Summary\n \n \n
\n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n
\n Reserves are approximately 46mmbbls oil.\n \
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n \n \n \n "
However the title is not in that text. If you want the title, scrape this part of the page:
<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>
Which you could do with this CSS selector:
> project_doc.css('.operator-container .field-header').text
=> "Cheviot"
Write this code step by step. It is hard to find out where your code goes wrong, unless you single-step it. For example, I used Nokogiri's command line tool to open an interactive Ruby shell, with
nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index
Related Topics
Centered Elements Inside a Flex Container Are Growing and Overflowing Beyond Top
How to Replicate Background-Attachment Fixed on Ios
How to Apply a Style to an Embedded Svg
Custom HTML Tag Attributes Are Not Rendered by Jsf
How to Make a Vertical Line in Html
Why Doesn't Table ≫ Tr ≫ Td Work When Using the Child Selector
Why Can't ≪Fieldset≫ Be Flex Containers
Is It Bad to Use !Important in a CSS Property
Nesting Optgroups in a Dropdownlist/Select
How to Open Link in a New Tab in Html
How Do Search Engines Deal With Angularjs Applications
Responsively Change Div Size Keeping Aspect Ratio
How to Wrap Text Around a Bottom-Right Div
Does Form Data Still Transfer If the Input Tag Has No Name
How to Draw Vertical Text With CSS Cross-Browser