HTML Is Read Before Fully Loaded Using Open-Uri and Nokogiri

HTML is read before fully loaded using open-uri and nokogiri

What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

This might open a browser window though.

Rails won't read a link with nokogiri and open-uri

Try this:

def scrape
@url = watched_link_params[:url]

page = Nokogiri::HTML(open(@url))
puts page
end

You will need to pass in the entire url, including the protocol designator; that is to say, you need to use http://www.google.com instead of www.google.com:

>> params = ActionController::Parameters.new(default: {url: 'http://www.google.com'})
>> watched_link_params = params.require(:default).permit(:url)
>> @url = watched_link_params[:url]
"http://www.google.com"
>> page = Nokogiri::HTML(open(@url))

Why doesn't Nokogiri load the full page?

Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.

require 'open-uri'
require 'zlib'

stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
body = stream.read
else
body = Zlib::GzipReader.new(stream).read
end

p body

Here's what you can key off of:

>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []

In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.

Doing all the stuff above and tossing it to:

require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')

should get you back on track.

Do this after all the above to confirm visually you're getting something usable:

p language_part.text.gsub("\t", '')

See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.

Trying to use open-uri in ruby, some HTML contents are coming in as Loading...

The problem

So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.

Three solutions

Ordered from my least favourite to my most favourite.

  1. You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
  2. You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
  3. You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)

Solution #3

So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:

require 'uri'
require 'net/http'

# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)

# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data

# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)

# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"

# parse the response
require 'json'
json = JSON.parse res.body

# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false

# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"

# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"

# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"

# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0

# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]

Getting all unique URL's using nokogiri

Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:

require 'nokogiri'
require 'open-uri'
require 'active_record'

ARGV[0] = 'https://www.nku.edu/academics/informatics.html'

ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"

%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end

puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"

anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end

See output.

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.

Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.

When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".

When browsing through the requests you'll find that the data you're looking for is located at:

https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json

screenshot developer tools

Since this is JSON you don't need "nokogiri" to parse it.

require 'httparty'
require 'json'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)

When executing the above you'll get the exception:

JSON::ParserError ...

This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.

response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"

To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.

require 'httparty'
require 'json'
require 'stringio'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...

If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:

data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))

Web page scraped with Nokogiri returns no data

The link you posted contains no data. The page you see is a frameset, with each frame created by its own URL. You want to parse the left frame, so you should edit your code to open the URL of the left frame:

  doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))

The individual projects are on separate pages, and you need to open each one. For example the first one is:

project_file = open(entries.first.css('a').attribute('href').value)       
project_doc = Nokogiri::HTML(project_file)

The "setoutForm" class scrapes lots of text. For example:

> project_doc.css('.setoutForm').text
=> "\n \n Field Type\n Location\n Water De
pth (m)\n First Production\n Contact\n \n \n
Oil\n 2/15\n 155m\n Q3/2018\n
\n John Gill\n Business Development Manager\n
jgill@alphapetroleum.com\n 01483 307204\n \n \n
\n \n Project Summary\n \n \n
\n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n
\n Reserves are approximately 46mmbbls oil.\n \
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n \n \n \n "

However the title is not in that text. If you want the title, scrape this part of the page:

<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>

Which you could do with this CSS selector:

> project_doc.css('.operator-container .field-header').text
=> "Cheviot"

Write this code step by step. It is hard to find out where your code goes wrong, unless you single-step it. For example, I used Nokogiri's command line tool to open an interactive Ruby shell, with

nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index


Related Topics



Leave a reply



Submit