How to scrape a website with the socksify gem (proxy)
require 'socksify/http'
http = Net::HTTP::SOCKSProxy(addr, port)
html = http.get(URI('http://google.de'))
html_doc = Nokogiri::HTML(html)
Not able to access page data, using anemone with socksify gem and Tor
There are a number of problems that could be causing this to happen. First, if ntp is not running on your machine, and the time is off by even a little bit, you will not be able do use the socks server to do anything complicated. This happened to me. You need to install ntp and make sure it has synced before doing anything.
Second, you may find that a lot of this commands like socksify are obsolete. The best way I have found to make sure that everything happens through the socks port without dns leakage is by using curl, which has bindings for many languages. You can carefully watch the traffic with tcpdump to make sure it isn't leaking, and it is watertight in my experience.
I'd also suggest that you look at torsocks, which has recently been updated by dgoulet on github. This replaces tsocks, which the outdated socksify_ruby is based on.
Finally, hidden services have been under great strain lately, because a bot has decided to start up a few million Tor clients. Make sure you can connect with the Tor Browser Bundle, assuming the project you are working on is trying to crawl hidden service.
You didn't actually say that this project involves Tor or hidden services, but you did tag it with Tor.
How to make an xpath expression read through a part of the document only (Ruby/Nokogiri/xpath)
The answer is to simply add a .
before //*[not(*)]
:
product_data = product.xpath(".//*[not(*)]")
This tells the XPath expression to start at the current node rather than the root.Mr. Novatchev's answer, while technically correct, would not result in the parsing code being idiomatic Ruby.
Ruby Net::HTTP - following 301 redirects
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri
handles redirects automatically. So all you'd need to do is:require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart
which handles this URL fiddling in addition to the redirects.
Related Topics
Warning While Installing The Rails Plugin
Can Optionparser Skip Unknown Options, to Be Processed Later in a Ruby Program
Parsing a CSV File Using Different Encodings and Libraries
How to Validate Overlapping Times in Rails with Postgresql
Why Am I Getting "Unable to Autoload Constant" with Rails and Grape
How to Unescape C-Style Escape Sequences from Ruby
Inspect or Clean Up The Working Tree Error When Installing Ruby 2.1.3 on MAC Os X 10.9.5
Can't Install Nokogiri for Ruby in Windows
Filtering Sensitive Data with Vcr
How to Use Rspec to Test That a Model Using Paperclip Is Validating The Size of an Uploaded File
How to Change "3 Errors Prohibited This Foobar from Being Saved" Validation Message in Rails
How to Test Strong Params with Rspec
Why The Unit Test Frameworks in Fortran Rely on Ruby Instead of Fortran Itself
Nomethoderror: Undefined Method 'On' for Main:Object
How to Get Source and Variable Values in Ruby Tracebacks
Current Password Can't Be Blank When Updating Devise Account