How to Scrape a Website with The Socksify Gem (Proxy)

How to scrape a website with the socksify gem (proxy)

 require 'socksify/http'
 http = Net::HTTP::SOCKSProxy(addr, port)
 html = http.get(URI('http://google.de'))
 html_doc = Nokogiri::HTML(html)

Not able to access page data, using anemone with socksify gem and Tor

There are a number of problems that could be causing this to happen. First, if ntp is not running on your machine, and the time is off by even a little bit, you will not be able do use the socks server to do anything complicated. This happened to me. You need to install ntp and make sure it has synced before doing anything.

Second, you may find that a lot of this commands like socksify are obsolete. The best way I have found to make sure that everything happens through the socks port without dns leakage is by using curl, which has bindings for many languages. You can carefully watch the traffic with tcpdump to make sure it isn't leaking, and it is watertight in my experience.

I'd also suggest that you look at torsocks, which has recently been updated by dgoulet on github. This replaces tsocks, which the outdated socksify_ruby is based on.

Finally, hidden services have been under great strain lately, because a bot has decided to start up a few million Tor clients. Make sure you can connect with the Tor Browser Bundle, assuming the project you are working on is trying to crawl hidden service.

You didn't actually say that this project involves Tor or hidden services, but you did tag it with Tor.

How to make an xpath expression read through a part of the document only (Ruby/Nokogiri/xpath)

The answer is to simply add a . before //*[not(*)]:

product_data = product.xpath(".//*[not(*)]")

This tells the XPath expression to start at the current node rather than the root.

Mr. Novatchev's answer, while technically correct, would not result in the parsing code being idiomatic Ruby.

Ruby Net::HTTP - following 301 redirects

301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.

Two alternatives come to mind:

1: Use `open-uri`

open-uri handles redirects automatically. So all you'd need to do is:

require 'open-uri' 
...
response = open('http://xyz...').read

If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:

Ruby open-uri redirect forbidden

2: Handle redirects with `Net::HTTP`

def get_response_with_redirect(uri)
   r = Net::HTTP.get_response(uri)
   if r.code == "301"
     r = Net::HTTP.get_response(URI.parse(r['location']))
   end
   r
end

If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.

How to Scrape a Website with The Socksify Gem (Proxy)