How to Use Watir to Scrape Data from a Website on a Linux Server Without Monitor

Can I use Watir to scrape data from a website on a linux server without monitor?

There are several ways to do this:

  1. Use HtmlUnit, either Celerity or watir-webdriver (through the remote Selenium2/WebDriver server).

  2. Use a real browser + a virtual X server (Xvfb). I'd recommend using watir-webdriver's Firefox driver and the Headless gem for a simple way to control this from Ruby.

This is basically a tradeoff between speed and realism. Personally I'd go with #2 if the site has any complex JavaScript or invalid HTML, but both approaches could be worth investigation.

For the future, I'm keeping an eye on this project, which looks like a terrific idea.

How do I run Firefox browser headless with my Ruby script?

I would look at using Watir-Webdriver instead of just plain Watir or Fire-watir. Especially since the only way to work with newer versions of firefox is going to be via Watir-Webdriver.

There's an earlier SO question where the answer covers just this sort of thing, so I'd suggest trying what is described there there first. Can I use Watir to scrape data from a website on a linux server without monitor?

Also since I now know you are using Mac OS, the advice in this thread from the webdriver google group might be more applicable to you

any scripting language can read AJAX/Java Script? (linux)

Check out TestPlan. It can do testing without a monitor -- by using the HTMLUnit backend. It handles quite a lot of JavaScript, including AJAX. I use it to scrape several pages and have built several tests of AJAX with it.

You can also run TestPlan with a browser if you want. This gives you the best of both worlds: develop tests and visually see what is happening, and then switch to the display-less mode.

Watir-Webdriver EOFError and Errno::ECONNREFUSED

I had success giving each app it's own Xvfb display:

On the server itself:

$ sudo /usr/bin/Xvfb :98 -screen 0 1280x1024x24 -ac &
$ sudo /usr/bin/Xvfb :99 -screen 0 1280x1024x24 -ac &

App 1 - before the browser is being created:

# ~/repo1/whatever.rb
# ...
h = headless(:display => '98')
# ...

App 2 - before the browser is being created:

# ~/repo2/something.rb
# ...
h = headless(:display => '99')
# ...

@chuck-van-der-linden is probably correct, though, that using VMs or similar are a better solution. If I was starting fresh with this architecture, this would be my approach.



Related Topics



Leave a reply



Submit