How to Programmatically Take Snapshot of Crawled Webpages (In Ruby)

How to Programmatically take Snapshot of Crawled Webpages (in Ruby)?

This really depends on your operating system. What you need is a way to hook into a web browser and save that to an image.

If you are on a Mac - I would imagine your best bet would be to use MacRuby (or RubyCocoa - although I believe this is going to be deprecated in the near future) and then to use the WebKit framework to load the page and render it as an image.

This is definitely possible, for inspiration you may wish to look at the Paparazzi! and webkit2png projects.

Another option, which isn't dependent on the OS, might be to use the BrowserShots API.

Programmatically take ScreenShot of Desktop in Ruby?

This is an awesome question!

A couple of years ago I has to work on a similar project. I found a library, called watir, you can use to control system browsers from Ruby.
At the time I checked, it wasn't really reliable in a Linux environment, but right now it seems to be pretty solid.

Here's a couple of links:

  • http://90kts.com/blog/2008/capturing-screenshots-in-watir/
  • http://www.marekj.com/wp/2008/04/desktop-screenshots-with-watir-win32screenshot-and-rmagick/
  • http://clearspace.openqa.org/thread/13949
  • http://wiki.openqa.org/display/WTR/FAQ#FAQ-HowdoItakescreenshotsandappendtoaWordfile%3F

I have never tried this solution so I would be really happy if you can write here a feedback if you decide to go with Watir. All the examples targets a Windows server, I didn't found a valid tutorial using a Linux + Firefox environment.

Take a snapshot of a html page using Ruby On Rails

You can use IMGKit to generate image and then ImageMagick or something else to resize an image

Take snapshot of video

You have wrong path of

ffmpeg.exe

ffmpeg.StartInfo.FileName = HttpContext.Current.Server.MapPath("~/ffmpeg.exe");

Scraping pages that do not seem to have URLs

The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.

The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.

The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.

How can I make and query read only snapshots in Postgres (or MySql)?

The usual way to create a snapshot in PostgreSQL is to use pg_dump/pg_restore.

A much quicker method is to simply use CREATE DATABASE to clone your database.

CREATE DATABASE my_copy_db TEMPLATE my_production_db;

which will be much faster than a dump/restore. The only drawback to this solution is that the source database must not have any open connections.

The copy will not be read-only by default, but you could simply revoke the respective privileges from the users to ensure that

Scraping pages that do not seem to have URLs

The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.

The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.

The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.



Related Topics



Leave a reply



Submit