How to Ignore File Types in a Web Crawler

How do I ignore file types in a web crawler?

use URI#path:

unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1)
  puts "downloading #{url}..."
end

How to skip Parent directories while scraping a File Type Website?

you can use os.path.normpath to normalize all the paths, so you don't get duplicates:

import os
import urlparse
...

    def parse(self, response):
        item = Item()
        for url in response.xpath('//a/@href').extract():
            url1 = response.url + url

            # =======================
            url_parts = list(urlparse.urlparse(url1))
            url_parts[2] = os.path.normpath(url_parts[2])
            url1 = urlparse.urlunparse(url_parts)
            # =======================

            if(url1[-4::] in videoext):
                item['name'] = url
                item['url'] = url1
                item['depth'] = response.meta["depth"]
                yield item
            elif(url1[-1]=='/'):
                yield scrapy.Request(url1, callback=self.parse)   
        pass

Exists a way to let crawlers ignore parts of a document?

Only difference between static content and dynamic content is the extension of the file you include:

var extension = "js"; // change to "php" for example to load dynamic content
function loadJS(filename){
  var js=document.createElement('script')
  js.setAttribute("type","text/javascript")
  js.setAttribute("src", filename)
  document.getElementsByTagName("head")[0].appendChild(js);
}
window.onload=function() {
  loadJS("somecontenttoload."+extension); // hard for crawlers to read 
}

in somecontenttoload.js:

document.getElementById("content").innerHTML="This is static";

in somecontenttoload.php

<?PHP 
  header("content-type:text/javascript");
  // load data from database
  $bla = .....;
  ?>
  document.getElementById("content").innerHTML="<? echo $bla; ?>";

How do i exclude everything but text/html from a heritrix crawl?

The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.

The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule

A possible ARCWriter configuration:

  <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
    <property name="shouldProcessRule">
      <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="ACCEPT" />
        <property name="regex" value="^text/html.*">
      </bean>
    </property>
    <!-- Other properties that need to be set ... -->
  </bean>

This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.

Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).

As shouldProcessRule is inherited from Processor, this can be applied to any processor.

More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)

How to exclude script and style tags from text extracted by StormCrawler?

The XPathFilter serves a different purpose which is to extract metadata from Xpath expressions. There is also the ContentFilter which is closer to what you need as it allows you to restrict the scope of the extracted text to a set of xpaths, however it does not give you a way of filtering out specific tags and keep everything else.

Your best option at this stage is probably to use the ParserBolt based on Tika: it can be configured with a mapper implementation which by default is set to identityMapper but could use any other implementation provided by Tika or yourself, see Tika documentation on HTML mapper.

Feel free to open an issue on GH to request a new type of parseFilter to exclude some HTML elements, as this could be useful to have. We have a related issue for googleon / googleoff tags and that could be a way of implementing it.

EDIT: we have since released the TextExtractor, see StormCrawler 1.13 release announcement