Scrapy: Couldn't Bind: 24: Too Many Open Files

Errno 24: Too many open files. But I am not opening files?

"Files" include network sockets, which are a type of file on Unix-based systems. The maximum number of open files is configurable with ulimit -n

# Check current limit
$ ulimit -n
256

# Raise limit to 2048
$ ulimit -n 2048

It is not surprising to run out of file handles and have to raise the limit. But if the limit is already high, you may be leaking file handles (not closing them quickly enough). In garbage-collected languages like Python, the finalizer does not always close files fast enough, which is why you should be careful to use with blocks or other systems to close the files as soon as you are done with them.

Crawling through multiple links on Scrapy

Initializing a HtmlResponse(url) doesn't accomplish anything, since the class doesn't make the request itself.

To add a request to scrapy's scheduler, you need to yield one, eg: yield scrapy.Request(url, callback=self.parse).

That being said, there are many improvements you can make to your spider.

Use scrapy's builtin LinkExtractor instead of string splitting
use css selectors instead of the hardcoded xpaths
use selector.root.text instead of w3lib.remove_tags (to remove the dependency entirely)

Here is a working example:

import scrapy
from scrapy.linkextractors import LinkExtractor

class MainSpider(scrapy.Spider):
    name = 'links'
    allowed_domains = ['www.ylioppilastutkinto.fi']
    start_urls = ['https://www.ylioppilastutkinto.fi/ylioppilastutkinto/pisterajat/']

    def parse(self, response):
        le = LinkExtractor(
            allow_domains=self.allowed_domains,
            restrict_xpaths='//*[@id="sidebar"]/div[1]/nav/ul/li[5]/div',
        )
        for link in le.extract_links(response):
            yield scrapy.Request(
                url=link.url,
                callback=self.parse_table,
                cb_kwargs={ 'date': link.text },
            )

    def parse_table(self, response, date):
        rows = response.css('#content table tbody tr')
        if not rows:
            print(f'No table found for url: {response.url}')
            return

        category = [char.root.text for char in rows[0].css('td strong')[1:]]
        if not category:
            category = [char.root.text for char in rows[0].css('td')[1:]]

        for row in rows[1:]:
            cols = row.css('td')
            title = cols[0].root.text
            nums = [col.root.text for col in cols[1:]]
            yield {
                'Date': date,
                'Category': category,
                title: nums
            }

Note that your category parsing doesn't appear to work. I'm not exactly sure what you are trying to extract, so I'll leave that one for you.

Scrapy error twisted.web._newclient.RequestGenerationFailed

I was able to get it to work.

Steps to reproduce...

Open a new directory and start a new python virtual enviornment, and update pip install scrapy and install pyinstaller into the virtual environement.

In the new directory create the the two python scripts... mine is main.py and scrape.py

main.py

import tkinter as tk
from tkinter import messagebox as tkms
from tkinter import ttk
import shlex
import os
import scrapy
from subprocess import Popen
import json

def get_path(name):
    return os.path.join(os.path.dirname(__file__),name).replace("\\","/")

harvest = None

def watch():
    global harvest
    if harvest:
        if harvest.poll() != None:
            # Update your progressbar to finished.
            progress_bar.stop()
            #if harvest finishes OK then show confirmation message otherwise show error.
            if harvest.returncode == 0:
                mes = tkms.showinfo(title='progress', message='Scraping Done')
                if mes == 'ok':
                    root.destroy()
            else:
                tkms.showinfo(title='Error', message=f'harvest returncode == {harvest.returncode}')

        harvest = None

    else:
        # indicate that process is running.
        progress_bar.grid()
        progress_bar.start(10)
        root.after(100, watch)

def scrape():
    global harvest
    command_line = shlex.split('scrapy runspider ' + get_url('scrape.py'))
    with open ('stdout.txt', 'wb') as out, open('stderr.txt', 'wb') as err:
        harvest = Popen(command_line, stdout=out, stderr=err)
    watch()

root = tk.Tk()
root.title("Title")

url = tk.StringVar(root)

entry1 = tk.Entry(root, width=90, textvariable=url)
entry1.grid(row=0, column=0, columnspan=3)

my_button = tk.Button(root, text="Process", command=scrape)
my_button.grid(row=2, column=2)

progress_bar = ttk.Progressbar(root, orient=tk.HORIZONTAL, length=300, mode='indeterminate')
progress_bar.grid(row=3, column=2)
progress_bar.grid_forget()

root.mainloop()

scrape.py

import scrapy
import os

class ImgSpider(scrapy.Spider):
    name = 'img'

    #allowed_domains = [user_domain]
    start_urls = ['https://www.bbc.com/news/in_pictures']  # i just used this for testing.

    def parse(self, response):
        title = response.css('img::attr(alt)').getall()
        links = response.css('img::attr(src)').getall()

        if not os.path.exists('./images'):
            os.makedirs('./images')
        with open('./images/urls.txt', 'w') as f:
            for i in title:
                f.write(i)
            f.close
        yield {"title": title, "links": links}

Then run pyinstaller -F main.py, which will then generate a main.spec file. open that and make these changes to the file.

main.spec

# -*- mode: python ; coding: utf-8 -*-

block_cipher = None
import os

scrape = "scrape.py"
imagesdir = "images"  

a = Analysis(
    ['main.py'],
    pathex=[],
    binaries=[],
    datas=[(scrape,'.'), (imagesdir,'.')],  # add these lines
    hiddenimports=[],
    hookspath=[],
    hooksconfig={},
    runtime_hooks=[],
    excludes=[],
    win_no_prefer_redirects=False,
    win_private_assemblies=False,
    cipher=block_cipher,
    noarchive=False,
)
pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher)

exe = EXE(
    pyz,
    a.scripts,
    a.binaries,
    a.zipfiles,
    a.datas,
    [],
    name='main',
    debug=False,
    bootloader_ignore_signals=False,
    strip=False,
    upx=True,
    upx_exclude=[],
    runtime_tmpdir=None,
    console=True,  # Once you have confirmed it is working you can set this to false
    disable_windowed_traceback=False,
    argv_emulation=False,
    target_arch=None,
    codesign_identity=None,
    entitlements_file=None,
)

Then once that is all done. go back to your terminal and run pyinstaller main.spec, and bobs your uncle...

Update

main.py -

I essentially just removed the shlex portion and made the path to scrape.py relative to the main.py file path.

import tkinter as tk
from tkinter import messagebox as tkms
from tkinter import ttk
from subprocess import Popen
import json
import os

def get_url():
    print('Getting URL...')
    data = url.get()
    if not os.path.exists('./data'):
        os.makedirs('./data')
    with open('./data/url.json', 'w') as f:
        json.dump(data, f)

harvest = None

def watch():
    global harvest
    print('watch started')
    if harvest:
        if harvest.poll() != None:
            print('progress bar ends')
            # Update your progressbar to finished.
            progress_bar.stop()
            #if harvest finishes OK then show confirmation message otherwise show error.
            if harvest.returncode == 0:
                mes = tkms.showinfo(title='progress', message='Scraping Done')
                if mes == 'ok':
                    root.destroy()
            else:
                tkms.showinfo(title='Error', message=f'harvest returncode == {harvest.returncode}')

            # Maybe report harvest.returncode?
            print(f'harvest return code if Poll !None =--######==== {harvest.returncode}')
            print(f'harvest poll =--######==== {harvest.poll}')
            # Re-schedule `watch` to be called again after 0.1 s.
            harvest = None

        else:
            # indicate that process is running.
            print('progress bar starts')
            progress_bar.grid()
            progress_bar.start(10)
            print(f'harvest return code =--######==== {harvest.returncode}')
            root.after(100, watch)

def scrape():
    global harvest
    scrapefile = os.path.join(os.path.dirname(__file__),'scrape.py')
    # harvest = Popen(command_line)
    with open ('stdout.txt', 'wb') as out, open('stderr.txt', 'wb') as err:
        # harvest = Popen('scrapy runspider ./scrape.py', stdout=out, stderr=err, shell=True)
        harvest = Popen(["python3", scrapefile], stdout=out, stderr=err)
        out.close(), err.close()
    print('harvesting started')
    watch()

root = tk.Tk()
root.title("Title")

url = tk.StringVar(root)

entry1 = tk.Entry(root, width=90, textvariable=url)
entry1.grid(row=0, column=0, columnspan=3)

my_button = tk.Button(root, text="Process", command=lambda: [get_url(), scrape()])
my_button.grid(row=2, column=2)

progress_bar = ttk.Progressbar(root, orient=tk.HORIZONTAL, length=300, mode='indeterminate')
progress_bar.grid(row=3, column=2)
progress_bar.grid_forget()

root.mainloop()

main.spec

# -*- mode: python ; coding: utf-8 -*-
block_cipher = None
a = Analysis(['main.py'], pathex=[], binaries=[],
    datas=[('scrape.py','.')],   # <------- this is the only change that I made
    hiddenimports=[], hookspath=[],
    hooksconfig={}, runtime_hooks=[], excludes=[],
    win_no_prefer_redirects=False, win_private_assemblies=False,
    cipher=block_cipher, noarchive=False,)
pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher)
exe = EXE(pyz, a.scripts, a.binaries, a.zipfiles, a.datas, [],
    name='main', debug=False, bootloader_ignore_signals=False, strip=False,
    upx=True, upx_exclude=[], runtime_tmpdir=None, console=False,
    disable_windowed_traceback=False, argv_emulation=False, target_arch=None,
    codesign_identity=None, entitlements_file=None,)

I made no changes to the scrape.py

Storing responses as files using Scrapy Splash

WRITING OUTPUT TO A JSON FILE:

I have tried to solve your problem. Here is the working version of your code. I hope this is what you are trying to achieve.

import json

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='render.html',
                args={'wait': 0.5}
            )

    def parse(self, response):
        quotes = {"quotes": []}
        for q in response.css("div.quote"):
            quote = dict()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            quotes["quotes"].append(quote)

        page = response.url[response.url.index("page/")+5:]
        print("page=", page)
        filename = 'quotes-%s.json' % page
        with open(filename, 'w') as outfile:
            outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))

UPDATE:
Above code has been updated to scrape from all pages and save results in separate json files from page-1 to 10.

This will write the list of quotes from each page to a separate json file as following:

{
    "quotes":[
        {
            "author":"Albert Einstein",
            "quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
        },
        {
            "author":"J.K. Rowling",
            "quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
        },
        {
            "author":"Jane Austen",
            "quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
        },
        {
            "author":"Marilyn Monroe",
            "quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
        },
        {
            "author":"Andr\u00e9 Gide",
            "quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
        },
        {
            "author":"Thomas A. Edison",
            "quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
        },
        {
            "author":"Eleanor Roosevelt",
            "quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
        },
        {
            "author":"Steve Martin",
            "quote":"\u201cA day without sunshine is like, you know, night.\u201d"
        }
    ]
}

How to limit number of followed pages per site in Python Scrapy

I'd make per-class variable, initialize it with stats = defaultdict(int) and increment self.stats[response.url] (or may be the key could be a tuple like (website, depth) in your case) in parse_item.

This is how I imagine this - should work in theory. Let me know if you need an example.

FYI, you can extract base url and calculate depth with the help of urlparse.urlparse (see docs).

Scrapy - logging to file and stdout simultaneously, with spider names

You want to use the ScrapyFileLogObserver.

import logging
from scrapy.log import ScrapyFileLogObserver

logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()

I'm glad you asked this question, I've been wanting to do this myself.