Errno 24: Too many open files. But I am not opening files?
"Files" include network sockets, which are a type of file on Unix-based systems. The maximum number of open files is configurable with ulimit -n
# Check current limit
$ ulimit -n
256
# Raise limit to 2048
$ ulimit -n 2048
It is not surprising to run out of file handles and have to raise the limit. But if the limit is already high, you may be leaking file handles (not closing them quickly enough). In garbage-collected languages like Python, the finalizer does not always close files fast enough, which is why you should be careful to use with
blocks or other systems to close the files as soon as you are done with them.
Crawling through multiple links on Scrapy
Initializing a HtmlResponse(url)
doesn't accomplish anything, since the class doesn't make the request itself.
To add a request to scrapy's scheduler, you need to yield one, eg: yield scrapy.Request(url, callback=self.parse)
.
That being said, there are many improvements you can make to your spider.
Use scrapy's builtin
LinkExtractor
instead of string splittinguse css selectors instead of the hardcoded xpaths
use
selector.root.text
instead ofw3lib.remove_tags
(to remove the dependency entirely)
Here is a working example:
import scrapy
from scrapy.linkextractors import LinkExtractor
class MainSpider(scrapy.Spider):
name = 'links'
allowed_domains = ['www.ylioppilastutkinto.fi']
start_urls = ['https://www.ylioppilastutkinto.fi/ylioppilastutkinto/pisterajat/']
def parse(self, response):
le = LinkExtractor(
allow_domains=self.allowed_domains,
restrict_xpaths='//*[@id="sidebar"]/div[1]/nav/ul/li[5]/div',
)
for link in le.extract_links(response):
yield scrapy.Request(
url=link.url,
callback=self.parse_table,
cb_kwargs={ 'date': link.text },
)
def parse_table(self, response, date):
rows = response.css('#content table tbody tr')
if not rows:
print(f'No table found for url: {response.url}')
return
category = [char.root.text for char in rows[0].css('td strong')[1:]]
if not category:
category = [char.root.text for char in rows[0].css('td')[1:]]
for row in rows[1:]:
cols = row.css('td')
title = cols[0].root.text
nums = [col.root.text for col in cols[1:]]
yield {
'Date': date,
'Category': category,
title: nums
}
Note that your category parsing doesn't appear to work. I'm not exactly sure what you are trying to extract, so I'll leave that one for you.
Scrapy error twisted.web._newclient.RequestGenerationFailed
I was able to get it to work.
Steps to reproduce...
Open a new directory and start a new python virtual enviornment, and update pip install scrapy and install pyinstaller into the virtual environement.
In the new directory create the the two python scripts... mine is main.py
and scrape.py
main.py
import tkinter as tk
from tkinter import messagebox as tkms
from tkinter import ttk
import shlex
import os
import scrapy
from subprocess import Popen
import json
def get_path(name):
return os.path.join(os.path.dirname(__file__),name).replace("\\","/")
harvest = None
def watch():
global harvest
if harvest:
if harvest.poll() != None:
# Update your progressbar to finished.
progress_bar.stop()
#if harvest finishes OK then show confirmation message otherwise show error.
if harvest.returncode == 0:
mes = tkms.showinfo(title='progress', message='Scraping Done')
if mes == 'ok':
root.destroy()
else:
tkms.showinfo(title='Error', message=f'harvest returncode == {harvest.returncode}')
harvest = None
else:
# indicate that process is running.
progress_bar.grid()
progress_bar.start(10)
root.after(100, watch)
def scrape():
global harvest
command_line = shlex.split('scrapy runspider ' + get_url('scrape.py'))
with open ('stdout.txt', 'wb') as out, open('stderr.txt', 'wb') as err:
harvest = Popen(command_line, stdout=out, stderr=err)
watch()
root = tk.Tk()
root.title("Title")
url = tk.StringVar(root)
entry1 = tk.Entry(root, width=90, textvariable=url)
entry1.grid(row=0, column=0, columnspan=3)
my_button = tk.Button(root, text="Process", command=scrape)
my_button.grid(row=2, column=2)
progress_bar = ttk.Progressbar(root, orient=tk.HORIZONTAL, length=300, mode='indeterminate')
progress_bar.grid(row=3, column=2)
progress_bar.grid_forget()
root.mainloop()
scrape.py
import scrapy
import os
class ImgSpider(scrapy.Spider):
name = 'img'
#allowed_domains = [user_domain]
start_urls = ['https://www.bbc.com/news/in_pictures'] # i just used this for testing.
def parse(self, response):
title = response.css('img::attr(alt)').getall()
links = response.css('img::attr(src)').getall()
if not os.path.exists('./images'):
os.makedirs('./images')
with open('./images/urls.txt', 'w') as f:
for i in title:
f.write(i)
f.close
yield {"title": title, "links": links}
Then run pyinstaller -F main.py
, which will then generate a main.spec
file. open that and make these changes to the file.
main.spec
# -*- mode: python ; coding: utf-8 -*-
block_cipher = None
import os
scrape = "scrape.py"
imagesdir = "images"
a = Analysis(
['main.py'],
pathex=[],
binaries=[],
datas=[(scrape,'.'), (imagesdir,'.')], # add these lines
hiddenimports=[],
hookspath=[],
hooksconfig={},
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False,
)
pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher)
exe = EXE(
pyz,
a.scripts,
a.binaries,
a.zipfiles,
a.datas,
[],
name='main',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
upx_exclude=[],
runtime_tmpdir=None,
console=True, # Once you have confirmed it is working you can set this to false
disable_windowed_traceback=False,
argv_emulation=False,
target_arch=None,
codesign_identity=None,
entitlements_file=None,
)
Then once that is all done. go back to your terminal and run pyinstaller main.spec, and bobs your uncle...
Update
main.py -
I essentially just removed the
shlex
portion and made the path to scrape.py relative to the main.py file path.
import tkinter as tk
from tkinter import messagebox as tkms
from tkinter import ttk
from subprocess import Popen
import json
import os
def get_url():
print('Getting URL...')
data = url.get()
if not os.path.exists('./data'):
os.makedirs('./data')
with open('./data/url.json', 'w') as f:
json.dump(data, f)
harvest = None
def watch():
global harvest
print('watch started')
if harvest:
if harvest.poll() != None:
print('progress bar ends')
# Update your progressbar to finished.
progress_bar.stop()
#if harvest finishes OK then show confirmation message otherwise show error.
if harvest.returncode == 0:
mes = tkms.showinfo(title='progress', message='Scraping Done')
if mes == 'ok':
root.destroy()
else:
tkms.showinfo(title='Error', message=f'harvest returncode == {harvest.returncode}')
# Maybe report harvest.returncode?
print(f'harvest return code if Poll !None =--######==== {harvest.returncode}')
print(f'harvest poll =--######==== {harvest.poll}')
# Re-schedule `watch` to be called again after 0.1 s.
harvest = None
else:
# indicate that process is running.
print('progress bar starts')
progress_bar.grid()
progress_bar.start(10)
print(f'harvest return code =--######==== {harvest.returncode}')
root.after(100, watch)
def scrape():
global harvest
scrapefile = os.path.join(os.path.dirname(__file__),'scrape.py')
# harvest = Popen(command_line)
with open ('stdout.txt', 'wb') as out, open('stderr.txt', 'wb') as err:
# harvest = Popen('scrapy runspider ./scrape.py', stdout=out, stderr=err, shell=True)
harvest = Popen(["python3", scrapefile], stdout=out, stderr=err)
out.close(), err.close()
print('harvesting started')
watch()
root = tk.Tk()
root.title("Title")
url = tk.StringVar(root)
entry1 = tk.Entry(root, width=90, textvariable=url)
entry1.grid(row=0, column=0, columnspan=3)
my_button = tk.Button(root, text="Process", command=lambda: [get_url(), scrape()])
my_button.grid(row=2, column=2)
progress_bar = ttk.Progressbar(root, orient=tk.HORIZONTAL, length=300, mode='indeterminate')
progress_bar.grid(row=3, column=2)
progress_bar.grid_forget()
root.mainloop()
main.spec
# -*- mode: python ; coding: utf-8 -*-
block_cipher = None
a = Analysis(['main.py'], pathex=[], binaries=[],
datas=[('scrape.py','.')], # <------- this is the only change that I made
hiddenimports=[], hookspath=[],
hooksconfig={}, runtime_hooks=[], excludes=[],
win_no_prefer_redirects=False, win_private_assemblies=False,
cipher=block_cipher, noarchive=False,)
pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher)
exe = EXE(pyz, a.scripts, a.binaries, a.zipfiles, a.datas, [],
name='main', debug=False, bootloader_ignore_signals=False, strip=False,
upx=True, upx_exclude=[], runtime_tmpdir=None, console=False,
disable_windowed_traceback=False, argv_emulation=False, target_arch=None,
codesign_identity=None, entitlements_file=None,)
I made no changes to the scrape.py
Storing responses as files using Scrapy Splash
WRITING OUTPUT TO A JSON FILE:
I have tried to solve your problem. Here is the working version of your code. I hope this is what you are trying to achieve.
import json
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 0.5}
)
def parse(self, response):
quotes = {"quotes": []}
for q in response.css("div.quote"):
quote = dict()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
quotes["quotes"].append(quote)
page = response.url[response.url.index("page/")+5:]
print("page=", page)
filename = 'quotes-%s.json' % page
with open(filename, 'w') as outfile:
outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))
UPDATE:
Above code has been updated to scrape from all pages and save results in separate json files from page-1 to 10.
This will write the list of quotes from each page to a separate json file as following:
{
"quotes":[
{
"author":"Albert Einstein",
"quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
},
{
"author":"J.K. Rowling",
"quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
},
{
"author":"Albert Einstein",
"quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
},
{
"author":"Jane Austen",
"quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author":"Marilyn Monroe",
"quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
},
{
"author":"Albert Einstein",
"quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
},
{
"author":"Andr\u00e9 Gide",
"quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
},
{
"author":"Thomas A. Edison",
"quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
},
{
"author":"Eleanor Roosevelt",
"quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
},
{
"author":"Steve Martin",
"quote":"\u201cA day without sunshine is like, you know, night.\u201d"
}
]
}
How to limit number of followed pages per site in Python Scrapy
I'd make per-class variable, initialize it with stats = defaultdict(int)
and increment self.stats[response.url]
(or may be the key could be a tuple like (website, depth)
in your case) in parse_item
.
This is how I imagine this - should work in theory. Let me know if you need an example.
FYI, you can extract base url and calculate depth with the help of urlparse.urlparse
(see docs).
Scrapy - logging to file and stdout simultaneously, with spider names
You want to use the ScrapyFileLogObserver
.
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
I'm glad you asked this question, I've been wanting to do this myself.
Related Topics
How to Delete a File or Folder in Python
What Is the Naming Convention in Python For Variable and Function
How to Load All Modules in a Folder
How to Base-64 Encode a Png Image for Use in a Data-Uri in a CSS File
CSS Problems with Flask Web App
Google App Engine: Won't Serve Static Assets with Below Error:
How to Split a Huge CSV File Based on Content of First Column
Cron Job: How to Run a Script That Requires to Open Display
Fetching the Output of a Command Executed Through Os.System() Command
Pipe Raw Opencv Images to Ffmpeg
Get Local Network Interface Addresses Using Only Proc
Sharing Psycopg2/Libpq Connections Across Processes
Python Deepcopy and Shallow Copy and Pass Reference
Run Python Script Only If It's Not Running