Reading file opened with Python Paramiko SFTPClient.open method is slow
Calling SFTPFile.prefetch
should increase the read speed:
ncfile = sftp_client.open('mynetCDFfile')
ncfile.prefetch()
b_ncfile = ncfile.read()
Another option is enabling read buffering, using bufsize
parameter of SFTPClient.open
:
ncfile = sftp_client.open('mynetCDFfile', bufsize=32768)
b_ncfile = ncfile.read()
(32768
is a value of SFTPFile.MAX_REQUEST_SIZE
)
Similarly for writes/uploads:
Writing to a file on SFTP server opened using Paramiko/pysftp "open" method is slow.
Yet another option is to explicitly specify the amount of data to read (it makes BufferedFile.read
take a more efficient code path):
ncfile = sftp_client.open('mynetCDFfile')
b_ncfile = ncfile.read(ncfile.stat().st_size)
If none of that works, you can download the whole file to memory instead:
Use pdfplumber and Paramiko to read a PDF file from an SFTP server
Obligatory warning: Do not use AutoAddPolicy
this way – You are losing a protection against MITM attacks by doing so. For a correct solution, see Paramiko "Unknown Server".
Open a remote file using paramiko in python slow
Your problem is likely to be caused by the file being a remote object. You've opened it on the server and are requesting one line at a time - because it's not local, each request takes much longer than if the file was sitting on your hard drive. The best alternative is probably to copy the file down to a local location first, using Paramiko's SFTP get
.
Once you've done that, you can open the file from the local location using os.open
.
Reading large Parquet file from SFTP with Pyspark is slow
By adding the buffer_size
parameter in the pyarrow.parquet library, the computational time went from 51 to 21 minutes :)
df = pq.read_table(SERVER_LOCATION\FILE.parquet, filesystem = fs, buffer_size = 32768)
Thanks @Martin Prikryl for your help ;)
Use pdfplumber and Paramiko to read a PDF file from an SFTP server
Paramiko SFTPClient.open
returns a file-like object.
To use a file-like object with pftplumber
, it seems that you can use load
function:
pdf = pdfplumber.load(fl)
You will also want to read this:
Reading file opened with Python Paramiko SFTPClient.open method is slow
As the Paramiko file-like object seems to work suboptimal when combined with pftplumber.load
function, as a workaround, you can download the file to memory instead:
flo = BytesIO()
sftp.getfo(fullpath, flo)
flo.seek(0)
pdfplumber.load(flo)
See How to use Paramiko getfo to download file from SFTP server to memory to process it
Read a file from server with SSH using Python
Paramiko's SFTPClient
class allows you to get a file-like object to read data from a remote file in a Pythonic way.
Assuming you have an open SSHClient
:
sftp_client = ssh_client.open_sftp()
remote_file = sftp_client.open('remote_filename')
try:
for line in remote_file:
# process line
finally:
remote_file.close()
SFTP to S3 AWS Lambda using Python Paramiko is extremely slow
My solution to the problem to use paramiko readv(), which reads a list of chunks and saves time because it doesn't use seek. I also added multithreading with the method above to download several chunks at once, then use the multipart upload. Doing readv alone sped it up to 2-3MB a sec, with higher speeds hitting 10MB a sec, and the multiple threads provided the same speeds, but processed different parts of the file simultaneously. This allowed a 1GB file to be read in less than 6 minutes, whereas the original would've only allowed a 200MB in a 15 minute timeframe. I'll also add prefetch and the other fixes mentioned in the comments were not used, as readv uses prefetch on its own, and prefetch doesnt help with large files
Reading .csv file to memory from SFTP server using Python Paramiko
Assuming you are using Paramiko SFTP library, use SFTPClient.open
method:
with sftp.open(path) as f:
f.prefetch()
df = pd.read_csv(f)
For the purpose of the prefetch
, see Reading file opened with Python Paramiko SFTPClient.open method is slow.
Related Topics
Convert a 1D Array to a 2D Array in Numpy
When Should I Be Using Classes in Python
Which Seeds Have to Be Set Where to Realize 100% Reproducibility of Training Results in Tensorflow
Determine Complete Django Url Configuration
In Python, How Does One Catch Warnings as If They Were Exceptions
Intraday Candlestick Charts Using Matplotlib
Python: Why Does My List Change When I'm Not Actually Changing It
Why Does @Foo.Setter in Python Not Work for Me
Importerror After Cython Embed
No Module Named When Using Pyinstaller
Import CSV with Different Number of Columns Per Row Using Pandas
What Is Sys.Maxint in Python 3
Copy File or Directories Recursively in Python
Python Map Object Is Not Subscriptable
Python Matplotlib Figure Title Overlaps Axes Label When Using Twiny