Count lines in large files
Try: sed -n '$=' filename
Also cat is unnecessary: wc -l filename
is enough in your present way.
Powershell Count lines extremely large file
If performance matters, avoid the use of cmdlets and the pipeline; use switch -File
:
$count = 0
switch -File C:\test.txt { default { ++$count } }
switch -File
enumerates the lines of the specified file; condition default
matches any line.
To give a sense of the performance difference:
# Create a sample file with 100,000 lines.
1..1e5 > tmp.txt
# Warm up the file cache
foreach ($line in [IO.File]::ReadLines("$pwd/tmp.txt")) { }
(Measure-Command { (Get-Content tmp.txt | Measure-Object).Count }).TotalSeconds
(Measure-Command { $count = 0; switch -File tmp.txt { default { ++$count } } }).TotalSeconds
Sample results from my Windows 10 / PSv5.1 machine:
1.3081307 # Get-Content + Measure-Object
0.1097513 # switch -File
That is, on my machine the switch -File
command was about 12 times faster.
How to get the line count of a large file, at least 5G
Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5
Step 2: Get the huge file size, A
Step 3: Get the newfile size,B
Step 4: (A/B)*n is approximately equal to the exact line count.
Set n to be different values,done a few times more, then get the average.
(Python) Counting lines in a huge ( 10GB) file as fast as possible
Ignacio's answer is correct, but might fail if you have a 32 bit process.
But maybe it could be useful to read the file block-wise and then count the \n
characters in each block.
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r") as f:
print sum(bl.count("\n") for bl in blocks(f))
will do your job.
Note that I don't open the file as binary, so the \r\n
will be converted to \n
, making the counting more reliable.
For Python 3, and to make it more robust, for reading files with all kinds of characters:
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r",encoding="utf-8",errors='ignore') as f:
print (sum(bl.count("\n") for bl in blocks(f)))
Is there a better way to determine the number of lines in a large txt file(1-2 GB)?
I'm just thinking out loud here, but chances are performance is I/O bound and not CPU bound. In any case, I'm wondering if interpreting the file as text may be slowing things down as it will have to convert between the file's encoding and string
's native encoding. If you know the encoding is ASCII or compatible with ASCII, you might be able to get away with just counting the number of times a byte with the value 10 appears (which is the character code for a linefeed).
What if you had the following:
FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);
long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;
do
{
bytesRead = fs.Read(buffer, 0, buffer.Length);
for (int i = 0; i < bytesRead; i++)
if (buffer[i] == '\n')
lineCount++;
}
while (bytesRead > 0);
My benchmark results for 1.5GB text file, timed 10 times, averaged:
StreamReader
approach, 4.69 secondsFile.ReadLines().Count()
approach, 4.54 secondsFileStream
approach, 1.46 seconds
Count lines in very large file where line size is fixed
Since each row has a fixed number of characters, just get the file's size in bytes with os.path.getsize
, subtract the length of the header, then divide by the length of each row. Something like this:
import os
file_name = 'TickStory/EURUSD.csv'
len_head = len('Timestamp,Bid price\n')
len_row = len('2012-01-01 22:00:36.416,1.29368\n')
size = os.path.getsize(file_name)
print((size - len_head) / len_row + 1)
This assumes all characters in the file are 1 byte.
How to get the exact count of lines in a very large text file in R?
1) wc This should be quite fast. First determine the filenames. We have assumed all files in the current directory whose extension is .txt
. Change as needed. Then for each file run wc -l
and form a data frame from it.
(If you are on Windows then install Rtools and ensure that \Rtools\bin
is on your PATH.)
filenames <- dir(pattern = "[.]txt$")
wc <- function(x) shell(paste("wc -l", x), intern = TRUE)
DF <- read.table(text = sapply(filenames, wc), col.names = c("count", "filename"))
2) count.fields An alternative approach is to use count.fields
. This does not make use of any external commands. filenames
is from above.
sapply(filenames, function(x) length(count.fields(x, sep = "\1")))
How to get line count of a large file cheaply in Python?
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many \n
you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
Related Topics
Limiting Memory Usage in R Under Linux
Command Line Utility to Print Statistics of Numbers in Linux
How to Trim White Space from a Variable in Awk
Recursive Search and Replace in Text Files on MAC and Linux
How to Get the Resolution (Width and Height) for a Video File from a Linux Command Line
Linux Bash Script to Extract Ip Address
How to Tell in Linux Which Process Sent My Process a Signal
Using Software Floating Point on X86 Linux
How to Read the Source Code of Shell Commands
Nginx: Serve Multiple Laravel Apps with Same Url But Two Different Sub Locations in Linux
Using Grep and Sed to Find and Replace a String
Making Cmake Print Commands Before Executing
How to Use Xargs to Copy Files That Have Spaces and Quotes in Their Names
Best Way to Find Os Name and Version in Unix/Linux Platform