Count Lines in Large Files

Count lines in large files

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.

Powershell Count lines extremely large file

If performance matters, avoid the use of cmdlets and the pipeline; use switch -File:

$count = 0
switch -File C:\test.txt { default { ++$count } }

switch -File enumerates the lines of the specified file; condition default matches any line.


To give a sense of the performance difference:

# Create a sample file with 100,000 lines.
1..1e5 > tmp.txt
# Warm up the file cache
foreach ($line in [IO.File]::ReadLines("$pwd/tmp.txt")) { }

(Measure-Command { (Get-Content tmp.txt | Measure-Object).Count }).TotalSeconds

(Measure-Command { $count = 0; switch -File tmp.txt { default { ++$count } } }).TotalSeconds

Sample results from my Windows 10 / PSv5.1 machine:

1.3081307  # Get-Content + Measure-Object
0.1097513 # switch -File

That is, on my machine the switch -File command was about 12 times faster.

How to get the line count of a large file, at least 5G

Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5

Step 2: Get the huge file size, A

Step 3: Get the newfile size,B

Step 4: (A/B)*n is approximately equal to the exact line count.

Set n to be different values,done a few times more, then get the average.

(Python) Counting lines in a huge ( 10GB) file as fast as possible

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b

with open("file", "r") as f:
print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
print (sum(bl.count("\n") for bl in blocks(f)))

Is there a better way to determine the number of lines in a large txt file(1-2 GB)?

I'm just thinking out loud here, but chances are performance is I/O bound and not CPU bound. In any case, I'm wondering if interpreting the file as text may be slowing things down as it will have to convert between the file's encoding and string's native encoding. If you know the encoding is ASCII or compatible with ASCII, you might be able to get away with just counting the number of times a byte with the value 10 appears (which is the character code for a linefeed).

What if you had the following:

FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);

long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;

do
{
bytesRead = fs.Read(buffer, 0, buffer.Length);
for (int i = 0; i < bytesRead; i++)
if (buffer[i] == '\n')
lineCount++;
}
while (bytesRead > 0);

My benchmark results for 1.5GB text file, timed 10 times, averaged:

  • StreamReader approach, 4.69 seconds
  • File.ReadLines().Count() approach, 4.54 seconds
  • FileStream approach, 1.46 seconds

Count lines in very large file where line size is fixed

Since each row has a fixed number of characters, just get the file's size in bytes with os.path.getsize, subtract the length of the header, then divide by the length of each row. Something like this:

import os

file_name = 'TickStory/EURUSD.csv'

len_head = len('Timestamp,Bid price\n')
len_row = len('2012-01-01 22:00:36.416,1.29368\n')

size = os.path.getsize(file_name)

print((size - len_head) / len_row + 1)

This assumes all characters in the file are 1 byte.

How to get the exact count of lines in a very large text file in R?

1) wc This should be quite fast. First determine the filenames. We have assumed all files in the current directory whose extension is .txt. Change as needed. Then for each file run wc -l and form a data frame from it.

(If you are on Windows then install Rtools and ensure that \Rtools\bin is on your PATH.)

filenames <- dir(pattern = "[.]txt$")
wc <- function(x) shell(paste("wc -l", x), intern = TRUE)
DF <- read.table(text = sapply(filenames, wc), col.names = c("count", "filename"))

2) count.fields An alternative approach is to use count.fields. This does not make use of any external commands. filenames is from above.

sapply(filenames, function(x) length(count.fields(x, sep = "\1")))

How to get line count of a large file cheaply in Python?

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.



Related Topics



Leave a reply



Submit