How to Get the Count of Line in a File in an Efficient Way

How can I get the count of line in a file in an efficient way?

BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();

Update: To answer the performance-question raised here, I made a measurement. First thing: 20.000 lines are too few, to get the program running for a noticeable time. I created a text-file with 5 million lines. This solution (started with java without parameters like -server or -XX-options) needed around 11 seconds on my box. The same with wc -l (UNIX command-line-tool to count lines), 11 seconds. The solution reading every single character and looking for '\n' needed 104 seconds, 9-10 times as much.

How can I get the count of line in a file in an efficient way in Dart?

var myFile = new File('file.txt');
// assuming a utf8 encoding
var numberOfLines = myFile.readAsLinesSync().length;

See readAsLinesSync doc to provide an other encoding.

What's the fastest way to count the total lines of text file in c#?

Here are a few ways this can be accomplished quickly:

StreamReader:

using (var sr = new StreamReader(path))
{
while (!String.IsNullOrEmpty(sr.ReadLine()))
lineCount ++;
}

FileStream:

var lineBuffer = new byte[65536]; // 64Kb
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
FileShare.Read, lineBuffer.Length))
{
int readBuffer = 0;
while ((readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0)
{
for (int i = 0; i < readBuffer; i++)
{
if (lineBuffer[i] == 0xD) // Carriage return + line feed
lineCount++;
}
}
}

Multithreading:

Arguably the number of threads shouldn't affect the read speed, but real world benchmarking can sometimes prove otherwise. Try different buffer sizes and see if you get any gains at all with your setup. *This method contains a race condition. Use with caution.

var tasks = new Task[Environment.ProcessorCount]; // 1 per core
var fileLock = new ReaderWriterLockSlim();
int bufferSize = 65536; // 64Kb

using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
FileShare.Read, bufferSize, FileOptions.RandomAccess))
{
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = Task.Factory.StartNew(() =>
{
int readBuffer = 0;
var lineBuffer = new byte[bufferSize];

while ((fileLock.TryEnterReadLock(10) &&
(readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0))
{
fileLock.ExitReadLock();
for (int n = 0; n < readBuffer; n++)
if (lineBuffer[n] == 0xD)
Interlocked.Increment(ref lineCount);
}
});
}
Task.WaitAll(tasks);
}

Number of lines in a file in Java

This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.

public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}

EDIT, 9 1/2 years later: I have practically no java experience, but anyways I have tried to benchmark this code against the LineNumberReader solution below since it bothered me that nobody did it. It seems that especially for large files my solution is faster. Although it seems to take a few runs until the optimizer does a decent job. I've played a bit with the code, and have produced a new version that is consistently fastest:

public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];

int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}

// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}

// count remaining characters
while (readChars != -1) {
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}

return count == 0 ? 1 : count;
} finally {
is.close();
}
}

Benchmark resuls for a 1.3GB text file, y axis in seconds. I've performed 100 runs with the same file, and measured each run with System.nanoTime(). You can see that countLinesOld has a few outliers, and countLinesNew has none and while it's only a bit faster, the difference is statistically significant. LineNumberReader is clearly slower.

Benchmark Plot

Is there a better way to determine the number of lines in a large txt file(1-2 GB)?

I'm just thinking out loud here, but chances are performance is I/O bound and not CPU bound. In any case, I'm wondering if interpreting the file as text may be slowing things down as it will have to convert between the file's encoding and string's native encoding. If you know the encoding is ASCII or compatible with ASCII, you might be able to get away with just counting the number of times a byte with the value 10 appears (which is the character code for a linefeed).

What if you had the following:

FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);

long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;

do
{
bytesRead = fs.Read(buffer, 0, buffer.Length);
for (int i = 0; i < bytesRead; i++)
if (buffer[i] == '\n')
lineCount++;
}
while (bytesRead > 0);

My benchmark results for 1.5GB text file, timed 10 times, averaged:

  • StreamReader approach, 4.69 seconds
  • File.ReadLines().Count() approach, 4.54 seconds
  • FileStream approach, 1.46 seconds

Count lines in large files

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.

How to get the line count of a large file, at least 5G

Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5

Step 2: Get the huge file size, A

Step 3: Get the newfile size,B

Step 4: (A/B)*n is approximately equal to the exact line count.

Set n to be different values,done a few times more, then get the average.

How to get line count of a large file cheaply in Python?

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

Fastest way to count lines in a file

for /f "tokens=1 delims=:" %%# in ('find /c /v "" ^< FILENAME') do set "linescount=%%#"
echo %linescount%

Fastest way to find the number of lines in a text (C++)

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}

int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}


Related Topics



Leave a reply



Submit