Get Last 10 Lines of Very Large Text File > 10Gb

Get last 10 lines of very large text file 10GB

Read to the end of the file, then seek backwards until you find ten newlines, and then read forward to the end taking into consideration various encodings. Be sure to handle cases where the number of lines in the file is less than ten. Below is an implementation (in C# as you tagged this), generalized to find the last numberOfTokens in the file located at path encoded in encoding where the token separator is represented by tokenSeparator; the result is returned as a string (this could be improved by returning an IEnumerable<string> that enumerates the tokens).

public static string ReadEndTokens(string path, Int64 numberOfTokens, Encoding encoding, string tokenSeparator) {

int sizeOfChar = encoding.GetByteCount("\n");
byte[] buffer = encoding.GetBytes(tokenSeparator);


using (FileStream fs = new FileStream(path, FileMode.Open)) {
Int64 tokenCount = 0;
Int64 endPosition = fs.Length / sizeOfChar;

for (Int64 position = sizeOfChar; position < endPosition; position += sizeOfChar) {
fs.Seek(-position, SeekOrigin.End);
fs.Read(buffer, 0, buffer.Length);

if (encoding.GetString(buffer) == tokenSeparator) {
tokenCount++;
if (tokenCount == numberOfTokens) {
byte[] returnBuffer = new byte[fs.Length - fs.Position];
fs.Read(returnBuffer, 0, returnBuffer.Length);
return encoding.GetString(returnBuffer);
}
}
}

// handle case where number of tokens in file is less than numberOfTokens
fs.Seek(0, SeekOrigin.Begin);
buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
return encoding.GetString(buffer);
}
}

Java : Read last n lines of a HUGE file

If you use a RandomAccessFile, you can use length and seek to get to a specific point near the end of the file and then read forward from there.

If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the Nth last line begins, you can seek to there and just read-and-print.

An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).

read only given last x lines in txt file

What about this

List <string> text = File.ReadLines("file.txt").Reverse().Take(2).ToList()

(Python) Counting lines in a huge ( 10GB) file as fast as possible

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b

with open("file", "r") as f:
print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
print (sum(bl.count("\n") for bl in blocks(f)))

How to read last n lines of log file

Your code will perform very poorly, since you aren't allowing any caching to happen.

In addition, it will not work at all for Unicode.

I wrote the following implementation:

///<summary>Returns the end of a text reader.</summary>
///<param name="reader">The reader to read from.</param>
///<param name="lineCount">The number of lines to return.</param>
///<returns>The last lneCount lines from the reader.</returns>
public static string[] Tail(this TextReader reader, int lineCount) {
var buffer = new List<string>(lineCount);
string line;
for (int i = 0; i < lineCount; i++) {
line = reader.ReadLine();
if (line == null) return buffer.ToArray();
buffer.Add(line);
}

int lastLine = lineCount - 1; //The index of the last line read from the buffer. Everything > this index was read earlier than everything <= this indes

while (null != (line = reader.ReadLine())) {
lastLine++;
if (lastLine == lineCount) lastLine = 0;
buffer[lastLine] = line;
}

if (lastLine == lineCount - 1) return buffer.ToArray();
var retVal = new string[lineCount];
buffer.CopyTo(lastLine + 1, retVal, 0, lineCount - lastLine - 1);
buffer.CopyTo(0, retVal, lineCount - lastLine - 1, lastLine + 1);
return retVal;
}

How do I read last 10 lines in a text file?

This is How I've finally solved. Anyway code is too slow so if any of you have any advice, please tell me:

    public static string ReadEndTokens(string filename, Int64 numberOfTokens, Encoding encoding, string tokenSeparator)
{
lock (typeof(SDAccess))
{
PersistentStorage sdPS = new PersistentStorage("SD");
sdPS.MountFileSystem();
string rootDirectory = VolumeInfo.GetVolumes()[0].RootDirectory;

int sizeOfChar = 1;//The only encoding suppourted by NETMF4.1 is UTF8
byte[] buffer = encoding.GetBytes(tokenSeparator);


using (FileStream fs = new FileStream(rootDirectory + @"\" + filename, FileMode.Open, FileAccess.ReadWrite))
{
Int64 tokenCount = 0;
Int64 endPosition = fs.Length / sizeOfChar;

for (Int64 position = sizeOfChar; position < endPosition; position += sizeOfChar)
{
fs.Seek(-position, SeekOrigin.End);
fs.Read(buffer, 0, buffer.Length);

encoding.GetChars(buffer);
if (encoding.GetChars(buffer)[0].ToString() + encoding.GetChars(buffer)[1].ToString() == tokenSeparator)
{
tokenCount++;
if (tokenCount == numberOfTokens)
{
byte[] returnBuffer = new byte[fs.Length - fs.Position];
fs.Read(returnBuffer, 0, returnBuffer.Length);
sdPS.UnmountFileSystem();// Unmount file system
sdPS.Dispose();
return GetString(returnBuffer);
}
}
}

// handle case where number of tokens in file is less than numberOfTokens
fs.Seek(0, SeekOrigin.Begin);
buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
sdPS.UnmountFileSystem();// Unmount file system
sdPS.Dispose();
return GetString(buffer);
}
}
}

//As GetString is not implemented in NETMF4.1 I've done this method
public static string GetString(byte[] bytes)
{
string cadena = "";
for (int i = 0; i < bytes.Length; i++)
cadena += Encoding.UTF8.GetChars(bytes)[i].ToString();
return cadena;
}

Count lines in large files

Try: sed -n '$=' filename

Also cat is unnecessary: wc -l filename is enough in your present way.



Related Topics



Leave a reply



Submit