Truncating the First 100Mb of a File in Linux

Truncate file at front

Truncate files at front seems not too hard to implement at system level.

But there are issues.

  • The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).

In other words truncating the beginning would change the only reference point and that is bad.

  • At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).

If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.

  • In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).

I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.

Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.

Find all files above a size and truncate?

I'd suggest rotating and compressing logs rather than truncating them. Logs typically compress really well, and you can move the compressed logs to backup media if you like. Plus, if you do have to delete anything, delete the oldest logs, not the newest ones.


That said, for educational purposes let's explore truncate. It has the ability to only shrink files, though it's buried in the documentation:

SIZE may also be prefixed by one of the following modifying characters: '+' extend by, '-' reduce by, '<' at most, '>' at least, '/' round down to multiple of, '%' round up to multiple of.

If the files are at a fixed depth you don't need the loop nor the find call. A simple glob will do:

truncate -s '<100M' /home/*/path/to/error_log

If they're at unpredictable depths you can use extended globbing...

shopt -s extglob
truncate -s '<100M' /home/**/error_log

...or use find -exec <cmd> {} +, which tells find to invoke a command on the files it finds.

find /home -name error_log -exec truncate -s '<100M' {} +

(If there are lots and lots of files find is safest. The glob options could exceed Linux's command-line length limit whereas find guards against that possibility.)

Remove beginning of file without rewriting the whole file

You can achieve the goal with Linux kernel v3.15 above for ext4/xfs file system.

int ret = fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, 4096);

See here
Truncating the first 100MB of a file in linux

delete first n characters of a very large file in unix shell

You can use:

sed -i.bak -r '1s/^.{10}//' file

This will create a backup file.bak and remove the first 10 characters from the first line. Note -i alone can also be used, to do in-place edit without backup.

Test

Original file:

$ cat a
1234567890some bad data and here we are
blablabla
yeah

Let's:

$ sed -i.bak -r '1s/^.{10}//' a
$ cat a
some bad data and here we are
blablabla
yeah
$ cat a.bak
1234567890some bad data and here we are
blablabla
yeah

Truncate/Delete the log file contents which are generated via output redirection

For long running processes that open the file for write mode (using '>', or otherwise), the process tracks the offset of the next write. Even if the file size is truncated to 0, the next write will resume at the last location. Most likely, based on description is that the long running process continue to log at the old offset (effectively leaving lot of zero-byte data in the start of the file.

  • Verify by inspecting the file - did the initial content disappear ?

Solution is simple, instead of logging in write mode, use append mode.

# Start with a clean file
rm -f sysout.log
# Force Append mode.
java - jar my_app.jar >> sysout.log 2>>&1 &

...
truncate ...
# New data should be written to the START of the file, based on truncated size.

Notice that all writing processes, and connections should use append mode.

How to process a file from one to another without doubled storage requirements?

Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

No, that is not possible with plain files. However, look into the Linux specific fallocate(2) (which is not portable, and might not work with every file system), so I don't recommend using it.

However, look into SQLite and GDBM indexed files. They provide an abstraction (above files) which enables you to "delete records".

Or just keep temporarily all the data in memory.

Or consider a double-pass (or multiple-pass) approach. Maybe nftw(3) could be useful.

(today, disk space is very cheap, so your requirement of avoid the double storage need is really strange; if you handle a huge amount of data you should have mentioned it)

Truncate wxFile (set lesser length)

There is indeed no such method, you need to call ftruncate(f.fd()) yourself under Unix or SetEndOfFile() under Windows.



Related Topics



Leave a reply



Submit