How Does Sort Work Out How Much Ram There Is

How does sort work out how much RAM there is?

If you're using GNU sort, the answer is it calculates a default based on the rlimits for data (set by ulimit -d) and RSS (set by ulimit -m) as well as the sysconf values for available memory and total memory.

Regardless of your ulimit, the default memory size won't exceed more than 3/4ths of either your currently available memory, or 1/8th of your total memory, whichever is greater.

/* Let MEM be available memory or 1/8 of total memory, whichever
is greater. */
double avail = physmem_available ();
double total = physmem_total ();
double mem = MAX (avail, total / 8);

/* Leave a 1/4 margin for physical memory. */
if (total * 0.75 < size)
size = total * 0.75;

With GNU sort, you can use the -S option to specify sorting buffer size:

   -S, --buffer-size=SIZE
use SIZE for main memory buffer

This value can either be a number of kilobytes, can be suffixed with another unit (e.g. -S 100M), or can be a percentage of total memory (e.g. -S 55%)

Sort with the limited memory

You are looking for external sorting.
The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO.
The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.

A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.

Sorting 10GB Data in 1 GB memory. How will I do it?

split the file into parts (buffers) that you can sort in-place

then when all buffers are sorted take 2 (or more) at the time and merge them (like merge sort) until there's only 1 buffer remaining which will be the sorted file

Does RAM affect the time taken to sort an array?

Adding RAM for an operation like that would only make a difference if you have filled up the available memory with everything that was running on your computer and the OS was swapping data out to disk to make room for your sort operation. Chances are, if that was happening, you would know because your computer would become pretty sluggish.

So, you just need enough memory for the working set of whatever applications you're running and then enough memory to hold the data you are sorting. Adding additional memory beyond that will not make any difference.

If you had an array of a million numbers to be sorted in Javascript, that array would likely take (1,000,000 * 8 bytes per number) + some overhead for a JS data structure = ~8MB. If your array values were larger than 8 bytes, then you'd have to account for that in the calculation, but hopefully you can see that this isn't a ton of memory in a modern computer.

If you have only an 8GB system and you have a lot of services and other things configured in it and are perhaps running a few other applications at the time, then it's possible that by the time you run nodejs, you don't have much free memory. You should be able to look at some system diagnostics to see how much free memory you have. As long as you have some free memory and are not causing the system to do disk swapping, adding more memory will not increase performance of the sort.

Now, if the data is stored in a database and you're doing some major database operation (such as creating a new index), then it's possible that the database may adjust how much memory it can use based on how much memory is available and it might be able to go faster by using more RAM. But, for a Javascript array which is already all in memory and is using a fixed algorithm for the sort, this would not be the case.

Sort a file with huge volume of data given memory constraint

It looks like what you are looking for is
external sorting.

Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

How could the UNIX sort command sort a very large file?

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

to SORT 1 BILLION of integer numbers with small physical memory

The simplest thing to do is break the input into smaller files that can fit in memory and sort each, and then merge the results.

Guido van Rossum has a good description of doing this in python while obviously not the same language the principle is the same.

Sorting 20GB of data

I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.

General idea:

  • read N lines from the input file (a value that allows you to keep the lines in memory)

  • sort these lines and write the sorted lines to file 1

  • repeat with the next N lines to obtain file 2

    ...

  • you reach the end of the input file and you now have M files (each of which is sorted)

  • merge these files into a single file (you'll have to do this in steps as well)


You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).

Efficient Out-Of-Core Sorting

The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of runs you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record.

You then check whether that record would sort before or after the record you just wrote out. If you would sort after it, you insert it into your heap, and continue. If it would sort before, you insert it into a second heap.

You stop adding records to the current run when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you repeat the process, writing a new run to a new file.

This will usually produce considerably longer intermediate runs in the initial phase, so merging them is substantially less work. Assuming the input records are in random order, you can expect this to approximately double the length of each run--but if the input is even partially sorted, this can take advantage of that existing ordering to extend the run lengths even more.

As an aside, I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I don't have a copy of the third edition, so I don't have a page number for that.



Related Topics



Leave a reply



Submit