How to Efficiently Get 10% of Random Lines Out of The Large File in Linux

How to efficiently get 10% of random lines out of the large file in Linux?

I think this is the best way:

file=your file here
lines_in_file=`wc -l < $file`
lines_wanted=$(($lines_in_file/10))

shuf -n $lines_wanted $file

Another creative solution:

echo $RANDOM generates a random number between 0 and 32767

Then, you can do:

echo $(($RANDOM*100000/32767+1))

.. to obtain a random number between 1 and 100000 (as nwellnhof points out in comments below, it's not any number from 1 to 100000, but one of 32768 possible numbers between 1 and 100000, so it's kind of a projection...)

So:

file=your file here
lines_in_file=`wc -l $file | awk {'print $1'}`
lines_wanted=$(($lines_in_file/10))
for i in `seq 1 $lines_wanted`
do line_chosen=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line_chosen}q;d" $file
done

How get unique lines from a very large file in linux?

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Select n random line from a text file cut from the original and paste into new file

Create a file with 1 million lines:

perl -e 'for (1..1000000) { print "line $_ - and some data_$_\n" }' > large_file

Here is a perl script to sample the large file:

sample_size.pl

#!/usr/bin/env perl

use warnings;
use strict;

my ($filename, $n) = @ARGV;
$filename
or die "usage: $0 filename sample_size";

-f $filename
or die "Invalid filename '$filename'";
chomp(my ($word_count_lines) = `/usr/bin/wc -l $filename`);
my ($lines, undef) = split /\s+/, $word_count_lines;

die "Need to pass in sample size"
unless $n;
my $sample_size = int $n;

die "Invalid sample size '$n', should in the between [ 0 - $lines ]"
unless (0 < $sample_size and $sample_size < $lines);

# Pick some random line numbers
my %sample;
while ( keys %sample < $sample_size ) {
$sample{ 1+int rand $lines }++;
}

open my $fh, $filename
or die "Unable to open '$filename' for reading : $!";

open my $fh_sample, "> $filename.sample"
or die "Unable to open '$filename.sample' for writing : $!";
open my $fh_remainder, "> $filename.remainder"
or die "Unable to open '$filename.remainder' for writing : $!";

my $current_fh;
while (<$fh>) {
my $line_number = $.;
$current_fh = $sample{ $line_number } ? $fh_sample : $fh_remainder;
# Write to correct file
print $current_fh $_;
}
close $fh
or die "Unable to finish reading '$filename' : $!";
close $fh_sample
or die "Unable to finish writing '$filename.sample' : $!";
close $fh_remainder
or die "Unable to finish writing '$filename.sample' : $!";

print "Original file '$filename' has $lines rows\n";
print "Created '$filename.sample' with $sample_size rows\n";
print "Created '$filename.remainder' with " . ($lines - $sample_size) . " rows\n";
print "Run 'mv $filename.remainder $filename' if you are happy with this result\n";

Run the script

$ perl ./sample_size.pl large_file 10

Output

Original file 'large_file' has 1000000 rows
Created 'large_file.sample' with 10 rows
Created 'large_file.remainder' with 999990 rows
Run 'mv large_file.remainder large_file' if you are happy with this result

Get a list of lines from a file

Why don't you directly use shuf to get random lines:

shuf -n NUMBER_OF_LINES file

Example

$ seq 100 >a   # the file "a" contains number 1 to 100, each one in a line

$ shuf -n 4 a
54
46
30
53

$ shuf -n 4 a
50
37
63
21

Update

Can I somehow store the number of lines shuf chose? – Pio

As I did in How to efficiently get 10% of random lines out of the large file in Linux?, you can do something like this:

shuf -i 1-1000 -n 5 > rand_numbers # store the list of numbers
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' list_of_numbers a #print those lines

Extract lines containing one of large number of strings from file

grep -F -f IDS DATA

Don't miss -F: it prevents from interpreting IDS as regular expressions, and enables a much more efficient Aho-Korasick algorithm.

Read random lines from huge CSV file

import random

filesize = 1500 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset) #go to random position
f.readline() # discard - bound to be partial line
random_line = f.readline() # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0: # we have hit the end
f.seek(0)
random_line = f.readline() # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let's assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

How can I select random files from a directory in bash?

Here's a script that uses GNU sort's random option:

ls |sort -R |tail -$N |while read file; do
# Something involving $file, or you can leave
# off the while to just get the filenames
done


Related Topics



Leave a reply



Submit