How to efficiently get 10% of random lines out of the large file in Linux?
I think this is the best way:
file=your file here
lines_in_file=`wc -l < $file`
lines_wanted=$(($lines_in_file/10))
shuf -n $lines_wanted $file
Another creative solution:
echo $RANDOM
generates a random number between 0 and 32767
Then, you can do:
echo $(($RANDOM*100000/32767+1))
.. to obtain a random number between 1 and 100000 (as nwellnhof points out in comments below, it's not any number from 1 to 100000, but one of 32768 possible numbers between 1 and 100000, so it's kind of a projection...)
So:
file=your file here
lines_in_file=`wc -l $file | awk {'print $1'}`
lines_wanted=$(($lines_in_file/10))
for i in `seq 1 $lines_wanted`
do line_chosen=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line_chosen}q;d" $file
done
How get unique lines from a very large file in linux?
Use sort -u
instead of sort | uniq
This allows sort
to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.
Select n random line from a text file cut from the original and paste into new file
Create a file with 1 million lines:
perl -e 'for (1..1000000) { print "line $_ - and some data_$_\n" }' > large_file
Here is a perl script to sample the large file:
sample_size.pl
#!/usr/bin/env perl
use warnings;
use strict;
my ($filename, $n) = @ARGV;
$filename
or die "usage: $0 filename sample_size";
-f $filename
or die "Invalid filename '$filename'";
chomp(my ($word_count_lines) = `/usr/bin/wc -l $filename`);
my ($lines, undef) = split /\s+/, $word_count_lines;
die "Need to pass in sample size"
unless $n;
my $sample_size = int $n;
die "Invalid sample size '$n', should in the between [ 0 - $lines ]"
unless (0 < $sample_size and $sample_size < $lines);
# Pick some random line numbers
my %sample;
while ( keys %sample < $sample_size ) {
$sample{ 1+int rand $lines }++;
}
open my $fh, $filename
or die "Unable to open '$filename' for reading : $!";
open my $fh_sample, "> $filename.sample"
or die "Unable to open '$filename.sample' for writing : $!";
open my $fh_remainder, "> $filename.remainder"
or die "Unable to open '$filename.remainder' for writing : $!";
my $current_fh;
while (<$fh>) {
my $line_number = $.;
$current_fh = $sample{ $line_number } ? $fh_sample : $fh_remainder;
# Write to correct file
print $current_fh $_;
}
close $fh
or die "Unable to finish reading '$filename' : $!";
close $fh_sample
or die "Unable to finish writing '$filename.sample' : $!";
close $fh_remainder
or die "Unable to finish writing '$filename.sample' : $!";
print "Original file '$filename' has $lines rows\n";
print "Created '$filename.sample' with $sample_size rows\n";
print "Created '$filename.remainder' with " . ($lines - $sample_size) . " rows\n";
print "Run 'mv $filename.remainder $filename' if you are happy with this result\n";
Run the script
$ perl ./sample_size.pl large_file 10
Output
Original file 'large_file' has 1000000 rows
Created 'large_file.sample' with 10 rows
Created 'large_file.remainder' with 999990 rows
Run 'mv large_file.remainder large_file' if you are happy with this result
Get a list of lines from a file
Why don't you directly use shuf
to get random lines:
shuf -n NUMBER_OF_LINES file
Example
$ seq 100 >a # the file "a" contains number 1 to 100, each one in a line
$ shuf -n 4 a
54
46
30
53
$ shuf -n 4 a
50
37
63
21
Update
Can I somehow store the number of lines shuf chose? – Pio
As I did in How to efficiently get 10% of random lines out of the large file in Linux?, you can do something like this:
shuf -i 1-1000 -n 5 > rand_numbers # store the list of numbers
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' list_of_numbers a #print those lines
Extract lines containing one of large number of strings from file
grep -F -f IDS DATA
Don't miss -F
: it prevents from interpreting IDS as regular expressions, and enables a much more efficient Aho-Korasick algorithm.
Read random lines from huge CSV file
import random
filesize = 1500 #size of the really big file
offset = random.randrange(filesize)
f = open('really_big_file')
f.seek(offset) #go to random position
f.readline() # discard - bound to be partial line
random_line = f.readline() # bingo!
# extra to handle last/first line edge cases
if len(random_line) == 0: # we have hit the end
f.seek(0)
random_line = f.readline() # so we'll grab the first line instead
As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:
Let's assume (in this case) we have min=3 and max=15
1) Find the length (Lp) of the previous line.
Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.
We accomplish this by randomly keeping the line X% of the time where:
X = min / Lp
If we don't keep the line, we do another random pick until our dice roll comes good. :-)
How can I select random files from a directory in bash?
Here's a script that uses GNU sort's random option:
ls |sort -R |tail -$N |while read file; do
# Something involving $file, or you can leave
# off the while to just get the filenames
done
Related Topics
Sending Data on Af_Packet Socket
Bluez: Setting Local Address to Be Private and Non-Resolvable
How to Increase a Date Within a Loop in Bash
Git Is Not Ignoring File Permission Changes
Insecure $Env{Path} While Running with - T Switch
Find Port Number of Ibm Mq Queue Manager
How to Remove Space/Tab from Command Output
Why Does Sed Leave Many Files Around
Udp Broadcast Sendto Failed:"Network Is Unreachable" on Linux 2.6.30
Issue with Signal Handling, Interrupt Handling
How to Ensure Data Reaches Storage, Bypassing Memory/Cache/Buffered-Io
Linux Kconfig Command Line Interface
Basic Build Issue Regarding Libs, Pkg-Config and Opencv
Cmake: How to Suppress "Entering Directory" Messages