Randomly Pick Lines from a File Without Slurping It with Unix

Randomly Pick Lines From a File Without Slurping It With Unix

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'

How can I randomly sample the contents of a file?

Assuming you basically want to output about 2.5% of all lines, this would do:

print if 0.025 > rand while <$input>;

How can I get exactly n random lines from a file with Perl?

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
If l has been accepted, randomly choose one of the lines in S and replace it with l.
Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
Return the set S of chosen lines.

Python: Choose random line from file, then delete that line

I have a file, 'TestingDeleteLines.txt', that's about 300 lines of text. Right now, I'm trying to get it to print me 10 random lines from that file, then delete those lines.

#!/usr/bin/env python
import random

k = 10
filename = 'TestingDeleteLines.txt'
with open(filename) as file:
    lines = file.read().splitlines()

if len(lines) > k:
    random_lines = random.sample(lines, k)
    print("\n".join(random_lines)) # print random lines

    with open(filename, 'w') as output_file:
        output_file.writelines(line + "\n"
                               for line in lines if line not in random_lines)
elif lines: # file is too small
    print("\n".join(lines)) # print all lines
    with open(filename, 'wb', 0): # empty the file
        pass

It is O(n**2) algorithm that can be improved if necessary (you don't need it for a tiny file such as your input)

Randomly pick a region and process it, a number of times

I've got to assume some things here, and leave some restrictions

Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code

First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.

use warnings;
use strict;
use feature 'say';

use Getopt::Long;
use List::MoreUtils qw(uniq);

my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);

GetOptions(
    'num-regions|n=i' => \$num_regions, 
    'region-len|l=i'  => \$region_len, 
    'char|c=s'        => \$count_freq_for,
) or usage();

my $file = shift || usage();

# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file  or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my @rand = uniq sort { $a <=> $b } 
    map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: @rand";

my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0); 

my $region;

while (my $line = <$fh>) { 
    chomp $line;
    # Total number of characters so far, up to this line and with this line
    $nchars_prev = $nchars;
    $nchars += length $line;
    next if $line =~ /^\s*>sp/;

    # Complete the region if there wasn't enough chars on the previous line 
    if ($chars_left > 0) {
        $region .= substr $line, 0, $chars_left;
        my $cnt = () = $region =~ /$count_freq_for/g;
        say "$region $cnt";
        $chars_left = -1; 
    };  

    # Random positions that happen to be on this line    
    my @pos = grep { $_ > $nchars_prev and $_ < $nchars } @rand;
    # say "\tPositions on ($nchars_prev -- $nchars) line: @pos" if @pos;

    for (@pos) { 
        my $pos_in_line = $_ - $nchars_prev;
        $region = substr $line, $pos_in_line, $region_len; 

        # Don't print if there aren't enough chars left on this line
        last if ( $chars_left = 
            ($region_len - (length($line) - $pos_in_line)) ) > 0;

        my $cnt = () = $region =~ /$count_freq_for/g;
        say "$region $cnt";
    }   
}

sub fraction_skipped {
    my ($fh) = @_;
    my ($skip_len, $data_len);
    my $curr_pos = tell $fh;
    seek $fh, 0, 0  if $curr_pos != 0;
    while (<$fh>) {
        chomp;
        if (/^\s*>sp/) { $skip_len += length }
        else           { $data_len += length }
    }
    seek $fh, $curr_pos, 0;  # leave it as we found it
    return $skip_len / ($skip_len+$data_len);
}

sub usage {
    say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
    exit;
}

Uncomment the srand line so to have the same run always, for testing.
Notes follow.

Some corner cases

If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little

Handling of issues regarding randomness

"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.

Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.

If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.

The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.

More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.

In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.

Note on a particular test case

With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.

Here is a simple driver to run the above a given number of times, for statistics.

Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.^†

Adjust and add code to process as needed for statistics; output is in files.

use warnings;
use strict;
use feature 'say';

use Getopt::Long;
use IPC::Run qw(run);

my $outdir = 'rr_output';         # pick a directory name
mkdir $outdir if not -d $outdir;    
my $prog  = 'random_regions.pl';  # your name for the program
my $input = 'data_file.txt';      # your name for input file     
my $ch = 'F';

my ($runs, $regions, $len) = (10, 10, 10);    
GetOptions(
    'runs|n=i'  => \$runs, 
    'regions=i' => \$regions, 
    'length=i'  => \$len, 
    'char=s'    => \$ch, 
    'input=s'   => \$input
) or usage();

my @cmd = ( $prog, $input, 
    '--num-regions', $regions, 
    '--region-len', $len, 
    '--char', $ch
);    
say "Run: @cmd, $runs times.";

for my $n (1..$runs) {
    my $outfile = "$outdir/regions_r$n.txt";
    say "Run #$n, output in: $outdir/$outfile";
    run \@cmd, '>', $outfile  or die "Error with @cmd: $!";
}    

sub usage {
    say STDERR "Usage: $0 [options]", "\n\toptions: ...";
    exit;
}

Please expand on the error checking. See for instance this post and links on details.

Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.

The called program (random_regions.pl above) must be executable.

^† Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.

Importing and extracting a random sample from a large .CSV in R

I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table or fread(data.table package)) .

Using perl you can easily do this task. For example , to read 1% of your file in a random way, you can do this :

xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)

Here I am calling it from R using system. xx contain now only 1% of your file.

You can wrap all this in a function:

read_partial_rand <- 
  function(big_file,percent){
    cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
    cmd <- paste(cmd,big_file)
    system(cmd,intern=TRUE)
  }

how to process big file with comparison of each line in that file with remaining all lines in same file?

It is taking a long time because it looks like you are reading the file a huge amount of times.

You first read the file into the lines List, then for every entry you read it again, then inside that you read it again!. Instead of doing this, read the file once into your lines array and then use that to compare the entries against each other.

Something like this might work for you:

List<String> lines = new ArrayList<>();
BufferedReader firstbufferedReader = new BufferedReader(new FileReader(newFile(pathname)));
while ((line = firstbufferedReader.readLine()) != null) {
    lines.add(line);
}
firstbufferedReader.close();
for (int i = 0; i < lines.size(); i++) 
{
    for (int j = i + 1; j < lines.size(); j++) 
    {
        application.linesToCompare(lines.get(i),lines.get(j));
    }
}

Randomly Pick Lines from a File Without Slurping It with Unix