Command Line Utility to Print Statistics of Numbers in Linux

command line utility to print statistics of numbers in linux

This is a breeze with R. For a file that looks like this:

Use this:

R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"

To get this:

       V1       
 Min.   : 1.00  
 1st Qu.: 3.25  
 Median : 5.50  
 Mean   : 5.50  
 3rd Qu.: 7.75  
 Max.   :10.00  
[1] 3.02765

The -q flag squelches R's startup licensing and help output
The -e flag tells R you'll be passing an expression from the terminal
x is a data.frame - a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.
Some functions, like summary(), naturally accommodate data.frames. If x had multiple fields, summary() would provide the above descriptive stats for each.
But sd() can only take one vector at a time, which is why I index x for that command (x[ , 1] returns the first column of x). You could use apply(x, MARGIN = 2, FUN = sd) to get the SDs for all columns.

What's the quickest way to get the mean of a set of numbers from the command line?

Awk

awk '{total += $1; count++ } END {print total/count}'

How to count lines in a document?

Use wc:

wc -l <filename>

This will output the number of lines in <filename>:

$ wc -l /dir/file.txt
3272485 /dir/file.txt

Or, to omit the <filename> from the result use wc -l < <filename>:

$ wc -l < /dir/file.txt
3272485

You can also pipe data to wc as well:

$ cat /dir/file.txt | wc -l
3272485
$ curl yahoo.com --silent | wc -l
63

summarise NA data from datafile on command line in linux / bash

The fundamental problem here seems to be that you haven't noticed the difference between compressed data and plain text. Most Unix tools only work on the latter.

But first: Your attempts will look for "na" as a substring of any value in any column. So "banana" will match because ... well, I imagine you can see why.

Without seeing your actual data, it's hard to know what exactly you need, but for looking for entries which are exactly "NA" case-insensitively in column 3 of a tab-delimited file, try

awk -F '\t' 'tolower($3) == "na"' file

The tolower() call converts to lower case so that you can then compare to a single string; this will cover all case variations of "NA", "Na", "nA", or "na" but only look for an exact match after case conversion, instead of looking for a substring. (This assumes you don't have e.g. whitespace around the values in the column, too; it really is an exact comparison.)

To get the count, pipe to wc -l (or refactor the Awk script:

awk -F '\t' 'tolower($3) == "na" { sum += 1 } END { print sum }' file

Probably read a 15-minute introduction to Awk if this isn't obvious.)

Perhaps more usefully, include file name and line number

awk -F '\t' 'tolower($3) == "na" { print FILENAME ":" FNR ":" $0 }' file

(You can pass multiple file names, and then the FILENAME makes more sense.)

To examine a different column, replace $3 with e.g. $42 to examine column 42. To use a different delimiter, put the delimiter instead of '\t' after -F. This still doesn't correctly deal with e.g. quoting in CSV files (maybe switch to a language which knows how to parse CSV in all its variations correctly, like Python, or use a dedicated CSV tool -- I hear there's one called csvtool).

If your file is compressed, you can't grep (or Awk or sed or etc) the compressed data directly; you have to uncompress it first:

gzcat datafile.gz |
awk -F '\t' -v gzfile="datafile.gz" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'

Notice how you don't put a file name after the Awk script when the input comes from a pipe (or not just Awk actually; any tool which is capable of reading a file or standard input, as is the norm among Unix text-processing tools).

The above also shows how to pass a string into Awk as a variable (-v variable="value"). This might be more useful if you have multiple files you want to loop over:

for datafile in datafile.gz datafile0.gz datafile1.gz; do
    gzcat "$datafile" |
    awk -F '\t' -v gzfile="$datafile" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'
done

Slightly bewilderingly, the $datafile variable is a shell variable (Bash or Zsh or what have you) which is entirely distinct from Awk's internal variables. We use the -v trick from above to make it available to Awk in the variable gzfile.

You could use grep (which reads plain text) or gzgrep (which can read gzip-compressed data directly) as well, but then you want to pass in a regular expression which targets the specific column. Just to show how it's done, here is a regex which says "something with no tabs, followed by a tab, followed by something with no tabs, followed by a tab, followed by na, followed by a tab" which (once you wrap your head around it) targets the third column.

gzgrep -Ei -c $'^[^\t]*\t[^\t]*\tna\t' filename.gz

The $'...' notation is Bash-specific and allows us to use the symbol \t instead of a literal tab in the regular expression. The -c option to grep asks it to report how many matching lines it found, so you don't even need wc -l here. The -E option selects a slightly less arcane and more modern regular expression dialect than the default from the early 1970s. (It's still not properly modern by any standard; the "extended" dialect is from the mid-1970s, with some later brushing up by the POSIX standardization in the 1990s. Newer tools support a plethora of extensions which are not standard in grep.)

To look for na alone with tabs on both sides in any column, try

gzgrep -Eic -e $'^na\t|\tna\t|\tna$' filename.gz

where | in the regular expression stands for "or", and ^na\t is the regular expression for na bracketed by beginning of line on one side and a tab on the other, and \tna$ is the same for end of line.

If you don't use Bash, you can remove the $ before ' and type a literal tab, usually with ctrl V tab

Files whose name ends with .gz are compressed with gzip; there are other compression tools which conventionally use file extensions like .Z (plain old compress, from the dark ages), .bz2 (bzip2), .xz (xz) etc; most of these ship with a bzcat or similar tool which performs the same job as gzcat for gzip files, and some will also have something like bzgrep to parallel gzgrep. (To add to the alphabet soup, gzcat could also mean GNU zcat on some systems! Then it handles .Z files, not .gz files).

Unix-based systems generally don't rely on the file extension to identify the type of a file, but the human convention is to add an extension when you compress something, and many of the tools will do that automatically (so for example gzip file creates file.gz).

A quick examination of a compressed file, perhaps with a hex dump tool, should reveal why grep doesn't work directly on the file. The compressed data is a tightly packed oblique binary which simply does not contain the uncompressed data in any easily discoverable form (unless you learn a lot about compression!)

Shell command to sum integers, one per line?

Bit of awk should do it?

awk '{s+=$1} END {print s}' mydatafile

Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:

awk '{s+=$1} END {printf "%.0f", s}' mydatafile

Scripts for computing the average of a list of numbers in a data file

Here is one method:

$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " file
ave: 54.646

linux: how can I perform awk-like statistics on text input?

Why don't you use a database:

first, add column names to your file:

sed -i 'i1col0 col1 col2 col3 col4' myfile

Then, create a database and output some stats:

sqlite3 myfile.sqlite <<END
.separator " "
.import myfile mytable
select max(col1), avg(col1) from mytable;
END

Outputs

1.00567 0.248412

How can I quickly sum all numbers in a file?

For a Perl one-liner, it's basically the same thing as the awk solution in Ayman Hourieh's answer:

 % perl -nle '$sum += $_ } END { print $sum'

If you're curious what Perl one-liners do, you can deparse them:

 %  perl -MO=Deparse -nle '$sum += $_ } END { print $sum'

The result is a more verbose version of the program, in a form that no one would ever write on their own:

BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    $sum += $_;
}
sub END {
    print $sum;
}
-e syntax OK

Just for giggles, I tried this with a file containing 1,000,000 numbers (in the range 0 - 9,999). On my Mac Pro, it returns virtually instantaneously. That's too bad, because I was hoping using mmap would be really fast, but it's just the same time:

use 5.010;
use File::Map qw(map_file);

map_file my $map, $ARGV[0];

$sum += $1 while $map =~ m/(\d+)/g;

say $sum;

How to get average, median, mean stats from a file which has numbers in first column?

In case you're not bound to any specific tool, try GNU datamash - a nice tool for "command-line statistical operations" on textual files.

To get mean, median, percentile 95 and percentile 99 values for first column/field (note, fields are TAB-separated by default):

$ datamash --header-out mean 1 median 1 perc:95 1 perc:99 1  < file
mean(field-1)   median(field-1) perc:95(field-1)    perc:99(field-1)
0.016128538461538   0.012794    0.0346484   0.04258088

Command Line Utility to Print Statistics of Numbers in Linux