command line utility to print statistics of numbers in linux
This is a breeze with R. For a file that looks like this:
1
2
3
4
5
6
7
8
9
10
Use this:
R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"
To get this:
V1
Min. : 1.00
1st Qu.: 3.25
Median : 5.50
Mean : 5.50
3rd Qu.: 7.75
Max. :10.00
[1] 3.02765
- The
-q
flag squelches R's startup licensing and help output - The
-e
flag tells R you'll be passing an expression from the terminal x
is adata.frame
- a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.- Some functions, like
summary()
, naturally accommodatedata.frames
. Ifx
had multiple fields,summary()
would provide the above descriptive stats for each. - But
sd()
can only take one vector at a time, which is why I indexx
for that command (x[ , 1]
returns the first column ofx
). You could useapply(x, MARGIN = 2, FUN = sd)
to get the SDs for all columns.
What's the quickest way to get the mean of a set of numbers from the command line?
Awk
awk '{total += $1; count++ } END {print total/count}'
How to count lines in a document?
Use wc
:
wc -l <filename>
This will output the number of lines in <filename>
:
$ wc -l /dir/file.txt
3272485 /dir/file.txt
Or, to omit the <filename>
from the result use wc -l < <filename>
:
$ wc -l < /dir/file.txt
3272485
You can also pipe data to wc
as well:
$ cat /dir/file.txt | wc -l
3272485
$ curl yahoo.com --silent | wc -l
63
summarise NA data from datafile on command line in linux / bash
The fundamental problem here seems to be that you haven't noticed the difference between compressed data and plain text. Most Unix tools only work on the latter.
But first: Your attempts will look for "na" as a substring of any value in any column. So "banana" will match because ... well, I imagine you can see why.
Without seeing your actual data, it's hard to know what exactly you need, but for looking for entries which are exactly "NA" case-insensitively in column 3 of a tab-delimited file, try
awk -F '\t' 'tolower($3) == "na"' file
The tolower()
call converts to lower case so that you can then compare to a single string; this will cover all case variations of "NA"
, "Na"
, "nA"
, or "na"
but only look for an exact match after case conversion, instead of looking for a substring. (This assumes you don't have e.g. whitespace around the values in the column, too; it really is an exact comparison.)
To get the count, pipe to wc -l
(or refactor the Awk script:
awk -F '\t' 'tolower($3) == "na" { sum += 1 } END { print sum }' file
Probably read a 15-minute introduction to Awk if this isn't obvious.)
Perhaps more usefully, include file name and line number
awk -F '\t' 'tolower($3) == "na" { print FILENAME ":" FNR ":" $0 }' file
(You can pass multiple file names, and then the FILENAME
makes more sense.)
To examine a different column, replace $3
with e.g. $42
to examine column 42. To use a different delimiter, put the delimiter instead of '\t'
after -F
. This still doesn't correctly deal with e.g. quoting in CSV files (maybe switch to a language which knows how to parse CSV in all its variations correctly, like Python, or use a dedicated CSV tool -- I hear there's one called csvtool
).
If your file is compressed, you can't grep
(or Awk or sed
or etc) the compressed data directly; you have to uncompress it first:
gzcat datafile.gz |
awk -F '\t' -v gzfile="datafile.gz" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'
Notice how you don't put a file name after the Awk script when the input comes from a pipe (or not just Awk actually; any tool which is capable of reading a file or standard input, as is the norm among Unix text-processing tools).
The above also shows how to pass a string into Awk as a variable (-v variable="value"
). This might be more useful if you have multiple files you want to loop over:
for datafile in datafile.gz datafile0.gz datafile1.gz; do
gzcat "$datafile" |
awk -F '\t' -v gzfile="$datafile" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'
done
Slightly bewilderingly, the $datafile
variable is a shell variable (Bash or Zsh or what have you) which is entirely distinct from Awk's internal variables. We use the -v
trick from above to make it available to Awk in the variable gzfile
.
You could use grep
(which reads plain text) or gzgrep
(which can read gzip
-compressed data directly) as well, but then you want to pass in a regular expression which targets the specific column. Just to show how it's done, here is a regex which says "something with no tabs, followed by a tab, followed by something with no tabs, followed by a tab, followed by na
, followed by a tab" which (once you wrap your head around it) targets the third column.
gzgrep -Ei -c $'^[^\t]*\t[^\t]*\tna\t' filename.gz
The $'...'
notation is Bash-specific and allows us to use the symbol \t
instead of a literal tab in the regular expression. The -c
option to grep
asks it to report how many matching lines it found, so you don't even need wc -l
here. The -E
option selects a slightly less arcane and more modern regular expression dialect than the default from the early 1970s. (It's still not properly modern by any standard; the "extended" dialect is from the mid-1970s, with some later brushing up by the POSIX standardization in the 1990s. Newer tools support a plethora of extensions which are not standard in grep
.)
To look for na
alone with tabs on both sides in any column, try
gzgrep -Eic -e $'^na\t|\tna\t|\tna$' filename.gz
where |
in the regular expression stands for "or", and ^na\t
is the regular expression for na
bracketed by beginning of line on one side and a tab on the other, and \tna$
is the same for end of line.
If you don't use Bash, you can remove the $
before '
and type a literal tab, usually with ctrl V tab
Files whose name ends with .gz
are compressed with gzip
; there are other compression tools which conventionally use file extensions like .Z
(plain old compress
, from the dark ages), .bz2
(bzip2
), .xz
(xz
) etc; most of these ship with a bzcat
or similar tool which performs the same job as gzcat
for gzip
files, and some will also have something like bzgrep
to parallel gzgrep
. (To add to the alphabet soup, gzcat
could also mean GNU zcat
on some systems! Then it handles .Z
files, not .gz
files).
Unix-based systems generally don't rely on the file extension to identify the type of a file, but the human convention is to add an extension when you compress something, and many of the tools will do that automatically (so for example gzip file
creates file.gz
).
A quick examination of a compressed file, perhaps with a hex dump tool, should reveal why grep
doesn't work directly on the file. The compressed data is a tightly packed oblique binary which simply does not contain the uncompressed data in any easily discoverable form (unless you learn a lot about compression!)
Shell command to sum integers, one per line?
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf
rather than print
:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
Scripts for computing the average of a list of numbers in a data file
Here is one method:
$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " file
ave: 54.646
linux: how can I perform awk-like statistics on text input?
Why don't you use a database:
first, add column names to your file:
sed -i 'i1col0 col1 col2 col3 col4' myfile
Then, create a database and output some stats:
sqlite3 myfile.sqlite <<END
.separator " "
.import myfile mytable
select max(col1), avg(col1) from mytable;
END
Outputs
1.00567 0.248412
How can I quickly sum all numbers in a file?
For a Perl one-liner, it's basically the same thing as the awk
solution in Ayman Hourieh's answer:
% perl -nle '$sum += $_ } END { print $sum'
If you're curious what Perl one-liners do, you can deparse them:
% perl -MO=Deparse -nle '$sum += $_ } END { print $sum'
The result is a more verbose version of the program, in a form that no one would ever write on their own:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
$sum += $_;
}
sub END {
print $sum;
}
-e syntax OK
Just for giggles, I tried this with a file containing 1,000,000 numbers (in the range 0 - 9,999). On my Mac Pro, it returns virtually instantaneously. That's too bad, because I was hoping using mmap
would be really fast, but it's just the same time:
use 5.010;
use File::Map qw(map_file);
map_file my $map, $ARGV[0];
$sum += $1 while $map =~ m/(\d+)/g;
say $sum;
How to get average, median, mean stats from a file which has numbers in first column?
In case you're not bound to any specific tool, try GNU datamash
- a nice tool for "command-line statistical operations" on textual files.
To get mean, median, percentile 95 and percentile 99 values for first column/field (note, fields are TAB
-separated by default):
$ datamash --header-out mean 1 median 1 perc:95 1 perc:99 1 < file
mean(field-1) median(field-1) perc:95(field-1) perc:99(field-1)
0.016128538461538 0.012794 0.0346484 0.04258088
Related Topics
Using Grep and Sed to Find and Replace a String
Sending Udp Packets from the Linux Kernel
How to Set Memory Limit for Oom Killer for Chrome
Can't Clone a Github Repo on Linux via Https
How to Program Linux .Dts Device Tree Files
Redirecting Stdout with Find -Exec and Without Creating New Shell
Can Docker Solve a Problem of Mismatched C Shared Libraries
Command Line: Search and Replace in All Filenames Matched by Grep
How to Configure Linux Capabilities Per User
Joining Multiple Fields in Text Files on Unix
How to Have Simple and Double Quotes in a Scripted Ssh Command
Removing Part of a Filename for Multiple Files on Linux
What Does "Ulimit -S Unlimited" Do
X86_64 Assembly Linux System Call Confusion
Delete All Files Older Than 30 Days, Based on File Name as Date