How to Count Number of Unique Values of a Field in a Tab-Delimited Text File

How to count number of unique values of a field in a tab-delimited text file?

You can make use of cut, sort and uniq commands as follows:

cat input_file | cut -f 1 | sort | uniq

gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

Avoiding UUOC :)

cut -f 1 input_file | sort | uniq

EDIT:

To count the number of unique occurences you can make use of wc command in the chain as:

cut -f 1 input_file | sort | uniq | wc -l

BASH: Printing the total number of unique elements in a tab delimited file

Could you please try following and let me know if this helps you.

awk '!a[$6]++{count++};END{print count}'  Input_file

Use awk 'BEGIN{FS="\t"} for TAB delimited Input_file too.

Solution 2nd: In GNU awk with length of array.

awk '{!a[$6]++} END{print length(a)}' Input_file

Output will be as follows on both the solutions:

awk '!a[$6]++{count++};END{print count}' Input_file
4
***********
awk '{!a[$6]++} END{print length(a)}' Input_file
4

How to build a tab-delimited text file with many calculated values?

The following answer works perfectly and comes from the combined suggestions of markp-fuso and KamilCuk. Thank you both!

# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'

for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk '$3 == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk '$3 == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/\1;\2/p' | awk -F';' '{a[$1]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)

# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt

Counting unique values in a column with a shell script

No need to use awk.

$ cut -f2 file.txt | sort | uniq | wc -l

should do it.

This uses the fact that tab is cut's default field separator, so we'll get just the content from column two this way. Then a pass through sort works as a pre-stage to uniq, which removes the duplicates. Finally we count the lines, which is the sought number.

Find number of unique values in a column

This should do the job:

grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c

  1. You first grep for the "ELEC.PLANT." part
  2. remove the .Q,A,M
  3. remove duplicates and count using sort | uniq -c

EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c

Counting the number of unique values based on two columns in bash

With complete awk solution could you please try following.

awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.

Need to find number of distinct values in each column in very large files

Though my solution is not elegant and there is surely a better one out there (BTree?), I found something that worked and thought I'd share it. I can't be the only one out there looking to determine distinct counts for fields in very large files. That said, I don't know how well this will scale to hundreds of millions or billions of records. At some point, with enough data, one will hit the 2GB size limit for a single array.

What didn't work:

  • For very large files: hash table for each field populated in real time as I iterate through the file, then use hashtable.count. The collective size of the hash tables causes SystemOutOfMemoryException before reaching the end of the file.
  • Importing data to SQL then using SQL on each column to determine distinct count. It takes WAY too long.

What did work:

  • For the large files with tens of millions of rows I first conduct analysis on the first 1000 rows in which I create a hash table for each field and populate with the distinct values.
  • For any field with more than 50 distinct values out of the 1000, I mark the field with a boolean flag HasHighDensityOfDistinctValues = true.
  • For any such fields with HasHighDensityOfDistinctValues == true I create a separate text file and as I iterate through the main file, I write values for just that field out to the field-specific text file.
  • For fields with a lower density of distinct values I maintain the hash table for each field and write distinct values to that.
  • I noticed that in many of the high density fields, there are repeat values occurring (such as a PersonID) for multiple consecutive rows so, to reduce the number of entries to the field-specific text files, I store the previous value of the field and only write to the text file if the current value does not equal the previous value. That cut down significantly on the total size of the field-specific text files.
  • Once done iterating through the main file being processed, I iterate through my FieldProcessingResults class and for each field, if HasHighDensityOfDistinctValues==true, I read each line in the field-specific text file and populate the field-specific hash table with distinct values, then use HashTable.Count to determine the count of distinct values.
  • Before moving on to the next field, I store the count associated with that field, then clear the hash table with myHashTable.Clear(). I close and delete the field-specific text file before moving on the the next field.

In this manner, I am able to get the count of distinct values for each field without necessarily having to concurrently populate and maintain an in-memory hash table for each field, which had caused the out of memory error.

How to get unique values/elements of a column?

If order doesn't matter, you could do it by creating a set from the items in column 2 of the lines in the file:

with open('SGD_features.tab') as file:
unique_features = set(line.split('\t')[1] for line in file)

for feature in unique_features:
print(feature)



Related Topics



Leave a reply



Submit