Extract Unique Block of Lines from a File Using Shell Script

extract unique block of lines from a file using shell script

Thanks @tripleee and @Jarmund for your suggestions. From your inputs I was finally able to figure out solution of my problem. I got hint from associative arrays to make unique key for each block, so here is what i did

take file-2 and convert each block into single line
awk '/[Ss]etup.[Tt]est/ || /perl/[[:alpha:]]/[Ss]etup[Rr]eq/{if(b) exit; else b=1}1' file-2 > $TESTSETFILE
cat $TESTSETFILE | sed ':a;N;$!ba;s/\n//g;s/ //g' >> $SINGLELINEFILE
Now each line in this file is a unique entry
after this i use grep for each line in File-1 and find respective block(which is converted into single line) now
Then i use awk or sort -u to find unique entries in my solution file

Maybe this solution is not best but it is lot faster than the previous one.

Here is my new script

FLBATCHLIST=$1
BATCHFILE=$2

TEMPDIR="./tempBatchdir"
rm -rf $TEMPDIR/*
WORKFILE="$TEMPDIR/failedTestList.txt"
CPBATCHFILE="$TEMPDIR/orig.test"
TESTSETFILE="$TEMPDIR/testset.txt"
DIFFFILE="$TEMPDIR/diff.txt"
SINGLELINEFILE="$TEMPDIR/singleline.txt"
TEMPFILE="$TEMPDIR/temp.txt"
#Output
FAILEDBATCH="$TEMPDIR/FailedBatch.test"
LOGFILE="$TEMPDIR/log.txt"

convertSingleLine ()
{
sed -i 's/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $CPBATCHFILE
STATUS=1
while [ $STATUS -ne "0" ]
do
        if [ ! -s $CPBATCHFILE ]; then
                echo "$CPBATCHFILE is empty" >> $LOGFILE
                STATUS=0
        fi
        awk '/[Ss]etup.*[Tt]est/ || /perl\/[[:alpha:]]*\/[Ss]etup[Rr]eq/{if(b) exit; else b=1}1' $CPBATCHFILE > $TESTSETFILE
        cat $TESTSETFILE | sed ':a;N;$!ba;s/\n//g;s/ //g' >> $SINGLELINEFILE
        echo "**" >> $SINGLELINEFILE
        TSTFLLINES=`wc -l < $TESTSETFILE`
        CPBTCHLINES=`wc -l < $CPBATCHFILE`
        DIFF=`expr $CPBTCHLINES - $TSTFLLINES`
        tail -n $DIFF $CPBATCHFILE > $DIFFFILE
        mv $DIFFFILE $CPBATCHFILE
done
}

####STARTS HERE####
mkdir -p $TEMPDIR

sed 's/^[eE][xX][eE][cC]//g;s/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $FLBATCHLIST > $WORKFILE
sed -i 's/\([\/\.\"]\)/\\\1/g' $WORKFILE

cp $BATCHFILE $CPBATCHFILE
convertSingleLine

for fltest in $(cat $WORKFILE)
do
        echo $fltest >> $LOGFILE
        grep "$fltest" $SINGLELINEFILE >> $FAILEDBATCH
        if [ $? -eq "0" ]; then
                echo "TEST FOUND" >> $LOGFILE
        else
                ABSTEST=$(echo $fltest | sed 's/\\//g')
                echo "FATAL ERROR: Test \"$ABSTEST\" not found in $BATCHFILE" | tee -a $LOGFILE
        fi
done

awk '!x[$0]++' $FAILEDBATCH > $TEMPFILE
mv $TEMPFILE $FAILEDBATCH

sed -i "s/exec/\\nexec /g;s/#/\\n#/g" $FAILEDBATCH
sed -i '1d;s/\//\\/g' $FAILEDBATCH

Here is the output

$ crflbatch file-1 file-2
FATAL ERROR: Test "perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp" not found in file-2
FATAL ERROR: Test "this/or/that" not found in file-2

$ cat tempBatchdir/FailedBatch.test
exec 1.20\setup\testinit
exec 1.20\abc\this_is_test_1
exec 1.20\abc\this_is_test_1

exec perl\RRP\SetupReq\testdef_ijk
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2

exec perl\LRP\SetupReq\testird_LRP("LRP")
exec perl\BaseLibs\launch_client("LRP")
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle
exec perl\LRP\BaseLibs\setupLRPMMMTab
exec perl\LRP\BaseLibs\launchMMM
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE")
#PAUSEExpandChurriptreeview&openallnodes
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb
exec perl\BaseLibs\ShutApp("SelfDestructionSystem")
exec perl\LRP\BaseLibs\close-MMM
$

How can I extract a predetermined range of lines from a text file on Unix?

sed -n '16224,16482p;16483q' filename > newfile

From the sed manual:

p -
Print out the pattern space (to the standard output). This command is usually only used in conjunction with the -n command-line option.

n -
If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If
there is no more input then sed exits without processing any more
commands.

q -
Exit sed without processing any more commands or input.
Note that the current pattern space is printed if auto-print is not disabled with the -n option.

and

Addresses in a sed script can be in any of the following forms:

number
Specifying a line number will match only that line in the input.

An address range can be specified by specifying two addresses
separated by a comma (,). An address range matches lines starting from
where the first address matches, and continues until the second
address matches (inclusively).

How to generate list of unique lines in text file using a Linux shell script?

If you don't mind the output being sorted, use

sort -u

This sorts and removes duplicates

extracting unique values between 2 sets/files

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7

Explanation of how the code works:

If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

Select unique or distinct values from a list in UNIX shell script

You might want to look at the uniq and sort applications.


./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:


class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:


class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:


jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:


class
java

How to print only the unique lines in BASH?

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest

Extract Unique Block of Lines from a File Using Shell Script