How Might I Remove Duplicate Lines from a File

How to remove duplicate lines in a file?

Simply use the -o and -u options of sort:

sort -o file -u file

You don't need even to use a pipe for another command, such as uniq.

How do I delete duplicate lines and create a new file without duplicates?

Sounds simple enough, but what you did looks overcomplicated. I think the following should be enough:

with open('TEST.txt', 'r') as f:
unique_lines = set(f.readlines())
with open('TEST_no_dups.txt', 'w') as f:
f.writelines(unique_lines)

A couple things to note:

  • If you are going to use a set, you might as well dump all the lines at creation, and f.readlines(), which returns the list of all the lines in your file, is perfect for that.
  • f.writelines() will write a sequence of lines to your files, but using a set breaks the order of the lines. So if that matters to you, I suggest replacing the last line by f.writelines(sorted(unique_lines, key=whatever you need))

remove duplicate lines in log file

sort | uniq -d doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u instead - that will remove duplicates.

But to answer the question you asked:

$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

The first awk command just prepends each line with it's length so the subsequent sort can sort all of the lines longest-first, then the 2nd awk only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut removes the line length that the first awk added.

In sequence:

$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s argument (for stable sort) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3- - in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.

Delete duplicate lines from text file based on column

Knowing that #11221 Select-Object -Unique is unnecessary slow and exhaustive, I would indeed use a Hashset for this, but instead of the Get-Content/Set-Content cmdlets, I recommend you to simply use the Import-Csv/Export-Csv cmdlets as they automatically deal with your properties (columns):

$Unique = [System.Collections.Generic.HashSet[string]]::new() 
Import-Csv .\Input.txt -Delimiter "`t" |ForEach-Object {
if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |Export-Csv .\Output.txt -Delimiter "`t"

How to remove repeated lines from a file?

you probably want itertools.groupby, without a comparison function it just returns a 'group' per unique line so you can just skip the group entirely and just write one line from each grouping.

with open('one.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line, _ in itertools.groupby(infile):
outfile.write(line)

This would only replace groups that occur in the same area, if repeated lines may appear in multiple places in the file (e.g. a a b a would write a b a) then you can keep a set of lines you have seen already

seen_lines = set()
with open('one.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
if line in seen_lines:
continue
outfile.write(line)
seen_lines.add(line)

How to remove duplicate lines from a file

uniq(1)

SYNOPSIS

uniq [OPTION]... [INPUT [OUTPUT]]

DESCRIPTION

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Or, if you want to remove non-adjacent duplicate lines as well, this fragment of perl will do it:

while(<>) {
print $_ if (!$seen{$_});
$seen{$_}=1;
}

Remove duplicate from txt file that contains different sentences but consist of the same words on PHP

You can use array_map, explode and sort to bring the keywords into the same order for all your lines before removing duplicates:

$lines = file('input.txt');

// sort keywords in each line
$lines = array_map(function($line) {
$keywords = explode(" ", trim($line));
sort($keywords);
return implode(" ", $keywords);
}, $lines);

$lines = array_unique($lines);
file_put_contents('output.txt', implode("\n", $lines));

This will iterate your array and order the keywords for each line alphabetically. Afterwards, you can remove the duplicated lines using array_unique.



Related Topics



Leave a reply



Submit