How to remove duplicate lines in a file?
Simply use the -o
and -u
options of sort
:
sort -o file -u file
You don't need even to use a pipe for another command, such as uniq
.
How do I delete duplicate lines and create a new file without duplicates?
Sounds simple enough, but what you did looks overcomplicated. I think the following should be enough:
with open('TEST.txt', 'r') as f:
unique_lines = set(f.readlines())
with open('TEST_no_dups.txt', 'w') as f:
f.writelines(unique_lines)
A couple things to note:
- If you are going to use a set, you might as well dump all the lines at creation, and
f.readlines()
, which returns the list of all the lines in your file, is perfect for that. f.writelines()
will write a sequence of lines to your files, but using a set breaks the order of the lines. So if that matters to you, I suggest replacing the last line byf.writelines(sorted(unique_lines, key=whatever you need))
remove duplicate lines in log file
sort | uniq -d
doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u
instead - that will remove duplicates.
But to answer the question you asked:
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
The first awk
command just prepends each line with it's length so the subsequent sort
can sort all of the lines longest-first, then the 2nd awk
only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut
removes the line length that the first awk
added.
In sequence:
$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s
argument (for stable sort
) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3-
- in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.
Delete duplicate lines from text file based on column
Knowing that #11221
Select-Object -Unique is unnecessary slow and exhaustive, I would indeed use a Hashset for this, but instead of the Get-Content
/Set-Content
cmdlets, I recommend you to simply use the Import-Csv
/Export-Csv
cmdlets as they automatically deal with your properties (columns):
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Import-Csv .\Input.txt -Delimiter "`t" |ForEach-Object {
if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |Export-Csv .\Output.txt -Delimiter "`t"
How to remove repeated lines from a file?
you probably want itertools.groupby
, without a comparison function it just returns a 'group' per unique line so you can just skip the group entirely and just write one line from each grouping.
with open('one.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line, _ in itertools.groupby(infile):
outfile.write(line)
This would only replace groups that occur in the same area, if repeated lines may appear in multiple places in the file (e.g. a a b a
would write a b a
) then you can keep a set of lines you have seen already
seen_lines = set()
with open('one.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
if line in seen_lines:
continue
outfile.write(line)
seen_lines.add(line)
How to remove duplicate lines from a file
uniq(1)
SYNOPSIS
uniq [OPTION]... [INPUT [OUTPUT]]
DESCRIPTION
Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).
Or, if you want to remove non-adjacent duplicate lines as well, this fragment of perl will do it:
while(<>) {
print $_ if (!$seen{$_});
$seen{$_}=1;
}
Remove duplicate from txt file that contains different sentences but consist of the same words on PHP
You can use array_map
, explode
and sort
to bring the keywords into the same order for all your lines before removing duplicates:
$lines = file('input.txt');
// sort keywords in each line
$lines = array_map(function($line) {
$keywords = explode(" ", trim($line));
sort($keywords);
return implode(" ", $keywords);
}, $lines);
$lines = array_unique($lines);
file_put_contents('output.txt', implode("\n", $lines));
This will iterate your array and order the keywords for each line alphabetically. Afterwards, you can remove the duplicated lines using array_unique
.
Related Topics
How to Use Angularjs with the Jinja2 Template Engine
Decode Escaped Characters in Url
Python String Prints as [U'String']
"Too Many Values to Unpack" Exception
Computing the Correlation Coefficient Between Two Multi-Dimensional Arrays
Python Unexpected Eof While Parsing
Finding Indices of Matches of One Array in Another Array
How to Flatten a Nested JSON Recursively, with Flatten_JSON
Argument 1 Has Unexpected Type 'Nonetype'
How to Add Both File and JSON Body in a Fastapi Post Request
Any Way to Modify Locals Dictionary
Multiprocessing: Sharing a Large Read-Only Object Between Processes
Importing an Ipynb File from Another Ipynb File
How to Add Percentages on Top of Bars in Seaborn
How to Locate Element of Credit Card Number Using Selenium Python