Does Distinct() Method Keep Original Ordering of Sequence Intact

Does Distinct() method keep original ordering of sequence intact?

It's not guaranteed, but it's the most obvious implementation. It would be hard to implement in a streaming manner (i.e. such that it returned results as soon as it could, having read as little as it could) without returning them in order.

You might want to read my blog post on the Edulinq implementation of Distinct().

Note that even if this were guaranteed for LINQ to Objects (which personally I think it should be) that wouldn't mean anything for other LINQ providers such as LINQ to SQL.

The level of guarantees provided within LINQ to Objects is a little inconsistent sometimes, IMO. Some optimizations are documented, others not. Heck, some of the documentation is flat out wrong.

Why LINQ Distinct() does not preserve order list?

Distinct() is about returning distinct rows, it doesn't promise any order but still orders on what it operates. For ordering there is OrderBy, OrderByDescending. In your case, since there is no Date in your final list, you cannot order by date.

EDIT:

void Main()
{
List<FarmDiary> farmDiary = new List<FarmDiary> {
new FarmDiary{Id=1, PersonInCharge="Bob",Date=new DateTime(2018,6,1)},
new FarmDiary{Id=2, PersonInCharge="John",Date=new DateTime(2018,6,2)},
new FarmDiary{Id=3, PersonInCharge="Bob",Date=new DateTime(2018,6,15)},
new FarmDiary{Id=4, PersonInCharge="David",Date=new DateTime(2018,7,1)},
new FarmDiary{Id=5, PersonInCharge="Zachary",Date=new DateTime(2018,6,10)},
};

List<string> staffNames = farmDiary
.GroupBy(d => d.PersonInCharge)
.OrderByDescending(x => x.OrderByDescending(d => d.Date).First().Date)
.Select(x => x.Key)
.ToList();

staffNames.Dump();
}

public class FarmDiary
{
public int Id { get; set; }
public string PersonInCharge { get; set; }
public DateTime Date { get; set; }
}

Preserving order with LINQ

I examined the methods of System.Linq.Enumerable, discarding any that returned non-IEnumerable results. I checked the remarks of each to determine how the order of the result would differ from order of the source.

Preserves Order Absolutely. You can map a source element by index to a result element

  • AsEnumerable
  • Cast
  • Concat
  • Select
  • ToArray
  • ToList

Preserves Order. Elements are filtered or added, but not re-ordered.

  • Distinct
  • Except
  • Intersect
  • OfType
  • Prepend (new in .net 4.7.1)
  • Skip
  • SkipWhile
  • Take
  • TakeWhile
  • Where
  • Zip (new in .net 4)

Destroys Order - we don't know what order to expect results in.

  • ToDictionary
  • ToLookup

Redefines Order Explicitly - use these to change the order of the result

  • OrderBy
  • OrderByDescending
  • Reverse
  • ThenBy
  • ThenByDescending

Redefines Order according to some rules.

  • GroupBy - The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
  • GroupJoin - GroupJoin preserves the order of the elements of outer, and for each element of outer, the order of the matching elements from inner.
  • Join - preserves the order of the elements of outer, and for each of these elements, the order of the matching elements of inner.
  • SelectMany - for each element of source, selector is invoked and a sequence of values is returned.
  • Union - When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

Edit: I've moved Distinct to Preserving order based on this implementation.

    private static IEnumerable<TSource> DistinctIterator<TSource>
(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource element in source)
if (set.Add(element)) yield return element;
}

How to Remove same key on C# List?

You can use the Enumerable.Distinct Method like so:

List<int> list = new List<int> { 1, 2, 5, 10, 1, 2, 4 };
list = list.Distinct().ToList(); // 1, 2, 5, 10, 4

Note that the documentation does not give a guarantee that the order of the sequence is preserved. However, the Distinct method is implemented such that the order is indeed preserved, as can be seen in the source.

You may also consult this post: Does Distinct() method keep original ordering of sequence intact?

Distinct() doesn't see uppercase letter changed by ToLower() method

I've also modified your code a bit:

string textToEncode = File.ReadAllText(@"C:\Users\ASUS\Desktop\szyfrowanie2\TextSample.txt").ToLower();
char[] distinctLetters = textToEncode.Distinct().ToArray();
var count = distinctLetters.Count();
Console.WriteLine("Letters used in text: \n\n");
for (int i = 0; i < count; i++)
{
if (Equals(distinctLetters[i], ' ')) { Console.Write("<space>"); }
else if (Equals(distinctLetters[i], '\r')) { Console.Write("<cr>"); }
else if (Equals(distinctLetters[i], '\n')) { Console.Write("<lf>"); }
else { Console.Write(" " + distinctLetters[i] + " "); }
}

Just a few minor things. I merged the two first lines, changed " " into ' ' so it now compares characters, changed the counting of characters to use distinctLetters instead of executing the same Distinct() command again and I added two conditions to handle the carriage return and line feed. (I always mix them up, btw.)

This now shows the right result but should also explain why characters went missing! A simple reason, actually. Your text file has a carriage return character, which will send the cursor back to the left. This will cause the first character to be overwritten by a space...

So your code actually prints " w i ..." but then gets the '\r'. It will then print a space, go back to the beginning of the line and writes another space over the ' '! Then the newline will come next, which prints a second space over the 'w', moves to the next line and prints a space again. Then the rest gets printed...

Simple, isn't it? But by capturing these two special characters with the two extra if statements, it is fixed... :-) The '\r' and '\n' characters are often overlooked in console applications, giving unexpected results when they get printed.

Most efficient way to remove duplicates from a List

There is a big difference between these two approaches:

List<int> Result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> Result2 = myList.Distinct().ToList(); //4700 ticks

The first one can (will probably) change the order of the elements of the returned List<>: Result1 elements won't be in the same order of myList's ones. The second maintains the original ordering.

There is probably no faster way than the first one.

There is probably no "more correct" (for a certain definition of "correct" based on ordering) than the second one.

(the third one is similar to the second one, only slower)

Just out of curiousity, the Distinct() is:

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,712
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
return DistinctIterator<TSource>(source, null);
}

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,722
static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) {
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource element in source)
if (set.Add(element)) yield return element;
}

So in the end the Distinct() simply uses an internal implementation of an HashSet<> (called Set<>) to check for the uniqueness of items.

For completeness sake, I'll add a link to the question Does C# Distinct() method keep original ordering of sequence intact?

Filtering duplicates from a type collection using LINQ

First, declare two equality compareres to specify your two conditions like this:

public class MyEqualityComparer1 : IEqualityComparer<SomeType>
{
public bool Equals(SomeType x, SomeType y)
{
return x.Application == y.Application && x.ExternalID == y.ExternalID;
}

public int GetHashCode(SomeType obj)
{
return (obj.Application + obj.ExternalID).GetHashCode();
}
}

public class MyEqualityComparer2 : IEqualityComparer<SomeType>
{
public bool Equals(SomeType x, SomeType y)
{
return x.Application == y.Application && x.ExternalDisplayId == y.ExternalDisplayId;
}

public int GetHashCode(SomeType obj)
{
return (obj.Application + obj.ExternalDisplayId).GetHashCode();
}
}

Then, order your list by CreatedDate and then use Distinct to filter your list like this:

var result = xDic
.OrderByDescending(x => x.CreateDate)
.Distinct(new MyEqualityComparer1())
.Distinct(new MyEqualityComparer2());

The Distinct method should remove the later items, so we should be able to depend on the fact that we used OrderByDescending to make sure that Distinct will remove items with the less recent CreatedTime.

However, since the documentation of Distinct do not guarantee this, you can use a custom distinct method like this:

public static class Extensions
{
public static IEnumerable<T> OrderedDistinct<T>(this IEnumerable<T> enumerable, IEqualityComparer<T> comparer)
{
HashSet<T> hash_set = new HashSet<T>(comparer);

foreach(var item in enumerable)
if (hash_set.Add(item))
yield return item;
}
}

And use it like this:

var result = xDic
.OrderByDescending(x => x.CreateDate)
.OrderedDistinct(new MyEqualityComparer1())
.OrderedDistinct(new MyEqualityComparer2());

How do I remove duplicates from a list, while preserving order?

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)



Related Topics



Leave a reply



Submit