Linq Ring: Any() VS Contains() for Huge Collections

LINQ Ring: Any() vs Contains() for Huge Collections

Contains() is an instance method, and its performance depends largely on the collection itself. For instance, Contains() on a List is O(n), while Contains() on a HashSet is O(1).

Any() is an extension method, and will simply go through the collection, applying the delegate on every object. It therefore has a complexity of O(n).

Any() is more flexible however since you can pass a delegate. Contains() can only accept an object.

What is the difference between Contains and Any in LINQ?

Contains takes an object, Any takes a predicate.

You use Contains like this:

listOFInts.Contains(1);

and Any like this:

listOfInts.Any(i => i == 1);
listOfInts.Any(i => i % 2 == 0); // Check if any element is an Even Number

So if you want to check for a specific condition, use Any. If you want to check for the existence of an element, use Contains.

MSDN for Contains, Any

Performance Benchmarking of Contains, Exists and Any

According to documentation:

List.Exists (Object method)

Determines whether the List(T) contains elements that match the
conditions defined by the specified predicate.

IEnumerable.Any (Extension method)

Determines whether any element of a sequence satisfies a condition.

List.Contains (Object Method)

Determines whether an element is in the List.

Benchmarking:

CODE:

    static void Main(string[] args)
{
ContainsExistsAnyShort();

ContainsExistsAny();
}

private static void ContainsExistsAny()
{
Console.WriteLine("***************************************");
Console.WriteLine("********* ContainsExistsAny ***********");
Console.WriteLine("***************************************");

List<int> list = new List<int>(6000000);
Random random = new Random();
for (int i = 0; i < 6000000; i++)
{
list.Add(random.Next(6000000));
}
int[] arr = list.ToArray();

find(list, arr);
}

private static void ContainsExistsAnyShort()
{
Console.WriteLine("***************************************");
Console.WriteLine("***** ContainsExistsAnyShortRange *****");
Console.WriteLine("***************************************");

List<int> list = new List<int>(2000);
Random random = new Random();
for (int i = 0; i < 2000; i++)
{
list.Add(random.Next(6000000));
}
int[] arr = list.ToArray();

find(list, arr);
}

private static void find(List<int> list, int[] arr)
{
Random random = new Random();
int[] find = new int[10000];
for (int i = 0; i < 10000; i++)
{
find[i] = random.Next(6000000);
}

Stopwatch watch = Stopwatch.StartNew();
for (int rpt = 0; rpt < 10000; rpt++)
{
list.Contains(find[rpt]);
}
watch.Stop();
Console.WriteLine("List/Contains: {0:N0}ms", watch.ElapsedMilliseconds);

watch = Stopwatch.StartNew();
for (int rpt = 0; rpt < 10000; rpt++)
{
list.Exists(a => a == find[rpt]);
}
watch.Stop();
Console.WriteLine("List/Exists: {0:N0}ms", watch.ElapsedMilliseconds);

watch = Stopwatch.StartNew();
for (int rpt = 0; rpt < 10000; rpt++)
{
list.Any(a => a == find[rpt]);
}
watch.Stop();
Console.WriteLine("List/Any: {0:N0}ms", watch.ElapsedMilliseconds);

watch = Stopwatch.StartNew();
for (int rpt = 0; rpt < 10000; rpt++)
{
arr.Contains(find[rpt]);
}
watch.Stop();
Console.WriteLine("Array/Contains: {0:N0}ms", watch.ElapsedMilliseconds);

Console.WriteLine("Arrays do not have Exists");

watch = Stopwatch.StartNew();
for (int rpt = 0; rpt < 10000; rpt++)
{
arr.Any(a => a == find[rpt]);
}
watch.Stop();
Console.WriteLine("Array/Any: {0:N0}ms", watch.ElapsedMilliseconds);
}

RESULTS

***************************************
***** ContainsExistsAnyShortRange *****
***************************************
List/Contains: 96ms
List/Exists: 146ms
List/Any: 381ms
Array/Contains: 34ms
Arrays do not have Exists
Array/Any: 410ms
***************************************
********* ContainsExistsAny ***********
***************************************
List/Contains: 257,996ms
List/Exists: 379,951ms
List/Any: 884,853ms
Array/Contains: 72,486ms
Arrays do not have Exists
Array/Any: 1,013,303ms

Linq performance: Any vs. Contains

Both the queries are fundamentally implementing the same algorithm. They will each iterate c1 for each item in c2, compare the Bar properties of the two objects, and return as soon as a match is found. The asymptotic complexity of the two cases is the same, meaning they'll both scale equally well (or equally bad, as the case happens to be) as the size of the two sets increase.

There may be a minor difference between the two in the overhead associated with one method over the other, but the difference will not be significant, and they will be smaller and smaller as the size of the collections increase. There isn't any real performance reason to pick one of the two over the other.

There is an option that you didn't show that is quite a bit faster than either of those. You can use a Join to find all of the items in c1 that also exist in c2 without doing a linear search through the sequence:

var query = from first in c1
join second in c2
on first.Bar equals second.Bar
select first;

Another option would be to use a HashSet instead of a List, as that can be much more easily searched:

var set = new HashSet<string>(c1.Select(item => item.Bar));

var query = c2.Where(item => set.Contains(item.Bar));

(This solution is pretty close to what Join will do internally.)

Both of these solutions will be a lot faster than either of your proposed solutions.

Scalable Contains method for LINQ against a SQL backend

I’ve solved this problem with a little different approach a view month ago. Maybe it’s a good solution for you too.

I didn’t want my solution to change the query itself. So a ids.ChunkContains(p.Id) or a special WhereContains method was unfeasible. Also should the solution be able to combine a Contains with another filter as well as using the same collection multiple times.

db.TestEntities.Where(p => (ids.Contains(p.Id) || ids.Contains(p.ParentId)) && p.Name.StartsWith("Test"))

So I tried to encapsulate the logic in a special ToList method that could rewrite the Expression for a specified collection to be queried in chunks.

var ids = Enumerable.Range(1, 11);
var result = db.TestEntities.Where(p => Ids.Contains(p.Id) && p.Name.StartsWith ("Test"))
.ToChunkedList(ids,4);

To rewrite the expression tree I discovered all Contains Method calls from local collections in the query with a view helping classes.

private class ContainsExpression
{
public ContainsExpression(MethodCallExpression methodCall)
{
this.MethodCall = methodCall;
}

public MethodCallExpression MethodCall { get; private set; }

public object GetValue()
{
var parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
return Expression.Lambda<Func<object>>(parent).Compile()();
}

public bool IsLocalList()
{
Expression parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
while (parent != null) {
if (parent is ConstantExpression)
return true;
var member = parent as MemberExpression;
if (member != null) {
parent = member.Expression;
} else {
parent = null;
}
}
return false;
}
}

private class FindExpressionVisitor<T> : ExpressionVisitor where T : Expression
{
public List<T> FoundItems { get; private set; }

public FindExpressionVisitor()
{
this.FoundItems = new List<T>();
}

public override Expression Visit(Expression node)
{
var found = node as T;
if (found != null) {
this.FoundItems.Add(found);
}
return base.Visit(node);
}
}

public static List<T> ToChunkedList<T, TValue>(this IQueryable<T> query, IEnumerable<TValue> list, int chunkSize)
{
var finder = new FindExpressionVisitor<MethodCallExpression>();
finder.Visit(query.Expression);
var methodCalls = finder.FoundItems.Where(p => p.Method.Name == "Contains").Select(p => new ContainsExpression(p)).Where(p => p.IsLocalList()).ToList();
var localLists = methodCalls.Where(p => p.GetValue() == list).ToList();

If the local collection passed in the ToChunkedList method was found in the query expression, I replace the Contains call to the original list with a new call to a temporary list containing the ids for one batch.

if (localLists.Any()) {
var result = new List<T>();
var valueList = new List<TValue>();

var containsMethod = typeof(Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public)
.Single(p => p.Name == "Contains" && p.GetParameters().Count() == 2)
.MakeGenericMethod(typeof(TValue));

var queryExpression = query.Expression;

foreach (var item in localLists) {
var parameter = new List<Expression>();
parameter.Add(Expression.Constant(valueList));
if (item.MethodCall.Object == null) {
parameter.AddRange(item.MethodCall.Arguments.Skip(1));
} else {
parameter.AddRange(item.MethodCall.Arguments);
}

var call = Expression.Call(containsMethod, parameter.ToArray());

var replacer = new ExpressionReplacer(item.MethodCall,call);

queryExpression = replacer.Visit(queryExpression);
}

var chunkQuery = query.Provider.CreateQuery<T>(queryExpression);

for (int i = 0; i < Math.Ceiling((decimal)list.Count() / chunkSize); i++) {
valueList.Clear();
valueList.AddRange(list.Skip(i * chunkSize).Take(chunkSize));

result.AddRange(chunkQuery.ToList());
}
return result;
}
// if the collection was not found return query.ToList()
return query.ToList();

Expression Replacer:

private class ExpressionReplacer : ExpressionVisitor {

private Expression find, replace;

public ExpressionReplacer(Expression find, Expression replace)
{
this.find = find;
this.replace = replace;
}

public override Expression Visit(Expression node)
{
if (node == this.find)
return this.replace;

return base.Visit(node);
}
}

How do I use LINQ Contains(string[]) instead of Contains(string)

spoulson has it nearly right, but you need to create a List<string> from string[] first. Actually a List<int> would be better if uid is also int. List<T> supports Contains(). Doing uid.ToString().Contains(string[]) would imply that the uid as a string contains all of the values of the array as a substring??? Even if you did write the extension method the sense of it would be wrong.

[EDIT]

Unless you changed it around and wrote it for string[] as Mitch Wheat demonstrates, then you'd just be able to skip the conversion step.

[ENDEDIT]

Here is what you want, if you don't do the extension method (unless you already have the collection of potential uids as ints -- then just use List<int>() instead). This uses the chained method syntax, which I think is cleaner, and
does the conversion to int to ensure that the query can be used with more providers.

var uids = arrayofuids.Select(id => int.Parse(id)).ToList();

var selected = table.Where(t => uids.Contains(t.uid));


Related Topics



Leave a reply



Submit