Why Is .Contains Slow? Most Efficient Way to Get Multiple Entities by Primary Key

Why is .Contains slow? Most efficient way to get multiple entities by primary key?

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The analysis in this answer is great but largely obsolete since 2013.

Using Contains in Entity Framework is actually very slow. It's true that it translates into an IN clause in SQL and that the SQL query itself is executed fast. But the problem and the performance bottleneck is in the translation from your LINQ query into SQL. The expression tree which will be created is expanded into a long chain of OR concatenations because there is no native expression which represents an IN. When the SQL is created this expression of many ORs is recognized and collapsed back into the SQL IN clause.

This does not mean that using Contains is worse than issuing one query per element in your ids collection (your first option). It's probably still better - at least for not too large collections. But for large collections it is really bad. I remember that I had tested some time ago a Contains query with about 12.000 elements which worked but took around a minute even though the query in SQL executed in less than a second.

It might be worth to test the performance of a combination of multiple roundtrips to the database with a smaller number of elements in a Contains expression for each roundtrip.

This approach and also the limitations of using Contains with Entity Framework is shown and explained here:

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

It's possible that a raw SQL command will perform best in this situation which would mean that you call dbContext.Database.SqlQuery<Image>(sqlString) or dbContext.Images.SqlQuery(sqlString) where sqlString is the SQL shown in @Rune's answer.

Edit

Here are some measurements:

I have done this on a table with 550000 records and 11 columns (IDs start from 1 without gaps) and picked randomly 20000 ids:

using (var context = new MyDbContext())
{
    Random rand = new Random();
    var ids = new List<int>();
    for (int i = 0; i < 20000; i++)
        ids.Add(rand.Next(550000));

    Stopwatch watch = new Stopwatch();
    watch.Start();

    // here are the code snippets from below

    watch.Stop();
    var msec = watch.ElapsedMilliseconds;
}

Test 1

var result = context.Set<MyEntity>()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 85.5 sec

Test 2

var result = context.Set<MyEntity>().AsNoTracking()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 84.5 sec

This tiny effect of AsNoTracking is very unusual. It indicates that the bottleneck is not object materialization (and not SQL as shown below).

For both tests it can be seen in SQL Profiler that the SQL query arrives at the database very late. (I didn't measure exactly but it was later than 70 seconds.) Obviously the translation of this LINQ query into SQL is very expensive.

Test 3

var values = new StringBuilder();
values.AppendFormat("{0}", ids[0]);
for (int i = 1; i < ids.Count; i++)
    values.AppendFormat(", {0}", ids[i]);

var sql = string.Format(
    "SELECT * FROM [MyDb].[dbo].[MyEntities] WHERE [ID] IN ({0})",
    values);

var result = context.Set<MyEntity>().SqlQuery(sql).ToList();

Result -> msec = 5.1 sec

Test 4

// same as Test 3 but this time including AsNoTracking
var result = context.Set<MyEntity>().SqlQuery(sql).AsNoTracking().ToList();

Result -> msec = 3.8 sec

This time the effect of disabling tracking is more noticable.

Test 5

// same as Test 3 but this time using Database.SqlQuery
var result = context.Database.SqlQuery<MyEntity>(sql).ToList();

Result -> msec = 3.7 sec

My understanding is that context.Database.SqlQuery<MyEntity>(sql) is the same as context.Set<MyEntity>().SqlQuery(sql).AsNoTracking(), so there is no difference expected between Test 4 and Test 5.

(The length of the result sets was not always the same due to possible duplicates after the random id selection but it was always between 19600 and 19640 elements.)

Edit 2

Test 6

Even 20000 roundtrips to the database are faster than using Contains:

var result = new List<MyEntity>();
foreach (var id in ids)
    result.Add(context.Set<MyEntity>().SingleOrDefault(e => e.ID == id));

Result -> msec = 73.6 sec

Note that I have used SingleOrDefault instead of Find. Using the same code with Find is very slow (I cancelled the test after several minutes) because Find calls DetectChanges internally. Disabling auto change detection (context.Configuration.AutoDetectChangesEnabled = false) leads to roughly the same performance as SingleOrDefault. Using AsNoTracking reduces the time by one or two seconds.

Tests were done with database client (console app) and database server on the same machine. The last result might get significantly worse with a "remote" database due to the many roundtrips.

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The approach described in this answer is no longer necessary.

You are right that most of the time is spent processing the translation of the query. EF's provider model doesn't currently include an expression that represents an IN clause, therefore ADO.NET providers can't support IN natively. Instead, the implementation of Enumerable.Contains translates it to a tree of OR expressions, i.e. for something that in C# looks like like this:

new []{1, 2, 3, 4}.Contains(i)

... we will generate a DbExpression tree that could be represented like this:

((1 = @i) OR (2 = @i)) OR ((3 = @i) OR (4 = @i))

(The expression trees have to be balanced because if we had all the ORs over a single long spine there would be more chances that the expression visitor would hit a stack overflow (yes, we actually did hit that in our testing))

We later send a tree like this to the ADO.NET provider, which can have the ability to recognize this pattern and reduce it to the IN clause during SQL generation.

When we added support for Enumerable.Contains in EF4, we thought it was desirable to do it without having to introduce support for IN expressions in the provider model, and honestly, 10,000 is much more than the number of elements we anticipated customers would pass to Enumerable.Contains. That said, I understand that this is an annoyance and that the manipulation of expressions trees makes things too expensive in your particular scenario.

I discussed this with one of our developers and we believe that in the future we could change the implementation by adding first-class support for IN. I will make sure this is added to our backlog, but I cannot promise when it will make it given there are many other improvements we would like to make.

To the workarounds already suggested in the thread I would add the following:

Consider creating a method that balances the number of database roundtrips with the number of elements you pass to Contains. For instance, in my own testing I observed that computing and executing against a local instance of SQL Server the query with 100 elements takes 1/60 of a second. If you can write your query in such a way that executing 100 queries with 100 different sets of ids would give you equivalent result to the query with 10,000 elements, then you can get the results in aproximately 1.67 seconds instead of 18 seconds.

Different chunk sizes should work better depending on the query and the latency of the database connection. For certain queries, i.e. if the sequence passed has duplicates or if Enumerable.Contains is used in a nested condition you may obtain duplicate elements in the results.

Here is a code snippet (sorry if the code used to slice the input into chunks looks a little too complex. There are simpler ways to achieve the same thing, but I was trying to come up with a pattern that preserves streaming for the sequence and I couldn't find anything like it in LINQ, so I probably overdid that part :) ):

Usage:

var list = context.GetMainItems(ids).ToList();

Method for context or repository:

public partial class ContainsTestEntities
{
    public IEnumerable<Main> GetMainItems(IEnumerable<int> ids, int chunkSize = 100)
    {
        foreach (var chunk in ids.Chunk(chunkSize))
        {
            var q = this.MainItems.Where(a => chunk.Contains(a.Id));
            foreach (var item in q)
            {
                yield return item;
            }
        }
    }
}

Extension methods for slicing enumerable sequences:

public static class EnumerableSlicing
{

    private class Status
    {
        public bool EndOfSequence;
    }

    private static IEnumerable<T> TakeOnEnumerator<T>(IEnumerator<T> enumerator, int count, 
        Status status)
    {
        while (--count > 0 && (enumerator.MoveNext() || !(status.EndOfSequence = true)))
        {
            yield return enumerator.Current;
        }
    }

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> items, int chunkSize)
    {
        if (chunkSize < 1)
        {
            throw new ArgumentException("Chunks should not be smaller than 1 element");
        }
        var status = new Status { EndOfSequence = false };
        using (var enumerator = items.GetEnumerator())
        {
            while (!status.EndOfSequence)
            {
                yield return TakeOnEnumerator(enumerator, chunkSize, status);
            }
        }
    }
}

Hope this helps!

Entity Framework + LINQ + Contains == Super Slow?

This is one of the pitfalls of deferred execution in Linq: In your first approach StudentIds is really an IQueryable, not an in-memory collection. That means using it in the second query will run the query again on the database - each and every time.

Forcing execution of the first query by using ToArray() makes StudentIds an in-memory collection and the Contains part in your second query will run over this collection that contains a fixed sequence of items - This gets mapped to something equivalent to a SQL where StudentId in (1,2,3,4) query.

This query will of course, be much much faster since you determined this sequence once up-front, and not every time the Where clause is executed. Your second query without using ToArray() (I would think) would be mapped to a SQL query with an where exists (...) sub-query that gets evaluated for each row.

Entity Framework 6: Slow WHERE NOT IN statement

Sending a very large filter condition to the database like you are doing can be very slow. Depending on the size of the table it can be much faster to retrieve all the values and do the filtering in memory instead using a HashSet. Try this:

Random rand = new Random();
var set= new HashSet<int>();
for (int i = 0; i < 20000; i++)
    set.Add(rand.Next(550000));

Stopwatch watch = new Stopwatch();
watch.Start();

var sql = "SELECT * FROM [MyDb].[dbo].[MyEntities]";

var result = context.Set<MyEntity>()
             .SqlQuery(sql)
             .AsEnumerable()
             .Where(x => !set.Contains(x.ID))
             .ToList();

watch.Stop();

Why is inserting entities in EF 4.1 so slow compared to ObjectContext?

As already indicated by Ladislav in the comment, you need to disable automatic change detection to improve performance:

context.Configuration.AutoDetectChangesEnabled = false;

This change detection is enabled by default in the DbContext API.

The reason why DbContext behaves so different from the ObjectContext API is that many more functions of the DbContext API will call DetectChanges internally than functions of the ObjectContext API when automatic change detection is enabled.

Here you can find a list of those functions which call DetectChanges by default. They are:

The Add, Attach, Find, Local, or Remove members on DbSet
The GetValidationErrors, Entry, or SaveChanges members on DbContext
The Entries method on DbChangeTracker

Especially Add calls DetectChanges which is responsible for the poor performance you experienced.

I contrast to this the ObjectContext API calls DetectChanges only automatically in SaveChanges but not in AddObject and the other corresponding methods mentioned above. That's the reason why the default performance of ObjectContext is faster.

Why did they introduce this default automatic change detection in DbContext in so many functions? I am not sure, but it seems that disabling it and calling DetectChanges manually at the proper points is considered as advanced and can easily introduce subtle bugs into your application so use [it] with care.

DbSet.Find method ridiculously slow compared to .SingleOrDefault on ID

Find calls DetectChanges internally, SingleOrDefault (or generally any query) doesn't. DetectChanges is an expensive operation, so that's the reason why Find is slower (but it might become faster if the entity is already loaded into the context because Find would not run a query but just return the loaded entity).

If you want to use Find for a lot of entities - in a loop for example - you can disable automatic change detection like so (can't write it in VB, so a C# example):

try
{
    context.Configuration.AutoDetectChangesEnabled = false;
    foreach (var id in someIdCollection)
    {
        var competitor = context.Competitors.Find(id);
        // ...
    }
}
finally
{
    context.Configuration.AutoDetectChangesEnabled = true;
}

Now, Find won't call DetectChanges with every call and it should be as fast as SingleOrDefault (and faster if the entity is already attached to the context).

Automatic change detection is a complex and somewhat mysterious subject. A great detailed discussion can be found in this four-part series:

(Link to part 1, the links to parts 2, 3 and 4 are at the beginning of that article)

http://blog.oneunicorn.com/2012/03/10/secrets-of-detectchanges-part-1-what-does-detectchanges-do/

Scalable Contains method for LINQ against a SQL backend

I’ve solved this problem with a little different approach a view month ago. Maybe it’s a good solution for you too.

I didn’t want my solution to change the query itself. So a ids.ChunkContains(p.Id) or a special WhereContains method was unfeasible. Also should the solution be able to combine a Contains with another filter as well as using the same collection multiple times.

db.TestEntities.Where(p => (ids.Contains(p.Id) || ids.Contains(p.ParentId)) && p.Name.StartsWith("Test"))

So I tried to encapsulate the logic in a special ToList method that could rewrite the Expression for a specified collection to be queried in chunks.

var ids = Enumerable.Range(1, 11);
var result = db.TestEntities.Where(p => Ids.Contains(p.Id) && p.Name.StartsWith ("Test"))
                                .ToChunkedList(ids,4);

To rewrite the expression tree I discovered all Contains Method calls from local collections in the query with a view helping classes.

private class ContainsExpression
{
    public ContainsExpression(MethodCallExpression methodCall)
    {
        this.MethodCall = methodCall;
    }

    public MethodCallExpression MethodCall { get; private set; }

    public object GetValue()
    {
        var parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        return Expression.Lambda<Func<object>>(parent).Compile()();
    }

    public bool IsLocalList()
    {
        Expression parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        while (parent != null) {
            if (parent is ConstantExpression)
                return true;
            var member = parent as MemberExpression;
            if (member != null) {
                parent = member.Expression;
            } else {
                parent = null;
            }
        }
        return false;
    }
}

private class FindExpressionVisitor<T> : ExpressionVisitor where T : Expression
{
    public List<T> FoundItems { get; private set; }

    public FindExpressionVisitor()
    {
        this.FoundItems = new List<T>();
    }

    public override Expression Visit(Expression node)
    {
        var found = node as T;
        if (found != null) {
            this.FoundItems.Add(found);
        }
        return base.Visit(node);
    }
}

public static List<T> ToChunkedList<T, TValue>(this IQueryable<T> query, IEnumerable<TValue> list, int chunkSize)
{
    var finder = new FindExpressionVisitor<MethodCallExpression>();
    finder.Visit(query.Expression);
    var methodCalls = finder.FoundItems.Where(p => p.Method.Name == "Contains").Select(p => new ContainsExpression(p)).Where(p => p.IsLocalList()).ToList();
    var localLists = methodCalls.Where(p => p.GetValue() == list).ToList();

If the local collection passed in the ToChunkedList method was found in the query expression, I replace the Contains call to the original list with a new call to a temporary list containing the ids for one batch.

if (localLists.Any()) {
    var result = new List<T>();
    var valueList = new List<TValue>();

    var containsMethod = typeof(Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public)
                        .Single(p => p.Name == "Contains" && p.GetParameters().Count() == 2)
                        .MakeGenericMethod(typeof(TValue));

    var queryExpression = query.Expression;

    foreach (var item in localLists) {
        var parameter = new List<Expression>();
        parameter.Add(Expression.Constant(valueList));
        if (item.MethodCall.Object == null) {
            parameter.AddRange(item.MethodCall.Arguments.Skip(1));
        } else {
            parameter.AddRange(item.MethodCall.Arguments);
        }

        var call = Expression.Call(containsMethod, parameter.ToArray());

        var replacer = new ExpressionReplacer(item.MethodCall,call);

        queryExpression = replacer.Visit(queryExpression);
    }

    var chunkQuery = query.Provider.CreateQuery<T>(queryExpression);

    for (int i = 0; i < Math.Ceiling((decimal)list.Count() / chunkSize); i++) {
        valueList.Clear();
        valueList.AddRange(list.Skip(i * chunkSize).Take(chunkSize));

        result.AddRange(chunkQuery.ToList());
    }
    return result;
}
// if the collection was not found return query.ToList()
return query.ToList();

Expression Replacer:

private class ExpressionReplacer : ExpressionVisitor {

    private Expression find, replace;

    public ExpressionReplacer(Expression find, Expression replace)
    {
        this.find = find;
        this.replace = replace;
    }

    public override Expression Visit(Expression node)
    {
        if (node == this.find)
            return this.replace;

        return base.Visit(node);
    }
}

How to avoid Query Plan re-compilation when using IEnumerable.Contains in Entity Framework LINQ queries?

This is a great question. First of all, here are a couple of workarounds that come to mind (they all require changes to the query):

First workaround

This one maybe a bit obvious and unfortunately not generally applicable: If the selection of items you would need to pass over to Enumerable.Contains already exists in a table in the database, you can write a query that calls Enumerable.Contains on the corresponding entity set in the predicate instead of bringing the items into memory first. An Enumerable.Contains call over data in the database should result in some kind of JOIN-based query that can be cached. E.g. assuming no navigation properties between Customers and SelectedCustomers, you should be able to write the query like this:

var q = db.Customers.Where(c => 
    db.SelectedCustomers.Select(s => s.Id).Contains(c.Id));

The syntax of the query with Any is a bit simpler in this case:

var q = db.Customers.Where(c => 
    db.SelectedCustomers.Any(s => s.Id == c.Id));

If you don't already have the necessary selection data stored in the database, you will probably don't want the overhead of having to store it, so you should consider the next workaround.

Second workaround

If you know beforehand that you will have a relatively manageable maximum number of elements in the list you can replace Enumerable.Contains with a tree of OR-ed equality comparisons, e.g.:

var list = new [] {1,2,3};
var q = db.Customers.Where(c => 
    list[0] == c.Id ||
    list[1] == c.Id ||
    list[2] == c.Id );

This should produce a parameterized query that can be cached. If the list varies in size from query to query, this should produce a different cache entry for each list size. Alternatively you could use a list with a fixed size and pass some sentinel value that you know will never match the value argument, e.g. 0, -1, or alternatively just repeat one of the other values. In order to produce such predicate expression programmatically at runtime based on a list, you might want to consider using something like PredicateBuilder.

Potential fixes and their challenges

On one hand, changes necessary to support caching of this kind of query using CompiledQuery explicitly would be pretty complex in the current version of EF. The key reason is that the elements in the IEnumerable<T> passed to the Enumerable.Contains method would have to translate into a structural part of the query for the particular translation we produce, e.g.:

var list = new [] {1,2,3};
var q = db.Customers.Where(c => list.Contains(c.Id)).ToList();

The enumerable “list” looks like a simple variable in C#/LINQ but it needs to be translated to a query like this (simplified for clarity):

SELECT * FROM Customers WHERE Id IN(1,2,3)

If list changes to new [] {5,4,3,2,1}, and we would have to generate the SQL query again!

SELECT * FROM Customers WHERE Id IN(5,4,3,2,1)

As a potential solution, we have talked about leaving generated SQL queries open with some kind of special place holder, e.g. store in the query cache that just says

SELECT * FROM Customers WHERE Id IN(<place holder>)

At execution time, we could pick this SQL from the cache and finish the SQL generation with the actual values. Another option would be to leverage a Table-Valued Parameter for the list if the target database can support it. The first option would probably work ok only with constant values, the latter requires a database that supports a special feature. Both are very complex to implement in EF.

Auto compiled queries

On the other hand, for automatic compiled queries (as opposed to explicit CompiledQuery) the issue becomes somewhat artificial: in this case we compute the query cache key after the initial LINQ translation, hence any IEnumerable<T> argument passed should have already been expanded into DbExpression nodes: a tree of OR-ed equality comparisons in EF5, and usually a single DbInExpression node in EF6. Since the query tree already contains a distinct expression for each distinct combination of elements in the source argument of Enumerable.Contains (and therefore for each distinct output SQL query), it is possible to cache the queries.

However even in EF6 these queries are not cached even in the auto compiled queries case. The key reason for that is that we expect the variability of elements in a list to be high (this has to do with the variable size of the list but is also exacerbated by the fact that we normally don't parameterize values that appear as constants to the query, so a list of constants will be translated into constant literals in SQL), so with enough calls to a query with Enumerable.Contains you could produce considerable cache pollution.

We have considered alternative solutions to this as well, but we haven't implemented any yet. So my conclusion is that you would be better off with the second workaround in most cases if as I said, you know the number of elements in the list will remain small and manageable (otherwise you will face performance issues).

Hope this helps!

Why Is .Contains Slow? Most Efficient Way to Get Multiple Entities by Primary Key