Why Does the Contains() Operator Degrade Entity Framework's Performance So Dramatically

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The approach described in this answer is no longer necessary.

You are right that most of the time is spent processing the translation of the query. EF's provider model doesn't currently include an expression that represents an IN clause, therefore ADO.NET providers can't support IN natively. Instead, the implementation of Enumerable.Contains translates it to a tree of OR expressions, i.e. for something that in C# looks like like this:

new []{1, 2, 3, 4}.Contains(i)

... we will generate a DbExpression tree that could be represented like this:

((1 = @i) OR (2 = @i)) OR ((3 = @i) OR (4 = @i))

(The expression trees have to be balanced because if we had all the ORs over a single long spine there would be more chances that the expression visitor would hit a stack overflow (yes, we actually did hit that in our testing))

We later send a tree like this to the ADO.NET provider, which can have the ability to recognize this pattern and reduce it to the IN clause during SQL generation.

When we added support for Enumerable.Contains in EF4, we thought it was desirable to do it without having to introduce support for IN expressions in the provider model, and honestly, 10,000 is much more than the number of elements we anticipated customers would pass to Enumerable.Contains. That said, I understand that this is an annoyance and that the manipulation of expressions trees makes things too expensive in your particular scenario.

I discussed this with one of our developers and we believe that in the future we could change the implementation by adding first-class support for IN. I will make sure this is added to our backlog, but I cannot promise when it will make it given there are many other improvements we would like to make.

To the workarounds already suggested in the thread I would add the following:

Consider creating a method that balances the number of database roundtrips with the number of elements you pass to Contains. For instance, in my own testing I observed that computing and executing against a local instance of SQL Server the query with 100 elements takes 1/60 of a second. If you can write your query in such a way that executing 100 queries with 100 different sets of ids would give you equivalent result to the query with 10,000 elements, then you can get the results in aproximately 1.67 seconds instead of 18 seconds.

Different chunk sizes should work better depending on the query and the latency of the database connection. For certain queries, i.e. if the sequence passed has duplicates or if Enumerable.Contains is used in a nested condition you may obtain duplicate elements in the results.

Here is a code snippet (sorry if the code used to slice the input into chunks looks a little too complex. There are simpler ways to achieve the same thing, but I was trying to come up with a pattern that preserves streaming for the sequence and I couldn't find anything like it in LINQ, so I probably overdid that part :) ):

Usage:

var list = context.GetMainItems(ids).ToList();

Method for context or repository:

public partial class ContainsTestEntities
{
public IEnumerable<Main> GetMainItems(IEnumerable<int> ids, int chunkSize = 100)
{
foreach (var chunk in ids.Chunk(chunkSize))
{
var q = this.MainItems.Where(a => chunk.Contains(a.Id));
foreach (var item in q)
{
yield return item;
}
}
}
}

Extension methods for slicing enumerable sequences:

public static class EnumerableSlicing
{

private class Status
{
public bool EndOfSequence;
}

private static IEnumerable<T> TakeOnEnumerator<T>(IEnumerator<T> enumerator, int count,
Status status)
{
while (--count > 0 && (enumerator.MoveNext() || !(status.EndOfSequence = true)))
{
yield return enumerator.Current;
}
}

public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> items, int chunkSize)
{
if (chunkSize < 1)
{
throw new ArgumentException("Chunks should not be smaller than 1 element");
}
var status = new Status { EndOfSequence = false };
using (var enumerator = items.GetEnumerator())
{
while (!status.EndOfSequence)
{
yield return TakeOnEnumerator(enumerator, chunkSize, status);
}
}
}
}

Hope this helps!

Why does the Contains() operator degrade Entity Framework' Linq queries?

Yes Contains() will degrade performance heavily.

So You can try below mentioned solution.

We were able to solve the EF Contains problem by adding an intermediate table and joining on that table from LINQ query that needed to use Contains clause. We were able to get amazing results with this approach. We have a large EF model and as "Contains" is not allowed when pre-compiling EF queries we were getting very poor performance for queries that use "Contains" clause.

An overview:

  1. Create a table in SQL Server - for example HelperForContainsOfIntType with HelperID of Guid data-type and ReferenceID of int data-type columns. Create different tables with ReferenceID of differing data-types as needed.

  2. Create an Entity / EntitySet for HelperForContainsOfIntType and other such tables in EF model. Create different Entity / EntitySet for different data-types as needed.

  3. Create a helper method in .NET code which takes the input of an IEnumerable and returns an Guid. This method generates a new Guid and inserts the values from IEnumerable into HelperForContainsOfIntType along with the generated Guid. Next, the method returns this newly generated Guid to the caller. For fast inserting into HelperForContainsOfIntType table, create a stored-procedure which takes input of an list of values and does the insertion. See Table-Valued Parameters in SQL Server 2008 (ADO.NET). Create different helpers for different data-types or create a generic helper method to handle different data-types.

Create a EF compiled query which is similar to something like below:

static Func<MyEntities, Guid, IEnumerable<Customer>> _selectCustomers =
CompiledQuery.Compile(
(MyEntities db, Guid containsHelperID) =>
from cust in db.Customers
join x in db.HelperForContainsOfIntType on cust.CustomerID equals x.ReferenceID where x.HelperID == containsHelperID
select cust
);

Call the helper method with values to be used in the Contains clause and get the Guid to use in the query. For example:

var containsHelperID = dbHelper.InsertIntoHelperForContainsOfIntType(new int[] { 1, 2, 3 });
var result = _selectCustomers(_dbContext, containsHelperID).ToList();

I got that from Here

Why does ToString() degrade Entity Framework's performance so dramatically

So I ran the following two expressions through LINQPad on AdventureWorks to test out a theory:

from p in Persons
join row in BusinessEntityAddresses
on new {p.BusinessEntityID} equals new {row.BusinessEntityID}
select new { V = row.ModifiedDate }

and

from p in Persons
join row in BusinessEntityAddresses
on new {p.BusinessEntityID} equals new {row.BusinessEntityID}
select new { V = row.ModifiedDate.ToString() }

Both had some reflection involved:

IL_017D:  ldtoken     <>f__AnonymousType1<System.String>.get_V
IL_0182: ldtoken <>f__AnonymousType1<System.String>
IL_0187: call System.Reflection.MethodBase.GetMethodFromHandle
IL_018C: castclass System.Reflection.MethodInfo

But the version with select new ... .ToString() generated several additional reflection calls:

IL_014E:  call        System.Reflection.FieldInfo.GetFieldFromHandle
IL_0153: call System.Linq.Expressions.Expression.Field
IL_0158: ldtoken System.Object.ToString
IL_015D: call System.Reflection.MethodBase.GetMethodFromHandle
IL_0162: castclass System.Reflection.MethodInfo
IL_0167: ldc.i4.0
IL_0168: newarr System.Linq.Expressions.Expression
IL_016D: call System.Linq.Expressions.Expression.Call
IL_0172: stelem.ref
IL_0173: ldloc.1 // CS$0$0001
IL_0174: ldc.i4.1
IL_0175: newarr System.Reflection.MethodInfo
IL_017A: stloc.2 // CS$0$0002
IL_017B: ldloc.2 // CS$0$0002
IL_017C: ldc.i4.0

It seems like calling .ToString in the LINQ Query is causing extra (duplicative) reflection overhead for each entry. When you call it afterwards, the compiler knows how to inline the reflection to get to .ToString more efficiently.

It seems like the bottom line is that you should avoid calling methods within LINQ Queries, as additional reflection overhead (GetMethodFromHandle, MethodInfo) gets baked into your query in ways that you wouldn't expect.

Why is .Contains slow? Most efficient way to get multiple entities by primary key?

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The analysis in this answer is great but largely obsolete since 2013.

Using Contains in Entity Framework is actually very slow. It's true that it translates into an IN clause in SQL and that the SQL query itself is executed fast. But the problem and the performance bottleneck is in the translation from your LINQ query into SQL. The expression tree which will be created is expanded into a long chain of OR concatenations because there is no native expression which represents an IN. When the SQL is created this expression of many ORs is recognized and collapsed back into the SQL IN clause.

This does not mean that using Contains is worse than issuing one query per element in your ids collection (your first option). It's probably still better - at least for not too large collections. But for large collections it is really bad. I remember that I had tested some time ago a Contains query with about 12.000 elements which worked but took around a minute even though the query in SQL executed in less than a second.

It might be worth to test the performance of a combination of multiple roundtrips to the database with a smaller number of elements in a Contains expression for each roundtrip.

This approach and also the limitations of using Contains with Entity Framework is shown and explained here:

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

It's possible that a raw SQL command will perform best in this situation which would mean that you call dbContext.Database.SqlQuery<Image>(sqlString) or dbContext.Images.SqlQuery(sqlString) where sqlString is the SQL shown in @Rune's answer.

Edit

Here are some measurements:

I have done this on a table with 550000 records and 11 columns (IDs start from 1 without gaps) and picked randomly 20000 ids:

using (var context = new MyDbContext())
{
Random rand = new Random();
var ids = new List<int>();
for (int i = 0; i < 20000; i++)
ids.Add(rand.Next(550000));

Stopwatch watch = new Stopwatch();
watch.Start();

// here are the code snippets from below

watch.Stop();
var msec = watch.ElapsedMilliseconds;
}

Test 1

var result = context.Set<MyEntity>()
.Where(e => ids.Contains(e.ID))
.ToList();

Result -> msec = 85.5 sec

Test 2

var result = context.Set<MyEntity>().AsNoTracking()
.Where(e => ids.Contains(e.ID))
.ToList();

Result -> msec = 84.5 sec

This tiny effect of AsNoTracking is very unusual. It indicates that the bottleneck is not object materialization (and not SQL as shown below).

For both tests it can be seen in SQL Profiler that the SQL query arrives at the database very late. (I didn't measure exactly but it was later than 70 seconds.) Obviously the translation of this LINQ query into SQL is very expensive.

Test 3

var values = new StringBuilder();
values.AppendFormat("{0}", ids[0]);
for (int i = 1; i < ids.Count; i++)
values.AppendFormat(", {0}", ids[i]);

var sql = string.Format(
"SELECT * FROM [MyDb].[dbo].[MyEntities] WHERE [ID] IN ({0})",
values);

var result = context.Set<MyEntity>().SqlQuery(sql).ToList();

Result -> msec = 5.1 sec

Test 4

// same as Test 3 but this time including AsNoTracking
var result = context.Set<MyEntity>().SqlQuery(sql).AsNoTracking().ToList();

Result -> msec = 3.8 sec

This time the effect of disabling tracking is more noticable.

Test 5

// same as Test 3 but this time using Database.SqlQuery
var result = context.Database.SqlQuery<MyEntity>(sql).ToList();

Result -> msec = 3.7 sec

My understanding is that context.Database.SqlQuery<MyEntity>(sql) is the same as context.Set<MyEntity>().SqlQuery(sql).AsNoTracking(), so there is no difference expected between Test 4 and Test 5.

(The length of the result sets was not always the same due to possible duplicates after the random id selection but it was always between 19600 and 19640 elements.)

Edit 2

Test 6

Even 20000 roundtrips to the database are faster than using Contains:

var result = new List<MyEntity>();
foreach (var id in ids)
result.Add(context.Set<MyEntity>().SingleOrDefault(e => e.ID == id));

Result -> msec = 73.6 sec

Note that I have used SingleOrDefault instead of Find. Using the same code with Find is very slow (I cancelled the test after several minutes) because Find calls DetectChanges internally. Disabling auto change detection (context.Configuration.AutoDetectChangesEnabled = false) leads to roughly the same performance as SingleOrDefault. Using AsNoTracking reduces the time by one or two seconds.

Tests were done with database client (console app) and database server on the same machine. The last result might get significantly worse with a "remote" database due to the many roundtrips.

What are the ways to optimize Entity Framework queries with Contains()?

It's almost certain that splitting the query in pieces performs better, in spite of more db round trips. It is always advised to limit the number of includes, because they not only blow up the size and complexity of the query (as you noticed) but also blow up the result set both in length and in width. Moreover, they often get translated into outer joins.

Apart from that, using Contains the way you do is OK.

Sorry, it is hard to be more specific without knowing your data model and the size of the tables involved.

How to avoid Query Plan re-compilation when using IEnumerable.Contains in Entity Framework LINQ queries?

This is a great question. First of all, here are a couple of workarounds that come to mind (they all require changes to the query):

First workaround

This one maybe a bit obvious and unfortunately not generally applicable: If the selection of items you would need to pass over to Enumerable.Contains already exists in a table in the database, you can write a query that calls Enumerable.Contains on the corresponding entity set in the predicate instead of bringing the items into memory first. An Enumerable.Contains call over data in the database should result in some kind of JOIN-based query that can be cached. E.g. assuming no navigation properties between Customers and SelectedCustomers, you should be able to write the query like this:

var q = db.Customers.Where(c => 
db.SelectedCustomers.Select(s => s.Id).Contains(c.Id));

The syntax of the query with Any is a bit simpler in this case:

var q = db.Customers.Where(c => 
db.SelectedCustomers.Any(s => s.Id == c.Id));

If you don't already have the necessary selection data stored in the database, you will probably don't want the overhead of having to store it, so you should consider the next workaround.

Second workaround

If you know beforehand that you will have a relatively manageable maximum number of elements in the list you can replace Enumerable.Contains with a tree of OR-ed equality comparisons, e.g.:

var list = new [] {1,2,3};
var q = db.Customers.Where(c =>
list[0] == c.Id ||
list[1] == c.Id ||
list[2] == c.Id );

This should produce a parameterized query that can be cached. If the list varies in size from query to query, this should produce a different cache entry for each list size. Alternatively you could use a list with a fixed size and pass some sentinel value that you know will never match the value argument, e.g. 0, -1, or alternatively just repeat one of the other values. In order to produce such predicate expression programmatically at runtime based on a list, you might want to consider using something like PredicateBuilder.

Potential fixes and their challenges

On one hand, changes necessary to support caching of this kind of query using CompiledQuery explicitly would be pretty complex in the current version of EF. The key reason is that the elements in the IEnumerable<T> passed to the Enumerable.Contains method would have to translate into a structural part of the query for the particular translation we produce, e.g.:

var list = new [] {1,2,3};
var q = db.Customers.Where(c => list.Contains(c.Id)).ToList();

The enumerable “list” looks like a simple variable in C#/LINQ but it needs to be translated to a query like this (simplified for clarity):

SELECT * FROM Customers WHERE Id IN(1,2,3)

If list changes to new [] {5,4,3,2,1}, and we would have to generate the SQL query again!

SELECT * FROM Customers WHERE Id IN(5,4,3,2,1)

As a potential solution, we have talked about leaving generated SQL queries open with some kind of special place holder, e.g. store in the query cache that just says

SELECT * FROM Customers WHERE Id IN(<place holder>)

At execution time, we could pick this SQL from the cache and finish the SQL generation with the actual values. Another option would be to leverage a Table-Valued Parameter for the list if the target database can support it. The first option would probably work ok only with constant values, the latter requires a database that supports a special feature. Both are very complex to implement in EF.

Auto compiled queries

On the other hand, for automatic compiled queries (as opposed to explicit CompiledQuery) the issue becomes somewhat artificial: in this case we compute the query cache key after the initial LINQ translation, hence any IEnumerable<T> argument passed should have already been expanded into DbExpression nodes: a tree of OR-ed equality comparisons in EF5, and usually a single DbInExpression node in EF6. Since the query tree already contains a distinct expression for each distinct combination of elements in the source argument of Enumerable.Contains (and therefore for each distinct output SQL query), it is possible to cache the queries.

However even in EF6 these queries are not cached even in the auto compiled queries case. The key reason for that is that we expect the variability of elements in a list to be high (this has to do with the variable size of the list but is also exacerbated by the fact that we normally don't parameterize values that appear as constants to the query, so a list of constants will be translated into constant literals in SQL), so with enough calls to a query with Enumerable.Contains you could produce considerable cache pollution.

We have considered alternative solutions to this as well, but we haven't implemented any yet. So my conclusion is that you would be better off with the second workaround in most cases if as I said, you know the number of elements in the list will remain small and manageable (otherwise you will face performance issues).

Hope this helps!

What is causing EF Core to retrieve data much slower then the actual SQL Query and suggestions to speed up?

You can try using Split Queries, instead of generating one big query, eg:

  _context.Organizations
.Where(predicate)
.Include(a => a.OrganizationType)
.Include(a => a.OrganizationLicenses)
.Include(a => a.Contacts)
.ThenInclude(b => b.Phones.Where(p => p.IsActive))
.ThenInclude(a => a.PhoneType)
.AsSplitQuery()
.Load();

which is intended primarily to reduce the load on the database engine by sending simpler queries, but a side-effect is that EF doesn't have to extract multiple entities from a single tabular result.



Related Topics



Leave a reply



Submit