Entity-framework code is slow when using Include() many times
tl;dr Multiple Include
s blow up the SQL result set. Soon it becomes cheaper to load data by multiple database calls instead of running one mega statement. Try to find the best mixture of Include
and Load
statements.
it does seem that there is a performance penalty when using Include
That's an understatement! Multiple Include
s quickly blow up the SQL query result both in width and in length. Why is that?
Growth factor of Include
s
(This part applies Entity Framework classic, v6 and earlier)
Let's say we have
- root entity
Root
- parent entity
Root.Parent
- child entities
Root.Children1
andRoot.Children2
- a LINQ statement
Root.Include("Parent").Include("Children1").Include("Children2")
This builds a SQL statement that has the following structure:
SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children1
UNION
SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children2
These <PseudoColumns>
consist of expressions like CAST(NULL AS int) AS [C2],
and they serve to have the same amount of columns in all UNION
-ed queries. The first part adds pseudo columns for Child2
, the second part adds pseudo columns for Child1
.
This is what it means for the size of the SQL result set:
- Number of columns in the
SELECT
clause is the sum of all columns in the four tables - The number of rows is the sum of records in included child collections
Since the total number of data points is columns * rows
, each additional Include
exponentially increases the total number of data points in the result set. Let me demonstrate that by taking Root
again, now with an additional Children3
collection. If all tables have 5 columns and 100 rows, we get:
One Include
(Root
+ 1 child collection): 10 columns * 100 rows = 1000 data points.
Two Include
s (Root
+ 2 child collections): 15 columns * 200 rows = 3000 data points.
Three Include
s (Root
+ 3 child collections): 20 columns * 300 rows = 6000 data points.
With 12 Includes
this would amount to 78000 data points!
Conversely, if you get all records for each table separately instead of 12 Includes
, you have 13 * 5 * 100
data points: 6500, less than 10%!
Now these numbers are somewhat exaggerated in that many of these data points will be null
, so they don't contribute much to the actual size of the result set that is sent to the client. But the query size and the task for the query optimizer certainly get affected negatively by increasing numbers of Include
s.
Balance
So using Includes
is a delicate balance between the cost of database calls and data volume. It's hard to give a rule of the thumb, but by now you can imagine that the data volume generally quickly outgrows the cost of extra calls if there are more than ~3 Includes
for child collections (but quite a bit more for parent Includes
, that only widen the result set).
Alternative
The alternative to Include
is to load data in separate queries:
context.Configuration.LazyLoadingEnabled = false;
var rootId = 1;
context.Children1.Where(c => c.RootId == rootId).Load();
context.Children2.Where(c => c.RootId == rootId).Load();
return context.Roots.Find(rootId);
This loads all required data into the context's cache. During this process, EF executes relationship fixup by which it auto-populates navigation properties (Root.Children
etc.) by loaded entities. The end result is identical to the statement with Include
s, except for one important difference: the child collections are not marked as loaded in the entity state manager, so EF will try to trigger lazy loading if you access them. That's why it's important to turn off lazy loading.
In reality, you will have to figure out which combination of Include
and Load
statements work best for you.
Other aspects to consider
Each Include
also increases query complexity, so the database's query optimizer will have to make increasingly more effort to find the best query plan. At some point this may no longer succeed. Also, when some vital indexes are missing (esp. on foreign keys) performance may suffer by adding Include
s, even with the best query plan.
Entity Framework core
Cartesian explosion
For some reason, the behavior described above, UNIONed queries, was abandoned as of EF core 3. It now builds one query with joins. When the query is "star" shaped1 this leads to Cartesian explosion (in the SQL result set). I can only find a note announcing this breaking change, but it doesn't say why.
Split queries
To counter this Cartesian explosion, Entity Framework core 5 introduced the concept of split queries that enables loading related data in multiple queries. It prevents building one massive, multiplied SQL result set. Also, because of lower query complexity, it may reduce the time it takes to fetch data even with multiple roundtrips. However, it may lead to inconsistent data when concurrent updates occur.
1Multiple 1:n relationships off of the query root.
Entity Framework Include performance
Your second approach relies on the EF navigation property fixup process. The problem is though that every
query.Include(q => q.ItemNavN).Load();
statement will also include all the master record data along with the related entity data.
Using the same basic idea, one potential improvement could be to execute one Load
per each navigation property, replacing the Include
with either Select
(for references) or SelectMany
(for collections) - something similar to how EF Core processes the Include
s internally.
Taking your second approach example, you could try the following and compare the performance:
var query = ctx.Filters.Where(x => x.SessionId == id)
.Join(ctx.Items, i => i.ItemId, fs => fs.Id, (f, fs) => fs);
query.Select(x => x.ItemNav1).Load();
query.Select(x => x.ItemNav2).Load();
query.Select(x => x.ItemNav3).Load();
query.Select(x => x.ItemNav4).Load();
query.Select(x => x.ItemNav5).Load();
query.Select(x => x.ItemNav6).Load();
var result = query.ToList();
// here all the navigation properties should be populated
Query using Entity Framework is Very SLOW
Have the entities been set up with navigation properties? For instance does the Diagnostico entitiy have something like below declared?
public virtual Parciero ParcieroPf {get; set;}
public virtual Parciero ParcieroPj {get; set;}
From reading the second part of the query it does look like there are related entities mapped.
If you have navigation properties at your disposal then you can structure those queries to use the navigation properties rather than the embedded SQL. As mentioned, that way of querying is vulnerable to SQL injection and it should be a priority to eliminate it.
The performance cost you are likely seeing is due to the manual lazy-load that is being done to populate the various related details for the query results.
At a minimum you can speed up the loading of these related details by first extracting your "idDiagnostico" values from the query results and using those to load all of the related child records in one hit, then associate them to their respective Diagnostico entities:
So assuming you need to keep the SQL query at least to begin with:
// ... load SQL based initial data ...
List<RetornaRelatorioEnvioSmsModel> models = ctx.Database.SqlQuery<RetornaRelatorioEnvioSmsModel>(query).ToList();
if (models.Count == 0)
return models;
// Fetch all applicable IDs.
var diagnosticoIds = ret.Select(x => x.idDiagnostico).ToList();
// Load the related data for *all* applicable diagnostico IDs above. Load into their view models, then at the end, split them among the related diagnostico.
var formasContatos = ctx.tb_diagnostico
.Where(x => diagnosticoIds.Contains(x.id_diagnostico))
.Select(x => new RetornaRelatorioEnvioSmsFormaContatoModel
{
formaContato = x.tb_parceiro.tb_tipo_forma_contato.nm_tipo_forma_contato
}).ToList();
var temas = ctx.tb_diagnostico
.Where(x => diagosticoIds.Contains(x.id_diagnostico))
.Select(x => new RetornaRelatorioEnvioSmsTemaModel
{
tema = x.tb_diagnosticoperfiltema.tb_perfiltema.tb_tema.nm_tema,
nivel = x.tb_diagnosticoperfiltema.tb_nivelmaturidade.nm.nivel
}).ToList();
// This part is a bit tricky.. It looks like you want the the lowest nu_prioridade of the highest nu_pontuacao
var temaPrioritario = ctx.tb_diagnostico
.SelectMany(x => x.tb_diagnosticoperfiltema) // from diagnostico
.SelectMany(x => x.tb_perfiltema) // from diagnostico.diagnosticoperfiltema
.GroupBy(x => x.tb_diagnosticoperfiltema.tb_diagnostico.id_diagnostico) // group by diagnostico ID. Requires bi-directional references...
.Select(x => new
{
x.Key, // id_diagnostico
Tema = x.OrderByDescending(y => y.tb_diagnosticoperfiltema.nu_pontuacao)
.ThenBy(y => y.nu_prioridade)
.Select(y => new RetornaRelatorioEnvioSmsTemaModel
{
tema = y.tb_tema.nm_tema,
nivel = y.tb_diagnosticoperfiltema.tb_nivelmaturidade.nm_nivel
}).FirstOrDefault())
.Where(x => diagnosticoIds.Contains(x.Key))
.Select(x => x.Tema)
.ToList();
// Caveat, the above needs to be tested but should give you an idea on how to select the desired data.
foreach(var model in models)
{
model.formasContato = formasContatos.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
model.temas = temas.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
model.temaPrioritario = temaPrioritarios.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
}
With the navigation properties though, this can all be done away with and loaded from the initial data model retrieved. It's a pretty complex model, and the (Italian?) naming convention makes it a bit hard to follow but hopefully that gives you some ideas on how to tackle the performance issues.
Why is Entity Framework so slow to add multiple items in one SaveChanges()?
While doing some research on your question I came across this enlightening article: http://www.codinghelmet.com/?path=howto/bulk-insert
Here's a quote:
Each object that was inserted required two SQL statements - one to insert a record, and additional one to obtain identity of the new record
This becomes a problem when inserting multiple records. A problem which is intensified by the fact that each record is inserted one at a time (But this is outside the context of your question since you're already testing the one by one insert). So if you're inserting 200 records, that's 400 sql statements being executed one by one.
So from my understanding EF is simply not built for bulk insertion. Even if it's as simple as inserting 200 records. Which to me seems like a big let down.
I started thinking, "Then what is EF good for anyway. It can't even insert a couple of records". Well i'll give EF props in two areas:
- Select Queries: It's very easy to write your queries and get data into your application quickly.
- Simplifying insertion of complex records. If you've ever had a table with a lot of foreign keys and you've tried inserting all the linked records in one transaction, you know what I'm talking about. Thankfully EF inserts each record in order and links all the related records in one single transaction for you. But as mentioned above, this comes at a cost.
So simply put, it seems like, if you have an operation that requires inserting a bunch of records, it might be best to use SqlBulkCopy. Which can insert thousands of records in seconds.
I know this might not be the answer you want to hear, because believe me it upsets me as well since I use EF alot, but I don't see any way around it
Related Topics
How to Register a .Net Com Dll with Regsvr32
Why Does Ienumerable<T> Inherit from Ienumerable
Providing Input/Subcommands to a Command (Cli) Executed with Ssh.Net Sshclient.Runcommand
How to Return the Current Action in an ASP.NET MVC View
Log4Net, How to Add a Custom Field to My Logging
How to Spawn Threads on Different CPU Cores
Connect to Active Directory via Ldap
Datetime2' Error When Using Entity Framework in VS 2010 .Net 4.0
Starting and Stopping Iis Express Programmatically
Return List Using Select New in Linq
How to Print <Xml Version="1.0"> Using Xdocument
Need a Way to Reference 2 Different Versions of the Same 3Rd Party Dll
Problems Using Entity Framework 6 and SQLite
Xunit.Net: Global Setup + Teardown
The Notorious Yet Unaswered Issue of Downloading a File When Windows Security Is Required