Entity-Framework Code Is Slow When Using Include() Many Times

Entity-framework code is slow when using Include() many times

tl;dr Multiple Includes blow up the SQL result set. Soon it becomes cheaper to load data by multiple database calls instead of running one mega statement. Try to find the best mixture of Include and Load statements.

it does seem that there is a performance penalty when using Include

That's an understatement! Multiple Includes quickly blow up the SQL query result both in width and in length. Why is that?

Growth factor of `Include`s

(This part applies Entity Framework classic, v6 and earlier)

Let's say we have

root entity Root
parent entity Root.Parent
child entities Root.Children1 and Root.Children2
a LINQ statement Root.Include("Parent").Include("Children1").Include("Children2")

This builds a SQL statement that has the following structure:

SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children1

UNION

SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children2

These <PseudoColumns> consist of expressions like CAST(NULL AS int) AS [C2], and they serve to have the same amount of columns in all UNION-ed queries. The first part adds pseudo columns for Child2, the second part adds pseudo columns for Child1.

This is what it means for the size of the SQL result set:

Number of columns in the SELECT clause is the sum of all columns in the four tables
The number of rows is the sum of records in included child collections

Since the total number of data points is columns * rows, each additional Include exponentially increases the total number of data points in the result set. Let me demonstrate that by taking Root again, now with an additional Children3 collection. If all tables have 5 columns and 100 rows, we get:

One Include (Root + 1 child collection): 10 columns * 100 rows = 1000 data points.

Two Includes (Root + 2 child collections): 15 columns * 200 rows = 3000 data points.

Three Includes (Root + 3 child collections): 20 columns * 300 rows = 6000 data points.

With 12 Includes this would amount to 78000 data points!

Conversely, if you get all records for each table separately instead of 12 Includes, you have 13 * 5 * 100 data points: 6500, less than 10%!

Now these numbers are somewhat exaggerated in that many of these data points will be null, so they don't contribute much to the actual size of the result set that is sent to the client. But the query size and the task for the query optimizer certainly get affected negatively by increasing numbers of Includes.

Balance

So using Includes is a delicate balance between the cost of database calls and data volume. It's hard to give a rule of the thumb, but by now you can imagine that the data volume generally quickly outgrows the cost of extra calls if there are more than ~3 Includes for child collections (but quite a bit more for parent Includes, that only widen the result set).

Alternative

The alternative to Include is to load data in separate queries:

context.Configuration.LazyLoadingEnabled = false;
var rootId = 1;
context.Children1.Where(c => c.RootId == rootId).Load();
context.Children2.Where(c => c.RootId == rootId).Load();
return context.Roots.Find(rootId);

This loads all required data into the context's cache. During this process, EF executes relationship fixup by which it auto-populates navigation properties (Root.Children etc.) by loaded entities. The end result is identical to the statement with Includes, except for one important difference: the child collections are not marked as loaded in the entity state manager, so EF will try to trigger lazy loading if you access them. That's why it's important to turn off lazy loading.

In reality, you will have to figure out which combination of Include and Load statements work best for you.

Other aspects to consider

Each Include also increases query complexity, so the database's query optimizer will have to make increasingly more effort to find the best query plan. At some point this may no longer succeed. Also, when some vital indexes are missing (esp. on foreign keys) performance may suffer by adding Includes, even with the best query plan.

Entity Framework core

Cartesian explosion

For some reason, the behavior described above, UNIONed queries, was abandoned as of EF core 3. It now builds one query with joins. When the query is "star" shaped¹ this leads to Cartesian explosion (in the SQL result set). I can only find a note announcing this breaking change, but it doesn't say why.

Split queries

To counter this Cartesian explosion, Entity Framework core 5 introduced the concept of split queries that enables loading related data in multiple queries. It prevents building one massive, multiplied SQL result set. Also, because of lower query complexity, it may reduce the time it takes to fetch data even with multiple roundtrips. However, it may lead to inconsistent data when concurrent updates occur.

¹Multiple 1:n relationships off of the query root.

Entity Framework Include performance

Your second approach relies on the EF navigation property fixup process. The problem is though that every

query.Include(q => q.ItemNavN).Load();

statement will also include all the master record data along with the related entity data.

Using the same basic idea, one potential improvement could be to execute one Load per each navigation property, replacing the Include with either Select (for references) or SelectMany (for collections) - something similar to how EF Core processes the Includes internally.

Taking your second approach example, you could try the following and compare the performance:

var query = ctx.Filters.Where(x => x.SessionId == id)
    .Join(ctx.Items, i => i.ItemId, fs => fs.Id, (f, fs) => fs);

query.Select(x => x.ItemNav1).Load();
query.Select(x => x.ItemNav2).Load();
query.Select(x => x.ItemNav3).Load();
query.Select(x => x.ItemNav4).Load();
query.Select(x => x.ItemNav5).Load();
query.Select(x => x.ItemNav6).Load();

var result = query.ToList();
// here all the navigation properties should be populated

Query using Entity Framework is Very SLOW

Have the entities been set up with navigation properties? For instance does the Diagnostico entitiy have something like below declared?

public virtual Parciero ParcieroPf {get; set;} 
public virtual Parciero ParcieroPj {get; set;}

From reading the second part of the query it does look like there are related entities mapped.

If you have navigation properties at your disposal then you can structure those queries to use the navigation properties rather than the embedded SQL. As mentioned, that way of querying is vulnerable to SQL injection and it should be a priority to eliminate it.

The performance cost you are likely seeing is due to the manual lazy-load that is being done to populate the various related details for the query results.

At a minimum you can speed up the loading of these related details by first extracting your "idDiagnostico" values from the query results and using those to load all of the related child records in one hit, then associate them to their respective Diagnostico entities:

So assuming you need to keep the SQL query at least to begin with:

// ... load SQL based initial data ...

List<RetornaRelatorioEnvioSmsModel> models = ctx.Database.SqlQuery<RetornaRelatorioEnvioSmsModel>(query).ToList();

if (models.Count == 0)
   return models;

// Fetch all applicable IDs.
var diagnosticoIds = ret.Select(x => x.idDiagnostico).ToList();

// Load the related data for *all* applicable diagnostico IDs above. Load into their view models, then at the end, split them among the related diagnostico.

var formasContatos = ctx.tb_diagnostico
    .Where(x => diagnosticoIds.Contains(x.id_diagnostico))
    .Select(x => new RetornaRelatorioEnvioSmsFormaContatoModel
    {
        formaContato = x.tb_parceiro.tb_tipo_forma_contato.nm_tipo_forma_contato
    }).ToList();

var temas = ctx.tb_diagnostico
    .Where(x => diagosticoIds.Contains(x.id_diagnostico))
    .Select(x => new RetornaRelatorioEnvioSmsTemaModel
    {
        tema = x.tb_diagnosticoperfiltema.tb_perfiltema.tb_tema.nm_tema,
        nivel = x.tb_diagnosticoperfiltema.tb_nivelmaturidade.nm.nivel
    }).ToList();

// This part is a bit tricky.. It looks like you want the the lowest nu_prioridade of the highest nu_pontuacao 
var temaPrioritario = ctx.tb_diagnostico
    .SelectMany(x => x.tb_diagnosticoperfiltema) // from diagnostico
    .SelectMany(x => x.tb_perfiltema) // from diagnostico.diagnosticoperfiltema
    .GroupBy(x => x.tb_diagnosticoperfiltema.tb_diagnostico.id_diagnostico) // group by diagnostico ID. Requires bi-directional references...
    .Select(x => new 
    {
        x.Key, // id_diagnostico
        Tema = x.OrderByDescending(y => y.tb_diagnosticoperfiltema.nu_pontuacao)
           .ThenBy(y => y.nu_prioridade)
           .Select(y => new RetornaRelatorioEnvioSmsTemaModel
           {
              tema = y.tb_tema.nm_tema,
              nivel = y.tb_diagnosticoperfiltema.tb_nivelmaturidade.nm_nivel
           }).FirstOrDefault())
    .Where(x => diagnosticoIds.Contains(x.Key))
    .Select(x => x.Tema)
    .ToList();

// Caveat, the above needs to be tested but should give you an idea on how to select the desired data.

foreach(var model in models)
{
   model.formasContato = formasContatos.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
   model.temas = temas.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
   model.temaPrioritario = temaPrioritarios.Where(x => x.id_diagnostico == model.id_diagnostico).ToList();
}

With the navigation properties though, this can all be done away with and loaded from the initial data model retrieved. It's a pretty complex model, and the (Italian?) naming convention makes it a bit hard to follow but hopefully that gives you some ideas on how to tackle the performance issues.

Why is Entity Framework so slow to add multiple items in one SaveChanges()?

While doing some research on your question I came across this enlightening article: http://www.codinghelmet.com/?path=howto/bulk-insert

Here's a quote:

Each object that was inserted required two SQL statements - one to insert a record, and additional one to obtain identity of the new record

This becomes a problem when inserting multiple records. A problem which is intensified by the fact that each record is inserted one at a time (But this is outside the context of your question since you're already testing the one by one insert). So if you're inserting 200 records, that's 400 sql statements being executed one by one.

So from my understanding EF is simply not built for bulk insertion. Even if it's as simple as inserting 200 records. Which to me seems like a big let down.

I started thinking, "Then what is EF good for anyway. It can't even insert a couple of records". Well i'll give EF props in two areas:

Select Queries: It's very easy to write your queries and get data into your application quickly.
Simplifying insertion of complex records. If you've ever had a table with a lot of foreign keys and you've tried inserting all the linked records in one transaction, you know what I'm talking about. Thankfully EF inserts each record in order and links all the related records in one single transaction for you. But as mentioned above, this comes at a cost.

So simply put, it seems like, if you have an operation that requires inserting a bunch of records, it might be best to use SqlBulkCopy. Which can insert thousands of records in seconds.

I know this might not be the answer you want to hear, because believe me it upsets me as well since I use EF alot, but I don't see any way around it

Entity-Framework Code Is Slow When Using Include() Many Times