Create Batches in Linq

Create batches in LINQ

An Enumerable.Chunk() extension method was added to .NET 6.0.

Example:

var list = new List<int> { 1, 2, 3, 4, 5, 6, 7 };

var chunks = list.Chunk(3);
// returns { { 1, 2, 3 }, { 4, 5, 6 }, { 7 } }

For those who cannot upgrade, the source is available on GitHub.

How to build batches/buckets with linq

Originally posted by @Nick_Whaley in Create batches in linq, but not the best response as the question was formulated differently:

Try this:

public static IEnumerable<IEnumerable<T>> Bucketize<T>(this IEnumerable<T> items, int bucketSize)
{
var enumerator = items.GetEnumerator();
while (enumerator.MoveNext())
yield return GetNextBucket(enumerator, bucketSize);
}

private static IEnumerable<T> GetNextBucket<T>(IEnumerator<T> enumerator, int maxItems)
{
int count = 0;
do
{
yield return enumerator.Current;

count++;
if (count == maxItems)
yield break;

} while (enumerator.MoveNext());
}

The trick is to pass the old-fashion enumerator between inner and outer enumeration, to enable continuation between two batches.

How to loop through IEnumerable in batches

Sounds like you need to use Skip and Take methods of your object. Example:

users.Skip(1000).Take(1000)

this would skip the first 1000 and take the next 1000. You'd just need to increase the amount skipped with each call

You could use an integer variable with the parameter for Skip and you can adjust how much is skipped. You can then call it in a method.

public IEnumerable<user> GetBatch(int pageNumber)
{
return users.Skip(pageNumber * 1000).Take(1000);
}

Batch Update using LINQ

You need to use skip/take in LINQ to get around this. The code below should work. Although I'm not always a fan of doing a while(true) with a break statement, it's the easiest way to implement this.

int takeCount = 50;
int skipAmount = 0;

while (true)
{
var dbUserList = db.Users.Where(x => users.Select(y => y.Id).Contains(x.Id)).OrderBy(x => x.Id).Skip(skipAmount).Take(takeAmount);
if (!dbUserList.Any)
{
break;
}

foreach (var user in users)
{
var dbUser = dbUserList.First(x => x.Id == user.Id);
dbUser.name = user.name;
dbUser.cat = user.cat;
}
db.SaveChanges();
}


Looping through a ListT in batches while ensuring items per batch are unique

If I completely understand the problem, then there are many ways to do this and the best solution would depend on your actual needs.

The assumptions are :

  1. What you have described is an in memory approach
  2. It doesn't need to hit a database
  3. It doesn't need to be producer consumer.

Then a very simple (yet efficient) batch and queue pattern can be used with minimal allocations.

Given

public class Payment
{
public int AccountId { get; set; }
public Payment(int accountId) => AccountId = accountId;
}

And

public static IEnumerable<Payment[]> GetBatches(IEnumerable<Payment> source, int count)
{
var hashset = new HashSet<int>(count);
var batch = new List<Payment>(count);
var leftOvers = new Queue<Payment>();

while (true)
{
foreach (var item in source)
{
// check if batched
if (hashset.Add(item.AccountId))
batch.Add(item); // add to batch
else
leftOvers.Enqueue(item); // add to left overs

// if we are at the batch size start a loop
while (batch.Count == count)
{
yield return batch.ToArray(); // return the batch

batch.Clear();
hashset.Clear();

// check the left overs
while (leftOvers.Any() && batch.Count != count)
if (hashset.Add(leftOvers.Peek().AccountId)) // check the batch
batch.Add(leftOvers.Dequeue());
else break; // we still have a duplicate bail
}
}

if(batch.Any()) yield return batch.ToArray();

if (!leftOvers.Any()) break;

source = leftOvers.ToList(); // allocation :(
hashset.Clear();
batch.Clear();
leftOvers.Clear();
}
}

Note : This is fairly resource efficient, though it does probably have an extra unnecessary small allocation when dealing with pure leftovers, I am sure this could be removed, though I'll leave that up to you. There are also many efficiencies you could add with the use of a channel could easily be turned into a consumer

Test

var list = new List<Payment>() {new(1), new(2), new(3), new(4), new(4), new(5), new(6), new(4), new(4), new(6), new(4)};

var batches = GetBatches(list, 3);

foreach (var batch in batches)
Console.WriteLine(string.Join(", ",batch.Select(x => x.AccountId)));

Output

1, 2, 3
4, 5, 6
4, 6
4
4
4

Full demo here to Play with

How to process IEnumerable in batches?

Using the batch solution from this thread it seems trivial:

const int batchSize = 100;

foreach (var batch in contacts.Batch(batchSize))
{
DoSomething(batch);
}

If you want to also wrap it up:

public static void ProcessInBatches<TSource>(
this IEnumerable<TSource> source,
int batchSize,
Action<IEnumerable<TSource>> action)
{
foreach (var batch in source.Batch(batchSize))
{
action(batch);
}
}

So, your code can be transformed into:

const int batchSize = 100;

contacts.ProcessInBatches(batchSize, DoSomething);

Create Buckets ith Linq

Well, for arbitrary list you have to compute range: [min..max] and then

  step = (max - min) / 2;

Code:

  // Given

List<double> list = new List<double>() {
0, 0.1, 1.1, 2.2, 3.3, 4.1, 5.6, 6.3, 7.1, 8.9, 9.8, 9.9, 10
};

int n = 5;

// We compute step

double min = list.Min();
double max = list.Max();

double step = (max - min) / 5;

// And, finally, group by:

double[][] result = list
.GroupBy(item => (int)Math.Clamp((item - min) / step, 0, n - 1))
.OrderBy(group => group.Key)
.Select(group => group.ToArray())
.ToArray();

// Let's have a look:

string report = string.Join(Environment.NewLine, result
.Select((array, i) => $"[{min + i * step} .. {min + i * step + step,2}) : {{{string.Join("; ", array)}}}"));

Console.WriteLine(report);

Outcome:

[0 ..  2) : {0; 0.1; 1.1}
[2 .. 4) : {2.2; 3.3}
[4 .. 6) : {4.1; 5.6}
[6 .. 8) : {6.3; 7.1}
[8 .. 10) : {8.9; 9.8; 9.9; 10}

Please, note Math.Clamp method to ensure [0..n-1] range for groups keys. If you want a Dictionary<int, double[]> where Key is index of bucket:

  Dictionary<int, double[]> buckets = list
.GroupBy(item => (int)Math.Clamp((item - min) / step, 0, n - 1))
.ToDictionary(group => group.Key, group => group.ToArray());

Linq Select 5 items per Iteration

for (int i=0; i < 20 ; i++)
{
var fiveitems = theList.Skip(i*5).Take(5);
}

Batch with Multiple GroupBy

orderby Location, RepName, AccountID

There needs to be a select clause after the above, as demonstrated in StriplingWarrior's answer. Linq Comprehension Queries must end with select or group by.


Unfortunately, there is a logical defect... Suppose I have 50 accounts in the first group and 100 accounts in the second group with a batch size of 100. The original code will produce 3 batches of size 50, not 2 batches of 50, 100.

Here's one way to fix it.

IEnumerable<IGrouping<int, EbrRecord>> query = ...

orderby Location, RepName, AccountID
select new EbrRecord(
AccountID = EbrData[0],
AccountName = EbrData[1],
MBSegment = EbrData[2],
RepName = EbrData[4],
Location = EbrData[7],
TsrLocation = EbrData[8]) into x
group x by new {Location = x.Location, RepName = x.RepName} into g
from g2 in g.Select((data, index) => new Record = data, Index = index })
.GroupBy(y => y.Index/100, y => y.Record)
select g2;


List<List<EbrRecord>> result = query.Select(g => g.ToList()).ToList();

Also note that using GroupBy to batch is very slow due to redundant iterations. You can write a for loop that will do it in one pass over the ordered set and that loop will run much faster than the LinqToObjects.



Related Topics



Leave a reply



Submit