Problems using foreach parallelization
To start with, you could write your foreach code a bit more concise :
FECltSim <- function(nSims=1000, size=100, mu=0, sigma=1) {
foreach(i=1:nSims, .combine=c) %dopar% {
mean(rnorm(n=size, mean=mu, sd=sigma))
}
}
This gives you a vector, no need to explicitly make it within the loop. Also no need to use cbind, as your result is every time just a single number. So .combine=c
will do
The thing with foreach is that it creates quite a lot of overhead to communicate between the cores and get the results of the different cores fit together. A quick look at the profile shows this pretty clearly :
$by.self
self.time self.pct total.time total.pct
$ 5.46 41.30 5.46 41.30
$<- 0.76 5.75 0.76 5.75
.Call 0.76 5.75 0.76 5.75
...
More than 40% of the time it is busy selecting things. It also uses a lot of other functions for the whole operation. Actually, foreach
is only advisable if you have relatively few rounds through very time consuming functions.
The other two solutions are built on a different technology, and do far less in R. On a sidenode, snow
is actually initially developed to work on clusters more than on single workstations, like multicore
is.
Parallel.ForEach loop is not working it skips some and double do others
Thanks to sedat-kapanoglu, I found the problem is really about thread safety. The solution was to change every List<T>
to ConcurrentBag<T>
.
For everyone, like me, The solution of "parallel not working with collections" is to change from System.Collections.Generic
to System.Collections.Concurrent
Problem with Invoke to parallelize foreach
Your problem is that Invoke
(and queueing a Task
to the UI TaskScheduler
) both require the UI thread to be processing its message loop. However, it is not. It is still waiting for the Parallel.ForEach
loop to complete. This is why you see a deadlock.
If you want the Parallel.ForEach
to run without blocking the UI thread, wrap it into a Task
, as such:
private TaskScheduler ui;
private void button1_Click(object sender, EventArgs e)
{
ui = TaskScheduler.FromCurrentSynchronizationContext();
DateTime start = DateTime.Now;
Task.Factory.StartNew(pforeach)
.ContinueWith(task =>
{
task.Wait(); // Ensure errors are propogated to the UI thread.
Text = (DateTime.Now - start).ToString();
}, ui);
}
private void pforeach()
{
int[] intArray = new int[60];
int totalcount = intArray.Length;
object lck = new object();
System.Threading.Tasks.Parallel.ForEach<int, int>(intArray,
() => 0,
(x, loop, count) =>
{
int value = 0;
System.Threading.Thread.Sleep(100);
count++;
value = (int)(100f / (float)totalcount * (float)count);
Task.Factory.StartNew(
() => Set(value),
CancellationToken.None,
TaskCreationOptions.None,
ui).Wait();
return count;
},
(x) =>
{
});
}
private void Set(int i)
{
progressBar1.Value = i;
}
Parallel.ForEach MaxDegreeOfParallelism Strange Behavior with Increasing Chunking
You can't use Parallel
methods with async
delegates - at least, not yet.
Since you already have a "pipeline" style of architecture, I recommend looking into TPL Dataflow. A single ActionBlock
may be all that you need, and once you have that working, other blocks in TPL Dataflow may replace other parts of your pipeline.
If you prefer to stick with your existing buffer, then you should use asynchronous concurrency instead of Parallel
:
private void Process() {
var throttler = new SemaphoreSlim(8);
var tasks = _buffer.GetConsumingEnumerable()
.Select(async report =>
{
await throttler.WaitAsync();
try {
await _handler.ProcessAsync(report).ConfigureAwait(false);
} catch (Exception e) {
if (_config.IsDevelopment) {
throw;
}
_logger.LogError(e, "GPS Report Service");
}
finally {
throttler.Release();
}
})
.ToList();
await Task.WhenAll(tasks);
}
Should I always use Parallel.Foreach because more threads MUST speed up everything?
No, it doesn't make sense for every foreach. Some reasons:
- Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
- If you're aggregating (e.g. summing values) then there are ways of using
Parallel.ForEach
for this, but you shouldn't just do it blindly - If your work will complete very fast anyway, there's no benefit, and it may well slow things down
Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.
Related Topics
How to Request an Early Exit When Knitting an Rmd Document
Avoiding Type Conflicts with Dplyr::Case_When
Apply Grouped Model Back Onto Data
How to Convert Date and Time from Character to Datetime Type
Why Are Xs Added to Data Frame Variable Names When Using Read.Csv
Using Grid and Ggplot2 to Create Join Plots Using R
R - Run Source() in Background
How to Convert Entire Dataframe to Numeric While Preserving Decimals
How to Group by All But One Columns
How to Check the Existence of a Downloaded File
Calling a Function from a Namespace
How to Rank Within Groups in R
How to Remove Row If It Has a Na Value in One Certain Column