Benchmarking Small Code Samples in C#, Can This Implementation Be Improved

Benchmarking small code samples in C#, can this implementation be improved?

Here is the modified function: as recommended by the community, feel free to amend this its a community wiki.

static double Profile(string description, int iterations, Action func) {
//Run at highest priority to minimize fluctuations caused by other processes/threads
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;

// warm up
func();

var watch = new Stopwatch();

// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
return watch.Elapsed.TotalMilliseconds;
}

Make sure you compile in Release with optimizations enabled, and run the tests outside of Visual Studio. This last part is important because the JIT stints its optimizations with a debugger attached, even in Release mode.

Can this Crystal benchmark code be improved significantly?

You need to compile with --release flag. By default, the Crystal compiler is focused on compilation speed, so you get a compiled program fast. This is particularly important during development. If you want a program that runs fast, you need to pass the --release flag which tells the compiler to take time for optimizations (that's handled by LLVM btw.).

You might also be able to shave off some time by using wrapping operators like &+ in location where it's guaranteed that the result can'd overflow. That skips some overflow checks.


A few other remarks:

Instead of 0.to_i64, you can just use a Int64 literal: 0_i64.

iMills = uninitialized Int32

Don't use uninitialized here. It's completely wrong. You want that variable to be initialized. There are some use cases for uninitialized in C bindings and some very specific, low-level implementations. It should never be used in most regular code.

I notice from a recent post that Crystal is more efficient with floats than integers

Where did you read that?

Adding type prefixes to identifiers doesn't seem to be very useful in Crystal. The compiler already keeps track of the types. You as the developer shouldn't have to do that, too. I've never seen that before.

Weird performance increase in simple benchmark

Update 4 explains the problem: in the first case, JIT keeps the calculated values (a, b) on the stack; in the second case, JIT keeps it in the registers.

In fact, Test1 works slowly because of the Stopwatch. I wrote the following minimal benchmark based on BenchmarkDotNet:

[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
private const int IterationCount = 100001;

[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithoutStopwatch()
{
double a = 1, b = 1;
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp st(1),st
a = a + b;
}
return string.Format("{0}", a);
}

[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithStopwatch()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// fadd qword ptr [ebp-14h]
// fstp qword ptr [ebp-14h]
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}

[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithTwoStopwatches()
{
var outerSw = new Stopwatch();
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp st(1),st
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}

The results on my computer:

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=Jit_RegistersVsStack Mode=Throughput Platform=X86 Jit=HostJit .NET=HostFramework

Method | AvrTime | StdDev | op/s |
------------------- |---------- |---------- |----------- |
WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |

As we can see:

  • WithoutStopwatch works quickly (because a = a + b uses the registers)
  • WithStopwatch works slowly (because a = a + b uses the stack)
  • WithTwoStopwatches works quickly again (because a = a + b uses the registers)

Behavior of JIT-x86 depends on big amount of different conditions. For some reason, the first stopwatch forces JIT-x86 to use the stack, and the second stopwatch allows it to use the registers again.

Benchmarking small code samples in C#, can this implementation be improved?

Here is the modified function: as recommended by the community, feel free to amend this its a community wiki.

static double Profile(string description, int iterations, Action func) {
//Run at highest priority to minimize fluctuations caused by other processes/threads
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;

// warm up
func();

var watch = new Stopwatch();

// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
return watch.Elapsed.TotalMilliseconds;
}

Make sure you compile in Release with optimizations enabled, and run the tests outside of Visual Studio. This last part is important because the JIT stints its optimizations with a debugger attached, even in Release mode.

How to measure code performance in .NET?

The Stopwatch class, available since .NET 2.0, is the best way to go for this. It is a very high performance counter accurate to fractions of a millisecond.
Take a look at the MSDN documentation, which is pretty clear.

EDIT: As previously suggested, it is also advisable to run your code a number of times in order to get a reasonable average time.

C# Improved algorithm

EDIT Better solution: use Except which isn’t symmetrical, unlike Intersect.

1 & 2: you can use the Intersect extension method to do this. However, if your second array contains elements not found in the first one, these will then be in the resulting list: Intersect works symmetrically.

As for “two-way closure”, I’ve never heard of this term and I rather doubt that it’s an established technical term.

C# code to copy files, can this snippet be improved?

You are doing far too much work with the nested loop.

You should remove the inner "foreach" and replace it with some code that:

(1) Constructs the name of the file that you are looking for and

(2) Uses File.Exists() to see if exists, then

(3) Continues with the same block of code that you currently have following the "if (Path.GetFileName(remoteSrcFile)..." condition.

Something like this:

foreach (string remoteSrcFile in Directory.EnumerateFiles(remoteRepository, "*.*", SearchOption.AllDirectories))
{
string localSrcFile = Path.Combine(localRepository, Path.GetFileName(remoteSrcFile));

if (File.Exists(localSrcFile))
{
...
}
}

How to explain the difference of performance in these 2 simple loops?

It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.

The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.

Loop 1:


000007fe`97d50240 4d8bd1 mov r10,r9
000007fe`97d50243 4983c128 add r9,28h
000007fe`97d50247 4183c004 add r8d,4
; Loop while j < 1000d
000007fe`97d5024b 4181f8e8030000 cmp r8d,3E8h
000007fe`97d50252 7cec jl 000007fe`97d50240

Loop 2:


; rax = ret
; ecx = j

; Add 10 to ret 4 times
000007fe`97d50292 48050a000000 add rax,0Ah
000007fe`97d50298 48050a000000 add rax,0Ah
000007fe`97d5029e 48050a000000 add rax,0Ah
000007fe`97d502a4 48050a000000 add rax,0Ah
000007fe`97d502aa 83c104 add ecx,4 ; increment j by 4

; Loop while j < 1000d
000007fe`97d502ad 81f9e8030000 cmp ecx,3E8h
000007fe`97d502b3 7cdd jl 000007fe`97d50292

You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.

My (wild) guess is that the JITer can better optimize the variable p because it is defined in the inner scope of the first loop. Since it can detect that p is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.



Related Topics



Leave a reply



Submit