Benchmarking Small Code Samples in C#, Can This Implementation Be Improved

Benchmarking small code samples in C#, can this implementation be improved?

Here is the modified function: as recommended by the community, feel free to amend this its a community wiki.

static double Profile(string description, int iterations, Action func) {
    //Run at highest priority to minimize fluctuations caused by other processes/threads
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
    Thread.CurrentThread.Priority = ThreadPriority.Highest;

    // warm up 
    func();

    var watch = new Stopwatch(); 

    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    watch.Start();
    for (int i = 0; i < iterations; i++) {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
    return watch.Elapsed.TotalMilliseconds;
}

Make sure you compile in Release with optimizations enabled, and run the tests outside of Visual Studio. This last part is important because the JIT stints its optimizations with a debugger attached, even in Release mode.

Can this Crystal benchmark code be improved significantly?

You need to compile with --release flag. By default, the Crystal compiler is focused on compilation speed, so you get a compiled program fast. This is particularly important during development. If you want a program that runs fast, you need to pass the --release flag which tells the compiler to take time for optimizations (that's handled by LLVM btw.).

You might also be able to shave off some time by using wrapping operators like &+ in location where it's guaranteed that the result can'd overflow. That skips some overflow checks.

A few other remarks:

Instead of 0.to_i64, you can just use a Int64 literal: 0_i64.

iMills = uninitialized Int32

Don't use uninitialized here. It's completely wrong. You want that variable to be initialized. There are some use cases for uninitialized in C bindings and some very specific, low-level implementations. It should never be used in most regular code.

I notice from a recent post that Crystal is more efficient with floats than integers

Where did you read that?

Adding type prefixes to identifiers doesn't seem to be very useful in Crystal. The compiler already keeps track of the types. You as the developer shouldn't have to do that, too. I've never seen that before.

Weird performance increase in simple benchmark

Update 4 explains the problem: in the first case, JIT keeps the calculated values (a, b) on the stack; in the second case, JIT keeps it in the registers.

In fact, Test1 works slowly because of the Stopwatch. I wrote the following minimal benchmark based on BenchmarkDotNet:

[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
    private const int IterationCount = 100001;

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithoutStopwatch()
    {
        double a = 1, b = 1;
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}", a);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithStopwatch()
    {
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // fadd        qword ptr [ebp-14h]
            // fstp        qword ptr [ebp-14h]
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithTwoStopwatches()
    {
        var outerSw = new Stopwatch();
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }
}

The results on my computer:

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
Type=Jit_RegistersVsStack  Mode=Throughput  Platform=X86  Jit=HostJit  .NET=HostFramework

             Method |   AvrTime |    StdDev |       op/s |
------------------- |---------- |---------- |----------- |
   WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
      WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
 WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |

As we can see:

WithoutStopwatch works quickly (because a = a + b uses the registers)
WithStopwatch works slowly (because a = a + b uses the stack)
WithTwoStopwatches works quickly again (because a = a + b uses the registers)

Behavior of JIT-x86 depends on big amount of different conditions. For some reason, the first stopwatch forces JIT-x86 to use the stack, and the second stopwatch allows it to use the registers again.

Benchmarking small code samples in C#, can this implementation be improved?

Here is the modified function: as recommended by the community, feel free to amend this its a community wiki.

static double Profile(string description, int iterations, Action func) {
    //Run at highest priority to minimize fluctuations caused by other processes/threads
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
    Thread.CurrentThread.Priority = ThreadPriority.Highest;

    // warm up 
    func();

    var watch = new Stopwatch(); 

    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    watch.Start();
    for (int i = 0; i < iterations; i++) {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
    return watch.Elapsed.TotalMilliseconds;
}

How to measure code performance in .NET?

The Stopwatch class, available since .NET 2.0, is the best way to go for this. It is a very high performance counter accurate to fractions of a millisecond.
Take a look at the MSDN documentation, which is pretty clear.

EDIT: As previously suggested, it is also advisable to run your code a number of times in order to get a reasonable average time.

C# Improved algorithm

EDIT Better solution: use Except which isn’t symmetrical, unlike Intersect.

1 & 2: you can use the Intersect extension method to do this. However, if your second array contains elements not found in the first one, these will then be in the resulting list: Intersect works symmetrically.

As for “two-way closure”, I’ve never heard of this term and I rather doubt that it’s an established technical term.

C# code to copy files, can this snippet be improved?

You are doing far too much work with the nested loop.

You should remove the inner "foreach" and replace it with some code that:

(1) Constructs the name of the file that you are looking for and

(2) Uses File.Exists() to see if exists, then

(3) Continues with the same block of code that you currently have following the "if (Path.GetFileName(remoteSrcFile)..." condition.

Something like this:

foreach (string remoteSrcFile in Directory.EnumerateFiles(remoteRepository, "*.*", SearchOption.AllDirectories))
{
    string localSrcFile = Path.Combine(localRepository, Path.GetFileName(remoteSrcFile));

    if (File.Exists(localSrcFile))
    {
        ...
    }
}

How to explain the difference of performance in these 2 simple loops?

It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.

The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.

Loop 1:


000007fe`97d50240 4d8bd1          mov     r10,r9 
000007fe`97d50243 4983c128        add     r9,28h 
000007fe`97d50247 4183c004        add     r8d,4  
; Loop while j < 1000d
000007fe`97d5024b 4181f8e8030000  cmp     r8d,3E8h
000007fe`97d50252 7cec            jl      000007fe`97d50240

Loop 2:


; rax = ret
; ecx = j

; Add 10 to ret 4 times
000007fe`97d50292 48050a000000    add     rax,0Ah
000007fe`97d50298 48050a000000    add     rax,0Ah
000007fe`97d5029e 48050a000000    add     rax,0Ah
000007fe`97d502a4 48050a000000    add     rax,0Ah
000007fe`97d502aa 83c104          add     ecx,4    ; increment j by 4

; Loop while j < 1000d
000007fe`97d502ad 81f9e8030000    cmp     ecx,3E8h
000007fe`97d502b3 7cdd            jl      000007fe`97d50292

You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.

My (wild) guess is that the JITer can better optimize the variable p because it is defined in the inner scope of the first loop. Since it can detect that p is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.

Benchmarking Small Code Samples in C#, Can This Implementation Be Improved

Benchmarking small code samples in C#, can this implementation be improved?

Can this Crystal benchmark code be improved significantly?

Weird performance increase in simple benchmark

Benchmarking small code samples in C#, can this implementation be improved?

How to measure code performance in .NET?

C# Improved algorithm

C# code to copy files, can this snippet be improved?

How to explain the difference of performance in these 2 simple loops?

Related Topics

Leave a reply