How Performant Is Stackframe

How performant is StackFrame?

edit: Some background


We have a similar feature which is disabled 99% of the time; we were using an approach like:

public void DoSomething()
{
TraceCall(MethodBase.GetCurrentMethod().Name);
// Do Something
}

public void TraceCall(string methodName)
{
if (!loggingEnabled) { return; }
// Log...
}

TraceCall(MethodBase.GetCurrentMethod().Name)

It was simple, but regardless of whether or not tracing was enabled we were incurring the performance hit of using Reflection to lookup the method name.

Our options were to either require more code in every method (and risk simple mistakes or refusal) or to switch to using StackFrame to determine the calling method only when logging was enabled.

Option A:

public void DoSomething()
{
if (loggingEnabled)
{
TraceCall(MethodBase.GetCurrentMethod().Name);
}
// Do Something
}

public void TraceCall(string methodName)
{
if (!loggingEnabled) { return; }
// Log...
}

Option B:

public void DoSomething()
{
TraceCall();
// Do Something
}

public void TraceCall()
{
if (!loggingEnabled) { return; }
StackFrame stackFrame = new StackFrame(1);
// Log...
}

We opted for Option B. It offers significant performance improvements over Option A when logging is disabled, 99% of the time and is very simple to implement.

Here's an alteration of Michael's code, to display the cost / benefit of this approach

using System;
using System.Diagnostics;
using System.Reflection;

namespace ConsoleApplication
{
class Program
{
static bool traceCalls;

static void Main(string[] args)
{
Stopwatch sw;

// warm up
for (int i = 0; i < 100000; i++)
{
TraceCall();
}

// call 100K times, tracing *disabled*, passing method name
sw = Stopwatch.StartNew();
traceCalls = false;
for (int i = 0; i < 100000; i++)
{
TraceCall(MethodBase.GetCurrentMethod());
}
sw.Stop();
Console.WriteLine("Tracing Disabled, passing Method Name: {0}ms"
, sw.ElapsedMilliseconds);

// call 100K times, tracing *enabled*, passing method name
sw = Stopwatch.StartNew();
traceCalls = true;
for (int i = 0; i < 100000; i++)
{
TraceCall(MethodBase.GetCurrentMethod());
}
sw.Stop();
Console.WriteLine("Tracing Enabled, passing Method Name: {0}ms"
, sw.ElapsedMilliseconds);

// call 100K times, tracing *disabled*, determining method name
sw = Stopwatch.StartNew();
traceCalls = false;
for (int i = 0; i < 100000; i++)
{
TraceCall();
}
Console.WriteLine("Tracing Disabled, looking up Method Name: {0}ms"
, sw.ElapsedMilliseconds);

// call 100K times, tracing *enabled*, determining method name
sw = Stopwatch.StartNew();
traceCalls = true;
for (int i = 0; i < 100000; i++)
{
TraceCall();
}
Console.WriteLine("Tracing Enabled, looking up Method Name: {0}ms"
, sw.ElapsedMilliseconds);

Console.ReadKey();
}

private static void TraceCall()
{
if (traceCalls)
{
StackFrame stackFrame = new StackFrame(1);
TraceCall(stackFrame.GetMethod().Name);
}
}

private static void TraceCall(MethodBase method)
{
if (traceCalls)
{
TraceCall(method.Name);
}
}

private static void TraceCall(string methodName)
{
// Write to log
}
}
}

The Results:

Tracing Disabled, passing Method Name: 294ms
Tracing Enabled, passing Method Name: 298ms
Tracing Disabled, looking up Method Name: 0ms
Tracing Enabled, looking up Method Name: 1230ms

Performance impact due to StackTrace constructor and getting method name

As of C# 5, it would definitely be better to get the compiler to bake this into the call site instead, using [CallerMemberName]

public void Log(string message, [CallerMemberName] caller = null)
{
}

Then:

public void DoSomething()
{
logger.Log("A message");
}

... is converted by the C# compiler into

public void DoSomething()
{
logger.Log("A message", "DoSomething");
}

C# stack frames, curly brackets and performance

When I write this:

int x = 4;
int y = 3;
Console.WriteLine(x + y);
Console.WriteLine(x);

Or this:

int x = 4;
{
int y = 3;
Console.WriteLine(x + y);
}
Console.WriteLine(x);

I still get this in the IL:

.maxstack 2
.locals init (
[0] int32 x,
[1] int32 y
)

There's not an extra stack frame.

Is it cheaper to get a specific StackFrame instead of StackTrace.GetFrame?

See recommendations for the compilation the correct benchmark. You should use prime number of iterations (for suppress JIT Loop unwinding optimization), run benchmark in Release mode without debugging, use cache warmup, etc.

I added your example in BenchmarkDotNet, look to StackFrameProgram.cs:

public class StackFrameProgram
{
private const int IterationCount = 100001;

public void Run()
{
var competition = new BenchmarkCompetition();
competition.AddTask("StackFrame", () => StackFrame());
competition.AddTask("StackTrace", () => StackTrace());
competition.Run();
}

private StackFrame StackFrame()
{
StackFrame method = null;
for (int i = 0; i < IterationCount; i++)
method = new StackFrame(1, false);
return method;
}

private StackFrame StackTrace()
{
StackFrame method = null;
for (int i = 0; i < IterationCount; i++)
method = new StackTrace().GetFrame(1);
return method;
}
}

There is my result (Intel Core i7-3632QM CPU 2.20GHz):

x86, .NET 3.5:
StackFrame : 1035ms
StackTrace : 1619ms

x64, .NET 3.5:
StackFrame : 981ms
StackTrace : 1754ms

x86, .NET 4.0:
StackFrame : 735ms
StackTrace : 1150ms

x64, .NET 4.0:
StackFrame : 637ms
StackTrace : 880ms

Let's look inside:

public StackFrame.ctor(int skipFrames, bool fNeedFileInfo)
{
this.InitMembers();
this.BuildStackFrame(skipFrames, fNeedFileInfo);
}

private void StackFrame.BuildStackFrame(int skipFrames, bool fNeedFileInfo)
{
StackFrameHelper sfh = new StackFrameHelper(fNeedFileInfo, null);
StackTrace.GetStackFramesInternal(sfh, 0, null);
int numberOfFrames = sfh.GetNumberOfFrames();
skipFrames += StackTrace.CalculateFramesToSkip(sfh, numberOfFrames);
if ((numberOfFrames - skipFrames) > 0)
{
this.method = sfh.GetMethodBase(skipFrames);
this.offset = sfh.GetOffset(skipFrames);
this.ILOffset = sfh.GetILOffset(skipFrames);
if (fNeedFileInfo)
{
this.strFileName = sfh.GetFilename(skipFrames);
this.iLineNumber = sfh.GetLineNumber(skipFrames);
this.iColumnNumber = sfh.GetColumnNumber(skipFrames);
}
}
}

public StackTrace.ctor()
{
this.m_iNumOfFrames = 0;
this.m_iMethodsToSkip = 0;
this.CaptureStackTrace(0, false, null, null);
}

private void StackTrace.CaptureStackTrace(int iSkip, bool fNeedFileInfo, Thread targetThread, Exception e)
{
this.m_iMethodsToSkip += iSkip;
StackFrameHelper sfh = new StackFrameHelper(fNeedFileInfo, targetThread);
GetStackFramesInternal(sfh, 0, e);
this.m_iNumOfFrames = sfh.GetNumberOfFrames();
if (this.m_iMethodsToSkip > this.m_iNumOfFrames)
{
this.m_iMethodsToSkip = this.m_iNumOfFrames;
}
if (this.m_iNumOfFrames != 0)
{
this.frames = new StackFrame[this.m_iNumOfFrames];
for (int i = 0; i < this.m_iNumOfFrames; i++)
{
bool flag = true;
bool flag2 = true;
StackFrame frame = new StackFrame(flag, flag2);
frame.SetMethodBase(sfh.GetMethodBase(i));
frame.SetOffset(sfh.GetOffset(i));
frame.SetILOffset(sfh.GetILOffset(i));
frame.SetIsLastFrameFromForeignExceptionStackTrace(sfh.IsLastFrameFromForeignExceptionStackTrace(i));
if (fNeedFileInfo)
{
frame.SetFileName(sfh.GetFilename(i));
frame.SetLineNumber(sfh.GetLineNumber(i));
frame.SetColumnNumber(sfh.GetColumnNumber(i));
}
this.frames[i] = frame;
}
if (e == null)
{
this.m_iMethodsToSkip += CalculateFramesToSkip(sfh, this.m_iNumOfFrames);
}
this.m_iNumOfFrames -= this.m_iMethodsToSkip;
if (this.m_iNumOfFrames < 0)
{
this.m_iNumOfFrames = 0;
}
}
else
{
this.frames = null;
}
}

public virtual StackFrame StackTrace.GetFrame(int index)
{
if (((this.frames != null) && (index < this.m_iNumOfFrames)) && (index >= 0))
{
return this.frames[index + this.m_iMethodsToSkip];
}
return null;
}

Is the stack frame of a function cleared, or is it left as such on the stack, when we return from the function

All data remains in memory. The only parameters changed are stack top pointer (esp) and stack frame base (ebp). Stack frame is a somewhat abstract concept, it is only defined by the values of these two registers. Hence the region above esp is "not visible" even though it may contain data left from previous stack frames

Understanding fastcall stack frame

push rbp        ;save non-volatile rbp
mov rbp, rsp ;save old stack
sub rsp, 32 ;reserve space for 32 bytes of local variables = 8 integers
;or 4 pointers.
;this is per the MS/Intel guides. You can use this as temp
;storage for the parameters or for local variables.
and spl, -16 ;align stack by 16 bytes (for sse code)

mov rsp, rbp ;restore the old stack
pop rbp ;restore rbp

How does subtracting 32 from RSP do anything at all

RSP is the stack pointer, not just another register. Doing anything to it affects the stack. In this case it reserves 8x4 = 32 bytes of space on the stack for local variables to be placed in.

What is SPL and why does masking it make something 16 byte aligned?

The and rsp,-16 forces the four LSB's to zero. And because the stack grows down this aligns it by 16 bytes.

The alignment by 16 bytes is needed when using SSE code, which x64 uses for floating point math. Having 16 byte alignment allows the compiler to use the faster aligned SSE load and store instructions.

SPL is the lower 8 bits of RSP. Why the compiler chooses to do this makes no sense.
Both instructions are 4 bytes and and rsp,-16 is strictly better, because it does not invoke partial register updates.

Disassembly:

0: 40 80 e4 f0 and spl,-16 ;bad! partial register update.
4: 48 83 e4 f0 and rsp,-16 ;good
8: 83 e4 f0 and esp,-16 ;not possible will zero upper 32 bits of rsp

[RSP is] just another register, right?

No, RSP is magically special.

It points to the stack, which is where PUSH and POP instructions act upon.

All local variables and parameters (which do not fit into the registers) are stored in the stack.

Understanding fastcall

There is only one calling convention in X64. To make matters more confusing if you specify a calling convention other than __fastcall most compiler will remap it to __fastcall on X64.

Access stacktraces with good performance?

Thread.currentThread().getStackTrace() is basically the same thing as @apangin points out, but it could be faster in future if it avoided creating a Throwable.

However, you might find sub-sampling will give you the improvement you need. Instead of recording every access, record every N-th access. If you record every 10th access you can reduce the overhead by up to 90% and if you have a high number of accesses it will be almost as accurate.

Another option is to use a profiler like YourKit which can do this much more efficiently. This can show you how much different callers to a method and their stack traces (typically it records every 10th)



Related Topics



Leave a reply



Submit