Why Does Optimisation Kill This Function

Why does optimisation kill this function?

This code violates the strict aliasing rules which makes it illegal to access an object through a pointer of a different type, although access through a *char ** is allowed. The compiler is allowed to assume that pointers of different types do not point to the same memory and optimize accordingly. It also means the code invokes undefined behavior and could really do anything.

One of the best references for this topic is Understanding Strict Aliasing and we can see the first example is in a similar vein to the OP's code:

uint32_t swap_words( uint32_t arg )
{
uint16_t* const sp = (uint16_t*)&arg;
uint16_t hi = sp[0];
uint16_t lo = sp[1];

sp[1] = hi;
sp[0] = lo;

return (arg);
}

The article explains this code violates strict aliasing rules since sp is an alias of arg but they have different types and says that although it will compile, it is likely arg will be unchanged after swap_words returns. Although with simple tests, I am unable to reproduce that result with either the code above nor the OPs code but that does not mean anything since this is undefined behavior and therefore not predictable.

The article goes on to talk about many different cases and presents several working solution including type-punning through a union, which is well-defined in C991 and may be undefined in C++ but in practice is supported by most major compilers, for example here is gcc's reference on type-punning. The previous thread Purpose of Unions in C and C++ goes into the gory details. Although there are many threads on this topic, this seems to do the best job.

The code for that solution is as follows:

typedef union
{
uint32_t u32;
uint16_t u16[2];
} U32;

uint32_t swap_words( uint32_t arg )
{
U32 in;
uint16_t lo;
uint16_t hi;

in.u32 = arg;
hi = in.u16[0];
lo = in.u16[1];
in.u16[0] = lo;
in.u16[1] = hi;

return (in.u32);
}

For reference the relevant section from the C99 draft standard on strict aliasing is 6.5 Expressions paragraph 7 which says:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:76)

— a type compatible with the effective type of the object,


— a qualified version of a type compatible with the effective type of the object,


— a type that is the signed or unsigned type corresponding to the effective type of the
object,


— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,


— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or


— a character type.

and footnote 76 says:

The intent of this list is to specify those circumstances in which an object may or may not be aliased.

and the relevant section from the C++ draft standard is 3.10 Lvalues and rvalues paragraph 10

The article Type-punning and strict-aliasing gives a gentler but less complete introduction to the topic and C99 revisited gives a deep analysis of C99 and aliasing and is not light reading. This answer to Accessing inactive union member - undefined? goes over the muddy details of type-punning through a union in C++ and is not light reading either.


Footnotes:

  1. Quoting comment by Pascal Cuoq: [...]C99 that was initially clumsily worded, appearing to make type-punning through unions undefined. In reality, type-punning though unions is legal in C89, legal in C11, and it was legal in C99 all along although it took until 2004 for the committee to fix incorrect wording, and the subsequent release of TC3. open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

Why does this optimization algorithm in R stop after a few function evaluations?

The parameters you arrive at by step 493 lead to an infinite loop in your qst function: not having any idea what this very complex code is actually doing, I'm afraid I can't diagnose further. Here's what I did to get that far:

  • I stated cur.params <- NULL in the global environment, then put cur.params <<- params within stcopn11; this saves the current set of parameters to the global environment, so that when you break out of the optim() call manually (via Control-C or ESC depending on your platform) you can inspect the current set of parameters, and restart from them easily
  • I put in old-school debugging statements (e.g. cat("entering stcopn11\n") and cat("leaving stcopn11\n") at the beginning and at the next-to-last line of the objective function, a few within stopc11 to indicate progress markers within)
  • once I had the "bad" parameters I used debug(stcopn11) and stcopn11(cur.param) to step through the function
  • I discovered that it was hanging on dimension 3 (j==3 in the for loop within stcopn11) and particularly on the first qst() call
  • I added a maxit=1e5 argument to qst; initialized it <- 1 before the while loop; set it <- it+1 each time through the loop; changed the stopping criterion to while (sum(nc) > 0 && it<maxit); and added if (it==maxit) stop("hit max number of iterations in qst") right after the loop

1e5 iterations in qst took 74 seconds; I have no idea whether it might stop eventually, but didn't want to wait to find out.

This was my modified version of stcopn11:

cur.param <- NULL  ## set parameter placeholder

##5. negative log likelihood for multivariate skew-t copula
stcopn11 <- function(param,debug=FALSE) {
cat("stcopn11\n")
cur.param <<- param ## record current params outside function
N <- nrow(udat)
mpoints <- 150
npar <- length(param)
nu <- exp(param[npar])+2
R <- paramToExtCorr(param)
Omega <- R[-1, -1]
delta <- R[1, -1]
zeta <- delta/sqrt(1-delta*delta)
cat("... solving iOmega")
iOmega <- solve(Omega)
alpha <- iOmega %*% delta /
sqrt(1-(t(delta) %*% iOmega %*% delta)[1,1])
ix <- matrix(0, nrow=N, ncol=dim)
lm <- matrix(0, nrow=N, ncol=dim)
cat("... entering dim loop\n")
for (j in 1:dim){
if (debug) cat(j,"\n")
minx <- qst(min(udat[,j]), alpha=zeta[j], nu=nu)
maxx <- qst(max(udat[,j]), alpha=zeta[j], nu=nu)
xx <- seq(minx, maxx, length=mpoints)
px <- sort(pst(xx, alpha=zeta[j], nu=nu))
ix[,j] <- pchip(px, xx, udat[,j])
lm[,j] <- dst(ix[,j], alpha=zeta[j], nu=nu, log=TRUE)
}
lc <- dmst(ix, Omega=Omega, alpha=alpha, nu=nu, log=TRUE)
cat("leaving stcopn11\n")
-sum(lc)+sum(lm)
}

Why does the compiler not optimize away interrupt code?

  1. The language execution model says that an ordinary (non-volatile) variable cannot be changed by "external forces". I.e. if your code flow does not explicitly change the variable, then from that code flow's point of view the variable cannot possibly change. (Besides what's defined by C11 for multithreaded execution). You have to manually "designate" variables that can be changed by interrupt handlers.

    This is one of the main factors that enables efficient optimizations of C code. It cannot be eliminated without making a significant negative impact on the performance of C programs.

  2. Firstly, compilers don't usually optimize away functions with external linkage. Is your interrupt handler declared with external linkage?

    Secondly, the decision to optimize out a function or keep it is not really based on whether the function is called or not. It is based on whether the corresponding symbol is referenced in your program in any way. Non-referenced symbols are removed, while the referenced ones are kept. There are other ways to reference a function symbol besides calling it. For example, taking address of a function also counts as a reference to function symbol. Functions whose addresses are taken anywhere in the program are never optimized away.

    Your interrupt vector entry gets initialized somehow at program startup, which normally involves taking address of the handler function. That is already sufficient to protect this function from being optimized out.

Using this pointer causes strange deoptimization in hot loop

Pointer aliasing seems to be the problem, ironically between this and this->target. The compiler is taking into account the rather obscene possibility that you initialized:

this->target = &this

In that case, writing to this->target[0] would alter the contents of this (and thus, this->target).

The memory aliasing problem is not restricted to the above. In principle, any use of this->target[XX] given an (in)appropriate value of XX might point to this.

I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.

Deoptimizations kill the performance with binary trees

(V8 developer here.)

The premise of this question is incorrect: a few deopts don't matter, and don't move the needle regarding performance. Trying to avoid them is an exercise in futility.

The first step when trying to improve performance of something is to profile it. In this case, a profile reveals that the benchmark is spending:

  • about 46.3% of the time in optimized code (about 4/5 of that for tree creation and 1/5 for tree iteration)
  • about 0.1% of the time in unoptimized code
  • about 52.8% of the time in the garbage collector, tracing and freeing all those short-lived objects.

This is as artificial a microbenchmark as they come. 50% GC time never happens in real-world code that does useful things aside from allocating multiple gigabytes of short-lived objects as fast as possible.

In fact, calling them "short-lived objects" is a bit inaccurate in this case. While the vast majority of the individual trees being constructed are indeed short-lived, the code allocates one super-large long-lived tree early on. That fools V8's adaptive mechanisms into assuming that all future TreeNodes will be long-lived too, so it allocates them in "old space" right away -- which would save time if the guess was correct, but ends up wasting time because the TreeNodes that follow are actually short-lived and would be better placed in "new space" (which is optimized for quickly freeing short-lived objects). So just by reshuffling the order of operations, I can get a 3x speedup.

This is a typical example of one of the general problems with microbenchmarks: by doing something extreme and unrealistic, they often create situations that are not at all representative of typical real-world scenarios. If engine developers optimized for such microbenchmarks, engines would perform worse for real-world code. If JavaScript developers try to derive insights from microbenchmarks, they'll write code that performs worse under realistic conditions.

Anyway, if you want to optimize this code, avoid as many of those object allocations as you can.


Concretely:

An artificial microbenchmark like this, by its nature, intentionally does useless work (such as: computing the same value a million times). You said you wanted to optimize it, which means avoiding useless work, but you didn't specify which parts of the useless work you'd like to preserve, if any. So in the absence of a preference, I'll assume that all useless work is useless. So let's optimize!

Looking at the code, it creates perfect binary trees of a given depth and counts their nodes. In other words, it sums up the 1s in these examples:

depth=0:

1

depth=1:

  1
/ \
1 1

depth=2:

     1
/ \
1 1
/ \ / \
1 1 1 1

and so on. If you think about it for a bit, you'll realize that such a tree of depth N has (2 ** (N+1)) - 1 nodes. So we can replace:

itemCheck(bottomUpTree(depth));

with

(2 ** (depth+1)) - 1

(and analogously for the "stretchDepth" line).

Next, we can take care of the useless repetitions. Since x + x + x + x + ... N times is the same as x*N, we can replace:

let check = 0;
for (let i = 0; i < iterations; i++) {
check += (2 ** (depth + 1)) - 1;
}

with just:

let check = ((2 ** (depth + 1)) - 1) * iterations;

With that we're from 12 seconds down to about 0.1 seconds. Not bad for five minutes of work, eh?

And that remaining time is almost entirely due to the longLivedTree. To apply the same optimizations to the operations creating and iterating that tree, we'd have to move them together, getting rid of its "long-livedness". Would you find that acceptable? You could get the overall time down to less than a millisecond! Would that make the benchmark useless? Not actually any more useless than it was to begin with, just more obviously so.

Is code clearness killing application performance?

Clean code doesn't kill performance. Bad code kills performance.



Related Topics



Leave a reply



Submit