Is Discard Bad for Program Performance in Opengl

Is discard bad for program performance in OpenGL?

It's hardware-dependent. For PowerVR hardware, and other GPUs that use tile-based rendering, using discard means that the TBR can no longer assume that every fragment drawn will become a pixel. This assumption is important because it allows the TBR to evaluate all the depths first, then only evaluate the fragment shaders for the top-most fragments. A sort of deferred rendering approach, except in hardware.

Note that you would get the same issue from turning on alpha test.

How can I use something like discard; to increase performance?

using discard will actually prevent the GPU from using the depth buffer before invoking the fragment shader.

It's much simpler to adjust the far plane of the projection matrix so unwanted vertices are outside the render box.

Also consider not calling the glDraw for distant trees in the first place.

you can for example group the trees per island and then per island check if it is close enough for the trees to render and just not call glDraw* when it's not.

Opengl ES 2.0 shader - discard pixels with alpha == 0 without discard and if

When you do

gl_FragColor = mix(p, c, c.a);

then the alpha channel of the texture is discarded, if the c.a == 1.0, because the alpha channel is also read from c.

You've to mix the .rgb components of the texture and the color, but you've to use the alpha channel of the texture (in any case) to solve the issue:

gl_FragColor = vec4(mix(p.rgb, c.rgb, c.a), p.a);

If the color should be black, a that parts, where the alpha channel of the texture is 0.0, then the color has to be multiplied by the alpha channel of the texture, before it is mixed:

gl_FragColor = vec4(mix(p.rgb, c.rgb*p.a, c.a), p.a);

But note, because of texture filtering, it may occur, that some colors which are red from the texture are nit completely transparent. If you want that the output color is either completely opaque ore completely transparent, then you've to compare the alpha channel to a threshold. This can be done by the glsl function step, which returns either 1.0 or 0.0:

gl_FragColor = vec4(mix(p.rgb, c.rgb, c.a), step(0.5, p.a));

Do conditional statements slow down shaders?

What is it about shaders that even potentially makes if statements performance problems? It has to do with how shaders get executed and where GPUs get their massive computing performance from.

Separate shader invocations are usually executed in parallel, executing the same instructions at the same time. They're simply executing them on different sets of input values; they share uniforms, but they have different internal registers. One term for a group of shaders all executing the same sequence of operations is "wavefront".

The potential problem with any form of conditional branching is that it can screw all that up. It causes different invocations within the wavefront to have to execute different sequences of code. That is a very expensive process, whereby a new wavefront has to be created, data copied over to it, etc.

Unless... it doesn't.

For example, if the condition is one that is taken by every invocation in the wavefront, then no runtime divergence is needed. As such, the cost of the if is just the cost of checking a condition.

So, let's say you have a conditional branch, and let's assume that all of the invocations in the wavefront will take the same branch. There are three possibilities for the nature of the expression in that condition:

Compile-time static. The conditional expression is entirely based off of compile-time constants. As such, you know from looking at the code which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
Statically uniform branching. The condition is based off of expressions involving things which are known at compile-time to be constant (specifically, constants and uniform values). But the value of the expression will not be known at compile-time. So the compiler can statically be certain that wavefronts will never be broken by this if, but the compiler cannot know which branch will be taken.
Dynamic branching. The conditional expression contains terms other than constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that will need to happen depends on the runtime evaluation of the condition expression.

Different hardware can handle different branching types without divergence.

Also, even if a condition is taken by different wavefronts, the compiler could restructure the code to not require actual branching. You gave a fine example: output = input*enable + input2*(1-enable); is functionally equivalent to the if statement. A compiler could detect that an if is being used to set a variable, and thus execute both sides. This is frequently done for cases of dynamic conditions where the bodies of the branches are small.

Pretty much all hardware can handle var = bool ? val1 : val2 without having to diverge. This was possible way back in 2002.

Since this is very hardware-dependent, it... depends on the hardware. There are however certain epochs of hardware that can be looked at:

Desktop, Pre-D3D10

There, it's kinda the wild west. NVIDIA's compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.

In general, this era is where about 80% of the "never use if statements" comes from. But even here, it's not necessarily true.

You can expect optimization of static branching. You can hope that statically uniform branching won't cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.

Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = input*enable + input2*(1-enable); is something that a decent compiler could generate from your equivalent if statement.

Desktop, Post-D3D10

Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.

Desktop, D3D11+

Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn't even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won't see any significant performance loss.

Note that some hardware from the previous epoch probably could do this as well. But this is the one where it's almost certain to be true.

Mobile, ES 2.0

Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There's such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.

Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.

Mobile, ES 3.0+

Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.

OpenGL (ES): Can an implementation optimize fragments resulting from overdraw?

PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.

Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.

In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.

ios - How to benchmark a shader

OpenGL ES debugger in Xcode might give you a good insight about which shader takes how much time and what parts of shader are consuming more time than other parts.

https://developer.apple.com/library/mac/recipes/xcode_help-debugger/articles/debugging_opengl_es_frame.html

Is Discard Bad for Program Performance in Opengl