How to Alpha Blend Rgba Unsigned Byte Color Fast

How to alpha blend RGBA unsigned byte color fast?

Use SSE - start around page 131.

The basic workflow

Load 4 pixels from src (16 1 byte numbers) RGBA RGBA RGBA RGBA (streaming load)
Load 4 more which you want to blend with srcbytetop RGBx RGBx RGBx RGBx
Do some swizzling so that the A term in 1 fills every slot I.e
xxxA xxxB xxxC xxxD -> AAAA BBBB CCCC DDDD
In my solution below I opted instead to re-use your existing "maskcurrent" array but having alpha integrated into the "A" field of 1 will require less loads from memory and thus be faster. Swizzling in this case would probably be: And with mask to select A, B, C, D. Shift right 8, Or with origional, shift right 16, or again.
Add the above to a vector that is all -255 in every slot
Multiply 1 * 4 (source with 255-alpha) and 2 * 3 (result with alpha).
You should be able to use the "multiply and discard bottom 8 bits" SSE2 instruction for this.
add those two (4 and 5) together
Store those somewhere else (if possible) or on top of your destination (if you must)

Here is a starting point for you:

    //Define your image with __declspec(align(16)) i.e char __declspec(align(16)) image[640*480]
    // so the first byte is aligned correctly for SIMD.
    // Stride must be a multiple of 16.

    for (int y = top ; y < bottom; ++y)
    {
        BYTE* resultByte = GET_BYTE(resultBits, left, y, stride, bytepp);
        BYTE* srcByte = GET_BYTE(srcBits, left, y, stride, bytepp);
        BYTE* srcByteTop = GET_BYTE(srcBitsTop, left, y, stride, bytepp);
        BYTE* maskCurrent = GET_GREY(maskSrc, left, y, width);
        for (int x = left; x < right; x += 4)
        {
            //If you can't align, use _mm_loadu_si128()
            // Step 1
            __mm128i src = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByte)) 
            // Step 2
            __mm128i srcTop = _mm_load_si128(reinterpret_cast<__mm128i*>(srcByteTop)) 

            // Step 3
            // Fill the 4 positions for the first pixel with maskCurrent[0], etc
            // Could do better with shifts and so on, but this is clear
            __mm128i mask = _mm_set_epi8(maskCurrent[0],maskCurrent[0],maskCurrent[0],maskCurrent[0],
                                        maskCurrent[1],maskCurrent[1],maskCurrent[1],maskCurrent[1],
                                        maskCurrent[2],maskCurrent[2],maskCurrent[2],maskCurrent[2],
                                        maskCurrent[3],maskCurrent[3],maskCurrent[3],maskCurrent[3],
                                        ) 

            // step 4
            __mm128i maskInv = _mm_subs_epu8(_mm_set1_epu8(255), mask) 

            //Todo : Multiply, with saturate - find correct instructions for 4..6
            //note you can use Multiply and add _mm_madd_epi16

            alpha = *maskCurrent;
            red = (srcByteTop[R] * alpha + srcByte[R] * (255 - alpha)) / 255;
            green = (srcByteTop[G] * alpha + srcByte[G] * (255 - alpha)) / 255;
            blue = (srcByteTop[B] * alpha + srcByte[B] * (255 - alpha)) / 255;
            CLAMPTOBYTE(red);
            CLAMPTOBYTE(green);
            CLAMPTOBYTE(blue);
            resultByte[R] = red;
            resultByte[G] = green;
            resultByte[B] = blue;
            //----

            // Step 7 - store result.
            //Store aligned if output is aligned on 16 byte boundrary
            _mm_store_si128(reinterpret_cast<__mm128i*>(resultByte), result)
            //Slow version if you can't guarantee alignment
            //_mm_storeu_si128(reinterpret_cast<__mm128i*>(resultByte), result)

            //Move pointers forward 4 places
            srcByte += bytepp * 4;
            srcByteTop += bytepp * 4;
            resultByte += bytepp * 4;
            maskCurrent += 4;
        }
    }

To find out which AMD processors will run this code (currently it is using SSE2 instructions) see Wikipedia's List of AMD Turion microprocessors. You could also look at other lists of processors on Wikipedia but my research shows that AMD cpus from around 4 years ago all support at least SSE2.

You should expect a good SSE2 implimentation to run around 8-16 times faster than your current code. That is because we eliminate branches in the loop, process 4 pixels (or 12 channels) at once and improve cache performance by using streaming instructions. As an alternative to SSE, you could probably make your existing code run much faster by eliminating the if checks you are using for saturation. Beyond that I would need to run a profiler on your workload.

Of course, the best solution is to use hardware support (i.e code your problem up in DirectX) and have it done on the video card.

Alpha Blending 2 RGBA colors in C

int blend(unsigned char result[4], unsigned char fg[4], unsigned char bg[4])
{
    unsigned int alpha = fg[3] + 1;
    unsigned int inv_alpha = 256 - fg[3];
    result[0] = (unsigned char)((alpha * fg[0] + inv_alpha * bg[0]) >> 8);
    result[1] = (unsigned char)((alpha * fg[1] + inv_alpha * bg[1]) >> 8);
    result[2] = (unsigned char)((alpha * fg[2] + inv_alpha * bg[2]) >> 8);
    result[3] = 0xff;
}

I don't know how fast it is, but it's all integer. It works by turning alpha (and inv_alpha) into 8.8 fixed-point representations. Don't worry about the fact that alpha's min value is 1. In that case, fg[3] was 0, meaning the foreground is transparent. The blends will be 1*fg + 256*bg, which means that all the bits of fg will be shifted out of the result.

You could do it very fast, indeed, if you packed your RGBAs in 64 bit integers. You could then compute all three result colors in parallel with a single expression.

Optimizing Alpha blending

You might be able to eke out a little more performance by representing a1*(2^24) as an integer, doing the arithmetic in integers, then shifting the result down by 24 bits. On modern architectures I doubt it would gain you much, though. If you want better performance, you'll really need to go for SIMD operations.

Oh, one thing: You should express the calculation of a1 as a1 = ((col1 & 0x000000FF) * (1.0 / 255.0)). That'll avoid an expensive FP division. (Compilers won't usually do that on their own, due to the potential loss of precision.)

SIMD for alpha blending - how to operate on every Nth byte?

As I see gRenderSurface, I wonder whether you should just blend images on GPU, e.g., using GLSL shader, or if not, reading memory back from the render surface can be very slow. Anyways, here's my cup of tea using SSE4.1 as I didn't find fully similar linked in comments.

This one shuffles alpha bytes to all color channels using _aa and does the "one minus source alpha" blending by the final masking. With AVX2 it outperforms scalar implementation by factor ~5.7x, while the SSE4.1 version with separate low and high quadword processing is ~3.14x faster than the scalar implementation (both measured using Intel Compiler 19.0).

Division by 255 from How to divide 16-bit integer by 255 with using SSE?

const __m128i _aa = _mm_set_epi8( 15,15,15,15, 11,11,11,11, 7,7,7,7, 3,3,3,3 );
const __m128i _mask1 = _mm_set_epi16(-1,0,0,0, -1,0,0,0);
const __m128i _mask2 = _mm_set_epi16(0,-1,-1,-1, 0,-1,-1,-1);
const __m128i _v255 = _mm_set1_epi8( -1 );
const __m128i _v1 = _mm_set1_epi16( 1 );

const int xmax = 4*source.cols-15;
for ( int y=0;y<source.rows;++y )
{
    // OpenCV CV_8UC4 input
    const unsigned char * pS = source.ptr<unsigned char>( y );
    const unsigned char * pD = dest.ptr<unsigned char>( y );
    unsigned char *pOut = out.ptr<unsigned char>( y );
    for ( int x=0;x<xmax;x+=16 )
    {
        __m128i _src = _mm_loadu_si128( (__m128i*)( pS+x ) );
        __m128i _src_a = _mm_shuffle_epi8( _src, _aa );

        __m128i _dst = _mm_loadu_si128( (__m128i*)( pD+x ) );
        __m128i _dst_a = _mm_shuffle_epi8( _dst, _aa );
        __m128i _one_minus_src_a = _mm_subs_epu8( _v255, _src_a );

        __m128i _s_a = _mm_cvtepu8_epi16( _src_a );
        __m128i _s = _mm_cvtepu8_epi16( _src );
        __m128i _d = _mm_cvtepu8_epi16( _dst );
        __m128i _d_a = _mm_cvtepu8_epi16( _one_minus_src_a );
        __m128i _out = _mm_adds_epu16( _mm_mullo_epi16( _s, _s_a ), _mm_mullo_epi16( _d, _d_a ) );
        _out = _mm_srli_epi16( _mm_adds_epu16( _mm_adds_epu16( _v1, _out ), _mm_srli_epi16( _out, 8 ) ), 8 );
        _out = _mm_or_si128( _mm_and_si128(_out,_mask2), _mm_and_si128( _mm_adds_epu16(_s_a, _mm_cvtepu8_epi16(_dst_a)),_mask1) );

        __m128i _out2;
        // compute _out2 using high quadword of of the _src and _dst
        //...
        __m128i _ret = _mm_packus_epi16( _out, _out2 );
        _mm_storeu_si128( (__m128i*)(pOut+x), _ret );

Combining two 16 bits RGB colors with alpha blending

My (untested) solution: I split the foreground and background colors to (red + blue) and (green) components and multiply them with a 6bit alpha value. Enjoy! (Only if it works :)

                            //   rrrrrggggggbbbbb
#define MASK_RB       63519 // 0b1111100000011111
#define MASK_G         2016 // 0b0000011111100000
#define MASK_MUL_RB 4065216 // 0b1111100000011111000000
#define MASK_MUL_G   129024 // 0b0000011111100000000000
#define MAX_ALPHA        64 // 6bits+1 with rounding

uint16 alphablend( uint16 fg, uint16 bg, uint8 alpha ){

  // alpha for foreground multiplication
  // convert from 8bit to (6bit+1) with rounding
  // will be in [0..64] inclusive
  alpha = ( alpha + 2 ) >> 2;
  // "beta" for background multiplication; (6bit+1);
  // will be in [0..64] inclusive
  uint8 beta = MAX_ALPHA - alpha;
  // so (0..64)*alpha + (0..64)*beta always in 0..64

  return (uint16)((
            (  ( alpha * (uint32)( fg & MASK_RB )
                + beta * (uint32)( bg & MASK_RB )
            ) & MASK_MUL_RB )
          |
            (  ( alpha * ( fg & MASK_G )
                + beta * ( bg & MASK_G )
            ) & MASK_MUL_G )
         ) >> 6 );
}

/*
  result masks of multiplications
  uppercase: usable bits of multiplications
  RRRRRrrrrrrbbb // 5-5 bits of red+blue
        1111100000011111 // from MASK_RB * 1
  1111100000011111000000 //   to MASK_RB * MAX_ALPHA // 22 bits!

  -----GGGGGGgggggg----- // 6 bits of green
        0000011111100000 // from MASK_G * 1
  0000011111100000000000 //   to MASK_G * MAX_ALPHA
*/

Blend mode on a transparent and semi transparent background

This equation is a simplification of the general blending equation. It assumes the destination color is opaque, and therefore drops the destination color's alpha term.

D = C1 * C1a + C2 * C2a * (1 - C1a)

where D is the resultant color, C1 is the color of the first element, C1a is the alpha of the first element, C2 is the second element color, C2a is the alpha of the second element. The destination alpha is calculated with:

Da = C1a + C2a * (1 - C1a)

The resultant color is premultiplied with the alpha. To restore the color to the unmultiplied values, just divide by Da, the resultant alpha.

Chrome slightly changes alpha-value of RGBA color after setting it

A thought as RGBA is a represented in 32 bit. That would mean that in actual fact then there is no such thing as an exact 0.25 of the 8 bit alpha. So therefore 0.247059 is the actual correct extrapolated value. So is Chrome Wrong? or is it in fact Correct? and the other browsers are giving you an invalid number that is not the true real representation of what is rendered on the page ?

You can then argue that the W3C standard is not entirely correct and that it should only allow for values that a fully devise-able with an 8 bit Alpha.. But then it is just a recommendation and not law...

Below is a stripped down version of Chromium customised webkit color.cpp code that looks to be doing the color conversions. but then i'm no chromium expert
http://www.chromium.org/developers/how-tos/getting-around-the-chrome-source-code

sources: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/platform/graphics/Color.cpp

#include <iostream>

using namespace std;
typedef unsigned RGBA32;

int colorFloatToRGBAByte(float f)
{
    return std::max(0, std::min(static_cast<int>(lroundf(255.0f * f)), 255));
}

RGBA32 makeRGBA32FromFloats(float r, float g, float b, float a)
{
    cout << "Alpha: " << a;
    return colorFloatToRGBAByte(a) << 24 | colorFloatToRGBAByte(r) << 16 | colorFloatToRGBAByte(g) << 8 | colorFloatToRGBAByte(b);
}

int main()
{
        RGBA32 t;
        t = makeRGBA32FromFloats (255.0f, 255.0f, 255.0f, 0.25f);
        cout << static_cast<unsigned>(t) << std::endl;
    return 0;
}

How to Alpha Blend Rgba Unsigned Byte Color Fast