Is a Logical Right Shift by a Power of 2 Faster in Avr

Is a logical right shift by a power of 2 faster in AVR?

Let's look at the datasheet:

http://atmel.com/dyn/resources/prod_documents/8271S.pdf

As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.

Is logical/arithmetic shift fewer bits faster?

It would depend on the hardware implementation. For common operations involving a shift by a constant amount (e.g. pointer arithmetic), there could be a faster path (e.g. it might be fused with a related addition operation). For shifting by a variable, a barrel shifter circuitry is used, where any shift amount would have the same delay.

Which is faster: x1 or x10?

Potentially depends on the CPU.

However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.

So the bottom line is... no. No difference.

Why are bitwise operators slower than multiplication/division/modulo?

*, %, and / all have fast paths for single-"limb" integers. <<, >>, and & don't. They're going through the general-purpose arbitrary-precision code path.

Bit duplication from 8-bit to 32-bit

It's simple - solve the simplest case, then do more complex ones.

Case 1: Duplicate 1 bit into an 4 bit value (the simplest).

+---+---------+
| 0 | _ _ _ A |
+---+---------+
| 1 | A A A A |
+---+---------+

This can be done as a simple set of shifts:

x = (x << 0) | (x << 1) | (x << 2) | (x << 3);

Or in a less obvious but faster way:

x = (x << 4) - x;

This step will be the last step in all following cases.

Case 2: Duplicate 2 bits into an 8 bit value.

+---+---------+---------+
| 0 | _ _ _ _ | _ _ A B |
+---+---------+---------+
| 1 | _ _ _ A | _ _ _ B |
+---+---------+---------+
| 2 | A A A A | B B B B |
+---+---------+---------+

Case 3: Duplicate 4 bits into a 16 bit value. How? Just move 2 bits to the upper part to turn it into the case 1! Divide and conquer!

+---+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D |
+---+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D |
+---+---------+---------+---------+---------+
| 2 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D |
+---+---------+---------+---------+---------+
| 3 | A A A A | B B B B | C C C C | D D D D |
+---+---------+---------+---------+---------+

Case 4: Duplicate 8 bits into a 32 bit value (the original).

+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | _ _ _ _ | _ _ _ _ | _ _ _ _ | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 2 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D | _ _ _ _ | _ _ E F | _ _ _ _ | _ _ G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 3 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D | _ _ _ E | _ _ _ F | _ _ _ G | _ _ _ H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 4 | A A A A | B B B B | C C C C | D D D D | E E E E | F F F F | G G G G | H H H H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+

Can be achieved by the code below:

uint32_t interleave(uint8_t value)
{
    uint32_t x = value;
    x = (x | (x << 12)) /* & 0x000F000F */; // GCC is not able to remove redundant & here
    x = (x | (x <<  6)) & 0x03030303;
    x = (x | (x <<  3)) & 0x11111111;
    x = (x << 4) - x;
    return x;
}

Some test cases to check that it works:

TEST_F(test, interleave)
{
    EXPECT_EQ(interleave(0x00), 0x00000000);
    EXPECT_EQ(interleave(0x11), 0x000F000F);
    EXPECT_EQ(interleave(0x22), 0x00F000F0);
    EXPECT_EQ(interleave(0x33), 0x00FF00FF);
    EXPECT_EQ(interleave(0x44), 0x0F000F00);
    EXPECT_EQ(interleave(0x55), 0x0F0F0F0F);
    EXPECT_EQ(interleave(0x66), 0x0FF00FF0);
    EXPECT_EQ(interleave(0x77), 0x0FFF0FFF);
    EXPECT_EQ(interleave(0x88), 0xF000F000);
    EXPECT_EQ(interleave(0x99), 0xF00FF00F);
    EXPECT_EQ(interleave(0xAA), 0xF0F0F0F0);
    EXPECT_EQ(interleave(0xBB), 0xF0FFF0FF);
    EXPECT_EQ(interleave(0xCC), 0xFF00FF00);
    EXPECT_EQ(interleave(0xDD), 0xFF0FFF0F);
    EXPECT_EQ(interleave(0xEE), 0xFFF0FFF0);
    EXPECT_EQ(interleave(0xFF), 0xFFFFFFFF);

    EXPECT_EQ(interleave(0x01), 0x0000000F);
    EXPECT_EQ(interleave(0x23), 0x00F000FF);
    EXPECT_EQ(interleave(0x45), 0x0F000F0F);
    EXPECT_EQ(interleave(0x67), 0x0FF00FFF);
    EXPECT_EQ(interleave(0x89), 0xF000F00F);
    EXPECT_EQ(interleave(0xAB), 0xF0F0F0FF);
    EXPECT_EQ(interleave(0xCD), 0xFF00FF0F);
    EXPECT_EQ(interleave(0xEF), 0xFFF0FFFF);
}

80286: Which is the fastest way to multiply by 10?

The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg. IMUL reg16, immed takes 21 cycles. Therefore, the fastest way to multiply by ten would appear to be:

           ;       // cycles
shl ax, 1  ; *2    // 2
mov bx, ax ; *2    // 4
shl ax, 1  ; *4    // 6
shl ax, 1  ; *8    // 8
add ax, bx ; *10   // 10

or, alternatively:

           ;      // cycles
mov bx, ax ; *1   // 2
shl ax, 1  ; *2   // 4
shl ax, 1  ; *4   // 6
add ax, bx ; *5   // 8
shl ax, 1  ; *10  // 10

Ten cycles either way.

AVR bitwise C operations

for the sake of the demonstration:

here are two code compiled with avr-gcc-4.7.2:

void main() {
    DDRC |= (1<<5);
}

and another:

void main() {
    DDRC |= 0b100000;
}

 % diff -s t2.s t.s
Files t2.s and t.s are identical

that's because 1<<N is detected at compile time, and transform to its constant equivalent, making both expressions identical when sent to the microcontroller.

About operations, please have a look at truth tables:

| a b -> a&b |        | a b -> a|b |
| 0 0    0   |        | 0 0    0   |
| 0 1    0   |        | 0 1    1   |
| 1 0    0   |        | 1 0    1   |
| 1 1    1   |        | 1 1    1   |

the hint to remember both truth tables is the following:

if one of the operands is a 0 and you're doing a &, then the result will be 0 (force to 0)
if one of the operands is a 1 and you're doing a |, then the result will be 1 (force to 1)

So if you take an example a bit more complicated:

101010 | 010101 = 111111
101010 & 010101 = 000000

and finaly, when you want to set a bit:

REGISTER = 00000001
REGISTER |= 1<<5      <=> REGISTER = 00000001 | 00100000
REGISTER == 00100001

if you want to reset that bit:

REGISTER &= ~(1<<5)   <=> REGISTER = 00100001 & ~(00100000)  <=> REGISTER = 00100001 & 11011111
REGISTER == 00000001

I hope it's making more sense… Though you'd better lookup for a course on combinatory logics, which is the basic to perfectly handle when doing embedded programming.

Now to answer your question:

if we know anyway a bit combination to simply write and maybe its faster and simpler?:

it's not necessarily faster, and not really simpler.

Consider the following made up register FOO:

  7   6   5   4   3   2   1   0
[ A | B | C | D | W | X | Y | Z ]

now consider that we have build a header that has preprocessor variables with the right values:

FOOZ = 0
FOOY = 1
FOOX = 2
FOOW = 3
FOOD = 4
FOOC = 5
FOOB = 6
FOOA = 7

and now we need to set up A, C and 'X' bits, which can be done as follows:

FOO |= 1<<FOOA | 1<<FOOC | 1<<FOOX

instead of:

FOO |= 0b10100100

which could more easily lead to errors.

Is a Logical Right Shift by a Power of 2 Faster in Avr