Is a logical right shift by a power of 2 faster in AVR?
Let's look at the datasheet:
http://atmel.com/dyn/resources/prod_documents/8271S.pdf
As far as I can see, the ASR (arithmetic shift right) always shifts by one bit and cannot take the number of bits to shift; it takes one cycle to execute. Therefore, shifting right by n bits will take n cycles. Powers of two behave just the same as any other number.
Is logical/arithmetic shift fewer bits faster?
It would depend on the hardware implementation. For common operations involving a shift by a constant amount (e.g. pointer arithmetic), there could be a faster path (e.g. it might be fused with a related addition operation). For shifting by a variable, a barrel shifter circuitry is used, where any shift amount would have the same delay.
Which is faster: x1 or x10?
Potentially depends on the CPU.
However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.
So the bottom line is... no. No difference.
Why are bitwise operators slower than multiplication/division/modulo?
*
, %
, and /
all have fast paths for single-"limb" integers. <<
, >>
, and &
don't. They're going through the general-purpose arbitrary-precision code path.
Bit duplication from 8-bit to 32-bit
It's simple - solve the simplest case, then do more complex ones.
Case 1: Duplicate 1 bit into an 4 bit value (the simplest).
+---+---------+
| 0 | _ _ _ A |
+---+---------+
| 1 | A A A A |
+---+---------+
This can be done as a simple set of shifts:
x = (x << 0) | (x << 1) | (x << 2) | (x << 3);
Or in a less obvious but faster way:
x = (x << 4) - x;
This step will be the last step in all following cases.
Case 2: Duplicate 2 bits into an 8 bit value.
+---+---------+---------+
| 0 | _ _ _ _ | _ _ A B |
+---+---------+---------+
| 1 | _ _ _ A | _ _ _ B |
+---+---------+---------+
| 2 | A A A A | B B B B |
+---+---------+---------+
Case 3: Duplicate 4 bits into a 16 bit value. How? Just move 2 bits to the upper part to turn it into the case 1! Divide and conquer!
+---+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D |
+---+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D |
+---+---------+---------+---------+---------+
| 2 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D |
+---+---------+---------+---------+---------+
| 3 | A A A A | B B B B | C C C C | D D D D |
+---+---------+---------+---------+---------+
Case 4: Duplicate 8 bits into a 32 bit value (the original).
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | _ _ _ _ | _ _ _ _ | _ _ _ _ | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 2 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D | _ _ _ _ | _ _ E F | _ _ _ _ | _ _ G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 3 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D | _ _ _ E | _ _ _ F | _ _ _ G | _ _ _ H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 4 | A A A A | B B B B | C C C C | D D D D | E E E E | F F F F | G G G G | H H H H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
Can be achieved by the code below:
uint32_t interleave(uint8_t value)
{
uint32_t x = value;
x = (x | (x << 12)) /* & 0x000F000F */; // GCC is not able to remove redundant & here
x = (x | (x << 6)) & 0x03030303;
x = (x | (x << 3)) & 0x11111111;
x = (x << 4) - x;
return x;
}
Some test cases to check that it works:
TEST_F(test, interleave)
{
EXPECT_EQ(interleave(0x00), 0x00000000);
EXPECT_EQ(interleave(0x11), 0x000F000F);
EXPECT_EQ(interleave(0x22), 0x00F000F0);
EXPECT_EQ(interleave(0x33), 0x00FF00FF);
EXPECT_EQ(interleave(0x44), 0x0F000F00);
EXPECT_EQ(interleave(0x55), 0x0F0F0F0F);
EXPECT_EQ(interleave(0x66), 0x0FF00FF0);
EXPECT_EQ(interleave(0x77), 0x0FFF0FFF);
EXPECT_EQ(interleave(0x88), 0xF000F000);
EXPECT_EQ(interleave(0x99), 0xF00FF00F);
EXPECT_EQ(interleave(0xAA), 0xF0F0F0F0);
EXPECT_EQ(interleave(0xBB), 0xF0FFF0FF);
EXPECT_EQ(interleave(0xCC), 0xFF00FF00);
EXPECT_EQ(interleave(0xDD), 0xFF0FFF0F);
EXPECT_EQ(interleave(0xEE), 0xFFF0FFF0);
EXPECT_EQ(interleave(0xFF), 0xFFFFFFFF);
EXPECT_EQ(interleave(0x01), 0x0000000F);
EXPECT_EQ(interleave(0x23), 0x00F000FF);
EXPECT_EQ(interleave(0x45), 0x0F000F0F);
EXPECT_EQ(interleave(0x67), 0x0FF00FFF);
EXPECT_EQ(interleave(0x89), 0xF000F00F);
EXPECT_EQ(interleave(0xAB), 0xF0F0F0FF);
EXPECT_EQ(interleave(0xCD), 0xFF00FF0F);
EXPECT_EQ(interleave(0xEF), 0xFFF0FFFF);
}
80286: Which is the fastest way to multiply by 10?
The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg. IMUL reg16, immed takes 21 cycles. Therefore, the fastest way to multiply by ten would appear to be:
; // cycles
shl ax, 1 ; *2 // 2
mov bx, ax ; *2 // 4
shl ax, 1 ; *4 // 6
shl ax, 1 ; *8 // 8
add ax, bx ; *10 // 10
or, alternatively:
; // cycles
mov bx, ax ; *1 // 2
shl ax, 1 ; *2 // 4
shl ax, 1 ; *4 // 6
add ax, bx ; *5 // 8
shl ax, 1 ; *10 // 10
Ten cycles either way.
AVR bitwise C operations
for the sake of the demonstration:
here are two code compiled with avr-gcc-4.7.2:
void main() {
DDRC |= (1<<5);
}
and another:
void main() {
DDRC |= 0b100000;
}
% diff -s t2.s t.s
Files t2.s and t.s are identical
that's because 1<<N
is detected at compile time, and transform to its constant equivalent, making both expressions identical when sent to the microcontroller.
About operations, please have a look at truth tables:
| a b -> a&b | | a b -> a|b |
| 0 0 0 | | 0 0 0 |
| 0 1 0 | | 0 1 1 |
| 1 0 0 | | 1 0 1 |
| 1 1 1 | | 1 1 1 |
the hint to remember both truth tables is the following:
- if one of the operands is a
0
and you're doing a&
, then the result will be0
(force to 0) - if one of the operands is a
1
and you're doing a|
, then the result will be1
(force to 1)
So if you take an example a bit more complicated:
101010 | 010101 = 111111
101010 & 010101 = 000000
and finaly, when you want to set a bit:
REGISTER = 00000001
REGISTER |= 1<<5 <=> REGISTER = 00000001 | 00100000
REGISTER == 00100001
if you want to reset that bit:
REGISTER &= ~(1<<5) <=> REGISTER = 00100001 & ~(00100000) <=> REGISTER = 00100001 & 11011111
REGISTER == 00000001
I hope it's making more sense… Though you'd better lookup for a course on combinatory logics, which is the basic to perfectly handle when doing embedded programming.
Now to answer your question:
if we know anyway a bit combination to simply write and maybe its faster and simpler?:
it's not necessarily faster, and not really simpler.
Consider the following made up register FOO:
7 6 5 4 3 2 1 0
[ A | B | C | D | W | X | Y | Z ]
now consider that we have build a header that has preprocessor variables with the right values:
FOOZ = 0
FOOY = 1
FOOX = 2
FOOW = 3
FOOD = 4
FOOC = 5
FOOB = 6
FOOA = 7
and now we need to set up A
, C
and 'X' bits, which can be done as follows:
FOO |= 1<<FOOA | 1<<FOOC | 1<<FOOX
instead of:
FOO |= 0b10100100
which could more easily lead to errors.
Related Topics
How Performing Multiple Matrix Multiplications in Cuda
What Is C++ Version of Realloc(), to Allocate the New Buffer and Copy the Contents from the Old One
Can Raw Pointers Be Used Instead of Iterators with Stl Algorithms for Containers with Linear Storage
Stopping the Debugger When a Nan Floating Point Number Is Produced
Execute C++ from String Variable
How to Show Enter Password in the Form of Asterisks(*) on Terminal
How to Use Dynamic Name for Variables in C++
High Delay in Rs232 Communication on a Pxa270
Comparing Character Arrays and String Literals in C++
When Instantiating a Template, Should Members of Its Incomplete Argument Types Be Visible
Operator Overload Which Permits Capturing with Rvalue But Not Assigning To
Union for Uint32_T and Uint8_T[4] Undefined Behavior
Clang Doesn't See Basic Headers
How to Make Template Rvalue Reference Parameter Only Bind to Rvalue Reference