How to Do Memory Test on Arm Architecture Hardware? (Something Like Memtest86)

Unexpected DRAM behavior when MCU stopped at breakpoint

I eventually found an answer to the issue.

As it turned out, the DRAM setup defined wrong address pin. The manufacturer provided a recommended value for the register defining this bit, but this turned out to be wrong.

For future googlers: The Vybrid Reference Manual Revisions 7 & 8 recommends DDRMC_CR73[APREBIT]=0xA for both DDR3 and LPDDR2. This assigns DDR_A10/CA10 as the address pin. The JEDEC standard specifies DDR_A0/CA0 as the address pin, thus the correct value should be: DDRMC_CR73[APREBIT]=0x0.

Huge output binary file produced

of course it does, you asked it to convert the file to a binary right? So even if you have only a single byte of data in the data segment you need to have (0x10000000-0x00900000)+(amount of data) bytes in the .bin file...the raw binary file format does not have any address information therefore it needs to cover all loadable segments and pad with fill in between, so you get that size of a file (0x10000000-0x00900000)+(amount of data). Do an experiment and change your 0x10000000 to 0x20000000 and you will see a file more than twice the size.

If you use elf or coff or intel hex or srec or some binary format that includes addresses as well as data and does not require padding between the segments in the file, then your files wills stay small...

Quickly find whether a value is present in a C array?

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

stmfd sp!,{r4-r11} ; save non-volatile registers
mov r3,r0,LSR #3 ; loop count = total count / 8
pld [r1,#128]
ldmia r1!,{r4-r7} ; pre load first set
loop_top:
pld [r1,#128]
ldmia r1!,{r8-r11} ; pre load second set
cmp r4,r2 ; search for match
cmpne r5,r2 ; use conditional execution to avoid extra branch instructions
cmpne r6,r2
cmpne r7,r2
beq found_it
ldmia r1!,{r4-r7} ; use 2 sets of registers to hide load delays
cmp r8,r2
cmpne r9,r2
cmpne r10,r2
cmpne r11,r2
beq found_it
subs r3,r3,#1 ; decrement loop count
bne loop_top
mov r0,#0 ; return value = false (not found)
ldmia sp!,{r4-r11} ; restore non-volatile registers
bx lr ; return
found_it:
mov r0,#1 ; return true
ldmia sp!,{r4-r11}
bx lr

Update:
There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
beq .L8
.L3: ldr r2, [r3, #4]!
cmp r2, r1
bne .L9
mov r0, #1
.L2: add sp, sp, #1024
bx lr
.L8: mov r0, #0
b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

Update 2:
I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
ldr r3,[r1],#4
cmp r3,r2
beq true_exit
subs r0,r0,#1
bne loop_top
false_exit: xxx
bx lr
true_exit: xxx
bx lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

Update 3:
I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT 297ns 562ns
Surface RT 2 172ns 296ns

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

How to find the size of the L1 cache line size with IO timing measurements?

Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.

Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.

Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.

When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.

If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.



Related Topics



Leave a reply



Submit