Loops Make Less to Run Out of Memory. (Structure Optimization)

Loops make Less to run out of memory. (Structure optimization)

A partial answer, there's way to get rid of loops: color and icon data can be defined as mixins calls, here's mini example:

.colors() {
    // "array" of colors
    .-(Maroon, #800000);
    .-(Red,    #FF0000);
    .-(Orange, #FFA500);
    .-(Yellow, #FFFF00);
}

// generate color classes:

.some-prefix {
    .colors();
    .-(@name, @color) {
        &.@{name} {
            color: contrast(@color);
            background: @color;
            // etc.
        }
    }
}

Same for icons...

While Loop vs For Loop, which is more memory efficient!

There is no difference in memory use between the programming language construct for and while. A given compiler might optimize one or the other better but you should not worry about this. With memory constraint conditions the most important thing is going to be your data structures and system design. A poor system design could require more memory.

Why are elementwise additions much faster in separate loops than in a combined loop?

Upon further analysis of this, I believe this is (at least partially) caused by the data alignment of the four-pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT: It does in fact look like you are allocating all the arrays separately.
Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

Observations:

6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.
In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.
In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is a very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/events/partial_address_alias.html

5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

~~Here, as the data sizes increase, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.~~

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in the L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in the cache. So you're bound by memory bandwidth.

2 x Intel X5482 Harpertown @ 3.2 GHz
Intel Core i7 870 @ 2.8 GHz
Intel Core i7 2600K @ 4.4 GHz

How to optimize for writes to memory in hot loop

Do your counts truly need to be qwords, or could you use a narrower type to cut your cache footprint in half with 32-bit (or even less with narrower)? If you're getting cache misses, that's going to mean much more time spent waiting for loads/stores if OoO exec can't hide that latency.

I guess copying the data around is going to be most of the memory bandwidth / cache misses, though. This looks like Radix Sort, and the amount of metadata to manage is smallish compared to the data. (But having at least it hit in cache can help, making it more resistant to eviction by all the other data you're throwing around.)

No matter what you do, the access pattern of Radix Sort is inherently not very cache-friendly, although it's not terrible. You're scattering writes across 256 possible destinations, along with updating the pointers. But those 256 destinations are sequential streams so they can hit in L1d cache if you're lucky.

Hopefully those destinations aren't multiples of 4k apart (initially or most of the time), otherwise they'll alias the same line in L1d cache and cause conflict misses. (i.e. force eviction of another partially-written cache line that you're soon going to write to.)

You have some redundant loads / stores which may be a bottleneck for load/store execution units, but if that's not the bottleneck then cache will absorb them just fine. This section is mostly about tuning the loop to use fewer uops, improving things in the no-cache-miss best case, and giving OoO exec less latency to hide.

Using a memory-destination dec and then reloading the dec's store seems obviously inefficient in terms of total back-end load/store operations, and latency for OoO exec to hide. (Although on AMD, dec mem is still a single uop for the front-end, vs. 3 on Intel; https://uops.info/ and https://agner.org/optimize/).

Similarly, aren't you loading [rdx + rdi * 8] ; arr[i] twice, with the same RDI? SHRX can copy-and-shift so you wouldn't even be saving uops by keeping around that load result for later. (You could also use a simple non-indexed addressing mode for arr[i], by doing a pointer-increment like add rdi,8 and cmp rdi, endp/jne where end is something you calculated earlier with lea endp, [rdx + size*8]. Looping forward over an array can be better for some HW prefetchers.)

x86-64 has 15 registers, so if you need more for this inner loop, save/restore some call-preserved registers (like RBX or RBP) at the top/bottom of the function. Or spill some outer-loop stuff to memory if necessary.

mov r9, %1 looks loop-invariant, so hoist that shl r9, 3 calculation out of the loop, and don't overwrite R9 inside the inner loop.

You do need to zero-extend the byte you extracted, but and rsi, 255 is not as efficient as movzx eax, sil. (Or better, pick a register like ECX whose low byte can be accessed without a REX prefix). AMD can't do mov-elimination on movzx the way Intel can, though, so just saving code size for AMD, but optimizing latency if you ever run this on Intel CPUs.

Or better, AMD Zen has single-uop BMI1 bextr r64,r64,r64, so prepare a start/length pair in the low 2 bytes of a register. As discussed, that's loop invariant. i.e. before the loop mov ecx, %k1 / shl cl, 3 / mov ch, 0x8 (AMD doesn't have partial-register stalls, just false dependencies. In this case true, since we want to merge.) If that's inline asm syntax, %k1 specifies the 32-bit version of the register. Or if it's memory, you're just loading anyway, and hoisting it will save another load!

(Intel has 2-uop bextr, presumably shift and bzhi uops.)

Or if you really want to load twice, movzx esi, byte [rdi + rdx] where RDI is a pointer to arr[[i] that you increment or decrement, and RDX is a byte offset. But probably BEXTR is a better bet.

Java optimization to prevent heapspace out of memory

You can try to set the -XX:MaxNewSize=40% of you Xmx AND -XX:NewSize=40% of you Xmx
This params will speedup the GC calls, because your creation rate is high.

For more help : check here

Why does this function run so much faster when it makes an extra read of memory?

What

So @Peter Cordes was along the right lines in the comments: this is caused by stalls due to attempting to read a memory location less than three cycles after writing to it.

Here's a smaller program which produces the same effects, basically a minimal example of the program anove:

    .bss
    .align 256
a:
    .zero   4

    .text
    .p2align 8, 0
    .globl _start
_start:
    mov $0x80000000, %ecx
1:
    mov a(%rip), %edx
    mov %edx, a(%rip)
    .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
    add $1, %ecx
    jno 1b

    mov $60, %rax
    mov $0, %rdi
    syscall

The eight no-op instructions inside the loop body slow the loop down to the extent that it runs at the rate of exactly one instruction per three cycles, causing the loop to run in around 3½ seconds (with the exact amount based on frequency scaling). If you remove them, the loop takes more like five seconds to run.

Why

Looking at the small program above

That answers the question of what the trigger is, but it still doesn't really explain what's actually going on to cause the issue, or why it creates such a large performance deviation. The key clue to this was looking at the number of µops dispatched on each port. Here are the numbers for the minimized program above, together with a few other interesting statistics:

          slow             fast
 1,156,849,106    1,426,795,253      uops_dispatched_port.port_0:u
 1,386,792,998    1,432,967,110      uops_dispatched_port.port_1:u
 3,635,449,011    3,582,057,598      uops_dispatched_port.port_2:u
 4,169,272,248    5,003,425,117      uops_dispatched_port.port_3:u
18,538,282,061    6,427,027,115      uops_dispatched_port.port_4:u
 1,763,689,624    1,435,813,046      uops_dispatched_port.port_5:u
 4,299,495,754    4,286,287,760      uops_dispatched_port.port_6:u
   784,337,695       12,429,585      uops_dispatched_port.port_7:u
19,393,649,837   12,969,389,151      cpu-cycles:u
17,197,960,860   51,654,810,808      uops_issued.any:u
21,677,912,988   21,641,546,006      uops_executed.core:u
   5.245720659      3.491356466      seconds time elapsed
   5.220800000      3.483496000      seconds user

The only difference between the two programs is that the "fast" program is running an extra eight nops per iteration; with 2³² iterations, that's an extra 34359738368 nops. At the µop level, a nop is a µop that gets issued and then just disappears, so to get a fair comparison between the program, we can subtract 34,359,738,368 from 51,654,810,808 to get an "adjusted" count of 17,295,072,440 µops issued for the "fast" program.

This reveals some pretty strange results when it comes to the µop lifecycle. After removing the nops, the number of µops issued is pretty much the same between the two programs – unsurprising, because they have the same code in that case. The number of µops executed is also pretty much the same, which is again unsurprising. (The issue and execution counts don't match each other because µops have a tendency to be split up, fused, rearranged etc. during execution, but this should affect both programs in the same way.)

On the other hand, the µop dispatch counts don't seem to match at all, which initially seems like it should be impossible. The trick is that, despite the naming, the performance counters for dispatch aren't counting µops directly – rather, they're counting the number of cycles that the port is busy handling a µop. So what the crazy mismatch on port 4 is showing is that the µops are getting "stuck" in port 4 and taking a while to execute (around 3 cycles, based on the statistics).

On the processor I'm using, port 4 is the only execution port capable of doing stores to memory (the MOV instruction itself takes up two ports; 2, 3 or 7 calculate the address, and 4 does the actual storing). I think the mental model I had of the processor oddity in question is that upon trying to read from a memory location less than 3 cycles after storing to it, the read would take longer than normal. However, the actual effects seem to be worse than that: the processor behaviour appears to be that if you attempt to read from a memory location less than 3 cycles after storing to it, all writes on the processor thread get blocked for 3 cycles.

Looking at the program in the question

Now let's look at the program in "Further experiments #2" above, in terms of port usage (a side note: for some reason I had to add an extra three nops to get the "fast" behaviour, so maybe there's been some extra optimisation in a microcode update):

          slow             fast
 8,881,288,565    3,278,012,510      uops_dispatched_port.port_0:u
 8,852,984,797    4,992,142,029      uops_dispatched_port.port_1:u
10,385,944,986   10,726,029,245      uops_dispatched_port.port_2:u
11,048,867,603   10,721,449,693      uops_dispatched_port.port_3:u
25,015,457,606   11,608,396,673      uops_dispatched_port.port_4:u
 9,255,886,026    5,102,583,196      uops_dispatched_port.port_5:u
11,265,935,146    7,295,609,724      uops_dispatched_port.port_6:u
 4,333,803,057    4,304,364,919      uops_dispatched_port.port_7:u
42,554,124,800   26,024,822,032      cpu-cycles:u
38,686,833,876  103,170,056,257      uops_issued.any:u
52,058,581,194   51,940,941,322      uops_executed.core:u
  11.504080112      7.024786187      seconds time elapsed
  11.488620000      7.018980000      seconds user

With fifteen nops being issued per iteration in the "fast" version, the number of non-nop µops issued there is 38,745,546,817. So we're seeing much the same results: not counting the nops, the µop issue and execution rate are the same between the two programs, but some µops are taking much longer to execute in the slow version and tying up the ports.

There's one small difference: the instruction that's reading memory too soon after writing it is now the read-modify-write instruction or $1, a(%rip). The write half of that needs to use both port 4 to do the writing, and a port capable of doing or (0, 1, 5, or 6) to calculate the value to write. So we now see that ports 0, 1, 5, and 6 are all showing as having stuck µops – the instruction is getting stuck due to the early read, and tying up its ALU port in addition to port 4 during the three-cycle "I can't write" period. The main consequence of this appears to be that the rest of the program slows down a little too; my guess at this (which is consistent with the numbers above but might be wrong anyway) is that when port 6 ends up getting stuck, the loop logic has to wait, because port 6 is the only port that can handle the add $1, %ebx; jno 1b macro-fused instruction required to return to the start of the loop. So the program is being slowed by 3 clock cycles per loop iteration, and sometimes more because the loop logic is jammed too.

The interesting effect, though, is how large the performance deviations are. With the small program above, port 4 was stuck for 3 cycles per loop iteration, but that didn't slow down the loop as a whole by that much, because port 4 throughput wasn't the limiting factor. With this more complex program, though, port 4 being stuck for 3 cycles per loop iteration does seem to slow down every loop iteration by 3 cycles (and a little more, perhaps due to port 6 being stuck too), so the port 4 jam is having a close to maximal effect on the program as a whole. It's still not entirely obvious why port 4 being jammed for 3 cycles is causing the program as a whole to stall for 3 cycles (this program isn't obviously limited on port 4 throughput either), but it's at least plausible that an effect like this could be occurring.

Anyway, to answer the main practical question in the OP, "In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?": the rule is "on this processor, reading a memory address less than three cycles after writing it will jam write instructions for 3 cycles, which will in turn slow down your program by up to 3 cycles, maybe even a little more; you can avoid it by preventing writes that happen that quickly after a read". Unfortunately, with out-of-order processors, it can be hard to know when the processor will attempt to run instructions, so it can be hard to guarantee that two instructions won't run too quickly after each other; for programs like the one I was initially working on, most likely the main piece of practical advice is to avoid repeatedly reading and writing the same memory address in a tight loop (as opposed to storing the value in a register and repeatedly reading and writing from there – the original assembly code only ended up looking the way it did because I was calling a function through a function pointer, meaning that the compiler couldn't safely make this optimisation).

Memory leakage in php with three for loops

I modified the code a little bit to free as much memory as I see could be freed.
I've added a comment above each modification. The added comments start with "#" so you could find them easier.
This is not related to this question, but worth mentioning that your database insertion code is vulnerable to SQL injection.

<?php
require("simple_html_dom.php");
$thelink = "http://www.somelink.co.uk";

# do not keep raw contents of the file on memory
#$data1 = file_get_contents($thelink);
#$html1 = str_get_html($data1);
$html1 = str_get_html(file_get_contents($thelink));

$ret1 = $html1->find('#idResults2');

// first inception level, we know page has only links
if (!$ret1){
    $es1 = $html1->find('table.litab a');

    # free $html1, not used anymore
    unset($html1);

    $countlinks1 = 0;
    foreach ($es1 as $aa1) {
        $links1[$countlinks1] = $aa1->href;
        $countlinks1++;
        // echo (addslashes($aa->href));
    }

    # free memroy used by the $es1 value, not used anymore
    unset($es1);

    //for every link in array do the same

    for ($i = 0; $i <= $countlinks1; $i++) {
        # do not keep raw contents of the file on memory
        #$data2 = file_get_contents($links1[$i]);
        #$html2 = str_get_html($data2);
        $html2 = str_get_html(file_get_contents($links1[$i]));

        $ret2 = $html2->find('#idResults2');

        // if got information then send to DB
        if ($ret2){
            pullInfo($html2);
        } else {
        // continue inception

            $es2 = $html2->find('table.litab a');

            # free memory used by $html2, not used anymore.
            # we would unset it at the end of the loop.
            $html2 = null;

            $countlinks2 = 0;
            foreach ($es2 as $aa2) {
                $links2[$countlinks2] = $aa2->href;
                $countlinks2++;
            }

            # free memory used by $es2
            unest($es2);

            for ($j = 0; $j <= $countlinks2; $j++) {
                # do not keep raw contents of the file on memory
                #$data3 = file_get_contents($links2[$j]);
                #$html3 = str_get_html($data3);
                $html3 = str_get_html(file_get_contents($links2[$j]));
                $ret3 = $html3->find('#idResults2');
                // if got information then send to DB   
                if ($ret3){
                    pullInfo($html3);
                }

                # free memory used by $html3 or on last iteration the memeory would net get free
                unset($html3);
            }
        }

        # free memory used by $html2 or on last iteration the memeory would net get free
        unset($html2);
    }
}

function pullInfo($html)
{
    $tds = $html->find('td');
    $count =0; 
    foreach ($tds as $td) {
      $count++;
      if ($count==1){
        $name = addslashes($td->innertext);
       }
      if ($count==2){
            $address = addslashes($td->innertext);
       }
      if ($count==3){
        $number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
       }

    }

    # check for available data:
    if ($count) {
        # free $tds and $td
        unset($tds, $td);

        mysql_query("INSERT INTO people (name, area, phone)
        VALUES ('$name', '$address', '$number')");
    }

}

Update:

You could trace your memory usage to see how much memory is being used in each section of your code. this could be done by using the memory_get_usage() calls, and saving the result to some file. like placing this below code in the end of each of your loops, or before creating objects, calling heavy methods:

file_put_contents('memory.log', 'memory used in line ' . __LINE__ . ' is: ' . memory_get_usage() . PHP_EOL, FILE_APPEND);

So you could trace the memory usage of each part of your code.

In the end remember all this tracing and optimization might not be enough, since your application might really need more memory than 32 MB. I'v developed a system that analyzes several data sources and detects spammers, and then blocks their SMTP connections and since sometimes the number of connected users are over 30000, after a lot of code optimization, I had to increase the PHP memory limit to 768 MB on the server, Which is not a common thing to do.

Which is faster/preferred: memset or for loop to zero out an array of doubles?

Note that for memset you have to pass the number of bytes, not the number of elements because this is an old C function:

memset(d, 0, sizeof(double)*length);

memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.

But for type safety and more readable code I would recommend std::fill() - it is the c++ way of doing things, and consider memset if a performance optimization is needed at this place in the code.

Can I prevent the gcc optimizer from delaying memory allocation?

You seem to be misinterpreting the situation. Virtual memory within a user-space process (heap space in this case) does get allocated “immediately” (possibly after a few system calls that negotiate a larger heap).

However, each page-aligned page-sized chunk of virtual memory that you haven’t touched yet will initially lack a physical page backing. Virtual pages are mapped to physical pages lazily, (only) when the need arises.

That said, the “allocation” you are observing (as part of the first access to the big heap space) is happening a few layers of abstraction below what GCC can directly influence and is handled by your operating system’s paging mechanism.

Side note: Another consequence would be, for example, that allocating a 1 TB chunk of virtual memory on a machine with, say, 128 GB of RAM will appear to work perfectly fine, as long as you never access most of that huge (lazily) allocated space. (There are configuration options that can limit such memory overcommitment if need be.)

When you touch your newly allocated virtual memory pages for the first time, each of them causes a page fault and your CPU ends up in a handler in the kernel because of that. The kernel evaluates the situation and establishes that the access was in fact legit. So it “materializes” the virtual memory page, i.e. picks a physical page to back the virtual page and updates both its bookkeeping data structures and (equally importantly) the hardware page mapping mechanism(s) (e.g. page tables or TLB, depending on architecture). Then the kernel switches back to your userspace process, which will have no clue that all of this just happened. Repeat for each page.

Presumably, the description above is hugely oversimplified. (For example, there can be multiple page sizes to strike a balance between mapping maintenance efficiency and granularity / fragmentation etc.)

A simple and ugly way to ensure that the memory buffer gets its hardware backing would be to find the smallest possible page size on your architecture (which would be 4 kiB on a x86_64, for example, so 1024 of those integers (well, in most cases)) and then touch each (possible) page of that memory beforehand, as in: for (size_t i = 0; i < 0x80000000; i += 1024) buffer[i] = 1;.

There are (of course) more reasonable solutions than that↑; this is just an example to illustrate what’s happening and why.

Loops Make Less to Run Out of Memory. (Structure Optimization)