Optimising C++ 2-D Arrays

Optimising C++ 2-D arrays

If you're using GCC the compiler can analyze your matrix accesses and change the order in memory in certain cases. The magic compiler flag is defined as:

-fipa-matrix-reorg

Perform matrix flattening and
transposing. Matrix flattening tries
to replace a m-dimensional matrix with
its equivalent n-dimensional matrix,
where n < m. This reduces the level of
indirection needed for accessing the
elements of the matrix. The second
optimization is matrix transposing
that attemps to change the order of
the matrix's dimensions in order to
improve cache locality. Both
optimizations need fwhole-program
flag. Transposing is enabled only if
profiling information is avaliable.

Note that this option is not enabled by -O2 or -O3. You have to pass it yourself.

What is the most optimized way to assign a value to a two dimensional array?

Your code is fine, readable and simple, except for the typos. Just make sure you use the same type for i and j as that of n and m.

#define N 10
#define M 10
    int array[M][N];
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++)
            array[i][j] = 0;
    }

The advantage of writing the code this way is it works for both a flat 2D array int array[M][N] and an array of arrays int *array[M] as well as a pointer to an array of arrays int **array.

If the array is a flat 2D array of integers and the value is 0, there is an alternative:

memset(array, 0, sizeof(array[0][0]) * M * N);

But it only works for very specific values that are represented in memory with identical bytes in all positions. Such is the case for 0 and -1 for regular two's complement architectures and 254 other less obvious values (out of 4 billion).

You could also intialize a single row and copy that to the following ones...

Such optimisations are confusing, error prone and unnecessary as modern optimizing compilers will generate very efficient code for this:

#define M 10
#define N 10
void clear_matrix(int array[M][N]) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++)
            array[i][j] = 0;
    }
}

Notes:

Code correctness always beats optimisation: what is the purpose of getting incorrect results fast?
Always benchmark and profile before you optimize, and focus first on the bottlenecks that stand out as obvious targets.
It is unlikely for the matrix intialization to be the bottleneck. If is it, your algorithm is probably at fault.
Finding a better algorithm, with reduced complexity, is a more productive effort than micro-optimisation.

optimize 2D array in C++

A general in-place matrix transposition is very difficult, but if you're okay with transposing it to another array, then it's pretty simple.

const int cols = 500; 
const int rows = 100; 

int arr[rows][cols];
// fill arr[][]

int arrT[cols][rows];
for (int r = 0; r < rows; r++) {
   for (int c = 0; c < cols; c++) {
      arrT[c][r] = arr[r][c];
   }
}

Of course, depending on how you're getting arr[][], you can just fill arrT[][] directly instead.

However, there may be a simpler solution of simple swapping the order of the loops.

for(int k = 0; k < T; ++k) { // for each trainee
  myscore[k] = 0;
  for(int j = 0; j < rows; ++j) { // for each expert
    for(int i = 0; i < cols; ++i) { // for each sample  
      myscore[k] += delta(i, anotherArray[k][i], arr[j][i]);
    }   
  }
}

Is this a possible way to optimize multidimensional arrays on heap?

Yes, using arithmetic and a single base pointer is what the compiler does internally for non-dynamically allocated 2D (n-dimensional) arrays.

You gain the most performance because there's a single calculation and indexed lookup. With the 2D array shown, there are two pointer lookups and two index calculations per array access (one index calculation and lookup to get to the right array, and then the second to access the element in the right array). With a 3D array, there'd be three index calculations and three lookups.

You also allocate less memory, and need fewer memory allocations, but those are second order effects.

Also, as WhozCraig points out in a comment but I didn't mention, you get better locality of reference and potential for smarter prefetch with a single big chunk of memory compared with multiple smaller chunks (that add up to more memory than the single big chunk).

I tested this file (sim2d.c) compiled with GCC 4.9.1 on Mac OS X 10.10.2 Yosemite.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static void *MjMalloc(size_t nbytes)
{
    void *rv = malloc(nbytes);
    if (rv == 0)
    {
        fprintf(stderr, "Memory allocation failure (%zu bytes)\n", nbytes);
        exit(1);
    }
    return rv;
}

/* Mechanism 1 */
typedef struct ArrayInt {
    int *array;
    int length;
} ArrayInt;

static void ArrayIntCreate(ArrayInt *array, int length) {
    array->array = MjMalloc(length * sizeof(int));
    array->length = length;
}

static void ArrayIntDelete(ArrayInt *array) {
    free(array->array);
}

typedef struct ArrayArrayInt {
    ArrayInt *array;
    int length;
} ArrayArrayInt;

static void ArrayArrayIntCreate(ArrayArrayInt *array, int length, int length2) {
    array->array = MjMalloc(length * sizeof(ArrayInt));
    array->length = length;
    for (int i = 0; i < length; i += 1) {
        ArrayIntCreate(&array->array[i], length2);
    }
}

static void ArrayArrayIntDelete(ArrayArrayInt *array) {
    for (int i = 0; i < array->length; i += 1) {
        ArrayIntDelete(&array->array[i]);
    }
    free(array->array);
}

/* Mechanism 2 */
typedef struct ArrayArrayInt2 {
    int *array;
    int length;
    int length2;
} ArrayArrayInt2;

static void ArrayArrayInt2Create(ArrayArrayInt2 *array, int length, int length2) {
    array->array = MjMalloc(length * length2 * sizeof(ArrayInt));
    array->length = length;
    array->length2 = length2;
}

static void ArrayArrayInt2Delete(ArrayArrayInt2 *array) {
    free(array->array);
}

#define aai2At(aai2, i) (&aai2.array[(i) * aai2.length2])
#define aai2At2(aai2, i, j) (aai2.array[(i) * aai2.length2 + (j)])

/* Head-to-head testing */
static double ElapsedTime(clock_t startTime, clock_t endTime) {
    return (double)(endTime - startTime) / CLOCKS_PER_SEC;
}

#define N 2000
#define N_CYCLES    1000

static void one_test_cycle(void)
{
    ArrayArrayInt aai;
    ArrayArrayInt2 aai2;
    long long int sum;
    clock_t startTime, endTime;

    startTime = clock();             
    sum = 0;
    for (int k = 0; k < N_CYCLES; k += 1) {
        ArrayArrayIntCreate(&aai, N, N);
        for (int i = 0; i < aai.length; i += 1) {
            int j = 0;
            for (; j < aai.array[i].length; j += 1) {
                aai.array[i].array[j] = i;
            }
            while ((j -= 1) >= 0) {
                sum += aai.array[i].array[j] - i + 1;
            }
        }
        ArrayArrayIntDelete(&aai);
    }
    endTime = clock();
    printf("aai1: sum = %lld; time = %.2f\n", sum, ElapsedTime(startTime, endTime));

    startTime = clock();
    sum = 0;
    for (int k = 0; k < N_CYCLES; k += 1) {
        ArrayArrayInt2Create(&aai2, N, N);
        for (int i = 0; i < aai2.length; i += 1) {
            int j = 0;
            for (; j < aai2.length2; j += 1) {
                aai2At(aai2, i)[j] = i;
            }
            while ((j -= 1) >= 0) {
                sum += aai2At(aai2, i)[j] - i + 1;
            }
        }
        ArrayArrayInt2Delete(&aai2);
    }
    endTime = clock();
    printf("aai2: sum = %lld; time = %.2f\n", sum, ElapsedTime(startTime, endTime));

    startTime = clock();
    sum = 0;
    for (int k = 0; k < N_CYCLES; k += 1) {
        ArrayArrayInt2Create(&aai2, N, N);
        for (int i = 0; i < aai2.length; i += 1) {
            int j = 0;
            for (; j < aai2.length2; j += 1) {
                aai2At2(aai2, i, j) = i;
            }
            while ((j -= 1) >= 0) {
                sum += aai2At2(aai2, i, j) - i + 1;
            }
        }
        ArrayArrayInt2Delete(&aai2);
    }
    endTime = clock();
    printf("aai3: sum = %lld; time = %.2f\n", sum, ElapsedTime(startTime, endTime));
}

static void print_now(const char *tag)
{
    time_t now = time(0);
    struct tm *lt = localtime(&now);
    char buffer[32];
    strftime(buffer, sizeof(buffer), "%Y-%m-%d %H:%M:%S", lt);
    printf("%s: %s\n", tag, buffer);
}

int main(void)
{
    print_now("Started");
    for (int i = 0; i < 3; i++)
        one_test_cycle();
    print_now("Finished");
    return 0;
}

There are two slightly different ways of accessing the aai2 data. I also separated the array size (N = 2000) from the number of cycles in a single test (N_CYCLES = 1000). The timing results I got were:

Started: 2015-04-07 07:40:41
aai1: sum = 4000000000; time = 6.80
aai2: sum = 4000000000; time = 5.99
aai3: sum = 4000000000; time = 5.98
aai1: sum = 4000000000; time = 6.75
aai2: sum = 4000000000; time = 6.02
aai3: sum = 4000000000; time = 5.99
aai1: sum = 4000000000; time = 6.72
aai2: sum = 4000000000; time = 6.01
aai3: sum = 4000000000; time = 5.99
Finished: 2015-04-07 07:41:38

I was getting similar patterns with (N_CYCLE = 2000), but it was taking twice as long to run — surprise, surprise.

I'm seeing a small but noticeable benefit (about 13% decrease) from the single allocation code, but no significant difference between the two timings for the 'aai2' tests.

Basic statistics:

# All data
# Count    = 9
# Mean     =  6.250000e+00
# Std Dev  =  3.807230e-01

# aai1 only:
# Count    = 3
# Mean     =  6.756667e+00
# Std Dev  =  4.041452e-02

# aai2 and aai3:
# Count    = 6
# Mean     =  5.996667e+00
# Std Dev  =  1.505545e-02

# aai2 only:
# Count    = 3
# Mean     =  6.006667e+00
# Std Dev  =  1.527525e-02

# aai3 only:
# Count    = 3
# Mean     =  5.986667e+00
# Std Dev  =  5.773503e-03

Clearly, formally making sure the machine is otherwise unloaded, and running many more iterations of the test, and similar benchmarking steps might improve the data, but the single allocation aai2 mechanism performs better on this machine than the multi-allocation aai mechanism. (Tangential aside: why do people not put a suffix 1 on their first version when they have two or more versions of the code?)

^{Hardware: 17" Mac Book Pro, early-2011, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz DDR3 RAM.}

Optimising 2D array indexing for cache line

I do not see justification for maximizing consecutive use of one cache line. The cache does not operate “one line at a time”, and there is generally no advantage to using one cache line repeatedly versus using any of the lines that are in cache.

A better goal is to maximize the number of accesses that are served from a line in L1 cache, instead of needing to be fetched from slower cache or memory. As long as an access “hits” a line currently in cache, we do not care which cache line it is.

The i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is eight-way associative, with 64-byte cache lines. This means the 32,768-byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address maps to exactly one set, as shown below. Within each set, eight cache lines are retained, from lines that have been recently used in that set. (The replacement algorithm is not least-recently-used, but it is an attempt to be useful and may have similar results to least-recently-used.)

Cache lookups work in this way:

Given a byte address x, let t = floor(x/64) (due to cache line size).
Let s = t % 64 (to select the set).
Check set s to see whether it contains the byte at address x.

Consider the effect of row length on these cache lookups. With a row length of 65,536 bytes, the addresses of array elements a[i][j] and a[i+1][j] differ by 65,536 bytes. That means their values for t in the above lookup procedure differ by exactly 1024, and their values for s are identical. Therefore, they map to the same set.

Once the algorithm moves up or down by more than eight rows, without changing columns outside of a cache line, the single cache set being used cannot handle the nine recently used cache lines. One of them must be evicted. In effect, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).

A simple way to address this is to pad the array so that the rows are 65,536+p bytes long, for some padding amount p. The array would be allocated with extra space and defined with longer rows than normal. The extra columns may generally be ignored. There is no need to initialize them; we do not care about their contents, just the effect they have on the addresses. (Alternately, they could be used for supplementary data if that is convenient for the program.)

With this padding, the distance between a[i][j] and a[i+1][j] is 65,536+p bytes, so the difference in t values is 1024+p/64, and the difference in s values is p/64 % 64. E.g., if p is 64 or 320, the difference in s values is 1 or 5, respectively.

I suggest testing 9*64 for p. Any value 64 or greater will ensure that array elements in the same column in consecutive rows map to different cache sets. However, the algorithm described in the question wanders in columns as well as in rows. So, if p were small, our fix to make consecutive rows map to different cache sets might be negated by column-wandering that meanders back to the same cache set. Other values of p should be tried as well.

This is not intended to be a complete solution to the problem, as there are many factors that affect performance.

Fastest way to loop through a 2d array?

Cache coherence. When you scan horizontally, your data will be closer together in memory, so you will have less cache misses and thus performance will be faster. For a small enough rectangle, this won't matter.

Optimizing two-dimensional arrays

You need a different data structure for fast lookups: dict rather than list. Here's an example:

array1 = [
    [[1,2], [1,5]],
    [[3,2], [7,5]],
]

array2 = [
    [[3,2], [9,9]],
    [[1,2], [1,5]],
]

lookup = {}

for r, row in enumerate(array1):
    for c, val in enumerate(row):
        pair = tuple(val)
        lookup[pair] = (r, c)

for r, row in enumerate(array2):
    for c, val in enumerate(row):
        pair = tuple(val)
        if pair in lookup:
            print dict(
                pair = pair, 
                array2_indexes = (r, c),
                array1_indexes = lookup[pair],
            )

Output:

{'array1_indexes': (1, 0), 'pair': (3, 2), 'array2_indexes': (0, 0)}
{'array1_indexes': (0, 0), 'pair': (1, 2), 'array2_indexes': (1, 0)}
{'array1_indexes': (0, 1), 'pair': (1, 5), 'array2_indexes': (1, 1)}

Fast Output of 2D Array C

You can write each row at once using fwrite():

for (int row = 0; row < 20; ++row) {
    fwrite(grid[row], sizeof grid[row], 1, stdout);
    putchar('\n');
}

BTW, your code that writes the whole grid using printf("%s", grid) is most likely causing undefined behavior. %s requires the argument to be a pointer to a null-terminated string, and grid presumably doesn't have a null at the end. If it seems to be working it's just accidental, because there happened to be a null byte in the memory following grid.

Note that this solution only works for an actual 2-D array of char. If it's an array of pointers, you can replace sizeof grid[row] with a variable or macro that specifies the size of each row pointed to by grid[row] (or just a hard--coded 30 as in the question).

Optimising C for performance vs memory optimisation using multidimensional arrays

There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.

Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.

Optimising C++ 2-D Arrays