1D or 2D Array, What's Faster

1D or 2D array, what's faster?

tl;dr : You should probably use a one-dimensional approach.

Note: One cannot dig into detail affecting performance when comparing dynamic 1d or dynamic 2d storage patterns without filling books since the performance of code is dependent one a very large number of parameters. Profile if possible.

1. What's faster?

For dense matrices the 1D approach is likely to be faster since it offers better memory locality and less allocation and deallocation overhead.

2. What's smaller?

Dynamic-1D consumes less memory than the 2D approach. The latter also requires more allocations.

Remarks

I laid out a pretty long answer beneath with several reasons but I want to make some remarks on your assumptions first.

I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y)

Let's compare these two functions:

int get_2d (int **p, int r, int c) { return p[r][c]; }
int get_1d (int *p, int r, int c)  { return p[c + C*r]; }

The (non-inlined) assembly generated by Visual Studio 2015 RC for those functions (with optimizations turned on) is:

?get_1d@@YAHPAHII@Z PROC
push    ebp
mov ebp, esp
mov eax, DWORD PTR _c$[ebp]
lea eax, DWORD PTR [eax+edx*4]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0

?get_2d@@YAHPAPAHII@Z PROC
push ebp
mov ebp, esp
mov ecx, DWORD PTR [ecx+edx*4]
mov eax, DWORD PTR _c$[ebp]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0

The difference is mov (2d) vs. lea (1d).
The former has a latency of 3 cycles and a a maximum throughput of 2 per cycle while the latter has a latency of 2 cycles and a maximum throughput of 3 per cycle. (According to Instruction tables - Agner Fog
Since the differences are minor, I think there should not be a big performance difference arising from index recalculation. I expect it to be very unlikely to identify this difference itself to be the bottleneck in any program.

This brings us to the next (and more interesting) point:

... but I could image that 1D could be in CPU cache ...

True, but 2d could be in CPU cache, too. See The Downsides: Memory locality for an explanation why 1d is still better.

The long answer, or why dynamic 2 dimensional data storage (pointer-to-pointer or vector-of-vector) is "bad" for simple / small matrices.

Note: This is about dynamic arrays/allocation schemes [malloc/new/vector etc.]. A static 2d array is a contiguous block of memory and therefore not subject to the downsides I'm going to present here.

The Problem

To be able to understand why a dynamic array of dynamic arrays or a vector of vectors is most likely not the data storage pattern of choice, you are required to understand the memory layout of such structures.

Example case using pointer to pointer syntax

int main (void)
{
    // allocate memory for 4x4 integers; quick & dirty
    int ** p = new int*[4];
    for (size_t i=0; i<4; ++i) p[i] = new int[4]; 

    // do some stuff here, using p[x][y] 

    // deallocate memory
    for (size_t i=0; i<4; ++i) delete[] p[i];
    delete[] p;
}

The downsides

Memory locality

For this “matrix” you allocate one block of four pointers and four blocks of four integers. All of the allocations are unrelated and can therefore result in an arbitrary memory position.

The following image will give you an idea of how the memory may look like.

For the real 2d case:

The violet square is the memory position occupied by p itself.
The green squares assemble the memory region p points to (4 x int*).
The 4 regions of 4 contiguous blue squares are the ones pointed to by each int* of the green region

For the 2d mapped on 1d case:

The green square is the only required pointer int *
The blue squares ensemble the memory region for all matrix elements (16 x int).

Real 2d vs mapped 2d memory layout

This means that (when using the left layout) you will probably observe worse performance than for a contiguous storage pattern (as seen on the right), due to caching for instance.

Let's say a cache line is "the amount of data transfered into the cache at once" and let's imagine a program accessing the whole matrix one element after another.

If you have a properly aligned 4 times 4 matrix of 32 bit values, a processor with a 64 byte cache line (typical value) is able to "one-shot" the data (4*4*4 = 64 bytes).
If you start processing and the data isn't already in the cache you'll face a cache miss and the data will be fetched from main memory. This load can fetch the whole matrix at once since it fits into a cache line, if and only if it is contiguously stored (and properly aligned).
There will probably not be any more misses while processing that data.

In case of a dynamic, "real two-dimensional" system with unrelated locations of each row/column, the processor needs to load every memory location seperately.
Eventhough only 64 bytes are required, loading 4 cache lines for 4 unrelated memory positions would -in a worst case scenario- actually transfer 256 bytes and waste 75% throughput bandwidth.
If you process the data using the 2d-scheme you'll again (if not already cached) face a cache miss on the first element.
But now, only the first row/colum will be in the cache after the first load from main memory because all other rows are located somewhere else in memory and not adjacent to the first one.
As soon as you reach a new row/column there will again be a cache miss and the next load from main memory is performed.

Long story short: The 2d pattern has a higher chance of cache misses with the 1d scheme offering better potential for performance due to locality of the data.

Frequent Allocation / Deallocation

As many as N + 1 (4 + 1 = 5) allocations (using either new, malloc, allocator::allocate or whatever) are necessary to create the desired NxM (4×4) matrix.
The same number of proper, respective deallocation operations must be applied as well.

Therefore, it is more costly to create/copy such matrices in contrast to a single allocation scheme.

This is getting even worse with a growing number of rows.

Memory consumption overhead

I'll asumme a size of 32 bits for int and 32 bits for pointers. (Note: System dependency.)

Let's remember: We want to store a 4×4 int matrix which means 64 bytes.

For a NxM matrix, stored with the presented pointer-to-pointer scheme we consume

N*M*sizeof(int) [the actual blue data] +
N*sizeof(int*) [the green pointers] +
sizeof(int**) [the violet variable p] bytes.

That makes 4*4*4 + 4*4 + 4 = 84 bytes in case of the present example and it gets even worse when using std::vector<std::vector<int>>.
It will require N * M * sizeof(int) + N * sizeof(vector<int>) + sizeof(vector<vector<int>>) bytes, that is 4*4*4 + 4*16 + 16 = 144 bytes in total, intead of 64 bytes for 4 x 4 int.

In addition -depending on the used allocator- each single allocation may well (and most likely will) have another 16 bytes of memory overhead. (Some “Infobytes” which store the number of allocated bytes for the purpose of proper deallocation.)

This means the worst case is:

N*(16+M*sizeof(int)) + 16+N*sizeof(int*) + sizeof(int**)

= 4*(16+4*4) + 16+4*4 + 4 = 164 bytes ! _Overhead: 156%_

The share of the overhead will reduce as the size of the matrix grows but will still be present.

Risk of memory leaks

The bunch of allocations requires an appropriate exception handling in order to avoid memory leaks if one of the allocations will fail!
You’ll need to keep track of allocated memory blocks and you must not forget them when deallocating the memory.

If new runs of of memory and the next row cannot be allocated (especially likely when the matrix is very large), a std::bad_alloc is thrown by new.

Example:

In the above mentioned new/delete example, we'll face some more code if we want to avoid leaks in case of bad_alloc exceptions.

  // allocate memory for 4x4 integers; quick & dirty
  size_t const N = 4;
  // we don't need try for this allocation
  // if it fails there is no leak
  int ** p = new int*[N];
  size_t allocs(0U);
  try 
  { // try block doing further allocations
    for (size_t i=0; i<N; ++i) 
    {
      p[i] = new int[4]; // allocate
      ++allocs; // advance counter if no exception occured
    }
  }
  catch (std::bad_alloc & be)
  { // if an exception occurs we need to free out memory
    for (size_t i=0; i<allocs; ++i) delete[] p[i]; // free all alloced p[i]s
    delete[] p; // free p
    throw; // rethrow bad_alloc
  }
  /*
     do some stuff here, using p[x][y] 
  */
  // deallocate memory accoding to the number of allocations
  for (size_t i=0; i<allocs; ++i) delete[] p[i];
  delete[] p;

Summary

There are cases where "real 2d" memory layouts fit and make sense (i.e. if the number of columns per row is not constant) but in the most simple and common 2D data storage cases they just bloat the complexity of your code and reduce the performance and memory efficiency of your program.

Alternative

You should use a contiguous block of memory and map your rows onto that block.

The "C++ way" of doing it is probably to write a class that manages your memory while considering important things like

What is The Rule of Three?
What is meant by Resource Acquisition is Initialization (RAII)?
C++ concept: Container (on cppreference.com)

Example

To provide an idea of how such a class may look like, here's a simple example with some basic features:

2d-size-constructible
2d-resizable
operator(size_t, size_t) for 2d- row major element access
at(size_t, size_t) for checked 2d-row major element access
Fulfills Concept requirements for Container

Source:

#include <vector>
#include <algorithm>
#include <iterator>
#include <utility>

namespace matrices
{

  template<class T>
  class simple
  {
  public:
    // misc types
    using data_type  = std::vector<T>;
    using value_type = typename std::vector<T>::value_type;
    using size_type  = typename std::vector<T>::size_type;
    // ref
    using reference       = typename std::vector<T>::reference;
    using const_reference = typename std::vector<T>::const_reference;
    // iter
    using iterator       = typename std::vector<T>::iterator;
    using const_iterator = typename std::vector<T>::const_iterator;
    // reverse iter
    using reverse_iterator       = typename std::vector<T>::reverse_iterator;
    using const_reverse_iterator = typename std::vector<T>::const_reverse_iterator;

    // empty construction
    simple() = default;

    // default-insert rows*cols values
    simple(size_type rows, size_type cols)
      : m_rows(rows), m_cols(cols), m_data(rows*cols)
    {}

    // copy initialized matrix rows*cols
    simple(size_type rows, size_type cols, const_reference val)
      : m_rows(rows), m_cols(cols), m_data(rows*cols, val)
    {}

    // 1d-iterators

    iterator begin() { return m_data.begin(); }
    iterator end() { return m_data.end(); }
    const_iterator begin() const { return m_data.begin(); }
    const_iterator end() const { return m_data.end(); }
    const_iterator cbegin() const { return m_data.cbegin(); }
    const_iterator cend() const { return m_data.cend(); }
    reverse_iterator rbegin() { return m_data.rbegin(); }
    reverse_iterator rend() { return m_data.rend(); }
    const_reverse_iterator rbegin() const { return m_data.rbegin(); }
    const_reverse_iterator rend() const { return m_data.rend(); }
    const_reverse_iterator crbegin() const { return m_data.crbegin(); }
    const_reverse_iterator crend() const { return m_data.crend(); }

    // element access (row major indexation)
    reference operator() (size_type const row,
      size_type const column)
    {
      return m_data[m_cols*row + column];
    }
    const_reference operator() (size_type const row,
      size_type const column) const
    {
      return m_data[m_cols*row + column];
    }
    reference at() (size_type const row, size_type const column)
    {
      return m_data.at(m_cols*row + column);
    }
    const_reference at() (size_type const row, size_type const column) const
    {
      return m_data.at(m_cols*row + column);
    }

    // resizing
    void resize(size_type new_rows, size_type new_cols)
    {
      // new matrix new_rows times new_cols
      simple tmp(new_rows, new_cols);
      // select smaller row and col size
      auto mc = std::min(m_cols, new_cols);
      auto mr = std::min(m_rows, new_rows);
      for (size_type i(0U); i < mr; ++i)
      {
        // iterators to begin of rows
        auto row = begin() + i*m_cols;
        auto tmp_row = tmp.begin() + i*new_cols;
        // move mc elements to tmp
        std::move(row, row + mc, tmp_row);
      }
      // move assignment to this
      *this = std::move(tmp);
    }

    // size and capacity
    size_type size() const { return m_data.size(); }
    size_type max_size() const { return m_data.max_size(); }
    bool empty() const { return m_data.empty(); }
    // dimensionality
    size_type rows() const { return m_rows; }
    size_type cols() const { return m_cols; }
    // data swapping
    void swap(simple &rhs)
    {
      using std::swap;
      m_data.swap(rhs.m_data);
      swap(m_rows, rhs.m_rows);
      swap(m_cols, rhs.m_cols);
    }
  private:
    // content
    size_type m_rows{ 0u };
    size_type m_cols{ 0u };
    data_type m_data{};
  };
  template<class T>
  void swap(simple<T> & lhs, simple<T> & rhs)
  {
    lhs.swap(rhs);
  }
  template<class T>
  bool operator== (simple<T> const &a, simple<T> const &b)
  {
    if (a.rows() != b.rows() || a.cols() != b.cols())
    {
      return false;
    }
    return std::equal(a.begin(), a.end(), b.begin(), b.end());
  }
  template<class T>
  bool operator!= (simple<T> const &a, simple<T> const &b)
  {
    return !(a == b);
  }

}

Note several things here:

T needs to fulfill the requirements of the used std::vector member functions
operator() doesn't do any "of of range" checks
No need to manage data on your own
No destructor, copy constructor or assignment operators required

So you don't have to bother about proper memory handling for each application but only once for the class you write.

Restrictions

There may be cases where a dynamic "real" two dimensional structure is favourable. This is for instance the case if

the matrix is very large and sparse (if any of the rows do not even need to be allocated but can be handled using a nullptr) or if
the rows do not have the same number of columns (that is if you don't have a matrix at all but another two-dimensional construct).

Performance of 2-dimensional array vs 1-dimensional array

In C, 2-dimensional arrays are just a neat indexing scheme for 1-dimensional arrays. Just like with a 1D array, 2D arrays allocate a single block of contiguous memory, and the A[row][col] notation is similar to saying A[row*NCOLS+col].

Usually if you were to implement your own multidimensional arrays using single dimensional arrays, you'd write an indexing function:

int getIndex(int row, int col) { return row*NCOLS+col; }

Assuming your compiler inlines this function, the performance here would be precisely the same as if you used the built in 'indexing function' of 2D arrays.

To illustrate:

#define NROWS 10
#define NCOLS 20

This:

int main(int argc, char *argv[]) {
    int myArr[NROWS*NCOLS];
    for (int i=0; i<NROWS; ++i) {
       for (int j=0; j<NCOLS; ++j) {
          myArr[getIndex(i,j)] = i+j;
       }
    }
    return 0;
}

Should perform the same as this:

int main(int argc, char *argv[]) {
    int myArr[NROWS][NCOLS];
    for (int i=0; i<NROWS; ++i) {
       for (int j=0; j<NCOLS; ++j) {
          myArr[i][j] = i+j;
       }
    }
    return 0;
}

Though as AraK pointed out, if you are jumping around rows a lot, and the rows are very large, you could hit a lot of page faults... in that case the custom indexing function (with rows and cols switched around) could help, but so could simply changing which of the dimensions in a 2-dimensional array you treat as the rows and which you treat as the columns.

Why 2D arrays run significantly slower than 1D arrays in java

As suggested by @David Zimmerman I've changed the code into following one:

  int SIZE = 1000000000;
  int[] rnums = new int[SIZE];
  byte[][] d1 = new byte[2][SIZE]; 
  byte[][] d2 = new byte[2][SIZE];

And it works, runs normally.

using 2-dimensional arrays is faster than two of 1-dimensional arrays?

It depends on how did you use your one dimensional arrays.

Two dimensional array elements stored in memory consequently, while two one dimensional arrays may start at different memory addresses.

If you are accessing two dimensional array many times, it will be faster, because most of its values are already in cache. This is called "Spatial locality".
On the other hand, if you are accessing two one dimensional arrays one by one, it will force your computer to load data from memory into cache, which is way slower.

Anyway, it would be better, if you post both versions of your code.

2D array vs 1D array

If what you're asking is which is better, a 2D array of size 1000x44000 or a 1D array of size 44000000, well what's the difference as far as memory goes? You still have the same number of elements! In the case of performance and understandability, the 2D is probably better. Imagine having to manually find each column or row in a 1D array, when you know exactly where they are in a 2D array.

Is using one dimensional map array faster than 2d?

A 2D array in Java is an array of arrays. For comparison, the 1D array int[] a = {1,2,3,4} is stored as 4 contiguous integers in memory, as such:

|1|2|3|4|

while a 2D array int[][] b = {{1,2},{3,4}} is stored as such:

   +--------------+ 
   |              |
|.|.|   |1|2|    |3|4|
 |       |
 +-------+

I.e., b[0] and b[1] each points to an array of 2 integers somewhere in memory.

Thus b[0][0] requires that a reference be read and followed to access the element while in the 1D case, a[0] can access the element directly.

The other performance issue is that the subarrays are not necessarily stored contiguously in memory and therefore the whole 2D array would not be loaded in the processor cache in one go. Depending on the size of your array and how you processed the data this can make a big difference in performance.

However, in most cases, these performance differences would be small enough to not affect the performance of your program and you should choose the structure that makes your program easier to read and to reason about.

My advice would be to go with the 2D array if it makes your program simpler; you can always optimize this later.

You could also encapsulate the array in an abstract data type that hides the way that the array is accessed (I.e., it could be using either a 1D or 2D array with an access method such as getValue(x, y) with the method internally managing the array addressing).

Using 2-dimensional or 1-dimensional, which is fastest?

The root of the problem is that your storage order is different between the two schemes.

Your 2D structures are stored row-major. By dereferencing the row first, you arrive at a single buffer which can be directly indexed by column. Neighboring columns are in adjacent memory locations.

Your 1D structures are stored column-major. Neighboring columns are size elements apart in memory.

Trying both orders of iteration covers almost all of the effect. But what's left is the data dependence. By referring to D(r-1,c), the access patterns are completely different between row- and column- major.

Sure enough, changing the 1D indexing to d1[r*size + c] and d1[(r-1)*size + c] produces the following timing:

2D array    1000    78099
2D array C  1000    878527
1D array    1000    19661
1D array C  1000    729280
2D vector   1000    61641
2D vector C 1000    741249
1D vector   1000    18348
1D vector C 1000    726231

So, we still have to explain it. I'm going with the "loop-carried dependency". When you iterated the column-major 1D array in column-major order (good idea), each element depended on the element computed in the previous iteration. That means the loop can't be fully pipelined, as the result has to be fully computed and written back to cache, before it can be read again to compute the next element. In row-major, the dependence is now an element that was computed long ago, which means the loop can be unrolled and pipelined.

1D or 2D Array, What's Faster