Confusion Between C++ and Opengl Matrix Order (Row-Major VS Column-Major)

Confusion between C++ and OpenGL matrix order (row-major vs column-major)

matrix notation used in opengl documentation does not describe in-memory layout for OpenGL matrices

If think it'll be easier if you drop/forget about the entire "row/column-major" thing. That's because in addition to row/column major, the programmer can also decide how he would want to lay out the matrix in the memory (whether adjacent elements form rows or columns), in addition to the notation, which adds to confusion.

OpenGL matrices have same memory layout as directx matrices.

x.x x.y x.z 0
y.x y.y y.z 0
z.x z.y z.z 0
p.x p.y p.z 1

{ x.x x.y x.z 0 y.x y.y y.z 0 z.x z.y z.z 0 p.x p.y p.z 1 }

x, y, z are 3-component vectors describing the matrix coordinate system (local coordinate system within relative to the global coordinate system).
p is a 3-component vector describing the origin of matrix coordinate system.

Which means that the translation matrix should be laid out in memory like this:

{ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, transX, transY, transZ, 1 }.

Leave it at that, and the rest should be easy.

---citation from old opengl faq--

9.005 Are OpenGL matrices column-major or row-major?
For programming purposes, OpenGL matrices are 16-value arrays with base vectors laid out contiguously in memory. The translation components occupy the 13th, 14th, and 15th elements of the 16-element matrix, where indices are numbered from 1 to 16 as described in section 2.11.2 of the OpenGL 2.1 Specification.
Column-major versus row-major is purely a notational convention. Note that post-multiplying with column-major matrices produces the same result as pre-multiplying with row-major matrices. The OpenGL Specification and the OpenGL Reference Manual both use column-major notation. You can use any notation, as long as it's clearly stated.
Sadly, the use of column-major format in the spec and blue book has resulted in endless confusion in the OpenGL programming community. Column-major notation suggests that matrices are not laid out in memory as a programmer would expect.

I'm going to update this 9 years old answer.

A mathematical matrix is defined as m x n matrix. Where m is a number of rows and n is number of columns. For the sake of completeness, rows are horizontals, columns are vertical. When denoting a matrix element in mathematical notation Mij, the first element (i) is a row index, the second one (j) is a column index. When two matrices are multiplied, i.e. A(m x n) * B(m1 x n1), the resulting matrix has number of rows from the first argument(A), and number of columns of the second(B), and number of columns of the first argument (A) must match number of rows of the second (B). so n == m1. Clear so far, yes?

Now, regarding in-memory layout. You can store matrix two ways. Row-major and column-major. Row-major means that effectively you have rows laid out one after another, linearly. So, elements go from left to right, row after row. Kinda like english text. Column-major means that effectively you have columns laid out one after another, linearly. So elements start at top left, and go from top to bottom.

Example:

//matrix
|a11 a12 a13|
|a21 a22 a23|
|a31 a32 a33|

//row-major
[a11 a12 a13 a21 a22 a23 a31 a32 a33]

 //column-major
[a11 a21 a31 a12 a22 a32 a13 a23 a33]

Now, here's the fun part!

There are two ways to store 3d transformation in a matrix.
As I mentioned before, a matrix in 3d essentially stores coordinate system basis vectors and position. So, you can store those vectors in rows or in columns of a matrix. When they're stored as columns, you multiply a matrix with a column vector. Like this.

//convention #1
|vx.x vy.x vz.x pos.x|   |p.x|   |res.x|
|vx.y vy.y vz.y pos.y|   |p.y|   |res.y|
|vx.z vy.z vz.z pos.z| x |p.z| = |res.z|
|   0    0    0     1|   |  1|   |res.w|

However, you can also store those vectors as rows, and then you'll be multiplying a row vector with a matrix:

//convention #2 (uncommon)
                  | vx.x  vx.y  vx.z 0|   
                  | vy.x  vy.y  vy.z 0|   
|p.x p.y p.z 1| x | vz.x  vz.y  vz.z 0| = |res.x res.y res.z res.w|
                  |pos.x pos.y pos.z 1|

So. Convention #1 often appears in mathematical texts. Convention #2 appeared in DirectX sdk at some point. Both are valid.

And in regards of the question, if you're using convention #1, then your matrices are column-major. And if you're using convention #2, then they're row major. However, memory layout is the same in both cases

[vx.x vx.y vx.z 0 vy.x vy.y vy.z 0 vz.x vz.y vz.z 0 pos.x pos.y pos.z 1]

Which is why I said it is easier to memorize which element is which, 9 years ago.

Row-major vs Column-major confusion

Let's look at algebra first; algebra doesn't even have a notion of "memory layout" and stuff.

From an algebraic pov, a MxN real matrix can act on a |R^N vector on its right side and yield a |R^M vector.

Thus, if you were sitting in an exam and given a MxN Matrix and a |R^N vector, you could with trivial operations multiply them and get a result -
whether that result is right or wrong will not depend on whether the software your professor uses to check your results internally uses column-major or a row-major layout; it will only depend on if you calculated the contraction of each row of the matrix with the (single) column of the vector properly.

To produce a correct output, the software will - by whatever means - essentially have to contract each row of the Matrix with the column vector, just like you did in the exam.

Thus, the difference between software that aligns column-major and software that uses row-major-layout is not what it calculates, but just how.

To put it more pecisely, the difference between those layouts with regard to the topcial single row's contraction with the column vector is just the means to determine

Where is the next element of the current row?

For a row-major-layout it's the element just in the next bucket in memory
For a column-major-layout it's the element in the bucket M buckets away.

And thats it.

To show you how that column/row magic is summoned in practice:

You haven't tagged your question with "c++", but because you mentioned 'glm', I assume that you can get along with C++.

In C++'s standard library there's an infamous beast called valarray, which, besides other tricky features, has overloads of operator[], one of them can take a std::slice ( which is essentially a very boring thing, consisting of just three integer-type numbers).

This little slice thing however, has everything one would need to access a row-major-storage column-wise or a column-major-storage row-wise - it has a start, a length, and a stride - the latter represents the "distance to next bucket" I mentioned.

Graphics Row vs Column Major Transformations

Row-major order allows creating affine transformations in the left-to-right order from basic transformations. For example this way:

M = Scale * Rotation * Translation

Agreeing with order of execution.
Of course basic transformations must be transposed vs. definitions from linear algebra. In column-major order used in math, the order in which transformations are executed is reversed and counterintuitive.
The efficency of both approches could be identical (it depends on the order in which tables are stored in memory).
Using column-major matrices in graphics is therefore only the result of linear algebra standard.

Are these functions column-major or row-major?

To my eyes, these two are using the same format, in that in both the first index is treated as the row, the second index is the column.

The looks may be deceiving, but in fact the first index in linmath.h is the column. C and C++ specify that in a multidimensional array defined like this

sometype a[n][m];

there are n times m elements of sometype in succession. If it is row or column major order solely depends on how you interpret the indices. Now OpenGL defines 4×4 matrices to be indexed in the following linear scheme

If you apply the rules of C++ multidimensional arrays you'd add the following column row designation

   ----> n

|  0 4 8 c
|  1 5 9 d
V  2 6 a e
m  3 7 b f

Which remaps the linear indices into 2-tuples of

0 -> 0,0
1 -> 0,1
2 -> 0,2
3 -> 0,3
4 -> 1,0
5 -> 1,1
6 -> 1,2
7 -> 1,3
8 -> 2,0
9 -> 2,1
a -> 2,2
b -> 2,3
c -> 3,0
d -> 3,1
e -> 3,2
f -> 3,3

Okay, OpenGL and some math libraries use column major ordering, fine. But why do it this way and break with the usual mathematical convention that in M_i,j the index i designates the row and j the column? Because it is make things look nicer. You see, matrix is just a bunch of vectors. Vectors that can and usually do form a coordinate base system.

Have a look at this picture:

A 3 dimensional cartesian coordinate system with right handed base vectors

The axes X, Y and Z are essentially vectors. They are defined as

X = (1,0,0)
Y = (0,1,0)
Z = (0,0,1)

Moment, does't that up there look like a identity matrix? Indeed it does and in fact it is!

However written as it is the matrix has been formed by stacking row vectors. And the rules for matrix multiplication essentially tell, that a matrix formed by row vectors, transforms row vectors into row vectors by left associative multiplication. Column major matrices transform column vectors into column vectors by right associative multiplication.

Now this is not really a problem, because left associative can do the same stuff as right associative can, you just have to swap rows for columns (i.e. transpose) everything and reverse the order of operands. However left<>right row<>column are just notational conventions in which we write things.

And the typical mathematical notation is (for example)

v_clip = P · V · M · v_local

This notation makes it intuitively visible what's going on. Furthermore in programming the key character = usually designates assignment from right to left. Some programming languages are more mathematically influenced, like Pascal or Delphi and write it :=. Anyway with row major ordering we'd have to write it

v_clip = v_local · M · V · P

and to the majority of mathematical folks this looks unnatural. Because, technically M, V and P are in fact linear operators (yes they're also matrices and linear transforms) and operators always go between the equality / assignment and the variable.

So that's why we use column major format: It looks nicer. Technically it could be done using row major format as well. And what does this have to do with the memory layout of matrices? Well, When you want to use a column major order notation, then you want direct access to the base vectors of the transformation matrices, without having them to extract them element by element. With storing numbers in a column major format, all it takes to access a certain base vector of a matrix is a simple offset in linear memory.

I can't speak for the code example of the other library, but I'd strongly assume, that it treats first index as the slower incrementing index as well, which makes it work in column major if subjected to the notations of OpenGL. Remember: column major & right associativity == row major & left associativity.

What is the correct order to use matrices in Opengl?

According to a careful reading of

Why does my translation matrix needs to be transposed?

the difference is that the first page you are looking at is showing you the mathematical notation for your matrices, in which each single column is arranged along a vertical line; whereas the other notation you are looking at is a sequence of characters resembling something that might occur in your C++ code. There are many other differences, such as the tall square brackets in the math notation (which are impossible to reproduce exactly in C++ code, as there are no characters that span multiple lines), and the fact that there are no commas between the numbers in math notation.

But the main thing is that since the matrix is stored in column-major order, if you initialize the entries in the order in which they are stored (such as by using the notation for initializing an array from a comma-separated list), then x, y, z, and 1 will be the last four entries you will set in the matrix. And since C++ is always parsed as a linear string reading left to right (the parser doesn't care what number happens to be directly under what other number when you look at the listing), if you happen to format a list of 16 values in 4 rows of 4 values each, the last four values in the list are the ones on the last row of your format.

Column Major Order for OpenGL

Matrix multiplication, mathematically speaking, works the same way regardless of column/row major ordering. What the column/row major ordering affects is how you create those matrices. How you build a rotation matrix, for example.

Furthermore:

should I be writing the first struct as follows

The whole column/row major thing only matters for arrays of values. The column/row major question is about how you store the matrix as an array. If you have an array of 9 values, and the 3x3 matrix is stored in row-major form, then the 2nd column, 1st row's index will be zero-based index 1. In column-major storage, the 2nd column, first row index will be 3.

So look at your union. The point of the union is to access the same float in different ways. The _21 member is supposed to be 2nd column, first row. If that's the case, then it is your responsibility to make sure that it maps to the right part of the array. If the array is supposed to be column-major, then _21 should map to the same value as _data[3]. If it's row major, it should map to _data[1].

If C is row-major order, why does ARM intrinsic code assume column-major order?

You're not using C 2D arrays like C[i][j], so it's not a matter of how C stores anything, it's how 2D indexing is done manually in this code, using n * idx_1 + idx_2, with a choice of which you loop over in the inner vs. outer loops.

But the hard part of a matmul with both matrices non-transposed is that you need to make opposite choices for the two input matrices: a naive matmul has to stride through distant elements of one of the input matrices, so it's inherently screwed. That's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. (O(n^3) work over O(n^2) data - you want to get the most use out of it for every time you bring it into L1d cache, and/or into registers.)

Loop interchange can speed things up to take advantage of spatial locality in the inner-most loop, if you do it right.

See the cache-blocked matmul example in What Every Programmer Should Know About Memory? which traverses contiguous memory in all 3 inputs in the inner few loops, picking the index that isn't scaled in any of the 3 matrices as the inner one. That looks like this:

  for (j_idx)
    for (k_idx)
      for (i_idx)
          C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];

Notice that B[k * j_idx + k_idx] is invariant over the loop inner loop, and that you're doing a simple dst[0..n] += const * src[0..n] operation over contiguous memory (which is easy to SIMD vectorize), although you're still doing 2 loads + 1 store for every FMA, so that's not going to max out your FP throughput.

Separate from the cache access pattern, that also avoids a long dependency chain into a single accumulator (element of C). But that's not a real problem for an optimized implementation: you can of course use multiple accumulators. FP math isn't strictly associative because of rounding error, but multiple accumulators are closer to pairwise summation and likely to have less bad FP rounding error than serially adding each element of the row x column dot product.
It will have different results to adding in the order standard simple C loop does, but usually closer to the exact answer.

Your proposed loop order i,k,j is the worst possible.

You're striding through distant elements of 2 of the 3 matrices in the inner loop, including discontiguous access to C[], opposite of what you said in your last paragraph.

With j as the inner-most loop, you'd access C[0], C[n], C[2n], etc. on the first outer iteration. And same for B[], so that's really bad.

Interchanging the i and j loops would give you contiguous access to C[] in the middle loop instead of strided, and still rows of one, columns of the other, in the inner-most loop. So that would be strictly an improvement: yes you're right that this naive example is constructed even worse than it needs to be.

But the key issue is the strided access to something in the inner loop: that's a performance disaster; that's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. The only index that is never used with a scale factor is i.

Why did Doom3 switch column and row major matrices?

Like Oli Charlesworth already commented this might be a decision to improve caching behaviour. OpenGL's matrices are column major because on the client side you're more interested in the columns than the rows (the columns form the base of a coordinate system). If however the matrices are used for calculations like in physics or collision detection, a lot of operations will be row major. So this strongly depends on the kind of operations mostly performed on the matrix in question. The Doom3 engine makes heavy use of Plücker coordinates, so choosing the right memory layout has a very strong effect on overall performance and a simple switch between matrix majority may add/remove a significant number of operations involved.

Confusion Between C++ and Opengl Matrix Order (Row-Major VS Column-Major)