Inverting a 4X4 Matrix

Inverting a 4x4 matrix

here:

bool gluInvertMatrix(const double m[16], double invOut[16])
{
    double inv[16], det;
    int i;

    inv[0] = m[5]  * m[10] * m[15] - 
             m[5]  * m[11] * m[14] - 
             m[9]  * m[6]  * m[15] + 
             m[9]  * m[7]  * m[14] +
             m[13] * m[6]  * m[11] - 
             m[13] * m[7]  * m[10];

    inv[4] = -m[4]  * m[10] * m[15] + 
              m[4]  * m[11] * m[14] + 
              m[8]  * m[6]  * m[15] - 
              m[8]  * m[7]  * m[14] - 
              m[12] * m[6]  * m[11] + 
              m[12] * m[7]  * m[10];

    inv[8] = m[4]  * m[9] * m[15] - 
             m[4]  * m[11] * m[13] - 
             m[8]  * m[5] * m[15] + 
             m[8]  * m[7] * m[13] + 
             m[12] * m[5] * m[11] - 
             m[12] * m[7] * m[9];

    inv[12] = -m[4]  * m[9] * m[14] + 
               m[4]  * m[10] * m[13] +
               m[8]  * m[5] * m[14] - 
               m[8]  * m[6] * m[13] - 
               m[12] * m[5] * m[10] + 
               m[12] * m[6] * m[9];

    inv[1] = -m[1]  * m[10] * m[15] + 
              m[1]  * m[11] * m[14] + 
              m[9]  * m[2] * m[15] - 
              m[9]  * m[3] * m[14] - 
              m[13] * m[2] * m[11] + 
              m[13] * m[3] * m[10];

    inv[5] = m[0]  * m[10] * m[15] - 
             m[0]  * m[11] * m[14] - 
             m[8]  * m[2] * m[15] + 
             m[8]  * m[3] * m[14] + 
             m[12] * m[2] * m[11] - 
             m[12] * m[3] * m[10];

    inv[9] = -m[0]  * m[9] * m[15] + 
              m[0]  * m[11] * m[13] + 
              m[8]  * m[1] * m[15] - 
              m[8]  * m[3] * m[13] - 
              m[12] * m[1] * m[11] + 
              m[12] * m[3] * m[9];

    inv[13] = m[0]  * m[9] * m[14] - 
              m[0]  * m[10] * m[13] - 
              m[8]  * m[1] * m[14] + 
              m[8]  * m[2] * m[13] + 
              m[12] * m[1] * m[10] - 
              m[12] * m[2] * m[9];

    inv[2] = m[1]  * m[6] * m[15] - 
             m[1]  * m[7] * m[14] - 
             m[5]  * m[2] * m[15] + 
             m[5]  * m[3] * m[14] + 
             m[13] * m[2] * m[7] - 
             m[13] * m[3] * m[6];

    inv[6] = -m[0]  * m[6] * m[15] + 
              m[0]  * m[7] * m[14] + 
              m[4]  * m[2] * m[15] - 
              m[4]  * m[3] * m[14] - 
              m[12] * m[2] * m[7] + 
              m[12] * m[3] * m[6];

    inv[10] = m[0]  * m[5] * m[15] - 
              m[0]  * m[7] * m[13] - 
              m[4]  * m[1] * m[15] + 
              m[4]  * m[3] * m[13] + 
              m[12] * m[1] * m[7] - 
              m[12] * m[3] * m[5];

    inv[14] = -m[0]  * m[5] * m[14] + 
               m[0]  * m[6] * m[13] + 
               m[4]  * m[1] * m[14] - 
               m[4]  * m[2] * m[13] - 
               m[12] * m[1] * m[6] + 
               m[12] * m[2] * m[5];

    inv[3] = -m[1] * m[6] * m[11] + 
              m[1] * m[7] * m[10] + 
              m[5] * m[2] * m[11] - 
              m[5] * m[3] * m[10] - 
              m[9] * m[2] * m[7] + 
              m[9] * m[3] * m[6];

    inv[7] = m[0] * m[6] * m[11] - 
             m[0] * m[7] * m[10] - 
             m[4] * m[2] * m[11] + 
             m[4] * m[3] * m[10] + 
             m[8] * m[2] * m[7] - 
             m[8] * m[3] * m[6];

    inv[11] = -m[0] * m[5] * m[11] + 
               m[0] * m[7] * m[9] + 
               m[4] * m[1] * m[11] - 
               m[4] * m[3] * m[9] - 
               m[8] * m[1] * m[7] + 
               m[8] * m[3] * m[5];

    inv[15] = m[0] * m[5] * m[10] - 
              m[0] * m[6] * m[9] - 
              m[4] * m[1] * m[10] + 
              m[4] * m[2] * m[9] + 
              m[8] * m[1] * m[6] - 
              m[8] * m[2] * m[5];

    det = m[0] * inv[0] + m[1] * inv[4] + m[2] * inv[8] + m[3] * inv[12];

    if (det == 0)
        return false;

    det = 1.0 / det;

    for (i = 0; i < 16; i++)
        invOut[i] = inv[i] * det;

    return true;
}

This was lifted from MESA implementation of the GLU library.

Inverting a 4x4 Matrix algorithm

Your attempt to write down the inverse of a 4x4 matrix is utterly wrong. There's absolutely no point trying to fix it since it can never work.

You ask what the result of 1/0 is. Well, that is division by zero and the result is not defined. There is no real number x that satisfies 1/0 == x. If there was then 1 == x*0 == 0, a contradiction. On a computer, attempting to perform division by zero sometimes leads to an error, or sometimes results in a special floating point value Inf being returned. The latter appears to be what happens in your environment.

I don't know why you rejected the determinant based code. Perhaps you found it tricky to implement. But that's just how it is. You aren't going to shortcut that complexity.

Invert 4x4 matrix - Numerical most stable solution needed

I think the answer to this depends on the exact form of the matrix. A standard decomposition method (LU, QR, Cholesky etc.) with pivoting (an essential) is fairly good on fixed point, especially for a small 4x4 matrix. See the book 'Numerical Recipes' by Press et al. for a description of these methods.

This paper gives some useful algorithms, but is behind a paywall unfortunately. They recommend a (pivoted) Cholesky decomposition with some additional features too complicated to list here.

Efficient 4x4 matrix inverse (affine transform)

You should be able to exploit the fact that the matrix is affine to speed things up over a full inverse. Namely, if your matrix looks like this

A = [ M   b  ]
    [ 0   1  ]

where A is 4x4, M is 3x3, b is 3x1, and the bottom row is (0,0,0,1), then

inv(A) = [ inv(M)   -inv(M) * b ]
         [   0            1     ]

Depending on your situation, it may be faster to compute the result of inv(A) * x instead of actually forming inv(A). In that case, things simplify to

inv(A) * [x] = [ inv(M) * (x - b) ]
         [1] = [        1         ]

where x is a 3x1 vector (usually a 3D point).

Lastly, if M represents a rotation (i.e. its columns are orthonormal), then you can use the fact that inv(M) = transpose(M). Then computing the inverse of A is just a matter of subtracting the translation component, and multiplying by the transpose of the 3x3 part.

Note that whether or not the matrix is orthonormal is something that you should know from the analysis of the problem. Checking it during runtime would be fairly expensive; although you might want to do it in debug builds to check that your assumptions hold.

Hope all of that is clear...

Performing high number of 4x4 matrix inversion - PyCuda

As mentioned in the comments, numpy functions in general cannot be used from pycuda kernel code (or CUDA kernel code, or numba cuda kernels).

CUBLAS offers a batched matrix inversion function, but it is not currently exposed in either pyculib cublas interface or scikit-cuda cublas interface.

We could proceed to implement our own interface (e.g. using python ctypes), but since its known that the matrices to be inverted are 4x4, I thought the suggestion in the comments from talonmies was an interesting one. Referring to the answer here, there is a fairly concise C code to do a direct inversion of a 4x4 matrix.

What follows first is a realization of this in CUDA. The function inv4x4 is an adaptation of the previous code, allotting 16 threads per matrix (one per matrix element) and using that code as a model. Each thread is responsible for computing one result matrix element. First we will compare it to CUBLAS matinvBatched for performance:

$ cat t411.cu
#include <iostream>
#include <cublas_v2.h>
#include <cstdlib>
// 4x4 matrix inversion
// https://stackoverflow.com/questions/1148309/inverting-a-4x4-matrix

// assumes warp size is 32
// assumes block size is multiple of warp size
// therefore assumes number of matrices to be inverted (n) is even
// 16 threads per matrix to invert

const unsigned block_size = 256;
typedef float mt;

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__device__ unsigned pat[3][16];
const unsigned hpat[3][16] = {
{ 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50},
{ 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14},
{ 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618}};

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off = off >> 4;
  return ret;
}

const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
template <typename T>
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n*16){
    si[threadIdx.x] = in[idx];
    unsigned lane = threadIdx.x & 15;
    unsigned sibase = threadIdx.x & 0x03F0;
    __syncwarp();
    unsigned off = pat[0][lane];
    T a,b;
    a  = si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    if (!getoff(off)) a = -a;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[1][lane];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[2][lane];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    T det = si[sibase + (lane>>2)]*a;
    det += __shfl_down_sync(tmsk, det, 4, 16); // first add
    det += __shfl_down_sync(tmsk, det, 8, 16); // second add
    det =  __shfl_sync(tmsk, det, 0, 16); // broadcast
    out[idx] = a / det;
  }
}

size_t nr = 2048;
int main(int argc, char *argv[]){
  if (argc > 1) nr = atoi(argv[1]);

  const mt m1[] = {1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0};
  const mt i1[] = {-3.0, -0.5, 1.5, 1.0, 1.0, 0.25, -0.25, -0.5, 3.0, 0.25, -1.25, -0.5, -3.0, 0.0, 1.0, 1.0};
  const mt m2[] = {1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0};
  const mt i2[] = {1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0};

  mt *h_d, *d_d;
  h_d = (mt *)malloc(nr*2*16*sizeof(mt));
  cudaMalloc(&d_d, nr*2*16*sizeof(mt));
  cudaMemcpyToSymbol(pat, hpat, 3*16*sizeof(unsigned));
  for (int i = 0; i < nr; i++){
    memcpy(h_d+i*16*2, m1, sizeof(m1));
    memcpy(h_d+i*16*2+16, m2, sizeof(m2));}
  cudaMemcpy(d_d, h_d, nr*2*16*sizeof(mt), cudaMemcpyHostToDevice);
  long long t = dtime_usec(0);
  inv4x4<<<nr*2*16/block_size, block_size>>>(d_d, d_d, nr*2);
  cudaDeviceSynchronize();
  t = dtime_usec(t);
  cudaMemcpy(h_d, d_d, nr*2*16*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 2; i++){
    for (int j = 0; j < 16; j++) std::cout << h_d[i*16 + j] << ",";
    std::cout << std::endl;
    for (int j = 0; j < 16; j++) std::cout << ((i==0)?i1[j]:i2[j]) << ",";
    std::cout << std::endl;}
  std::cout << "kernel time: " << t << " microseconds" << std::endl;
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) std::cout << cudaGetErrorString(err) << std::endl;
  //cublas
  for (int i = 0; i < nr; i++){
    memcpy(h_d+i*16*2, m1, sizeof(m1));
    memcpy(h_d+i*16*2+16, m2, sizeof(m2));}
  cudaMemcpy(d_d, h_d, nr*2*16*sizeof(mt), cudaMemcpyHostToDevice);
  cublasHandle_t h;
  cublasStatus_t cs = cublasCreate(&h);
  if (cs != CUBLAS_STATUS_SUCCESS) std::cout << "cublas create error" << std::endl;
  mt **A, **Ai, *Aid, **Ap, **Aip;
  A  = (mt **)malloc(nr*2*sizeof(mt *));
  Ai = (mt **)malloc(nr*2*sizeof(mt *));
  cudaMalloc(&Aid, nr*2*16*sizeof(mt));
  cudaMalloc(&Ap,  nr*2*sizeof(mt *));
  cudaMalloc(&Aip, nr*2*sizeof(mt *));
  for (int i = 0; i < nr*2; i++) A[i]  =  d_d + 16*i;
  for (int i = 0; i < nr*2; i++) Ai[i] =  Aid + 16*i;
  cudaMemcpy(Ap, A, nr*2*sizeof(mt *), cudaMemcpyHostToDevice);
  cudaMemcpy(Aip, Ai, nr*2*sizeof(mt *), cudaMemcpyHostToDevice);
  int *info;
  cudaMalloc(&info, nr*2*sizeof(int));
  t = dtime_usec(0);
  cs = cublasSmatinvBatched(h, 4,  Ap, 4, Aip, 4, info, nr*2);
  if (cs != CUBLAS_STATUS_SUCCESS) std::cout << "cublas matinv error" << std::endl;
  cudaDeviceSynchronize();
  t = dtime_usec(t);
  cudaMemcpy(h_d, Aid, nr*2*16*sizeof(mt), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 2; i++){
    for (int j = 0; j < 16; j++) std::cout << h_d[i*16 + j] << ",";
    std::cout << std::endl;
    for (int j = 0; j < 16; j++) std::cout << ((i==0)?i1[j]:i2[j]) << ",";
    std::cout << std::endl;}
  std::cout << "cublas time: " << t << " microseconds" << std::endl;
  err = cudaGetLastError();
  if (err != cudaSuccess) std::cout << cudaGetErrorString(err) << std::endl;
  return 0;
}
$ nvcc -o t411 t411.cu -lcublas
$ ./t411
-3,-0.5,1.5,1,1,0.25,-0.25,-0.5,3,0.25,-1.25,-0.5,-3,-0,1,1,
-3,-0.5,1.5,1,1,0.25,-0.25,-0.5,3,0.25,-1.25,-0.5,-3,0,1,1,
1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
kernel time: 49 microseconds
-3,-0.5,1.5,1,1,0.25,-0.25,-0.5,3,0.25,-1.25,-0.5,-3,0,1,1,
-3,-0.5,1.5,1,1,0.25,-0.25,-0.5,3,0.25,-1.25,-0.5,-3,0,1,1,
1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
cublas time: 95 microseconds
$

We see that the code appears to provide the correct result for 2 test matrices inverted, and the overall time to invert 4096 matrices on a Tesla P100 is about 50us and is about 2x faster than CUBLAS. Note that I have not exhaustively tested this code.

What follows next is a simple pycuda implementation of a similar function. Here, for simplicity we are just inverting 2 matrices:

$ cat t10.py
import numpy as np
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off = off >> 4;
  return ret;
}

const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
typedef float T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n*16){
    si[threadIdx.x] = in[idx];
    unsigned lane = threadIdx.x & 15;
    unsigned sibase = threadIdx.x & 0x03F0;
    __syncwarp();
    unsigned off = pat[lane];
    T a,b;
    a  = si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    if (!getoff(off)) a = -a;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+16];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+32];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    T det = si[sibase + (lane>>2)]*a;
    det += __shfl_down_sync(tmsk, det, 4, 16); // first add
    det += __shfl_down_sync(tmsk, det, 8, 16); // second add
    det =  __shfl_sync(tmsk, det, 0, 16); // broadcast
    out[idx] = a / det;
  }
}

""")
# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
    # internal constants not to be modified
    hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
    # Convert parameters into numpy array
    inpd = np.array(inp, dtype=np.float32)
    hpatd = np.array(hpat, dtype=np.uint32)
    output = np.empty((n*16), dtype= np.float32)
    # Get kernel function
    matinv4x4 = kernel.get_function("inv4x4")
    # Define block, grid and compute
    blockDim = (256,1,1) # do not change
    gridDim = ((n/16)+1,1,1)
    # Kernel function
    matinv4x4 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output
#example/test case
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0)
n = 2
result = gpuinv4x4(inp, n)
print(result)
$ python t10.py
[-3.   -0.5   1.5   1.    1.    0.25 -0.25 -0.5   3.    0.25 -1.25 -0.5  -3.
 -0.    1.    1.    1.    0.    0.    0.    0.    1.    0.    0.    0.    0.
  1.    0.    0.    0.    0.    1.  ]
$

I've spent very little time creating this pycuda test case, so please consider it as a rough demonstration vehicle.

I suspect that if the only thing you need to do in CUDA is invert these matrices, this won't be an interesting or attractive use case. I expect that the cost to transfer the data to the device and return the results back would outweigh any speed-up benefit from using the GPU, vs. ordinary numpy. However I haven't tested or benchmarked a numpy case.

Note that the use of __syncwarp() means this kernel code requires CUDA 9.0 or later.

Also note that the code expects an even number of matrices to invert. If you don't have an even number, pad your array with any value to the next even number of matrices.

Also note that the code just assumes the matrices are invertible. There is no test to see if they are not, and for example if the determinant computed were zero, the matrix would not be invertible (using this method) and the results would typically be NaN, due to division-by-zero.

It's not clear what the purpose is here, so this example should not be construed to suggest that general matrix inversion is a good idea or a proper solution method for a particular problem.

Probably a better pythonic method for inversion of dense matrices on the GPU would be to use cupy

How to perform PyCUDA 4x4 matrix inversion with same accuracy than numpy linalg inv or pinv function

This question has previous related questions here and (to a lesser extent) here. It's unclear to me why the title of the question refers to 3x3, the bolded text in the question refers to to 3x3, but the presented problem is a 4x4 matrix inverse (and as stated to OP previously, this code can only be used to invert 4x4 matrices). I will proceed under the assumption that the example case is the desired case.

According to my testing, the only thing that is necessary is to take the previous answer code and convert it to use double (or, in pycuda, float64) rather than float (or in pycuda, float32). I think this should be evident because the example matrix values exceed the range of a float32 type. Here is a worked example:

$ cat t10.py
import numpy as np
# import matplotlib.pyplot as plt
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off = off >> 4;
  return ret;
}

const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
typedef double T;  // *** change this typedef to convert between float and double
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n*16){
    si[threadIdx.x] = in[idx];
    unsigned lane = threadIdx.x & 15;
    unsigned sibase = threadIdx.x & 0x03F0;
    __syncwarp();
    unsigned off = pat[lane];
    T a,b;
    a  = si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    if (!getoff(off)) a = -a;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+16];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+32];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    T det = si[sibase + (lane>>2)]*a;
    det += __shfl_down_sync(tmsk, det, 4, 16); // first add
    det += __shfl_down_sync(tmsk, det, 8, 16); // second add
    det =  __shfl_sync(tmsk, det, 0, 16); // broadcast
    out[idx] = a / det;
  }
}

""")
# host code
def gpuinv4x4(inp, n):
    # internal constants not to be modified
    hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
    # Convert parameters into numpy array
    # *** change next line between float32 and float64 to match float or double
    inpd = np.array(inp, dtype=np.float64)
    hpatd = np.array(hpat, dtype=np.uint32)
    # *** change next line between float32 and float64 to match float or double
    output = np.empty((n*16), dtype= np.float64)
    # Get kernel function
    matinv4x4 = kernel.get_function("inv4x4")
    # Define block, grid and compute
    blockDim = (256,1,1) # do not change
    gridDim = ((n/16)+1,1,1)
    # Kernel function
    matinv4x4 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output

inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 2.120771107884677649e+09, 0.0, 0.0, 0.0, 0.0, 3.557266600921528288e+27, 3.557266600921528041e+07, 3.557266600921528320e+17, 0.0, 3.557266600921528041e+07, 3.557266600921528288e+27, 3.557266600921528041e+07, 0.0, 3.557266600921528320e+17, 3.557266600921528041e+07, 1.778633300460764144e+27)
n = 2
result = gpuinv4x4(inp, n)
print(result)
$ python t10.py
[ -3.00000000e+00  -5.00000000e-01   1.50000000e+00   1.00000000e+00
   1.00000000e+00   2.50000000e-01  -2.50000000e-01  -5.00000000e-01
   3.00000000e+00   2.50000000e-01  -1.25000000e+00  -5.00000000e-01
  -3.00000000e+00  -0.00000000e+00   1.00000000e+00   1.00000000e+00
   4.71526605e-10   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.81114719e-28  -2.81114719e-48  -5.62229437e-38
   0.00000000e+00  -2.81114719e-48   2.81114719e-28  -5.62229437e-48
   0.00000000e+00  -5.62229437e-38  -5.62229437e-48   5.62229437e-28]
$

Apart from updating the input matrix to use the one supplied in this question, note that only 3 lines of code were changed from the previous answer to convert from 32-bit floating point to 64-bit floating point. Each of those 3 lines are marked by a comment that contains triple-asterisk (***) in it.

Also, if you are concerned about accuracy beyond the first 9 or so digits displayed here, I haven't investigated that. It may be that this code is not numerically suitable for every case or need.

As an aside, the code in the question also shows a float128 type in the python section. There is no native CUDA type that corresponds to that.