C++ 128/256-Bit Fixed Size Integer Types

Is there a 256-bit integer type?

Clang has _ExtInt extended integers that supports operations other than division, but SIMD isn't useful for that because of carry between elements¹. Other mainstream x86-64 compilers don't even have that; you need a library or something to define a custom type and use the same add-with-carry instructions clang will use. (Or a less efficient emulation in pure C²).

__m256i is AVX2 SIMD 4x uint64_t (or a narrower element size like 8x uint32_t). It's not a 256-bit scalar integer type, you can't use it for scalar operations, __m256i var = 1 won't even compile. There is no x86 SIMD support for integers wider than 64-bit, and the Intel intrinsic types like __m128i and __m256i are purely for SIMD.

GCC's __int128 / unsigned __int128 typically uses scalar add/adc, and/or scalar mul / imul, because AVX2 is generally not helpful for extended precision. (Only for stuff like bitwise AND/OR/XOR where element boundaries are irrelevant.)

Footnote 1: There actually is some scope for using SIMD for BigInteger types, but only with a specialized format. And more importantly, you have to manually choose when to re-normalize (propagate carry) so your calculations have to be designed around it; it's not a drop-in replacement. See Mysticial's answer on Can long integer routines benefit from SSE?

Footnote 2: Unfortunately C does not provide carry-out from addition / subtraction, so it's not even convenient to write in C. sum = a+b / carry = sum<a works for carry out when there's no carry in, but it's much harder to write a full adder in C. And compiler typically make crap asm that doesn't just use native add-with-carry instructions on machines where they're available. Extended-precision libraries for very big integers, like GMP, are typically written in asm.

256-bit arithmetic in Clang (extended integers)

It looks like division with these types is not currently supported beyond 128 bits.

As of 2 August 2020, using clang trunk on godbolt, compiling the following code for x86-64

typedef unsigned _ExtInt(256) uint256;

uint256 div(uint256 a, uint256 b) {
    return a/b;
}

fails with the error message

fatal error: error in backend: Unsupported library call operation!

Try it

The same thing happens with _ExtInt(129) and everything larger that I tried. _ExtInt(128) and smaller seem to work, though they call the internal library function __udivti3 instead of inlining.

It has been reported as LLVM bug 45649. There is some discussion on that page, but the upshot seems to be that they do not really want to write a full arbitrary-precision divide instruction.

Addition, subtraction and multiplication do work with _ExtInt(256) on this version.

Representing 128-bit numbers in C++

Look into other libraries that have been developed. Lots of people have wanted to do this before you. :D

Try bigint C++

C++: How do I store a 256 bit number, and how do I convert it to hex?

ints only go to 32 bits, longs to 64bits... so.. what do you do when you are working with a much larger number?

You use large number libraries.

Also, how easy would it be to switch between the binary representation and the hex representation?

I don't understand the question. A number's a number's a number. Are you asking how to print a number in an certain base? You can format output when using streams like so:

int x = 100;

cout << x << endl; // print decimal value
cout << oct << x << endl; // print octal value
cout << hex << x << endl; // print hexadecimal value

100
0144
0x64

Largest value in Visual Studio C or C++

There aren't data types for 128-bit integers that work like the ones for 64-bit sizes and below. If you want them, you'll have to implement them yourself. Using GMP or boost::multiprecision is always an option.

Multiword addition in C

256-bit version

__uint128_t a[2], b[2], c[2];        // c = a + b
c[0] = a[0] + b[0];                  // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]);  // add high part and carry

Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what @harold's stated:

struct uint192_t {
    __uint128_t H;
    uint64_t L;
} a, b, c;  // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);

Alternatively you can use the integer overflow builtins or checked arithmetic builtins

bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;

Demo on Godbolt

If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512:       Hot Topics
      
        What Does Ll Mean
 Inconsistent Use of Const Qualifier Between Declaration and Definition
 What Is an Anonymous Object
 Good C++ Solutions to the "Bring All the Zeros to the Back of the Array" Interview Challenge
 Why Does My Stl Code Run So Slowly When I Have the Debugger/Ide Attached
 How to Use the Ansi Escape Code for Outputting Colored Text on Console
 How to Detect Text Area from Image
 C++ -- How to Overload Operator+=
 Openmp and CPU Affinity
 So_Rcvtime and So_Rcvtimeo Not Affecting Boost.Asio Operations
 C++ Copy Constructor Gets Called Instead of Initializer_List<>
 Can Qt Signals Return a Value
      
hhhhhhhh

With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster

There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision

#include <boost/multiprecision/cpp_int.hpp>

using namespace boost::multiprecision;

uint256_t myUnsignedInt256 = 1;

Some other libraries:

ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t

256 bit fixed point arithmetic, the future?

SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.

Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.

Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.

So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.

Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2^-1022 to 2¹⁰²³, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2^-127 (with 1 bit of precision) to 2¹²⁷ (with 256b of precision).

Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)

Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.

AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).

AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.

Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.

For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.

Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.

The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.

Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)