Are There in X86 Any Instructions to Accelerate Sha (Sha1/2/256/512) Encoding

Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?

Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?

It's November 2016 and the answer is finally Yes. But its only SHA-1 and SHA-256 (and by extension, SHA-224).

Intel CPUs with SHA extensions hit the market recently. It looks like processors which support it are Goldmont microarchitecture:

Pentium J4205 (desktop)
Pentium N4200 (mobile)
Celeron J3455 (desktop)
Celeron J3355 (desktop)
Celeron N3450 (mobile)
Celeron N3350 (mobile)

I looked through offerings at Amazon for machines with the architecture or the processor numbers, but I did not find any available (yet). I believe HP Acer had one laptop with Pentium N4200 expected to be available in ~~November 2016~~ December 2016 that would meet testing needs.

For some of the technical details why it's only SHA-1, SHA-224 and SHA-256, then see crypto: arm64/sha256 - add support for SHA256 using NEON instructions on the kernel crypto mailing list. The short answer is, above SHA-256, things are not easily parallelizable.

You can find source code for both Intel SHA intrinsics and ARMv8 SHA intrinsics at Noloader GitHub | SHA-Intrinsics. They are C source files, and provide the compress function for SHA-1, SHA-224 and SHA-256. The intrinsic-based implementations increase throughput approximately 3× to 4× for SHA-1, and approximately 6× to 12× for SHA-224 and SHA-256.

How can I access SHA intrinsic?

SHA instructions are now available in Goldmont architecture. It was released around September, 2016. According to the Intel Intrinsics Guide, these are the intrinsics of interest:

__m128i _mm_sha1msg1_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1msg2_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1nexte_epu32 (__m128i a, __m128i b)
__m128i _mm_sha1rnds4_epu32 (__m128i a, __m128i b, const int func)
__m128i _mm_sha256msg1_epu32 (__m128i a, __m128i b)
__m128i _mm_sha256msg2_epu32 (__m128i a, __m128i b)
__m128i _mm_sha256rnds2_epu32 (__m128i a, __m128i b, __m128i k)

GCC 5.0 and above make intrinsics available all the time for Function Specific Option Pragmas. You will need Binutils 2.24, however. Testing also shows Clang 3.7 and 3.8 support the intrinsics. Testing also shows Visual Studio 2015 can consume them, but VS2013 failed to compile them.

You can detect the availability of SHA in the preprocessor on Linux by looking for the macro __SHA__. -march=native will make it available if its native to the processor. If not, you can enable it with -msha.

$ gcc -march=native -dM -E - </dev/null | egrep -i '(aes|rdrnd|rdseed|sha)'
#define __RDRND__ 1
#define __SHA__ 1
#define __RDSEED__ 1
#define __AES__ 1

The code for using SHA1 is shown below. Its based on Intel's blog titled Intel® SHA Extensions. Another reference implementation is available from the miTLS project.

The code below is based on Intel® SHA Extensions blog. The code works with full SHA1 blocks, so const uint32_t *data is 64 bytes. You will have to add the padding for the final block and set the bit length.

It runs at about 1.7 cycles-per-byte (cpb) on an Celeron J3455. I believe Andy Polyakov has SHA1 running around 1.5 cpb for OpenSSL. For reference, an optimized C/C++ implementation will run somewhere around 9 to 10 cpb.

static void SHA1_SHAEXT_Transform(uint32_t *state, const uint32_t *data)
{
    __m128i ABCD, ABCD_SAVE, E0, E0_SAVE, E1;
    __m128i MASK, MSG0, MSG1, MSG2, MSG3;

    // Load initial values
    ABCD = _mm_loadu_si128((__m128i*) state);
    E0 = _mm_set_epi32(state[4], 0, 0, 0);
    ABCD = _mm_shuffle_epi32(ABCD, 0x1B);
    MASK = _mm_set_epi64x(0x0001020304050607ULL, 0x08090a0b0c0d0e0fULL);

    // Save current hash
    ABCD_SAVE = ABCD;
    E0_SAVE = E0;

    // Rounds 0-3
    MSG0 = _mm_loadu_si128((__m128i*) data+0);
    MSG0 = _mm_shuffle_epi8(MSG0, MASK);
    E0 = _mm_add_epi32(E0, MSG0);
    E1 = ABCD;
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);

    // Rounds 4-7
    MSG1 = _mm_loadu_si128((__m128i*) (data+4));
    MSG1 = _mm_shuffle_epi8(MSG1, MASK);
    E1 = _mm_sha1nexte_epu32(E1, MSG1);
    E0 = ABCD;
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 0);
    MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);

    // Rounds 8-11
    MSG2 = _mm_loadu_si128((__m128i*) (data+8));
    MSG2 = _mm_shuffle_epi8(MSG2, MASK);
    E0 = _mm_sha1nexte_epu32(E0, MSG2);
    E1 = ABCD;
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);
    MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
    MSG0 = _mm_xor_si128(MSG0, MSG2);

    // Rounds 12-15
    MSG3 = _mm_loadu_si128((__m128i*) (data+12));
    MSG3 = _mm_shuffle_epi8(MSG3, MASK);
    E1 = _mm_sha1nexte_epu32(E1, MSG3);
    E0 = ABCD;
    MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 0);
    MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
    MSG1 = _mm_xor_si128(MSG1, MSG3);

    // Rounds 16-19
    E0 = _mm_sha1nexte_epu32(E0, MSG0);
    E1 = ABCD;
    MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);
    MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
    MSG2 = _mm_xor_si128(MSG2, MSG0);

    // Rounds 20-23
    E1 = _mm_sha1nexte_epu32(E1, MSG1);
    E0 = ABCD;
    MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
    MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
    MSG3 = _mm_xor_si128(MSG3, MSG1);

    // Rounds 24-27
    E0 = _mm_sha1nexte_epu32(E0, MSG2);
    E1 = ABCD;
    MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 1);
    MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
    MSG0 = _mm_xor_si128(MSG0, MSG2);

    // Rounds 28-31
    E1 = _mm_sha1nexte_epu32(E1, MSG3);
    E0 = ABCD;
    MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
    MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
    MSG1 = _mm_xor_si128(MSG1, MSG3);

    // Rounds 32-35
    E0 = _mm_sha1nexte_epu32(E0, MSG0);
    E1 = ABCD;
    MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 1);
    MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
    MSG2 = _mm_xor_si128(MSG2, MSG0);

    // Rounds 36-39
    E1 = _mm_sha1nexte_epu32(E1, MSG1);
    E0 = ABCD;
    MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
    MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
    MSG3 = _mm_xor_si128(MSG3, MSG1);

    // Rounds 40-43
    E0 = _mm_sha1nexte_epu32(E0, MSG2);
    E1 = ABCD;
    MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
    MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
    MSG0 = _mm_xor_si128(MSG0, MSG2);

    // Rounds 44-47
    E1 = _mm_sha1nexte_epu32(E1, MSG3);
    E0 = ABCD;
    MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 2);
    MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
    MSG1 = _mm_xor_si128(MSG1, MSG3);

    // Rounds 48-51
    E0 = _mm_sha1nexte_epu32(E0, MSG0);
    E1 = ABCD;
    MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
    MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
    MSG2 = _mm_xor_si128(MSG2, MSG0);

    // Rounds 52-55
    E1 = _mm_sha1nexte_epu32(E1, MSG1);
    E0 = ABCD;
    MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 2);
    MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
    MSG3 = _mm_xor_si128(MSG3, MSG1);

    // Rounds 56-59
    E0 = _mm_sha1nexte_epu32(E0, MSG2);
    E1 = ABCD;
    MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
    MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
    MSG0 = _mm_xor_si128(MSG0, MSG2);

    // Rounds 60-63
    E1 = _mm_sha1nexte_epu32(E1, MSG3);
    E0 = ABCD;
    MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);
    MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
    MSG1 = _mm_xor_si128(MSG1, MSG3);

    // Rounds 64-67
    E0 = _mm_sha1nexte_epu32(E0, MSG0);
    E1 = ABCD;
    MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 3);
    MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
    MSG2 = _mm_xor_si128(MSG2, MSG0);

    // Rounds 68-71
    E1 = _mm_sha1nexte_epu32(E1, MSG1);
    E0 = ABCD;
    MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);
    MSG3 = _mm_xor_si128(MSG3, MSG1);

    // Rounds 72-75
    E0 = _mm_sha1nexte_epu32(E0, MSG2);
    E1 = ABCD;
    MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
    ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 3);

    // Rounds 76-79
    E1 = _mm_sha1nexte_epu32(E1, MSG3);
    E0 = ABCD;
    ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);

    // Add values back to state
    E0 = _mm_sha1nexte_epu32(E0, E0_SAVE);
    ABCD = _mm_add_epi32(ABCD, ABCD_SAVE);

    // Save state
    ABCD = _mm_shuffle_epi32(ABCD, 0x1B);
    _mm_storeu_si128((__m128i*) state, ABCD);
    *(state+4) = _mm_extract_epi32(E0, 3);
}

You can tell if your processor supports the SHA extensions under Linux by looking for the sha_ni flag:

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 92
model name  : Intel(R) Celeron(R) CPU J3455 @ 1.50GHz
stepping    : 9
microcode   : 0x1a
cpu MHz     : 799.987
cache size  : 1024 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 21
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclm
ulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic mov
be popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch intel_pt tpr_shadow vn
mi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdseed smap clflushopt sha_ni x
saveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
bugs        : monitor
bogomips    : 2995.20
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
...

Also see Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?

You can find source for both Intel SHA intrinsics and ARMv8 SHA intrinsics at Noloader GitHub | SHA-Intrinsics. They are C source files, and provide the compress function for SHA-1, SHA-224 and SHA-256. The intrinsic based implementations increase throughput approximately 3x to 4x for SHA-1, and approximately 6x to 12x for SHA-224 and SHA-256.

Seeking information on hardware SHA-2 acceleration

Support of hardware-accelerated SHA256 was added to openssl 1.0.2 [22 Jan 2015]:
https://git.openssl.org/gitweb/?p=openssl.git;a=blob;f=CHANGES

1962  Changes between 1.0.1l and 1.0.2 [22 Jan 2015]

2012   *) Support for new and upcoming Intel processors, including AVX2,
2013      BMI and SHA ISA extensions. This includes additional "stitched"
2014      implementations, AESNI-SHA256 and GCM, and multi-buffer support
2015      for TLS encrypt.

So, with enabled hardware and recent openssl, any php/python library which uses openssl to compute SHA256 may use hardware accelerated SHA256 digest computation (if enabled in the openssl and if this implementation will be selected by the library). And command-line too: openssl dgst -sha256 -binary file_to_be_hashed.

There are some raw bindings of openssl library in php: http://php.net/manual/en/function.openssl-digest.php and OPENSSL_ALGO_SHA256 is there since PHP 5.5: which openssl version support for sha256 in php

Linux Kernel CryptoAPI may use hardware SHA1/SHA2 since 4.4 version: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.4-Crypto (http://lkml.iu.edu/hypermail/linux/kernel/1511.0/00383.html); but it is unlikely that PHP/other scripting library will use kernel cryptoapi.

Are There in X86 Any Instructions to Accelerate Sha (Sha1/2/256/512) Encoding