Different CPU Cache Size Reported by /Sys/Device/ and Dmidecode

different CPU cache size reported by /sys/device/ and dmidecode

A few things:

You have a quad-core CPU
The index<n> name in /sys/devices/system/cpu/cpu<n>/cache does not correspond to L1/L2/L3 etc. There is a .../index<n>/level file that will tell you the level of the cache.
Your L1 cache is split into two caches (likely index0 and index1), one for data, and one for instructions (see .../index<n>/type), per core. 4 cores * 2 halves * 32K matches the 256K that dmidecode reports.
The L2 cache is split per-core. 4 cores * 256K (from index2) = 1024K, which matches dmidecodes L2 number.

Intel CPU Cache Policy

This is not something you can query from CPUID or such, nor can you configure your CPU to do one or the other, thus there exists no tool for querying. What you can query is the cache associativity, the cache line size, and the cache size, for example via /proc/cpuinfo.

All Intel-compatible CPUs during the last one/two decades used a write-back strategy for caches (which presumes fetching a cache line first to allow partial writes). Of course that's the theory, reality is slighly more complex than that.

Virtually all processors (your model included) have one or several forms of write combining (or fill buffers as Intel calls it since Merom), and all but the most antique Intel-compatible CPUs support uncached writes from SSE registers (which again uses a form of write-combining). And then of course, there are things like on-chip cache coherence protocols and snoop filtering and other mechanisms to ensure cache coherency both between cores of one processor and between different processors in a multi-processor system.

Nevertheless -- the general cache policy is still write-back.

Measuring Cache Latencies

I would rather try to use the hardware clock as a measure. The rdtsc instruction will tell you the current cycle count since the CPU was powered up. Also it is better to use asm to make sure always the same instructions are used in both measured and dry runs. Using that and some clever statistics I have made this a long time ago:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>

int i386_cpuid_caches (size_t * data_caches) {
    int i;
    int num_data_caches = 0;
    for (i = 0; i < 32; i++) {

        // Variables to hold the contents of the 4 i386 legacy registers
        uint32_t eax, ebx, ecx, edx; 

        eax = 4; // get cache info
        ecx = i; // cache id

        asm (
            "cpuid" // call i386 cpuid instruction
            : "+a" (eax) // contains the cpuid command code, 4 for cache query
            , "=b" (ebx)
            , "+c" (ecx) // contains the cache id
            , "=d" (edx)
        ); // generates output in 4 registers eax, ebx, ecx and edx 

        // taken from http://download.intel.com/products/processor/manual/325462.pdf Vol. 2A 3-149
        int cache_type = eax & 0x1F; 

        if (cache_type == 0) // end of valid cache identifiers
            break;

        char * cache_type_string;
        switch (cache_type) {
            case 1: cache_type_string = "Data Cache"; break;
            case 2: cache_type_string = "Instruction Cache"; break;
            case 3: cache_type_string = "Unified Cache"; break;
            default: cache_type_string = "Unknown Type Cache"; break;
        }

        int cache_level = (eax >>= 5) & 0x7;

        int cache_is_self_initializing = (eax >>= 3) & 0x1; // does not need SW initialization
        int cache_is_fully_associative = (eax >>= 1) & 0x1;

        // taken from http://download.intel.com/products/processor/manual/325462.pdf 3-166 Vol. 2A
        // ebx contains 3 integers of 10, 10 and 12 bits respectively
        unsigned int cache_sets = ecx + 1;
        unsigned int cache_coherency_line_size = (ebx & 0xFFF) + 1;
        unsigned int cache_physical_line_partitions = ((ebx >>= 12) & 0x3FF) + 1;
        unsigned int cache_ways_of_associativity = ((ebx >>= 10) & 0x3FF) + 1;

        // Total cache size is the product
        size_t cache_total_size = cache_ways_of_associativity * cache_physical_line_partitions * cache_coherency_line_size * cache_sets;

        if (cache_type == 1 || cache_type == 3) {
            data_caches[num_data_caches++] = cache_total_size;
        }

        printf(
            "Cache ID %d:\n"
            "- Level: %d\n"
            "- Type: %s\n"
            "- Sets: %d\n"
            "- System Coherency Line Size: %d bytes\n"
            "- Physical Line partitions: %d\n"
            "- Ways of associativity: %d\n"
            "- Total Size: %zu bytes (%zu kb)\n"
            "- Is fully associative: %s\n"
            "- Is Self Initializing: %s\n"
            "\n"
            , i
            , cache_level
            , cache_type_string
            , cache_sets
            , cache_coherency_line_size
            , cache_physical_line_partitions
            , cache_ways_of_associativity
            , cache_total_size, cache_total_size >> 10
            , cache_is_fully_associative ? "true" : "false"
            , cache_is_self_initializing ? "true" : "false"
        );
    }

    return num_data_caches;
}

int test_cache(size_t attempts, size_t lower_cache_size, int * latencies, size_t max_latency) {
    int fd = open("/dev/urandom", O_RDONLY);
    if (fd < 0) {
        perror("open");
        abort();
    }
    char * random_data = mmap(
          NULL
        , lower_cache_size
        , PROT_READ | PROT_WRITE
        , MAP_PRIVATE | MAP_ANON // | MAP_POPULATE
        , -1
        , 0
        ); // get some random data
    if (random_data == MAP_FAILED) {
        perror("mmap");
        abort();
    }

    size_t i;
    for (i = 0; i < lower_cache_size; i += sysconf(_SC_PAGESIZE)) {
        random_data[i] = 1;
    }

    int64_t random_offset = 0;
    while (attempts--) {
        // use processor clock timer for exact measurement
        random_offset += rand();
        random_offset %= lower_cache_size;
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mov %4, %%al\n\t"  // load data
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            : "m" (random_data[random_offset])
            );
        // printf("%d\n", cycles_used);
        if (cycles_used < max_latency)
            latencies[cycles_used]++;
        else 
            latencies[max_latency - 1]++;
    }

    munmap(random_data, lower_cache_size);

    return 0;
} 

int main() {
    size_t cache_sizes[32];
    int num_data_caches = i386_cpuid_caches(cache_sizes);

    int latencies[0x400];
    memset(latencies, 0, sizeof(latencies));

    int empty_cycles = 0;

    int i;
    int attempts = 1000000;
    for (i = 0; i < attempts; i++) { // measure how much overhead we have for counting cyscles
        int32_t cycles_used, edx, temp1, temp2;
        asm (
            "mfence\n\t"        // memory fence
            "rdtsc\n\t"         // get cpu cycle count
            "mov %%edx, %2\n\t"
            "mov %%eax, %3\n\t"
            "mfence\n\t"        // memory fence
            "mfence\n\t"
            "rdtsc\n\t"
            "sub %2, %%edx\n\t" // substract cycle count
            "sbb %3, %%eax"     // substract cycle count
            : "=a" (cycles_used)
            , "=d" (edx)
            , "=r" (temp1)
            , "=r" (temp2)
            :
            );
        if (cycles_used < sizeof(latencies) / sizeof(*latencies))
            latencies[cycles_used]++;
        else 
            latencies[sizeof(latencies) / sizeof(*latencies) - 1]++;

    }

    {
        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                empty_cycles = j;
                fprintf(stderr, "Empty counting takes %d cycles\n", empty_cycles);
                break;
            }
        }
    }

    for (i = 0; i < num_data_caches; i++) {
        test_cache(attempts, cache_sizes[i] * 4, latencies, sizeof(latencies) / sizeof(*latencies));

        int j;
        size_t sum = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum += latencies[j];
        }
        size_t sum2 = 0;
        for (j = 0; j < sizeof(latencies) / sizeof(*latencies); j++) {
            sum2 += latencies[j];
            if (sum2 >= sum * .75) {
                fprintf(stderr, "Cache ID %i has latency %d cycles\n", i, j - empty_cycles);
                break;
            }
        }

    }

    return 0;

}

Output on my Core2Duo:

Cache ID 0:
- Level: 1
- Type: Data Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Total Size: 32768 bytes (32 kb)

Cache ID 2:
- Level: 2
- Type: Unified Cache
- Total Size: 262144 bytes (256 kb)

Cache ID 3:
- Level: 3
- Type: Unified Cache
- Total Size: 3145728 bytes (3072 kb)

Empty counting takes 90 cycles
Cache ID 0 has latency 6 cycles
Cache ID 2 has latency 21 cycles
Cache ID 3 has latency 168 cycles

Get hardware information from /proc filesytem in Linux

By reading the source code of lshw, I found that lshw read raw data from /sys/class/dmi/. Because lshw is written in CPP that I am not familiar with, there is a question Where does dmidecode get the SMBIOS table? mentioned that dmidecode.c read raw data from /sys/class/dmi that's same as lshw does.

Here are the definitions in dmidecode.c

#define SYS_ENTRY_FILE "/sys/firmware/dmi/tables/smbios_entry_point"
#define SYS_TABLE_FILE "/sys/firmware/dmi/tables/DMI"

I extract code from dmidecode.c to get the CPU and memory information, and use lsscsi to get disk information.

Thanks for your help.

How can I find out the total physical memory (RAM) of my linux box suitable to be parsed by a shell script?

If you're interested in the physical RAM, use the command dmidecode. It gives you a lot more information than just that, but depending on your use case, you might also want to know if the 8G in the system come from 2x4GB sticks or 4x2GB sticks.

sudo dmidecode | grep UUID' and '/sys/devices/virtual/dmi/id/product_uuid'. Are they same?

dmidecode | grep UUID and /sys/devices/virtual/dmi/id/product_uuidshould be equal but depending on your system the output can be different.

From dmidecode source code:

/*
 * As of version 2.6 of the SMBIOS specification, the first 3
 * fields of the UUID are supposed to be encoded on little-endian.
 * The specification says that this is the defacto standard,
 * however I've seen systems following RFC 4122 instead and use
 * network byte order, so I am reluctant to apply the byte-swapping
 * for older versions.
 */

See also: [PATCH] dmi, Use little-endian for sysfs PRODUCT UUID

I notice that Ansible also try /sys first and fall back to dmidecode executable to collect dmi related facts.

how to understand if a CPU support ECC?

No, you can't conclude that ECC DRAM is supported or not based on what the internal caches use to protect data in the cache. The two things are unrelated.

You need to check the CPU and motherboard specs to make sure that both support ECC DRAM. (In your case your Core2 doesn't have an onboard memory controller, so the DRAM is connected to the CPU northbridge. It was the last generation to not integrate the DRAM controllers.)

All recent Intel CPUs use ECC in their L2 / L3 caches, but L1D is actually just parity, not ECC. (To support efficient single-byte and unaligned stores without as much ECC overhead.)

Core2Quad doesn't have an L3 cache.

And BTW, its memory bandwidth is significantly worse than a Nehalem or newer with DDR3, which might be an issue for ZFS copying data around a lot. I think memory bandwidth might be part of the bottleneck in my old Core2Duo running an XFS RAID5 which doesn't saturate individual disk bandwidths for sequential read.

Different CPU Cache Size Reported by /Sys/Device/ and Dmidecode