Aligning Static String Literals

Aligning static string literals

With C++11, following may help: https://ideone.com/IDEdY0

#include <cstdint>

// Sequence of char
template <char...Cs> struct char_sequence
{
template <char C> using push_back = char_sequence<Cs..., C>;
};

// Remove all chars from char_sequence from '\0'
template <typename, char...> struct strip_sequence;

template <char...Cs>
struct strip_sequence<char_sequence<>, Cs...>
{
using type = char_sequence<Cs...>;
};

template <char...Cs, char...Cs2>
struct strip_sequence<char_sequence<'\0', Cs...>, Cs2...>
{
using type = char_sequence<Cs2...>;
};

template <char...Cs, char C, char...Cs2>
struct strip_sequence<char_sequence<C, Cs...>, Cs2...>
{
using type = typename strip_sequence<char_sequence<Cs...>, Cs2..., C>::type;
};

// struct to create a aligned char array
template <std::size_t Alignment, typename chars> struct aligned_string;

template <std::size_t Alignment, char...Cs>
struct aligned_string<Alignment, char_sequence<Cs...>>
{
alignas(Alignment) static constexpr char str[sizeof...(Cs)] = {Cs...};
};

template <std::size_t Alignment, char...Cs>
alignas(Alignment) constexpr
char aligned_string<Alignment, char_sequence<Cs...>>::str[sizeof...(Cs)];

// helper to get the i_th character (`\0` for out of bound)
template <std::size_t I, std::size_t N>
constexpr char at(const char (&a)[N]) { return I < N ? a[I] : '\0'; }

// helper to check if the c-string will not be truncated
template <std::size_t max_size, std::size_t N>
constexpr bool check_size(const char (&)[N])
{
static_assert(N <= max_size, "string too long");
return N <= max_size;
}

// Helper macros to build char_sequence from c-string
#define PUSH_BACK_8(S, I) \
::push_back<at<(I) + 0>(S)>::push_back<at<(I) + 1>(S)> \
::push_back<at<(I) + 2>(S)>::push_back<at<(I) + 3>(S)> \
::push_back<at<(I) + 4>(S)>::push_back<at<(I) + 5>(S)> \
::push_back<at<(I) + 6>(S)>::push_back<at<(I) + 7>(S)>

#define PUSH_BACK_32(S, I) \
PUSH_BACK_8(S, (I) + 0) PUSH_BACK_8(S, (I) + 8) \
PUSH_BACK_8(S, (I) + 16) PUSH_BACK_8(S, (I) + 24)

#define PUSH_BACK_128(S, I) \
PUSH_BACK_32(S, (I) + 0) PUSH_BACK_32(S, (I) + 32) \
PUSH_BACK_32(S, (I) + 64) PUSH_BACK_32(S, (I) + 96)

// Macro to create char_sequence from c-string (limited to 128 chars)
#define MAKE_CHAR_SEQUENCE(S) \
strip_sequence<char_sequence<> \
PUSH_BACK_128(S, 0) \
>::type::template push_back<check_size<128>(S) ? '\0' : '\0'>

// Macro to return an aligned c-string
#define MAKE_ALIGNED_STRING(ALIGNMENT, S) \
aligned_string<ALIGNMENT, MAKE_CHAR_SEQUENCE(S)>::str

And so you have:

static const CommandStruct commands[] =
{
{ MAKE_ALIGNED_STRING(16, "Some literal"), 28 },
{ MAKE_ALIGNED_STRING(16, "Some other literal"), 29 },
{ MAKE_ALIGNED_STRING(16, "Yet another literal"), 8 },
};

How can I align a string literal to an address which is multiple of 4?

In C99 you can do this using a union, for example

#define ALIGNED_STR_UNION(N, S) const union { char s[sizeof S]; int align; } N = { S }
ALIGNED_STR_UNION(sssu, "XYZABC");

Adjust the type int as necessary.

With that, sssu.s refers to the characters.

The .s can be avoided like

#define ALIGNED_STR(S) ((const union { char s[sizeof S]; int align; }){ S }.s)
const char *sss = ALIGNED_STR("XYZABC");

However, this version wastes space on the pointer (including a relative relocation for position independent code) and does not allow declaring static literals in functions.

It is probably better to use the non-standard alignment attributes instead of this.

How to 8-byte align each string in a static array?

Assuming you mean that you want the string literals to be aligned; this is not possible. But you can get a similar effect by making arrays with custom alignment, e.g.:

_Alignas(8) static char const s1[] = {"a"};
_Alignas(8) static char const s2[] = {"longer string"};
_Alignas(8) static char const s3[] = {"bcd"};
_Alignas(8) static char const s4[] = {"wow a really long string"};
_Alignas(8) static char const s5[] = {"foo"};

char const *const strings[] = { s1, s2, s3, s4, s5 };

You could save typing by using a preprocessor macro for each entry.

See also this question.


According to the C17 standard you can also use compound literals with alignment specifier:

char const *const strings[] = 
{
(_Alignas(8) char const[]){"a"},
(_Alignas(8) char const[]){"longer string"},
};

although some compilers don't support this yet.

Memory Allocation of Static String Literals

In your example, there are no absolute guarantees of the adjacency/placement of the two string literals with respect to each other. GCC in this case happens to demonstrate such behavior, but it has no obligation to exhibit this behavior.

In this example, we see no padding, and we can even use undefined behavior to demonstrate adjacency of string literals. This works with GCC, but using alternate libc's or different compilers, you could get other behavior, such as detecting duplicate string literals across translation units and reducing redundancy to save memory in the final application.

Also, while the pointers you declared are of type char *, the literals actually should be const char*, since they will be stored in RODATA, and writing to that memory will cause a segfault.


Code Listing


#include <stdio.h>
#include <string.h>

struct example_t {
char * a;
char * b;
char * c;
};

int main(void) {

struct example_t test = {
"Chocolate",
"Cookies",
"And milk"
};
size_t len = strlen(test.a) + strlen(test.b) + strlen(test.c) + ((3-1) * sizeof(char));

char* t= test.a;
int i;
for (i = 0; i< len; i++) {
printf("%c", t[i]);
}

return 0;
}

Sample output


./a.out 
ChocolateCookiesAnd milk

Output of gcc -S


    .file   "test.c"
.section .rodata
.LC0:
.string "Chocolate"
.LC1:
.string "Cookies"
.LC2:
.string "And milk"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $72, %rsp
.cfi_offset 3, -24
movq $.LC0, -48(%rbp)
movq $.LC1, -40(%rbp)
movq $.LC2, -32(%rbp)
movq -48(%rbp), %rax
movq %rax, %rdi
call strlen
movq %rax, %rbx
movq -40(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rax, %rbx
movq -32(%rbp), %rax
movq %rax, %rdi
call strlen
addq %rbx, %rax
addq $2, %rax
movq %rax, -64(%rbp)
movq -48(%rbp), %rax
movq %rax, -56(%rbp)
movl $0, -68(%rbp)
jmp .L2
.L3:
movl -68(%rbp), %eax
movslq %eax, %rdx
movq -56(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
movsbl %al, %eax
movl %eax, %edi
call putchar
addl $1, -68(%rbp)
.L2:
movl -68(%rbp), %eax
cltq
cmpq -64(%rbp), %rax
jb .L3
movl $0, %eax
addq $72, %rsp
popq %rbx
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",@progbits

Can a const static string be allocated on the stack?

From the C++ specification § 2.14.5/8 for string literals;

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).

It is also worthwhile mentioning this, static storage duration, applies to all the string literals; hence L"", u"", U"" etc; § 2.14.5/10-12.

In turn, for the static storage duration § 3.7.1/1;

All variables which do not have dynamic storage duration, do not have thread storage duration, and are not local have static storage duration. The storage for these entities shall last for the duration of the program (3.6.2, 3.6.3).

Hence, your string "abcdef" shall exist for the duration of the program. The compiler can choose where to store it (and this may be a system constraint), but it must remain valid.

For the C language specification (C11 draft n1570), string literals § 6.4.5/6;

In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence.

And the static storage duration § 6.2.4/3;

An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup.

The same rationale for the location applies (it will most likely be a system constraint), but must remain valid for the duration of the program.

How to properly align and format varying string lengths

So after reading some more I realized what I was trying to achieve didn't necessarily require string formatting or printf statements. I ended up, like the previous comments and answer above suggested, just formatting the output with multiple print statements. It's definitely not the most beautiful code to look at but it gives me the exact output I was after.

This was the code that did the trick:

System.out.println("  " + number1 + " +");
for(int i = 0; i < (number1.length() - number2.length()); i++){
System.out.print(" ");
}
System.out.println(" " + number2);
System.out.print(" ");
for(int j = 0; j < ((number1.length() - number2.length())+ number2.length()+1); j++){
System.out.print("-");
}
System.out.println();
System.out.print(" " + total);

Check whether equal string literals are stored at the same address

Is there any macro, in any C++ implementation, but mainly g++ and clang, whose definition guarantees that several equal string literals are stored at the same address?

  • gcc has the -fmerge-constants option (this is not a guarantee) :

Attempt to merge identical constants (string constants and floating-point constants) across compilation units.

This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.

Enabled at levels -O, -O2, -O3, -Os.

  • Visual Studio has String Pooling (/GF option : "Eliminate Duplicate Strings")

String pooling allows what were intended as multiple pointers to multiple buffers to be multiple pointers to a single buffer. In the following code, s and t are initialized with the same string. String pooling causes them to point to the same memory:

char *s = "This is a character buffer";
char *t = "This is a character buffer";

Note: although MSDN uses char* strings literals, const char* should be used

  • clang apparently also has the -fmerge-constants option, but I can't find much about it, except in the --help section, so I'm not sure if it really is the equivalent of the gcc's one :

Disallow merging of constants


Anyway, how string literals are stored is implementation dependent (many do store them in the read-only portion of the program).

Rather than building your library on possible implementation-dependent hacks, I can only suggest the usage of std::string instead of C-style strings : they will behave exactly as you expect.

You can construct your std::string in-place in your containers with the emplace() methods :

    std::unordered_set<std::string> my_set;
my_set.emplace("Hello");

String literals: Where do they go?

A common technique is for string literals to be put in "read-only-data" section which gets mapped into the process space as read-only (which is why you can't change it).

It does vary by platform. For example, simpler chip architectures may not support read-only memory segments so the data segment will be writable.

Rather than try to figure out a trick to make string literals changeable (it will be highly dependent on your platform and could change over time), just use arrays:

char foo[] = "...";

The compiler will arrange for the array to get initialized from the literal and you can modify the array.

Is it possible to force GCC to pad string constants in .rodata

There does not seem to be any way to accomplish this, at least with GCC. Testing seems to indicate that although the compiler will align integers, doubles and so on,because string constants are made of characters and alignment for character data is on byte boundaries, the compiler feels no need to align them.

The particulars of this bus error seem to indicate that glibc uses optimized routines that copy data words at a time without checking for alignment first (having not looked at the source, I don't know if this is true or not however).

This led me to investigating musl, an alternative libc implementation that is simple to install and use on a project by project basis.The C source code of the musl version of strcat takes care to copy unaligned bytes before copying words at a time, and thus this particular issue goes away, although naturally others remain.



Related Topics



Leave a reply



Submit