String-Interning at Compiletime for Profiling

String-interning at compiletime for profiling

Identical literal strings are not guaranty to be identical, but you can build type from it which can compare identical (without comparing string), something like:

// Sequence of char
template <char...Cs> struct char_sequence
{
template <char C> using push_back = char_sequence<Cs..., C>;
};

// Remove all chars from char_sequence from '\0'
template <typename, char...> struct strip_sequence;

template <char...Cs>
struct strip_sequence<char_sequence<>, Cs...>
{
using type = char_sequence<Cs...>;
};

template <char...Cs, char...Cs2>
struct strip_sequence<char_sequence<'\0', Cs...>, Cs2...>
{
using type = char_sequence<Cs2...>;
};

template <char...Cs, char C, char...Cs2>
struct strip_sequence<char_sequence<C, Cs...>, Cs2...>
{
using type = typename strip_sequence<char_sequence<Cs...>, Cs2..., C>::type;
};

// struct to create a aligned char array
template <typename chars> struct static_string;

template <char...Cs>
struct static_string<char_sequence<Cs...>>
{
static constexpr char str[sizeof...(Cs)] = {Cs...};
};

template <char...Cs>
constexpr
char static_string<char_sequence<Cs...>>::str[sizeof...(Cs)];

// helper to get the i_th character (`\0` for out of bound)
template <std::size_t I, std::size_t N>
constexpr char at(const char (&a)[N]) { return I < N ? a[I] : '\0'; }

// helper to check if the c-string will not be truncated
template <std::size_t max_size, std::size_t N>
constexpr bool check_size(const char (&)[N])
{
static_assert(N <= max_size, "string too long");
return N <= max_size;
}

// Helper macros to build char_sequence from c-string
#define PUSH_BACK_8(S, I) \
::push_back<at<(I) + 0>(S)>::push_back<at<(I) + 1>(S)> \
::push_back<at<(I) + 2>(S)>::push_back<at<(I) + 3>(S)> \
::push_back<at<(I) + 4>(S)>::push_back<at<(I) + 5>(S)> \
::push_back<at<(I) + 6>(S)>::push_back<at<(I) + 7>(S)>

#define PUSH_BACK_32(S, I) \
PUSH_BACK_8(S, (I) + 0) PUSH_BACK_8(S, (I) + 8) \
PUSH_BACK_8(S, (I) + 16) PUSH_BACK_8(S, (I) + 24)

#define PUSH_BACK_128(S, I) \
PUSH_BACK_32(S, (I) + 0) PUSH_BACK_32(S, (I) + 32) \
PUSH_BACK_32(S, (I) + 64) PUSH_BACK_32(S, (I) + 96)

// Macro to create char_sequence from c-string (limited to 128 chars)
#define MAKE_CHAR_SEQUENCE(S) \
strip_sequence<char_sequence<> \
PUSH_BACK_128(S, 0) \
>::type::template push_back<check_size<128>(S) ? '\0' : '\0'>

// Macro to return an static c-string
#define MAKE_STRING(S) \
aligned_string<MAKE_CHAR_SEQUENCE(S)>::str

So

MEASURE_SCOPE(MAKE_STRING("text_rendering_code"));

would still return same pointer than you can compare directly.

You can modify your Macro MEASURE_SCOPE to include directly MAKE_STRING.

gcc has an extension to simplify MAKE_STRING:

template <typename CHAR, CHAR... cs>
const char* operator ""_c() { return static_string<cs...>{}::str; }

and then

MEASURE_SCOPE("text_rendering_code"_c);

When to use intern() on String literals

This is a technique to ensure that CONSTANT is not actually a constant.

When the Java compiler sees a reference to a final static primitive or String, it inserts the actual value of that constant into the class that uses it. If you then change the constant value in the defining class but don't recompile the using class, it will continue to use the old value.

By calling intern() on the "constant" string, it is no longer considered a static constant by the compiler, so the using class will actually access the defining class' member on each use.


JLS citations:

  • definition of a compile-time constant: http://docs.oracle.com/javase/specs/jls/se6/html/expressions.html#5313

  • implication of changes to a compile-time constant (about halfway down the page): http://docs.oracle.com/javase/specs/jls/se6/html/binaryComp.html#45139

Is it good practice to use java.lang.String.intern()?

When would I use this function in favor to String.equals()

when you need speed since you can compare strings by reference (== is faster than equals)

Are there side effects not mentioned in the Javadoc?

The primary disadvantage is that you have to remember to make sure that you actually do intern() all of the strings that you're going to compare. It's easy to forget to intern() all strings and then you can get confusingly incorrect results. Also, for everyone's sake, please be sure to very clearly document that you're relying on the strings being internalized.

The second disadvantage if you decide to internalize strings is that the intern() method is relatively expensive. It has to manage the pool of unique strings so it does a fair bit of work (even if the string has already been internalized). So, be careful in your code design so that you e.g., intern() all appropriate strings on input so you don't have to worry about it anymore.

(from JGuru)

Third disadvantage (Java 7 or less only): interned Strings live in PermGen space, which is usually quite small; you may run into an OutOfMemoryError with plenty of free heap space.

(from Michael Borgwardt)

Prepend to static string at compile time

Arguments are not constexpr, you have to turn them in type:

gcc/clang have an extension to allow to build UDL from literal string:

// That template uses the extension
template<typename Char, Char... Cs>
constexpr auto operator"" _cs() -> std::integer_sequence<Char, Cs...> {
return {};
}

See my answer from String-interning at compiletime for profiling to have MAKE_STRING macro if you cannot used the extension (Really more verbose, and hard coded limit for accepted string length).

Then

template<char ... Cs>
static constexpr auto
create_hint_for_opt_cls(std::integer_sequence<char, Cs...>) {
return hint_for_opt_cls_holder<Cs...>::value;
}

constexpr static auto name = "foobar"_cs;

When is it beneficial to flyweight Strings in Java?

Don't use String.intern() in your code. At least not if you might get 20 or more different strings. In my experience using String.intern slows down the whole application when you have a few millions strings.

To avoid duplicated String objects, just use a HashMap.

private final Map<String, String> pool = new HashMap<String, String>();

private void interned(String s) {
String interned = pool.get(s);
if (interned != null) {
return interned;
pool.put(s, s);
return s;
}

private void readFile(CsvFile csvFile) {
for (List<String> row : csvFile) {
for (int i = 0; i < row.size(); i++) {
row.set(i, interned(row.get(i)));
// further process the row
}
}
pool.clear(); // allow the garbage collector to clean up
}

With that code you can avoid duplicate strings for one CSV file. If you need to avoid them on a larger scale, call pool.clear() in another place.

String interning in .Net Framework - What are the benefits and when to use interning

Interning is an internal implementation detail. Unlike boxing, I do not think there is any benefit in knowing more than what you have read in Richter's book.

Micro-optimisation benefits of interning strings manually are minimal hence is generally not recommended.

This probably describes it:

class Program
{
const string SomeString = "Some String"; // gets interned

static void Main(string[] args)
{
var s1 = SomeString; // use interned string
var s2 = SomeString; // use interned string
var s = "String";
var s3 = "Some " + s; // no interning

Console.WriteLine(s1 == s2); // uses interning comparison
Console.WriteLine(s1 == s3); // do NOT use interning comparison
}
}

How can I avoid string.intern() contention and keep the memory footprint low?

Add an extra indirection step: Have a second HashMap that keeps the keys, and look up the keys there first before inserting them in the in-memory structures. This will give you much more flexibility than String#intern().

However, if you need to parse that 200MB XML file on every tomcat startup, and the extra 10 seconds make people grumble (are they restarting tomcat every so often?) - that makes flags pop up (have you considered using a database, even Apache Derby, to keep the parsed data?).

store a string in a constexpr struct

You might do:

template<typename Char, Char... Cs>
struct CharSeq
{
static constexpr const Char s[] = {Cs..., 0}; // The unique address
};

// That template uses the extension
template<typename Char, Char... Cs>
constexpr CharSeq<Char, Cs...> operator"" _cs() {
return {};
}

See my answer from String-interning at compiletime for profiling to have MAKE_STRING macro if you cannot used the extension (Really more verbose, and hard coded limit for accepted string length).

Then

struct A
{
template <char ... Cs>
constexpr A(CharSeq<char, Cs...>) : m_name(CharSeq<char, Cs...>::s) {}

constexpr auto name(){ return m_name; }

std::string_view m_name;
};

With only valid usages similar to:

A a = {"Hello"_cs};
constexpr A b = {"World"_cs};


Related Topics



Leave a reply



Submit