Very Poor Boost::Lexical_Cast Performance

Very poor boost::lexical_cast performance

Edit 2012-04-11

rve quite rightly commented about lexical_cast's performance, providing a link:

http://www.boost.org/doc/libs/1_49_0/doc/html/boost_lexical_cast/performance.html

I don't have access right now to boost 1.49, but I do remember making my code faster on an older version. So I guess:

the following answer is still valid (if only for learning purposes)
there was probably an optimization introduced somewhere between the two versions (I'll search that)
which means that boost is still getting better and better

Original answer

Just to add info on Barry's and Motti's excellent answers:

Some background

Please remember Boost is written by the best C++ developers on this planet, and reviewed by the same best developers. If lexical_cast was so wrong, someone would have hacked the library either with criticism or with code.

I guess you missed the point of lexical_cast's real value...

Comparing apples and oranges.

In Java, you are casting an integer into a Java String. You'll note I'm not talking about an array of characters, or a user defined string. You'll note, too, I'm not talking about your user-defined integer. I'm talking about strict Java Integer and strict Java String.

In Python, you are more or less doing the same.

As said by other posts, you are, in essence, using the Java and Python equivalents of sprintf (or the less standard itoa).

In C++, you are using a very powerful cast. Not powerful in the sense of raw speed performance (if you want speed, perhaps sprintf would be better suited), but powerful in the sense of extensibility.

Comparing apples.

If you want to compare a Java Integer.toString method, then you should compare it with either C sprintf or C++ ostream facilities.

The C++ stream solution would be 6 times faster (on my g++) than lexical_cast, and quite less extensible:

inline void toString(const int value, std::string & output)
{
   // The largest 32-bit integer is 4294967295, that is 10 chars
   // On the safe side, add 1 for sign, and 1 for trailing zero
   char buffer[12] ;
   sprintf(buffer, "%i", value) ;
   output = buffer ;
}

The C sprintf solution would be 8 times faster (on my g++) than lexical_cast but a lot less safe:

inline void toString(const int value, char * output)
{
   sprintf(output, "%i", value) ;
}

Both solutions are either as fast or faster than your Java solution (according to your data).

Comparing oranges.

If you want to compare a C++ lexical_cast, then you should compare it with this Java pseudo code:

Source s ;
Target t = Target.fromString(Source(s).toString()) ;

Source and Target being of whatever type you want, including built-in types like boolean or int, which is possible in C++ because of templates.

Extensibility? Is that a dirty word?

No, but it has a well known cost: When written by the same coder, general solutions to specific problems are usually slower than specific solutions written for their specific problems.

In the current case, in a naive viewpoint, lexical_cast will use the stream facilities to convert from a type A into a string stream, and then from this string stream into a type B.

This means that as long as your object can be output into a stream, and input from a stream, you'll be able to use lexical_cast on it, without touching any single line of code.

So, what are the uses of `lexical_cast`?

The main uses of lexical casting are:

Ease of use (hey, a C++ cast that works for everything being a value!)
Combining it with template heavy code, where your types are parametrized, and as such you don't want to deal with specifics, and you don't want to know the types.
Still potentially relatively efficient, if you have basic template knowledge, as I will demonstrate below

The point 2 is very very important here, because it means we have one and only one interface/function to cast a value of a type into an equal or similar value of another type.

This is the real point you missed, and this is the point that costs in performance terms.

But it's so slooooooowwww!

If you want raw speed performance, remember you're dealing with C++, and that you have a lot of facilities to handle conversion efficiently, and still, keep the lexical_cast ease-of-use feature.

It took me some minutes to look at the lexical_cast source, and come with a viable solution. Add to your C++ code the following code:

#ifdef SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT

namespace boost
{
   template<>
   std::string lexical_cast<std::string, int>(const int &arg)
   {
      // The largest 32-bit integer is 4294967295, that is 10 chars
      // On the safe side, add 1 for sign, and 1 for trailing zero
      char buffer[12] ;
      sprintf(buffer, "%i", arg) ;
      return buffer ;
   }
}

#endif

By enabling this specialization of lexical_cast for strings and ints (by defining the macro SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT), my code went 5 time faster on my g++ compiler, which means, according to your data, its performance should be similar to Java's.

And it took me 10 minutes of looking at boost code, and write a remotely efficient and correct 32-bit version. And with some work, it could probably go faster and safer (if we had direct write access to the std::string internal buffer, we could avoid a temporary external buffer, for example).

Why does boost::lexical_cast throw an exception even though it converted the value?

That's confusing.

Couldn't repro it at first: https://wandbox.org/permlink/MWJ3Ys7iUhNIaBek - you can change compiler versions and boost version there

However, changing the compiler to clang did the trick: https://wandbox.org/permlink/Ml8lQWESprfEplBi (even with boost 1.73)

Things get weirder: on my box, clang++-9 is fine even with asan/ubsan.

So I took to installing a few docker distributions.

It turns out that when using clagn++ -stdlib=libc++ things break.

Conclusion

It's not that complicated after a long chase down debuggers and standard library implementations. Here's the low-down:

#include <sstream>
#include <cassert>
#include <iostream>

int main() {
    double v;
    std::cout << std::numeric_limits<double>::min_exponent10 << std::endl;
    std::cout << std::numeric_limits<double>::max_exponent10 << std::endl;
    assert(std::istringstream("1e308") >> v);
    assert(std::istringstream("1.03964e-312") >> v); // line 10
    assert(std::istringstream("1e309") >> v); // line 11
}

On libstdc++ prints:

-307
308
sotest: /home/sehe/Projects/stackoverflow/test.cpp:11: int main(): Assertion `std::istringstream("1e309") >> v' failed.

On libc++:

-307
308
sotest: /home/sehe/Projects/stackoverflow/test.cpp:10: int main(): Assertion `std::istringstream("1.03964e-312") >> v' failed.

Summarizing, libstdc++ is allowing subnormal representations in some cases:

The 11 bit width of the exponent allows the representation of numbers between 10−308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10−324.

It is likely that the library does do some checks to find whether there is acceptable loss of precision, but it could also be leaving this entirely to your own judgment.

Suggestions

If you need that kind of range, I'd suggest using a multiprecision library (GMP, MPFR, or indeed Boost).

For full fidelity with decimal input formats, consider e.g. cpp_dec_float:

#include <boost/multiprecision/cpp_dec_float.hpp>
using Decimal = boost::multiprecision::cpp_dec_float_50;

int main() {
    Decimal v("1.03964e-312");
    std::cout << v << std::endl;
}

Prints

1.03964e-312

Is boost::lexical_cast redundant with c++11 stoi, stof and family?

boost::lexical_cast

handles more kinds of conversion, including iterator pairs, arrays, C strings, etc.
offers the same generic interface (sto* have different names for different types)
is locale-sensitive (sto*/to_string are only in part, e.g. lexical_cast can process thousands separators, while stoul usually doesn't)

When does boost::lexical_cast to std::string fail?

It can fail for example if a user-defined conversion throws:

enum class MyType {};

std::ostream& operator<<( std::ostream&, MyType const& )
{
    throw "error";
}

int main()
{
    try 
    {
        boost::lexical_cast< std::string >( MyType{} );
    }
    catch(...)
    {
        std::cout << "lexical_cast exception";
    }
}

As you have no control about the type of exceptions thrown by user-defined conversions, catching boost::bad_lexical_cast won't even be enough. Your unit test has to catch all exceptions.

Live Demo

What overhead is there in performing an identity boost::lexical_cast?

Since the documentation doesn't offer anything on this topic, I dug into the lexical_cast source (1.51.0) and found that it does some compile-time checking on the types and decides a specific "caster class" that does the conversion. In case source and target are the same, this "caster class" will simply return the input.

Pseudo-codified and simplified from source (boost/lexical_cast.hpp:2268):

template <typename Target, typename Source>
Target lexical_cast(const Source &arg)
{
    static if( is_character_type_to_character_type<Target, src> ||
               is_char_array_to_stdstring<Target, src> ||
               is_same_and_stdstring<Target, src> )
    //         ^-- optimization for std::string to std::string and similar stuff
    {
      return arg;
    }
    else
    {
      /* some complicated stuff */
    }
}

I can't directly see any optimizations for other identity casts, though, and looking through the normally selected lexical_cast_do_cast "caster class" is making my head hurt. :(

boost lexical_cast throws exception

Your problem is probably that the loop is processed one more time than you expect.
The last time through the loop, the read to word fails, setting the fail bit in iss, which is what while(iss) is checking. To fix it you need to do something like this.

string word;
istringstream iss(line);
do
{
   string word;
   iss >> word; 
   if(iss)
   {       
     double x;
     x = lexical_cast<double>(word);   
     cout << x << endl;
   }
} while (iss);

How can I use boost::lexical_cast with folly::fbstring?

From looking at the lexical_cast documentation it appears that std::string is explicitly the only string-like exception allowed to normal lexical-cast semantics, to keep the meaning of the cast as straightforward as possible and catch as many possible conversion errors as can be caught. The documentation also says that for other cases to use alternatives such as std::stringstream.

In your case I think a to_fbstring method would be perfect:

template <typename T>
fbstring to_fbstring(const T& item)
{
    std::ostringstream os;

    os << item;

    return fbstring(os.str());
}

Locale invariant guarantee of boost::lexical_cast

Can I force the boost::lexical_cast<> not to be aware of locales?

No, I don't think that is possible. The best you can do is call

std::locale::global(std::locale::classic());

to set the global locale to the "C" locale as boost::lexical_cast relies on the global locale. However, the problem is if somewhere else in the code the global locale is set to something else before calling boost::lexical_cast, then you still have the same problem.
Therefore, a robust solution would be imbue a stringstream like so, and you can be always sure that this works:

std::ostringstream oss;
oss.imbue(std::locale::classic());
oss.precision(std::numeric_limits<double>::digits10);
oss << 0.15784465;

Very Poor Boost::Lexical_Cast Performance