Very poor boost::lexical_cast performance
Edit 2012-04-11
rve quite rightly commented about lexical_cast's performance, providing a link:
http://www.boost.org/doc/libs/1_49_0/doc/html/boost_lexical_cast/performance.html
I don't have access right now to boost 1.49, but I do remember making my code faster on an older version. So I guess:
- the following answer is still valid (if only for learning purposes)
- there was probably an optimization introduced somewhere between the two versions (I'll search that)
- which means that boost is still getting better and better
Original answer
Just to add info on Barry's and Motti's excellent answers:
Some background
Please remember Boost is written by the best C++ developers on this planet, and reviewed by the same best developers. If lexical_cast
was so wrong, someone would have hacked the library either with criticism or with code.
I guess you missed the point of lexical_cast
's real value...
Comparing apples and oranges.
In Java, you are casting an integer into a Java String. You'll note I'm not talking about an array of characters, or a user defined string. You'll note, too, I'm not talking about your user-defined integer. I'm talking about strict Java Integer and strict Java String.
In Python, you are more or less doing the same.
As said by other posts, you are, in essence, using the Java and Python equivalents of sprintf
(or the less standard itoa
).
In C++, you are using a very powerful cast. Not powerful in the sense of raw speed performance (if you want speed, perhaps sprintf
would be better suited), but powerful in the sense of extensibility.
Comparing apples.
If you want to compare a Java Integer.toString
method, then you should compare it with either C sprintf
or C++ ostream
facilities.
The C++ stream solution would be 6 times faster (on my g++) than lexical_cast
, and quite less extensible:
inline void toString(const int value, std::string & output)
{
// The largest 32-bit integer is 4294967295, that is 10 chars
// On the safe side, add 1 for sign, and 1 for trailing zero
char buffer[12] ;
sprintf(buffer, "%i", value) ;
output = buffer ;
}
The C sprintf
solution would be 8 times faster (on my g++) than lexical_cast
but a lot less safe:
inline void toString(const int value, char * output)
{
sprintf(output, "%i", value) ;
}
Both solutions are either as fast or faster than your Java solution (according to your data).
Comparing oranges.
If you want to compare a C++ lexical_cast
, then you should compare it with this Java pseudo code:
Source s ;
Target t = Target.fromString(Source(s).toString()) ;
Source and Target being of whatever type you want, including built-in types like boolean
or int
, which is possible in C++ because of templates.
Extensibility? Is that a dirty word?
No, but it has a well known cost: When written by the same coder, general solutions to specific problems are usually slower than specific solutions written for their specific problems.
In the current case, in a naive viewpoint, lexical_cast
will use the stream facilities to convert from a type A
into a string stream, and then from this string stream into a type B
.
This means that as long as your object can be output into a stream, and input from a stream, you'll be able to use lexical_cast
on it, without touching any single line of code.
So, what are the uses of lexical_cast
?
The main uses of lexical casting are:
- Ease of use (hey, a C++ cast that works for everything being a value!)
- Combining it with template heavy code, where your types are parametrized, and as such you don't want to deal with specifics, and you don't want to know the types.
- Still potentially relatively efficient, if you have basic template knowledge, as I will demonstrate below
The point 2 is very very important here, because it means we have one and only one interface/function to cast a value of a type into an equal or similar value of another type.
This is the real point you missed, and this is the point that costs in performance terms.
But it's so slooooooowwww!
If you want raw speed performance, remember you're dealing with C++, and that you have a lot of facilities to handle conversion efficiently, and still, keep the lexical_cast
ease-of-use feature.
It took me some minutes to look at the lexical_cast source, and come with a viable solution. Add to your C++ code the following code:
#ifdef SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT
namespace boost
{
template<>
std::string lexical_cast<std::string, int>(const int &arg)
{
// The largest 32-bit integer is 4294967295, that is 10 chars
// On the safe side, add 1 for sign, and 1 for trailing zero
char buffer[12] ;
sprintf(buffer, "%i", arg) ;
return buffer ;
}
}
#endif
By enabling this specialization of lexical_cast for strings and ints (by defining the macro SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT
), my code went 5 time faster on my g++ compiler, which means, according to your data, its performance should be similar to Java's.
And it took me 10 minutes of looking at boost code, and write a remotely efficient and correct 32-bit version. And with some work, it could probably go faster and safer (if we had direct write access to the std::string
internal buffer, we could avoid a temporary external buffer, for example).
Why does boost::lexical_cast throw an exception even though it converted the value?
That's confusing.
Couldn't repro it at first: https://wandbox.org/permlink/MWJ3Ys7iUhNIaBek - you can change compiler versions and boost version there
However, changing the compiler to clang did the trick: https://wandbox.org/permlink/Ml8lQWESprfEplBi (even with boost 1.73)
Things get weirder: on my box, clang++-9 is fine even with asan/ubsan.
So I took to installing a few docker distributions.
It turns out that when using clagn++ -stdlib=libc++
things break.
Conclusion
It's not that complicated after a long chase down debuggers and standard library implementations. Here's the low-down:
#include <sstream>
#include <cassert>
#include <iostream>
int main() {
double v;
std::cout << std::numeric_limits<double>::min_exponent10 << std::endl;
std::cout << std::numeric_limits<double>::max_exponent10 << std::endl;
assert(std::istringstream("1e308") >> v);
assert(std::istringstream("1.03964e-312") >> v); // line 10
assert(std::istringstream("1e309") >> v); // line 11
}
On libstdc++ prints:
-307
308
sotest: /home/sehe/Projects/stackoverflow/test.cpp:11: int main(): Assertion `std::istringstream("1e309") >> v' failed.
On libc++:
-307
308
sotest: /home/sehe/Projects/stackoverflow/test.cpp:10: int main(): Assertion `std::istringstream("1.03964e-312") >> v' failed.
Summarizing, libstdc++ is allowing subnormal representations in some cases:
The 11 bit width of the exponent allows the representation of numbers between 10−308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10−324.
It is likely that the library does do some checks to find whether there is acceptable loss of precision, but it could also be leaving this entirely to your own judgment.
Suggestions
If you need that kind of range, I'd suggest using a multiprecision library (GMP, MPFR, or indeed Boost).
For full fidelity with decimal input formats, consider e.g. cpp_dec_float:
#include <boost/multiprecision/cpp_dec_float.hpp>
using Decimal = boost::multiprecision::cpp_dec_float_50;
int main() {
Decimal v("1.03964e-312");
std::cout << v << std::endl;
}
Prints
1.03964e-312
Is boost::lexical_cast redundant with c++11 stoi, stof and family?
boost::lexical_cast
- handles more kinds of conversion, including iterator pairs, arrays, C strings, etc.
- offers the same generic interface (
sto*
have different names for different types) - is locale-sensitive (
sto*
/to_string
are only in part, e.g.lexical_cast
can process thousands separators, whilestoul
usually doesn't)
When does boost::lexical_cast to std::string fail?
It can fail for example if a user-defined conversion throws:
enum class MyType {};
std::ostream& operator<<( std::ostream&, MyType const& )
{
throw "error";
}
int main()
{
try
{
boost::lexical_cast< std::string >( MyType{} );
}
catch(...)
{
std::cout << "lexical_cast exception";
}
}
As you have no control about the type of exceptions thrown by user-defined conversions, catching boost::bad_lexical_cast
won't even be enough. Your unit test has to catch all exceptions.
Live Demo
What overhead is there in performing an identity boost::lexical_cast?
Since the documentation doesn't offer anything on this topic, I dug into the lexical_cast
source (1.51.0) and found that it does some compile-time checking on the types and decides a specific "caster class" that does the conversion. In case source and target are the same, this "caster class" will simply return the input.
Pseudo-codified and simplified from source (boost/lexical_cast.hpp:2268
):
template <typename Target, typename Source>
Target lexical_cast(const Source &arg)
{
static if( is_character_type_to_character_type<Target, src> ||
is_char_array_to_stdstring<Target, src> ||
is_same_and_stdstring<Target, src> )
// ^-- optimization for std::string to std::string and similar stuff
{
return arg;
}
else
{
/* some complicated stuff */
}
}
I can't directly see any optimizations for other identity casts, though, and looking through the normally selected lexical_cast_do_cast
"caster class" is making my head hurt. :(
boost lexical_cast throws exception
Your problem is probably that the loop is processed one more time than you expect.
The last time through the loop, the read to word fails, setting the fail bit in iss, which is what while(iss) is checking. To fix it you need to do something like this.
string word;
istringstream iss(line);
do
{
string word;
iss >> word;
if(iss)
{
double x;
x = lexical_cast<double>(word);
cout << x << endl;
}
} while (iss);
How can I use boost::lexical_cast with folly::fbstring?
From looking at the lexical_cast
documentation it appears that std::string
is explicitly the only string-like exception allowed to normal lexical-cast semantics, to keep the meaning of the cast as straightforward as possible and catch as many possible conversion errors as can be caught. The documentation also says that for other cases to use alternatives such as std::stringstream
.
In your case I think a to_fbstring
method would be perfect:
template <typename T>
fbstring to_fbstring(const T& item)
{
std::ostringstream os;
os << item;
return fbstring(os.str());
}
Locale invariant guarantee of boost::lexical_cast
Can I force the boost::lexical_cast<> not to be aware of locales?
No, I don't think that is possible. The best you can do is call
std::locale::global(std::locale::classic());
to set the global locale to the "C" locale as boost::lexical_cast
relies on the global locale. However, the problem is if somewhere else in the code the global locale is set to something else before calling boost::lexical_cast
, then you still have the same problem.
Therefore, a robust solution would be imbue
a stringstream like so, and you can be always sure that this works:
std::ostringstream oss;
oss.imbue(std::locale::classic());
oss.precision(std::numeric_limits<double>::digits10);
oss << 0.15784465;
Related Topics
How to Pass a Function Pointer That Points to Constructor
Workaround for Template Argument Deduction in Non-Deduced Context
Can a Std::String Contain Embedded Nulls
How to Write to a Memory Buffer With a File*
Splitting a String into an Array in C++ Without Using Vector
Do I Really Need to Implement User-Provided Constructor for Const Objects
Check If Member Exists Using Enable_If
What Are the Issues with a Vector-Of-Vectors
How to Copy a .Txt File to a Char Array in C++
What Is the C++ Compiler Required to Do with Ill-Formed Programs According to the Standard
How Non-Member Functions Improve Encapsulation
Is It Legal to Index into a Struct
Why Cannot a Non-Member Function Be Used for Overloading the Assignment Operator
How to Properly Uninitialize Openssl
C++ Fastest Way to Read Only Last Line of Text File
Accessing Environment Variables in C++