How to Parse Space-Separated Floats in C++ Quickly

How to parse space-separated floats in C++ quickly?

If the conversion is the bottle neck (which is quite possible),
you should start by using the different possiblities in the
standard. Logically, one would expect them to be very close,
but practically, they aren't always:

  • You've already determined that std::ifstream is too slow.

  • Converting your memory mapped data to an std::istringstream
    is almost certainly not a good solution; you'll first have to
    create a string, which will copy all of the data.

  • Writing your own streambuf to read directly from the memory,
    without copying (or using the deprecated std::istrstream)
    might be a solution, although if the problem really is the
    conversion... this still uses the same conversion routines.

  • You can always try fscanf, or scanf on your memory mapped
    stream. Depending on the implementation, they might be faster
    than the various istream implementations.

  • Probably faster than any of these is to use strtod. No need
    to tokenize for this: strtod skips leading white space
    (including '\n'), and has an out parameter where it puts the
    address of the first character not read. The end condition is
    a bit tricky, your loop should probably look a bit like:


char* begin; // Set to point to the mmap'ed data...
// You'll also have to arrange for a '\0'
// to follow the data. This is probably
// the most difficult issue.
char* end;
errno = 0;
double tmp = strtod( begin, &end );
while ( errno == 0 && end != begin ) {
// do whatever with tmp...
begin = end;
tmp = strtod( begin, &end );
}

If none of these are fast enough, you'll have to consider the
actual data. It probably has some sort of additional
constraints, which means that you can potentially write
a conversion routine which is faster than the more general ones;
e.g. strtod has to handle both fixed and scientific, and it
has to be 100% accurate even if there are 17 significant digits.
It also has to be locale specific. All of this is added
complexity, which means added code to execute. But beware:
writing an efficient and correct conversion routine, even for
a restricted set of input, is non-trivial; you really do have to
know what you are doing.

EDIT:

Just out of curiosity, I've run some tests. In addition to the
afore mentioned solutions, I wrote a simple custom converter,
which only handles fixed point (no scientific), with at most
five digits after the decimal, and the value before the decimal
must fit in an int:

double
convert( char const* source, char const** endPtr )
{
char* end;
int left = strtol( source, &end, 10 );
double results = left;
if ( *end == '.' ) {
char* start = end + 1;
int right = strtol( start, &end, 10 );
static double const fracMult[]
= { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };
results += right * fracMult[ end - start ];
}
if ( endPtr != nullptr ) {
*endPtr = end;
}
return results;
}

(If you actually use this, you should definitely add some error
handling. This was just knocked up quickly for experimental
purposes, to read the test file I'd generated, and nothing
else.)

The interface is exactly that of strtod, to simplify coding.

I ran the benchmarks in two environments (on different machines,
so the absolute values of any times aren't relevant). I got the
following results:

Under Windows 7, compiled with VC 11 (/O2):

Testing Using fstream directly (5 iterations)...
6.3528e+006 microseconds per iteration
Testing Using fscan directly (5 iterations)...
685800 microseconds per iteration
Testing Using strtod (5 iterations)...
597000 microseconds per iteration
Testing Using manual (5 iterations)...
269600 microseconds per iteration

Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):

Testing Using fstream directly (5 iterations)...
784000 microseconds per iteration
Testing Using fscanf directly (5 iterations)...
526000 microseconds per iteration
Testing Using strtod (5 iterations)...
382000 microseconds per iteration
Testing Using strtof (5 iterations)...
360000 microseconds per iteration
Testing Using manual (5 iterations)...
186000 microseconds per iteration

In all cases, I'm reading 554000 lines, each with 3 randomly
generated floating point in the range [0...10000).

The most striking thing is the enormous difference between
fstream and fscan under Windows (and the relatively small
difference between fscan and strtod). The second thing is
just how much the simple custom conversion function gains, on
both platforms. The necessary error handling would slow it down
a little, but the difference is still significant. I expected
some improvement, since it doesn't handle a lot of things the
the standard conversion routines do (like scientific format,
very, very small numbers, Inf and NaN, i18n, etc.), but not this
much.

Parse a C-string of floating numbers

Look at Boost Spirit:

  • How to parse space-separated floats in C++ quickly?

It supports NaN, positive and negative infinity just fine. Also it allows you to express the constraining grammar succinctly.

  1. Simple adaptation of the code

    Here is the adapted sample for your grammar:

    struct Point { float x,y; };
    typedef std::vector<Point> data_t;

    // And later:
    bool ok = phrase_parse(f,l,*(double_ > ',' > double_), space, data);

    The iterators can be any iterators. So you can hook it up with your C string just fine.

    Here's a straight adaptation of the linked benchmark case. This shows you how to parse from any std::istream or directly from a memory mapped file.

    Live On Coliru


  2. Further optimizations (strictly for C strings)

    Here's a version that doesn't need to know the length of the string up front (this is neat because it avoids the strlen call in case you didn't have the length available):

    template <typename OI>
    static inline void parse_points(OI out, char const* it, char const* last = std::numeric_limits<char const*>::max()) {
    namespace qi = boost::spirit::qi;
    namespace phx = boost::phoenix;

    bool ok = qi::phrase_parse(it, last,
    *(qi::double_ >> ',' >> qi::double_) [ *phx::ref(out) = phx::construct<Point>(qi::_1, qi::_2) ],
    qi::space);

    if (!ok || !(it == last || *it == '\0')) {
    throw it; // TODO proper error reporting?
    }
    }

    Note how I made it take an output iterator so that you get to decide how to accumulate the results. The obvious wrapper to /just/ parse to a vector would be:

    static inline data_t parse_points(char const* szInput) {
    data_t pts;
    parse_points(back_inserter(pts), szInput);
    return pts;
    }

    But you can also do different things (like append to an existing container, that could have reserved a known capacity up front etc.). Things like this often allow truly optimized integration in the end.

    Here's that code fully demo-ed in ~30 lines of essential code:

    Live On Coliru

  3. Extra Awesome Bonus

    To show off the flexibility of this parser; if you just wanted to check the input and get a count of the points, you can replace the output iterator with a simple lambda function that increments a counter instead of adds a newly constructed point.

    int main() {
    int count = 0;
    parse_points( " 10,9 2.5, 3 4 ,150.32 ", boost::make_function_output_iterator([&](Point const&){count++;}));
    std::cout << "elements in sample: " << count << "\n";
    }

    Live On Coliru

    Since everything is inlined the compiler will notice that the whole Point doesn't need to be constructed here and eliminate that code: http://paste.ubuntu.com/9781055/

    The main function is seen directly invoking the very parser primitives. Handcoding the parser won't get you better tuning here, at least not without a lot of effort.

Error Handling + Extracting space-separated float values from a string?

While using strtod() (for double) or strtof() (for float) is the correct approach for the conversion, and can be used to work through all values in each line read, there actually is a simpler way to approach this problem. Since the sets of ASCII characters (tokens) to be converted to floating-point values are all separated by one or more spaces, you can simply use strtok() to split the line read into file_string into tokens and then convert each token with strtod() (or strtof())

What that does is simplify error handling and you having to increment the pointer returned in endptr past any characters not used in a conversion. If you process the entire line with strtod(), then for your 12e or 4e.3 examples of bad values, it would be up to you to check if endptr pointed to something other than a space (one after the last char used in the conversion), and if that was the case, you would have to manually loop incrementing endptr checking for the next space.

When you tokenize the line with strtok() splitting all space separated words in file_string into tokens, then you only have to worry about validating the conversion and checking if endptr is pointing at something other than the '\0' (nul-terminating character) at the end of the token to know if the value was bad.

A quick example of doing it this way could be:

#include <stdio.h>
#include <stdlib.h> /* for strtod() */
#include <string.h> /* for strtok() */
#include <errno.h> /* for errno */
#include <ctype.h> /* for isspace() */

#define DELIM " \t\n" /* delmiters to split line into tokens */

int main (void) {

char file_string[] = "-123.45 1.2e7 boo 1.3e-7 12e 4e.3 1.1 2.22\n",
*nptr = file_string, /* nptr for strtod */
*endptr = nptr; /* endptr for strtod */
size_t n = 0;

nptr = strtok (file_string, DELIM); /* get 1st token */

while (nptr != NULL) {
errno = 0; /* reset errno 0 */

double d = strtod (nptr, &endptr); /* call strtod() */

printf ("\ntoken: '%s'\n", nptr); /* output token (optional) */

/* validate conversion */
if (d == 0 && endptr == nptr) { /* no characters converted */
fprintf (stderr, "error, no characters converted in '%s'.\n", nptr);
d = -1000; /* set bad value */
}
else if (errno) { /* underflow or overflow in conversion */
fprintf (stderr, "error: overflow or underflow converting '%s'.\n",
nptr);
d = -1000; /* set bad value */
}
else if (*endptr) { /* not all characters used in converison */
fprintf (stderr, "error: malformed value '%s'.\n", nptr);
d = -1000; /* set bad value */
}

printf ("double[%2zu]: %g\n", n, d); /* output double */
n += 1; /* increment n */

nptr = strtok (NULL, DELIM); /* get next token */
}
}

(note: the string you pass to strtok(), e.g. file_string in your case cannot be a read-only String-Literal because strtok() modifies the original string. You can make a copy of the string if you need to preserve the original)

Example Use/Output

With file_string containing "-123.45 1.2e7 boo 1.3e-7 12e 4e.3 1.1 2.22\n", you would have:

$ ./bin/strtod_strtok_loop

token: '-123.45'
double[ 0]: -123.45

token: '1.2e7'
double[ 1]: 1.2e+07

token: 'boo'
error, no characters converted in 'boo'.
double[ 2]: -1000

token: '1.3e-7'
double[ 3]: 1.3e-07

token: '12e'
error: malformed value '12e'.
double[ 4]: -1000

token: '4e.3'
error: malformed value '4e.3'.
double[ 5]: -1000

token: '1.1'
double[ 6]: 1.1

token: '2.22'
double[ 7]: 2.22

Look things over and let me know if you have further questions.

How do I extract space separated floating numbers from string into another array in c?

strtod() is the best tool for parsing floating point numbers.

To separate sub-strings using whitespace, at a minimum OP's code is not appending a needed null character. Further the copying from input1[] needs to copy more than just one character.

// Avoid magic numbers, using define self-documents the code
#define FLOAT_STRING_SIZE 100
#define FLOAT_STRING_N 100

int main(void) {
string floats[FLOAT_STRING_N][FLOAT_STRING_SIZE];
string input1 = get_string();
char *p = input1;

int count;
for (count = 0; count < FLOAT_STRING_N; count++) {
while (isspace((unsigned char) *p) p++;

// no more
if (*p == '\0') break;

int i;
for (i = 0; i<FLOAT_STRING_SIZE-1; i++) {
if (isspace((unsigned char) *p) || *p == '\0') break;
floats[count][i] = *p++; // save character and advance to next one.
}
floats[count][i] = '\0';
}

int count;
for (int c = 0; c < count; c++) {
puts(floats[c]);
}

return 0;
}

C++ getting floats from line

Separate components by spaces, now you have an array of strings. Cast float value to each string. Place new float values into strings using format identifiers (%f) and check if string made from the float value is equal to the original corresponding string.
(e.g. (string)0.012 == @"0.012" but (string)0 != @"v")


(I'm assuming 'v' casted to float returns 0, it may return unicode values I suppose but in any event it won't return 'v' because that's not a valid float. So this will still work)


*Seperate the original file by \n to get the lines, once you have the line you want seperate its' components by   spaces

Split float from beginning of string in C

You can use the strtof() function, which has an argument to (optionally) return a pointer to the 'rest' of the input string (after the float has been extracted):

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
char str[] = "-5.2foo";
char* rest;
float f;
f = strtof(str, &rest);
printf("%f | %s\n", f, rest);
// And, if there's nothing left, then nothing will be printed ...
char str2[] = "-5.2";
f = strtof(str2, &rest);
printf("%f | %s\n", f, rest);
return 0;
}

From the cppreference page linked above (str_end is the second argument):

The functions sets the pointer pointed to by str_end to point to the
character past the last character interpreted. If str_end is a null
pointer, it is ignored.

If there is nothing 'left' in the input string, then the returned value will point to that string's terminating nul character.



Related Topics



Leave a reply



Submit