How to Tokenize a String in C++

How do I tokenize a string in C++?

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.

Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.

At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.

A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:

auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};

while (iss >> str) {
process(str);
}

Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.

Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.

More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:

auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);

How to Tokenize String without using strtok()

  1. You do not initialize the variable i.

  2. while(inputString[i] != '\0') can be written while(inputString[i]).

  3. testing[inputString[i]]++ makes sense to count the number of occurrences of a given character from inputString, but it does not make sense to print it. You may want to do something like:

    while(1) 
    {
    char testing[MAX_WORD], *t=testing;
    while(inputString[i]&&(inputString[i]!=' '))
    *t++=inputString[i++];
    if (t>testing) printf("%s", testing);
    if (!inputString[i]) break;
    i++;
    }
  4. It would be better to name MAX_WORD_LENGTH instead of MAX_WORD.

These are a few problems in your code.

Fast way to tokenize a string in c

I don't know about strtok_r, but strtok is probably the fastest way to tokenize a string. Perhaps you were doing it wrong? Maybe that is why it appeared slow for you.

Here is how you tokenize a string in C...

#include <string.h>
#include <stdio.h>

int
main (void)
{
char string[] = "root.ahmed.andre";
char *token = strtok (string, ".");

while (token) {
// Do what you want with the token here...
puts (token);

// Get the next token
token = strtok (NULL, ".");
}
}

And just for the sake of argument, the code below tokenizes your string 1,000,000 times and displays how long it took to do so. For me, it took 90 ms. That's blazing fast.

#include <string.h>
#include <stdio.h>
#include <sys/time.h>

int
main (void)
{
struct timeval tv;
long int start;
long int end;
int i;

// Get start time in milliseconds
gettimeofday (&tv, NULL);
start = (tv.tv_sec * 1000) + (tv.tv_usec / 1000);

for (i = 0; i < 1000000; i++) {
char string[] = "root.ahmed.andre";
char *token = strtok (string, ".");

while (token) {
token = strtok (NULL, ".");
}
}

// Get end time in milliseconds
gettimeofday (&tv, NULL);
end = (tv.tv_sec * 1000) + (tv.tv_usec / 1000);

// Print execution time in milliseconds
printf ("\nDone in %ld ms!\n\n", end - start);

return 0;
}

Tokenizing a String - C

strtok stores "the point where the last token was found" :

"The point where the last token was found is kept internally by the function to be used on the next call (particular library implementations are not required to avoid data races)."
-- reference

That's why you can call it with NULL the second time.

So your calling it again with a different pointer inside your loop makes you loose the state of the initial call (meaning tk = strtok(NULL, "\r\n") will be NULL by the end of the while, because it will be using the state of the inner loops).

So the solution is probably to change the last line of the while from:

tk = strtok(NULL, "\r\n");

to something like (please check the bounds first, it should not go after buf + strlen(buf)):

tk = strtok(tk + strlen(tk) + 1, "\r\n");

Or use strtok_r, which stores the state externally (like in this answer).

// first call
char *saveptr1;
tk = strtok_r(buf, "\r\n", &saveptr1);
while(tk != NULL) {
//...
tk = strtok_r(NULL, "\r\n", &saveptr1);
}

Tokenizing a string in C?

Try this for size...

#include <stdio.h>
#include <ctype.h>

typedef char * string;

int main(int argc, char *argv[])
{
string inputS = argv[1];
string input[50]; /* Up to 50 tokens */
char buffer[200];
int i;
int strnum = 0;
char *next = buffer;
char c;

if (argc != 2)
{
fprintf(stderr, "Usage: %s expression\n", argv[0]);
return 1;
}

printf("input: <<%s>>\n", inputS);
printf("parsing:\n");

while ((c = *inputS++) != '\0')
{
input[strnum++] = next;
if (isdigit(c))
{
printf("Digit: %c\n", c);
*next++ = c;
while (isdigit(*inputS))
{
c = *inputS++;
printf("Digit: %c\n", c);
*next++ = c;
}
*next++ = '\0';
}
else
{
printf("Non-digit: %c\n", c);
*next++ = c;
*next++ = '\0';
}
}

printf("parsed:\n");
for (i = 0; i < strnum; i++)
{
printf("%d: <<%s>>\n", i, input[i]);
}

return 0;
}

Given the program is called tokenizer and the command:

tokenizer '(3+2)*564/((3+4)*2)'

It gives me the output:

input: <<(3+2)*564/((3+4)*2)>>
parsing:
Non-digit: (
Digit: 3
Non-digit: +
Digit: 2
Non-digit: )
Non-digit: *
Digit: 5
Digit: 6
Digit: 4
Non-digit: /
Non-digit: (
Non-digit: (
Digit: 3
Non-digit: +
Digit: 4
Non-digit: )
Non-digit: *
Digit: 2
Non-digit: )
parsed:
0: <<(>>
1: <<3>>
2: <<+>>
3: <<2>>
4: <<)>>
5: <<*>>
6: <<564>>
7: <</>>
8: <<(>>
9: <<(>>
10: <<3>>
11: <<+>>
12: <<4>>
13: <<)>>
14: <<*>>
15: <<2>>
16: <<)>>

String tokenization in C

The problem is that strtok uses the second argument as a set to tokenize on. So the string " /" will tokenize either on space ' ' or on slash '/'. Not the full string.

That means name will be pointing to the single string "Mustafa" while phone points to "Baki" and note1 points to "Phone:123456789".

You should use only the slash "/" in the initial calls to strtok. Then if needed strip trailing spaces in the strings.

Tokenizing a string and return it as an array

Well, excluding the fact that this code is not capable of handling no more than 3 tokens, it has an another basic problem: It will return an illegal pointer to memory. temp and tokens are variables which are within the stack frame of the parseString() function. So when it's execution finishes, those variables will be gone. The ideal solution here is to allocate tokens in the heap.

Here is my solution:

char** parseString(char* cmd)
{
char delimiters[] = " ";
char* temp = strtok(cmd, delimiters);
//If temp is NULL then the string contains no tokens
if (temp == NULL)
{
return NULL;
}
else
{
int i = 0;
char** tokens = malloc(3*sizeof(char*));
while (temp != NULL)
{
tokens[i++] = temp;
temp = strtok(NULL, " ");
}
for (i = 0; i < 3; i++)
{
printf("%s\n", tokens[i]);
}
return tokens;
}
}

Tokenizing an array of string in c

A char ** can point to an array of char *, which is what you want to return.

In your loop, you want to assign the return value of strtok to elements of the array that new_arr points to. You also want to allocate space for elements of size sizeof(char *), not sizeof(char):

  char ** new_arr = malloc(sizeof(char *) * n_tokens);
if(new_arr == NULL)
return NULL;

new = strtok(str, " ");
while(new != NULL)
{
printf("%s\n", new);
new_arr[i++] = new;
new = strtok(NULL, " ");
}

return new_arr;

This works under the assumption that the value of n_tokens is correct. If you don't know how many tokens, you can still do this by using realloc to expand the size of the array:

  char ** new_arr = malloc(sizeof(char *));
if(new_arr == NULL)
return NULL;

new = strtok(str, " ");
while(new != NULL)
{
printf("%s\n", new);
new_arr[i++] = new;
new_arr = realloc(new_arr, sizeof(char *) * (i + 1));
if (new_arr == NULL) {
return NULL;
}
new = strtok(NULL, " ");
}

return new_arr;


Related Topics



Leave a reply



Submit