How do I tokenize a string in C++?
C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split
function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>
? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.
Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using std::string::find
until you hit std::string::npos
, and extract the contents using std::string::substr
.
A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream
:
auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};
while (iss >> str) {
process(str);
}
Using std::istream_iterator
s, the contents of the string stream could also be copied into a vector using its iterator range constructor.
Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator
for this purpose in particular:
auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
std::sregex_token_iterator{begin(str), end(str), re, -1},
std::sregex_token_iterator{}
);
How to Tokenize String without using strtok()
You do not initialize the variable
i
.while(inputString[i] != '\0')
can be writtenwhile(inputString[i])
.testing[inputString[i]]++
makes sense to count the number of occurrences of a given character frominputString
, but it does not make sense to print it. You may want to do something like:while(1)
{
char testing[MAX_WORD], *t=testing;
while(inputString[i]&&(inputString[i]!=' '))
*t++=inputString[i++];
if (t>testing) printf("%s", testing);
if (!inputString[i]) break;
i++;
}It would be better to name
MAX_WORD_LENGTH
instead ofMAX_WORD
.
These are a few problems in your code.
Fast way to tokenize a string in c
I don't know about strtok_r
, but strtok
is probably the fastest way to tokenize a string. Perhaps you were doing it wrong? Maybe that is why it appeared slow for you.
Here is how you tokenize a string in C...
#include <string.h>
#include <stdio.h>
int
main (void)
{
char string[] = "root.ahmed.andre";
char *token = strtok (string, ".");
while (token) {
// Do what you want with the token here...
puts (token);
// Get the next token
token = strtok (NULL, ".");
}
}
And just for the sake of argument, the code below tokenizes your string 1,000,000 times and displays how long it took to do so. For me, it took 90 ms. That's blazing fast.
#include <string.h>
#include <stdio.h>
#include <sys/time.h>
int
main (void)
{
struct timeval tv;
long int start;
long int end;
int i;
// Get start time in milliseconds
gettimeofday (&tv, NULL);
start = (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
for (i = 0; i < 1000000; i++) {
char string[] = "root.ahmed.andre";
char *token = strtok (string, ".");
while (token) {
token = strtok (NULL, ".");
}
}
// Get end time in milliseconds
gettimeofday (&tv, NULL);
end = (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
// Print execution time in milliseconds
printf ("\nDone in %ld ms!\n\n", end - start);
return 0;
}
Tokenizing a String - C
strtok
stores "the point where the last token was found" :
"The point where the last token was found is kept internally by the function to be used on the next call (particular library implementations are not required to avoid data races)."
-- reference
That's why you can call it with NULL the second time.
So your calling it again with a different pointer inside your loop makes you loose the state of the initial call (meaning tk = strtok(NULL, "\r\n")
will be NULL by the end of the while, because it will be using the state of the inner loops).
So the solution is probably to change the last line of the while from:
tk = strtok(NULL, "\r\n");
to something like (please check the bounds first, it should not go after buf + strlen(buf)
):
tk = strtok(tk + strlen(tk) + 1, "\r\n");
Or use strtok_r, which stores the state externally (like in this answer).
// first call
char *saveptr1;
tk = strtok_r(buf, "\r\n", &saveptr1);
while(tk != NULL) {
//...
tk = strtok_r(NULL, "\r\n", &saveptr1);
}
Tokenizing a string in C?
Try this for size...
#include <stdio.h>
#include <ctype.h>
typedef char * string;
int main(int argc, char *argv[])
{
string inputS = argv[1];
string input[50]; /* Up to 50 tokens */
char buffer[200];
int i;
int strnum = 0;
char *next = buffer;
char c;
if (argc != 2)
{
fprintf(stderr, "Usage: %s expression\n", argv[0]);
return 1;
}
printf("input: <<%s>>\n", inputS);
printf("parsing:\n");
while ((c = *inputS++) != '\0')
{
input[strnum++] = next;
if (isdigit(c))
{
printf("Digit: %c\n", c);
*next++ = c;
while (isdigit(*inputS))
{
c = *inputS++;
printf("Digit: %c\n", c);
*next++ = c;
}
*next++ = '\0';
}
else
{
printf("Non-digit: %c\n", c);
*next++ = c;
*next++ = '\0';
}
}
printf("parsed:\n");
for (i = 0; i < strnum; i++)
{
printf("%d: <<%s>>\n", i, input[i]);
}
return 0;
}
Given the program is called tokenizer
and the command:
tokenizer '(3+2)*564/((3+4)*2)'
It gives me the output:
input: <<(3+2)*564/((3+4)*2)>>
parsing:
Non-digit: (
Digit: 3
Non-digit: +
Digit: 2
Non-digit: )
Non-digit: *
Digit: 5
Digit: 6
Digit: 4
Non-digit: /
Non-digit: (
Non-digit: (
Digit: 3
Non-digit: +
Digit: 4
Non-digit: )
Non-digit: *
Digit: 2
Non-digit: )
parsed:
0: <<(>>
1: <<3>>
2: <<+>>
3: <<2>>
4: <<)>>
5: <<*>>
6: <<564>>
7: <</>>
8: <<(>>
9: <<(>>
10: <<3>>
11: <<+>>
12: <<4>>
13: <<)>>
14: <<*>>
15: <<2>>
16: <<)>>
String tokenization in C
The problem is that strtok
uses the second argument as a set to tokenize on. So the string " /"
will tokenize either on space ' '
or on slash '/'
. Not the full string.
That means name
will be pointing to the single string "Mustafa"
while phone
points to "Baki"
and note1
points to "Phone:123456789"
.
You should use only the slash "/"
in the initial calls to strtok
. Then if needed strip trailing spaces in the strings.
Tokenizing a string and return it as an array
Well, excluding the fact that this code is not capable of handling no more than 3 tokens, it has an another basic problem: It will return an illegal pointer to memory. temp
and tokens
are variables which are within the stack frame of the parseString()
function. So when it's execution finishes, those variables will be gone. The ideal solution here is to allocate tokens
in the heap.
Here is my solution:
char** parseString(char* cmd)
{
char delimiters[] = " ";
char* temp = strtok(cmd, delimiters);
//If temp is NULL then the string contains no tokens
if (temp == NULL)
{
return NULL;
}
else
{
int i = 0;
char** tokens = malloc(3*sizeof(char*));
while (temp != NULL)
{
tokens[i++] = temp;
temp = strtok(NULL, " ");
}
for (i = 0; i < 3; i++)
{
printf("%s\n", tokens[i]);
}
return tokens;
}
}
Tokenizing an array of string in c
A char **
can point to an array of char *
, which is what you want to return.
In your loop, you want to assign the return value of strtok
to elements of the array that new_arr
points to. You also want to allocate space for elements of size sizeof(char *)
, not sizeof(char)
:
char ** new_arr = malloc(sizeof(char *) * n_tokens);
if(new_arr == NULL)
return NULL;
new = strtok(str, " ");
while(new != NULL)
{
printf("%s\n", new);
new_arr[i++] = new;
new = strtok(NULL, " ");
}
return new_arr;
This works under the assumption that the value of n_tokens
is correct. If you don't know how many tokens, you can still do this by using realloc
to expand the size of the array:
char ** new_arr = malloc(sizeof(char *));
if(new_arr == NULL)
return NULL;
new = strtok(str, " ");
while(new != NULL)
{
printf("%s\n", new);
new_arr[i++] = new;
new_arr = realloc(new_arr, sizeof(char *) * (i + 1));
if (new_arr == NULL) {
return NULL;
}
new = strtok(NULL, " ");
}
return new_arr;
Related Topics
Writing a Sequence of Numbers Like: 1 22 333 4444 55555
How Come a Non-Const Reference Cannot Bind to a Temporary Object
What Is an Undefined Reference/Unresolved External Symbol Error and How to Fix It
Templated Check For the Existence of a Class Member Function
Why Should I Prefer to Use Member Initialization Lists
What Is External Linkage and Internal Linkage
Where Are Static Variables Stored in C and C++
Function With Same Name But Different Signature in Derived Class
Avx2 What Is the Most Efficient Way to Pack Left Based on a Mask
How to Read in User Entered Comma Separated Integers
Store Hex Value as String (Arduino Project)
Why Is "Using Namespace Std;" Considered Bad Practice
When Should Static_Cast, Dynamic_Cast, Const_Cast, and Reinterpret_Cast Be Used
What Are All the Common Undefined Behaviours That a C++ Programmer Should Know About
How to Enable C++11/C++0X Support in Eclipse Cdt