How to Split Text into Words

How to split text into words?

Split text on whitespace, then trim punctuation.

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));

Agrees exactly with example.

How do I split a string into a list of words?

Given a string sentence, this stores each word in a list called words:

words = sentence.split()

Split Strings into words with multiple word boundary delimiters

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Splitting a text file into words in C

You want to read from a file, fgets() might come in mind.

You want to split into tokens by a delimiter (whitespace), strtok() should come in mind.

So, you could do it like this:

#include <stdio.h>
#include <string.h>

int main(void)
{
   FILE * pFile;
   char mystring [100];
   char* pch;

   pFile = fopen ("text_newlines.txt" , "r");
   if (pFile == NULL) perror ("Error opening file");
   else {
     while ( fgets (mystring , 100 , pFile) != NULL )
       printf ("%s", mystring);
     fclose (pFile);
   }

   pFile = fopen ("text_wspaces.txt" , "r");
   if (pFile == NULL) perror ("Error opening file");
   else {
     while ( fgets (mystring , 100 , pFile) != NULL ) {
       printf ("%s", mystring);
       pch = strtok (mystring," ");
       while (pch != NULL)
       {
         printf ("%s\n",pch);
         pch = strtok (NULL, " ");
       }
     }
     fclose (pFile);
   }

   return 0;
}

Output:

linux25:/home/users/grad1459>./a.out
Milk
Work
Chair
And then she tried to run 
And
then
she
tried
to
run

but she was stunned by the view of
but
she
was
stunned
by
the
view
of
//newline here as well

How to Split text by Numbers and Group of words

you can try splitting using this regex

([\d,]+|[a-zA-Z]+ *[a-zA-Z]*) //note the spacing between + and *.

[0-9,]+ // will search for one or more digits and commas
[a-zA-Z]+ [a-zA-Z] // will search for a word, followed by a space(if any) followed by another word(if any).
```
String regEx = "[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*";
```

you use them like this

public static void main(String args[]) {

  String input = new String("2 Marine Cargo       14,642 10,528       16,016 more text 8,609 argA 2,106 argB");
  System.out.println("Return Value :" );      

  Pattern pattern = Pattern.compile("[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*");

  ArrayList<String> result = new ArrayList<String>();
  Matcher m = pattern.matcher(input);
  while (m.find()) { 
         System.out.println(">"+m.group(0)+"<");  
         result.add(m.group(0));

   }
}

The following is the output as well as a detailed explaination of the RegEx that is autogenerated from https://regex101.com

Sample Image

1st Alternative [0-9,]+
Match a single character present in the list below [0-9,]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
, matches the character , literally (case sensitive)

2nd Alternative [a-zA-Z]+ *[a-zA-Z]*
Match a single character present in the list below [a-zA-Z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
 * matches the character   literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)

splitting a text into words using bufferReader

In line

words = line.split("\\s+");

you split by regex, which is much slower, than splitting by one char (5 times on my machine).
Java split String performances

If the words are exactly separated by only one space, then the solution is simple

words = line.split(" ");

just replace with this line and your code will run faster.

If words can be separated by several spaces, then add such a line after the loop

text.remove("");

and still replace your regex split with 1 char split.

public class Test {
    public static void main(String[] args) throws IOException {
        // string contains 1, 2 and two spaces between 1 and 2. text size should be 2
        String txt = "1  2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1";

        InputStream inpstr = new ByteArrayInputStream(txt.getBytes());

        BufferedReader read = new BufferedReader(new InputStreamReader(inpstr));
        Set<String> text = new TreeSet<>();
        String[] words;
        String line;
        long startTime = System.nanoTime();
        while ((line = read.readLine()) != null) {
            //words = line.split("\\s+"); -- runs 5 times slower
            words = line.split(" ");
            for (int i = 0; i < words.length; i++) {
                text.add(words[i]);
            }
        }
        text.remove("");  // add only if words can be separated with multiple spaces

        long endTime = System.nanoTime();
        System.out.println((endTime - startTime) + " " + text.size());
    }
}

Also you can replace your for loop with

text.addAll(Arrays.asList(words));

How to split text from file into words?

If you count the characters you'll see that s is the 57th character. 57 is 19 times 3 which is the number of parsed characters in each cycle, (20 -1, as fgets null terminates the string and leaves the 20th character in the buffer).

As you are reading lines in batches of 19 characters, the line will be cuted every multiple of 19 charater and the rest will be read by the next fgets in the cycle.

The first two times you where lucky enough that the line was cutted at a space, character 19 at the end of beggining, character 38 at the end of tired, the third time it was in the midle of sister so it cuted it in two words.

Two possible fixes:

Replace:
```
while (fgets(word, N, fileinput)!= NULL)
```
With:
```
while (fscanf(fileinput, %19s, word)  == 1)
```
Provided that there are no words larger than 19 in the file, which is the case.
Make word large enough to take whole the line:
```
char word[80];
```
80 should be enough for the sample line.

Split text into chunks by ensuring the entireness of words

For me this sound like task for textwrap built-in module, example using your data

import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))

output

This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.

You will probably need some trials to get value which suits your needs best. If you need list of strs use textwrap.wrap i.e. textwrap.wrap(text,55)

How to split text without spaces into list of words

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.

After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

How to Split Text into Words