How to Split Text Without Spaces into List of Words

How to split text without spaces into list of words

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.

After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

How to split words without spaces into a sentence which each word start with capital letter using Javascript

You could use a regular expression:

function splitWords(s) {
  return s.match(/.[^A-Z]*/g).map((word, i) =>
    (i ? word[0].toLowerCase() : word[0].toUpperCase()) + word.slice(1)
  ).join(" ");
}

let words = splitWords("howToMakeALiving");
console.log(words);
words = splitWords("WhyIHadMyDNATakenInParis");
console.log(words);

Parse a string without spaces into an array of individual words

Although there's cases where there's multiple interpretations possible and picking the best one can be trouble, you can always approach it with a fairly naïve algorithm like this:

WORDS = %w[
  blueberry
  blue
  berry
  fin
  fins
  muffin
  muffins
  are
  insane
  insanely
  in
  delicious
  deli
  us
].sort_by do |word|
  [ -word.length, word ]
end

WORD_REGEXP = Regexp.union(*WORDS)

def best_fit(string)
  string.scan(WORD_REGEXP)
end

This will parse your example:

best_fit("blueberrymuffinsareinsanelydelicious")
# => ["blueberry", "muffins", "are", "insanely", "delicious"]

Note that this skips any non-matching components.

Splitting long string without spaces into words with certain length

Use the split function, which creates a list of tokens based on the provided delimiter. You can provide a '\n' delimiter, something like this:

with open('input.txt', 'r') as file:
    data = file.read()
separated_list = data.split('\n')
print(separated_list)

output:

['abc', 'def', 'hij']

Split a text without whitespaces by words in list

Use a one-liner that employs a (massive) look behind built from allLists to insert spaces before each word:

str = str.replaceAll("(?<=" + String.join("|", allLists) + ")", " ");

Note that order of words in allLists is important; if you want longer words to take preference, list them first (recommended). Eg if both "book" and "booking" are in your list, put booking before book, otherwise you’ll get "book ing" in your result.

How to split string without spaces into list of integers in Python?

You don't need to use split here:

>>> a = "12345"    
>>> map(int, a)
[1, 2, 3, 4, 5]

Strings are Iterable too

For python 3x:

list(map(int, a))

How to split a string without spaces into an array of words?

One way might be to use a Map and use the names as a key and the value as the number.

Then extract the keys from the map, order them so that the longest string comes first and create a regex with a capturing group and an alternation

The regex would eventually look like:

(three|seven|eight|four|five|nine|one|two|six|o)

Then split the string using this regex. Map over the items removing all non digits when the map does not contain the key and remove all empty values from the array.

Finally get the value from the map by using the key.

let map = new Map([  ["o", 0],  ["one", 1],  ["two", 2],  ["three", 3],  ["four", 4],  ["five", 5],  ["six", 6],  ["seven", 7],  ["eight", 8],  ["nine", 9]]);let regex = new RegExp("(" + [...map.keys()]  .sort((a, b) => b.length - a.length)  .join('|') + ")");
let strings = [  "69ooooneotwonine",  "o",  "testninetest",  "10001",  "7xxxxxxx6fivetimesfifefofourt",  "test"
].map(s =>  s.split(regex)  .map(x => !map.has(x) ? x.replace(/\D+/, '') : x)  .filter(Boolean)  .map(x => map.has(x) ? map.get(x) : x)  .join(''));
console.log(strings);

splitting and joining words in a string to remove extra spaces in between words

Try using title() function!

name = "banAna   sPlit"
name = name.lower()
name = name.split()

array = []
for i in name: 
    array.append(i.title())

name = " ".join(array)

print(name)

This also removes the whitespace between words!

How to Split Text Without Spaces into List of Words