Split a Text into Single Words

Reading a text file and splitting it into single words in python

Given this file:

$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6

If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):

with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)

Prints:

line1
word1
word2
line2
...
word6

Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:

with open('words.txt') as f:
flat_list=[word for line in f for word in line.split()]

>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']

Which can create the same output as the first example with print '\n'.join(flat_list)...

Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):

with open('words.txt') as f:
matrix=[line.split() for line in f]

>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]

If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:

import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\bword\d+', line):
# wordN by wordN with no lineN

Or, if you want that to be a line by line generator with a regex:

 with open("words.txt") as f:
(word for line in f for word in re.findall(r'\w+', line))

Split a text into single words

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

Splitting Text information in a dataframe into single words and detect if they are part of a dictionary R

As I don't know which dictionary you are working with, here's a description of how in principle you can go about this task:

Data:

df <- data.frame(Descriptions = c("cyber"," &%@","aah ingds.", "823942 blc"))

Let's say you work with the GradyAugmented dictionary from the library(qdapDictionaries), you could paste the words in the dictionary together separating them by the regex alternation marker |and use grepl, which returns TRUE or FALSE, to check whether the dictionary words are contained in any of the df$Description strings:

df$inDict <- grepl(paste0("\\b(", paste(GradyAugmented[1:100], collapse = "|"), ")\\b"), df$Descriptions)

Result:

df
Descriptions inDict
1 cyber TRUE
2 &%@ FALSE
3 aah ingds. TRUE
4 823942 blc FALSE

The dictionary may be very large and you may run into memory problems. In that case you can take a different route, via %in%:

df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)

Here the rows are lists:

df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)
df
Descriptions inDict
1 cyber TRUE
2 &%@ FALSE
3 aah ingds. TRUE, FALSE
4 823942 blc FALSE, FALSE

Hope this helps.

Split string into individual words Java

Use split() method

Eg:

String s = "I want to walk my dog";
String[] arr = s.split(" ");

for ( String ss : arr) {
System.out.println(ss);
}

How do I split a string into a list of words?

Given a string sentence, this stores each word in a list called words:

words = sentence.split()

Split an text into single words, each on a line, with javascript

This should work

 x.innerHTML = MyVar.value.replace(/ /g, "<br />");

Splitting whole text to words using one regex

You can use the opposite approach - matching:

List<String> words = new ArrayList<>();
String line = " W metal, w liczbę, w trupie ciało, -";
Matcher m = Pattern.compile("\\p{L}+").matcher(line);
while (m.find()) {
words.add(m.group());
}
System.out.println(words); // => [W, metal, w, liczbę, w, trupie, ciało]

See the IDEONE demo. The \\p{L}+ will match 1+ any Unicode letters.

There is a way to use splitting approach, but we need to pre-process the input string first:

String line = "    W metal, w liczbę, w trupie ciało, -";
String[] words = line.replaceFirst("^\\P{L}+", "").split("\\P{L}+");
System.out.println(Arrays.toString(words));

See another IDEONE demo

The .replaceFirst("^\\P{L}+", "") will remove all non-letter symbols from the beginning of the string, thus, leaving no empty elements in the split array.

Split Strings into words with multiple word boundary delimiters

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']


Related Topics



Leave a reply



Submit