Reading a text file and splitting it into single words in python
Given this file:
$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6
If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):
with open('words.txt','r') as f:
for line in f:
for word in line.split():
print(word)
Prints:
line1
word1
word2
line2
...
word6
Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:
with open('words.txt') as f:
flat_list=[word for line in f for word in line.split()]
>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']
Which can create the same output as the first example with print '\n'.join(flat_list)
...
Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):
with open('words.txt') as f:
matrix=[line.split() for line in f]
>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]
If you want a regex solution, which would allow you to filter wordN
vs lineN
type words in the example file:
import re
with open("words.txt") as f:
for line in f:
for word in re.findall(r'\bword\d+', line):
# wordN by wordN with no lineN
Or, if you want that to be a line by line generator with a regex:
with open("words.txt") as f:
(word for line in f for word in re.findall(r'\w+', line))
Split a text into single words
Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.
$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);
This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"
Splitting Text information in a dataframe into single words and detect if they are part of a dictionary R
As I don't know which dictionary you are working with, here's a description of how in principle you can go about this task:
Data:
df <- data.frame(Descriptions = c("cyber"," &%@","aah ingds.", "823942 blc"))
Let's say you work with the GradyAugmented
dictionary from the library(qdapDictionaries)
, you could paste the words in the dictionary together separating them by the regex alternation marker |
and use grepl
, which returns TRUE or FALSE, to check whether the dictionary words are contained in any of the df$Description
strings:
df$inDict <- grepl(paste0("\\b(", paste(GradyAugmented[1:100], collapse = "|"), ")\\b"), df$Descriptions)
Result:
df
Descriptions inDict
1 cyber TRUE
2 &%@ FALSE
3 aah ingds. TRUE
4 823942 blc FALSE
The dictionary may be very large and you may run into memory problems. In that case you can take a different route, via %in%
:
df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)
Here the rows are lists:
df$inDict <- lapply(strsplit(df$Descriptions, " "), function(x) x %in% GradyAugmented)
df
Descriptions inDict
1 cyber TRUE
2 &%@ FALSE
3 aah ingds. TRUE, FALSE
4 823942 blc FALSE, FALSE
Hope this helps.
Split string into individual words Java
Use split()
method
Eg:
String s = "I want to walk my dog";
String[] arr = s.split(" ");
for ( String ss : arr) {
System.out.println(ss);
}
How do I split a string into a list of words?
Given a string sentence
, this stores each word in a list called words
:
words = sentence.split()
Split an text into single words, each on a line, with javascript
This should work
x.innerHTML = MyVar.value.replace(/ /g, "<br />");
Splitting whole text to words using one regex
You can use the opposite approach - matching:
List<String> words = new ArrayList<>();
String line = " W metal, w liczbę, w trupie ciało, -";
Matcher m = Pattern.compile("\\p{L}+").matcher(line);
while (m.find()) {
words.add(m.group());
}
System.out.println(words); // => [W, metal, w, liczbę, w, trupie, ciało]
See the IDEONE demo. The \\p{L}+
will match 1+ any Unicode letters.
There is a way to use splitting approach, but we need to pre-process the input string first:
String line = " W metal, w liczbę, w trupie ciało, -";
String[] words = line.replaceFirst("^\\P{L}+", "").split("\\P{L}+");
System.out.println(Arrays.toString(words));
See another IDEONE demo
The .replaceFirst("^\\P{L}+", "")
will remove all non-letter symbols from the beginning of the string, thus, leaving no empty elements in the split array.
Split Strings into words with multiple word boundary delimiters
A case where regular expressions are justified:
import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Related Topics
PHP Files Are Downloaded by Browser Instead of Processed by Local Dev Server (Mamp)
PHP Recursively Unset Array Keys If Match
How to Save Webpage as a Image File Using PHP
PHP Curly Braces in Array Notation
PHP Ltrim Behavior with Character List
Handling Put/Delete Arguments in PHP
Print Currency Number Format in PHP
Return PHP Object by Index Number (Not Name)
How to Perform an Action Every 5 Results
Using PHP to Download Files, Not Working on Large Files
Max_File_Size in PHP - What's the Point
How to Pass Multiple Variables Across Multiple Pages
Call to Undefined Function Odbc_Connect() Message While Connecting Sap Hana Database