Extracting Pairs of Words Using String.Split()

Extracting pairs of words using String.split()

Currently (last tested on Java 17) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.

Instead use Pattern and Matcher classes with regex like \w+\s+\w+ which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").


Is this what you are looking for?

(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

In split we are trying to

  1. find spaces -> \\s
  2. that are not predicted -> (?<!negativeLookBehind)
  3. by some word -> \\w+
  4. with previously matched (space) -> \\G
  5. before it ->\\G\\w+.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.


I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma

Java String: split String

If you really have to use split you can use something like

String[] array = string.split("(?<=\\G[^,]{1,100},[^,]{1,100},[^,]{1,100},[^,]{1,100}),");

Explanation if idea in my previous answer on similar but simpler topic

Demo:

String string = "NNP,PERSON,true,?,IN,O,false,pobj,NNP,ORGANIZATION,true,?,p";
String[] array = string.split("(?<=\\G[^,]{1,100},[^,]{1,100},[^,]{1,100},[^,]{1,100}),");
for (String s : array)
System.out.println(s);

output:

NNP,PERSON,true,?
IN,O,false,pobj
NNP,ORGANIZATION,true,?
p

But if there is any chance that you don't have to use split but you still want to use regex then I encourage you to use Pattern and Matcher classes to create simple regex which can find parts you are interested in, not complicated regex to find parts you want to get rid of. I mean something like

  1. any xx,xxx,xxx,xxx part where x is not ,
  2. any xx or xx,xx or xxx,xxx,xxx parts if they are placed at the end of string (to catch rest of data unmatched by regex from point 1.)

So

Pattern p = Pattern.compile("[^,]+(,[^,]+){3}|[^,]+(,[^,]+){0,2}$");

should do the trick.


Another solution and probably the fastest (and quite easy to write) would be creating your own parser which will iterate over all characters from your string, store them in some buffer, calculate how many , already occurred and if number is multiplication of 4 clear buffer and write its contend to array (or better dynamic collection like list). Such parser can look like

public static List<String> parse(String s){
List<String> tokens = new ArrayList<>();
StringBuilder sb = new StringBuilder();
int commaCounter = 0;

for (char ch: s.toCharArray()){
if (ch==',' && ++commaCounter == 4){
tokens.add(sb.toString());
sb.delete(0, sb.length());
commaCounter = 0;
}else{
sb.append(ch);
}
}
if (sb.length()>0)
tokens.add(sb.toString());

return tokens;
}

You can later convert List to array if you need but I would stay with List.

Splitting a string at a particular position in java

I would use the String.substring(beginIndex, endIndex); and String.substring(beginIndex);

String a = "word1 word2 word3 word4";
int first = a.indexOf(" ");
int second = a.indexOf(" ", first + 1);
String b = a.substring(0,second);
String c = b.subString(second); // Only startindex, cuts at the end of the string

This would result in a = "word1 word2" and b = "word3 word4"

Split string into list of words and separators

You'd need a regular expression:

(\w+)([#|*])

See example Dart code here that should get you going: https://dartpad.dartlang.org/ae3897b2221a94b5a4c9e6929bebcfce

Java. How can I split a string with multiple spaces on every nth space?

You could do:

String[] stringArray = string.split("(?<!\\G\\S+)\\s");

Split string into words with whitespace unless in between a pair of double quotation marks

A solution :

var str = 'get "something" from "any site"';
var tokens = [].concat.apply([], str.split('"').map(function(v,i){
return i%2 ? v : v.split(' ')
})).filter(Boolean);

Result :

["get", "something", "from", "any site"]

It's probably possible to do simpler. The idea here is to split using " and then split by the space the odd results of the first splitting.

If you want to keep the quotes, you may use

var tokens = [].concat.apply([], str.split('"').map(function(v,i){
return i%2 ? '"'+v+'"' : v.split(' ')
})).filter(Boolean);

Result :

['get', '"something"', 'from', '"any site"']

Split string and concatenate removing whole word in R

Just needs a little tweaking, and now strings can be generalized to a vector of such strings:

Solution

sapply(
# Split each string by "/" into its components.
X = strsplit(x = strings, split = "/"),
# Remove undesired components and then reassemble the strings.
FUN = function(v){paste0(
# Use subscripting to filter out matches.
v[!grepl(x = v, pattern = "^\\s*(Arts and Humanities|Social Sciences)\\s*$")],
# Reassemble components as separated by "/".
collapse = "/"
)},

# Make the result a vector like the original 'string' (rather than a list).
simplify = TRUE,
USE.NAMES = FALSE
)

Result

Given a vector of strings like this

strings <- c(
"Arts and Humanities Other Topics/Social Sciences Other Topics/Arts and Humanities/Social Sciences/Sociology",
"Sociology/Arts and Humanities"
)

this solution should yield the following result:

[1] "Arts and Humanities Other Topics/Social Sciences Other Topics/Sociology"
[2] "Sociology"

Note

A solution that uses unlist() will collapse everything into a single, giant string, rather than reassembling each string in strings.

Split string into array of character strings

"cat".split("(?!^)")

This will produce

array ["c", "a", "t"]

Get word pairs from a sentence

There is a way, lots of ways
one of these can be:

String string = "I want this split up into pairs";
String[] words = string.split(" ");
List<String> pairs = new ArrayList<String>();
for (int i = 0; i < words.length-1; ++i) {
pairs.add(words[i] + " " + words[i+1]);
}
System.out.println(pairs);

How to extract a string between two delimiters

If you have just a pair of brackets ( [] ) in your string, you can use indexOf():

String str = "ABC[ This is the text to be extracted ]";    
String result = str.substring(str.indexOf("[") + 1, str.indexOf("]"));


Related Topics



Leave a reply



Submit