Extracting pairs of words using String.split()
Currently (last tested on Java 17) it is possible to do it with split()
, but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+
which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.
Instead use Pattern
and Matcher
classes with regex like \w+\s+\w+
which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").
Is this what you are looking for?
(you can replace \\w
with \\S
to include all non-space characters but for this example I will leave \\w
since it is easier to read regex with \\w\\s
then \\S\\s
)
String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));
output:
[one two, three four, five six, seven]
\G
is previous match, (?<!regex)
is negative lookbehind.
In split
we are trying to
- find spaces ->
\\s
- that are not predicted ->
(?<!negativeLookBehind)
- by some word ->
\\w+
- with previously matched (space) ->
\\G
- before it ->
\\G\\w+
.
Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G
at start matches start of the String ^
.
So before first iteration regex in negative look-behind will look like (?<!^\\w+)
and since first space do have ^\\w+
before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input
String) will be stored in \\G
and used later in next negative look-behind.
So for 3rd space regex will check if there is previously matched space \\G
and word \\w+
before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G
(it will have different position in input
String).
Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)
input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")
Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.
I just noticed that we can also use +
instead of {1,maxWordLength}
if we want to split with every odd number like every 3rd, 5th, 7th for example
String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma
Java String: split String
If you really have to use split you can use something like
String[] array = string.split("(?<=\\G[^,]{1,100},[^,]{1,100},[^,]{1,100},[^,]{1,100}),");
Explanation if idea in my previous answer on similar but simpler topic
Demo:
String string = "NNP,PERSON,true,?,IN,O,false,pobj,NNP,ORGANIZATION,true,?,p";
String[] array = string.split("(?<=\\G[^,]{1,100},[^,]{1,100},[^,]{1,100},[^,]{1,100}),");
for (String s : array)
System.out.println(s);
output:
NNP,PERSON,true,?
IN,O,false,pobj
NNP,ORGANIZATION,true,?
p
But if there is any chance that you don't have to use split but you still want to use regex then I encourage you to use Pattern and Matcher classes to create simple regex which can find
parts you are interested in, not complicated regex to find parts you want to get rid of. I mean something like
- any
xx,xxx,xxx,xxx
part where x is not,
- any
xx
orxx,xx
orxxx,xxx,xxx
parts if they are placed at the end of string (to catch rest of data unmatched by regex from point 1.)
So
Pattern p = Pattern.compile("[^,]+(,[^,]+){3}|[^,]+(,[^,]+){0,2}$");
should do the trick.
Another solution and probably the fastest (and quite easy to write) would be creating your own parser which will iterate over all characters from your string, store them in some buffer, calculate how many ,
already occurred and if number is multiplication of 4 clear buffer and write its contend to array (or better dynamic collection like list). Such parser can look like
public static List<String> parse(String s){
List<String> tokens = new ArrayList<>();
StringBuilder sb = new StringBuilder();
int commaCounter = 0;
for (char ch: s.toCharArray()){
if (ch==',' && ++commaCounter == 4){
tokens.add(sb.toString());
sb.delete(0, sb.length());
commaCounter = 0;
}else{
sb.append(ch);
}
}
if (sb.length()>0)
tokens.add(sb.toString());
return tokens;
}
You can later convert List to array if you need but I would stay with List.
Splitting a string at a particular position in java
I would use the String.substring(beginIndex, endIndex); and String.substring(beginIndex);
String a = "word1 word2 word3 word4";
int first = a.indexOf(" ");
int second = a.indexOf(" ", first + 1);
String b = a.substring(0,second);
String c = b.subString(second); // Only startindex, cuts at the end of the string
This would result in a = "word1 word2" and b = "word3 word4"
Split string into list of words and separators
You'd need a regular expression:
(\w+)([#|*])
See example Dart code here that should get you going: https://dartpad.dartlang.org/ae3897b2221a94b5a4c9e6929bebcfce
Java. How can I split a string with multiple spaces on every nth space?
You could do:
String[] stringArray = string.split("(?<!\\G\\S+)\\s");
Split string into words with whitespace unless in between a pair of double quotation marks
A solution :
var str = 'get "something" from "any site"';
var tokens = [].concat.apply([], str.split('"').map(function(v,i){
return i%2 ? v : v.split(' ')
})).filter(Boolean);
Result :
["get", "something", "from", "any site"]
It's probably possible to do simpler. The idea here is to split using "
and then split by the space the odd results of the first splitting.
If you want to keep the quotes, you may use
var tokens = [].concat.apply([], str.split('"').map(function(v,i){
return i%2 ? '"'+v+'"' : v.split(' ')
})).filter(Boolean);
Result :
['get', '"something"', 'from', '"any site"']
Split string and concatenate removing whole word in R
Just needs a little tweaking, and now strings
can be generalized to a vector of such strings:
Solution
sapply(
# Split each string by "/" into its components.
X = strsplit(x = strings, split = "/"),
# Remove undesired components and then reassemble the strings.
FUN = function(v){paste0(
# Use subscripting to filter out matches.
v[!grepl(x = v, pattern = "^\\s*(Arts and Humanities|Social Sciences)\\s*$")],
# Reassemble components as separated by "/".
collapse = "/"
)},
# Make the result a vector like the original 'string' (rather than a list).
simplify = TRUE,
USE.NAMES = FALSE
)
Result
Given a vector of strings
like this
strings <- c(
"Arts and Humanities Other Topics/Social Sciences Other Topics/Arts and Humanities/Social Sciences/Sociology",
"Sociology/Arts and Humanities"
)
this solution should yield the following result:
[1] "Arts and Humanities Other Topics/Social Sciences Other Topics/Sociology"
[2] "Sociology"
Note
A solution that uses unlist()
will collapse everything into a single, giant string, rather than reassembling each string in strings
.
Split string into array of character strings
"cat".split("(?!^)")
This will produce
array ["c", "a", "t"]
Get word pairs from a sentence
There is a way, lots of ways
one of these can be:
String string = "I want this split up into pairs";
String[] words = string.split(" ");
List<String> pairs = new ArrayList<String>();
for (int i = 0; i < words.length-1; ++i) {
pairs.add(words[i] + " " + words[i+1]);
}
System.out.println(pairs);
How to extract a string between two delimiters
If you have just a pair of brackets ( []
) in your string, you can use indexOf()
:
String str = "ABC[ This is the text to be extracted ]";
String result = str.substring(str.indexOf("[") + 1, str.indexOf("]"));
Related Topics
Java: String Concat VS Stringbuilder - Optimised, So What Should I Do
Intellij - Convert a Java Project/Module into a Maven Project/Module
Uninitialized Object VS Object Initialized to Null
Difference Between Using Throwable and Exception in a Try Catch
No Serializer Found for Class Org.Hibernate.Proxy.Pojo.Bytebuddy.Bytebuddyinterceptor
Method Calls Inside a Java Class Return an "Identifier Expected After This Token" Error
How to Get Start and End Range from List of Timestamps
Maven Dependency for Servlet 3.0 API
Java Implementation of JSON to Xml Conversion
Can a Private Method in Super Class Be Overridden in the Sub-Class
Is Asynchronous Jdbc Call Possible
Right Way to Write JSON Deserializer in Spring or Extend It
Reset Buffer with Bufferedreader in Java