Java Split on Spaces and Special Characters

Java Split on Spaces and Special Characters

Just use:

String[] terms = input.split("[\\s@&.?$+-]+");

You can put a short-hand character class inside a character class (note the \s), and most meta-character loses their meaning inside a character class, except for [, ], -, &, \. However, & is meaningful only when comes in pair &&, and - is treated as literal character if put at the beginning or the end of the character class.

Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines.

As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \W consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.

Java regex - split string with leading special characters

Split is behaving as expected by splitting off a zero-length string at the start before the first comma.

To fix, first remove all splitting chars from the start:

String[] sArr = s.replaceAll("^([^a-zA-Z]*\\s*)*", "").split("[^a-zA-Z]+\\s*");

Note that I’ve altered the removal regex to trim any sequence of spaces and non-letters from the front.

You don’t need to remove from the tail because split discards empty trailing elements from the result.

Split Java string on spaces with special characters and complications

You may use this regex for matching with a lookahead assertion:

-?[a-z_]\w*(?:=".*?"(?=\h+(?:-[a-z](?=\h|$)|[a-z]\w*=)|$)|\S+)?

RegEx Demo

RegEx Explanation:

  • -?: Start with an optional hyphen
  • [a-z_]\w*: match a variable that starts with a lowercase letter or underscore followed by 0+ word characters
  • (?:: Start non-capture group

    • ".*?"(?=...<expression>): Match quoted string that starts and ends with double quote. Using lookahead we assert that we have another variable or end of line ahead.
    • |: OR
    • \S+: Match 1+ non-whitespace characters
  • ): End non-capture group

How to split a string on whitespace and on special char while getting there offset values in java

Matcher#start

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
Matcher matcher = pattern.matcher("I live, in India.");
while (matcher.find()) {
System.out.println(matcher.group() + " => " + matcher.start());
}
}
}

Output:

I => 0
live => 2
, => 6
in => 8
India => 11
. => 16

Explanation of regex:

  1. \b specifies word boundary.
  2. | specifies OR.
  3. \p{Punct} specifies punctuation.
  4. \S+ specifies one or more non-whitespace character.

Splitting a string using special characters and keeping them

So you want to use split() to get every character separately, except for spaces and commas, so split by spaces/commas and by "nothing", i.e. the zero-width "space" between non-space/comma characters.

String str = "g, i+, w+ | (d | (u+, f))+";
String[] chunks = str.split("[\\s,]+|(?<![\\s,])(?![\\s,])");
System.out.println(String.join(",", chunks));

Output

g,i,+,w,+,|,(,d,|,(,u,+,f,),),+

Alternative: Search for what you want, and collect it into an array or List (requires Java 9):

String str = "g, i+, w+ | (d | (u+, f))+";
String[] chunks = Pattern.compile("[^\\s,]").matcher(str).results()
.map(MatchResult::group).toArray(String[]::new);
System.out.println(String.join(",", chunks));

Same output.

For older versions of Java, use a find() loop:

String str = "g, i+, w+ | (d | (u+, f))+";
List<String> chunkList = new ArrayList<>();
for (Matcher m = Pattern.compile("[^\\s,]").matcher(str); m.find(); )
chunkList.add(m.group());
System.out.println(chunkList);

Output

[g, i, +, w, +, |, (, d, |, (, u, +, f, ), ), +]

You can always convert the List to an array:

String[] chunks = chunkList.toArray(new String[0]);


Related Topics



Leave a reply



Submit