Split Regex to Extract Strings of Contiguous Characters

Split regex to extract Strings of contiguous characters

It is totally possible to write the regex for splitting in one step:

"(?<=(.))(?!\\1)"

Since you want to split between every group of same characters, we just need to look for the boundary between 2 groups. I achieve this by using a positive look-behind just to grab the previous character, and use a negative look-ahead and back-reference to check that the next character is not the same character.

As you can see, the regex is zero-width (only 2 look around assertions). No character is consumed by the regex.

Javascript Regex to split a string into array of grouped/contiguous characters

Your regex is fine, you're just using the wrong function. Use String.match, not String.split:

var matches = 'aaaabbbbczzxxxhhnnppp'.match(/((.)\2*)/g);

How to split string using regex in java

The idea behind this is, (.)\\1+ helps to match any number of repeated characters at very first and this |. helps to match all the other single characters. Finally put all the matched characters into a list and then print it.

   String s = "AABBABA";
ArrayList<String> fields = new ArrayList<String>();

Pattern regex = Pattern.compile("(.)\\1+|.");
Matcher m = regex.matcher(s);
while(m.find()){

fields.add(m.group(0));

}
System.out.println(fields);
}

Output:

[AA, BB, A, B, A]

By defining all the above input inside an array.

   String s[] = {"AA", "ABA", "AABBABA"};
Pattern regex = Pattern.compile("(.)\\1+|.");
for(String i:s)
{
ArrayList<String> fields = new ArrayList<String>();
Matcher m = regex.matcher(i);
while(m.find()){

fields.add(m.group(0));

}
System.out.println(fields);
}

Output:

[AA]
[A, B, A]
[AA, BB, A, B, A]

Split string into repeated characters

Try this:

String   str = "aaaabbbccccaaddddcfggghhhh";
String[] out = str.split("(?<=(.))(?!\\1)");

System.out.println(Arrays.toString(out));
=> [aaaa, bbb, cccc, aa, dddd, c, f, ggg, hhhh]

Explanation: we want to split the string at groups of same chars, so we need to find out the "boundary" between each group. I'm using Java's syntax for positive look-behind to pick the previous char and then a negative look-ahead with a back reference to verify that the next char is not the same as the previous one. No characters were actually consumed, because only two look-around assertions were used (that is, the regular expresion is zero-width).

Extracting numbers from a String in Java by splitting on a regex

You could use a regex like this:

([-.]?\d+(?:\.\d+)?)

Working demo

Sample Image

Match Information:

MATCH 1
1. [1-6] `0.286`
MATCH 2
1. [6-12] `-3.099`
MATCH 3
1. [12-17] `-0.44`
MATCH 4
1. [18-24] `-2.901`
MATCH 5
1. [25-31] `-0.436`
MATCH 6
1. [34-37] `123`
MATCH 7
1. [38-43] `0.123`
MATCH 8
1. [44-47] `.34`

Update

Jawee's approach

As Jawee pointed in his comment there is a problem for .34.34, so you can use his regex that fix this problem. Thanks Jawee to point out that.

(-?(?:\d+)?\.?\d+)

To have graphic idea about what happens behind this regex you can check this Debuggex
image:

Regular expression visualization

Engine explanation:

1st Capturing group (-?(?:\d+)?\.?\d+)
-? -> matches the character - literally zero and one time
(?:\d+)? -> \d+ match a digit [0-9] one and unlimited times (using non capturing group)
\.? matches the character . literally zero and one time
\d+ match a digit [0-9] one and unlimited times

REGEX in java for extracting consecutive duplicate characters in a string

There is no single plain regex solution to this problem because you need a lookbehind with a backreference inside, which is not supported by Java regex engine.

What you can do is either get all (\w)\1+ matches and then check their length using common string methods:

String s = "aaabbaa";
Pattern pattern = Pattern.compile("(\\w)\\1+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group().length() == 2) System.out.println(matcher.group(1));
}

(see the Java demo) or you can match 3 or more repetitions or just 2 repetitions and only grab the match if the Group 2 matched:

String s = "aaabbaa";
Pattern pattern = Pattern.compile("(\\w)\\1{2,}|(\\w)\\2");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group(2) != null)
System.out.println(matcher.group(2));
}

See this Java demo. Regex details:

  • (\w)\1{2,} - a word char and two or more occurrences of the same char right after
  • | - or
  • (\w)\2 - a word char and the same char right after.

How to split a string on regex in Python

You need to use re.split if you want to split a string according to a regex pattern.

tokens = re.split(r'[.:]', ip)

Inside a character class | matches a literal | symbol and note that [.:] matches a dot or colon (| won't do the orring here).

So you need to remove | from the character class or otherwise it would do splitting according to the pipe character also.

or

Use string.split along with list_comprehension.

>>> ip = '192.168.0.1:8080'
>>> [j for i in ip.split(':') for j in i.split('.')]
['192', '168', '0', '1', '8080']


Related Topics



Leave a reply



Submit