Parsing Xml with Regex in Java

Parsing XML with REGEX in Java

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\\1>");
if (matcher.find()) {
String DataElements = matcher.group(1);
Matcher matcher2 = regex2.matcher(DataElements);
while (matcher2.find()) {
list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
}
}

Use Java Regex to parse xml file

Give this pattern a try:

"<(tag[13])>(.+?)</tag[13]>"

Usage:

public static void main(String[] args) throws Exception {
String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";

Matcher matcher = Pattern.compile("<(tag[13])>(.+?)</tag[13]>").matcher(xmlString);
while (matcher.find()) {
System.out.println(matcher.group(1) + " " + matcher.group(2));
}
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

NON REGEX

Or you could use the Document & DocumentBuilderFactory from the org.wc3.dom package.

Something like:

public static void main(String[] args) throws Exception {
String xmlString = "<MainTag><element><tag1>Key1</tag1><tag2>Not intrested</tag2><tag3>Value1</tag3></element><element><tag1>Key2</tag1><tag2>Not intrested</tag2></element><element><tag1>Key3</tag1><tag2>Not intrested</tag2><tag3>Value3</tag3></element></MainTag>";
Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new ByteArrayInputStream(xmlString.getBytes("utf-8"))));

Node rootNode = xmlDocument.getFirstChild();
if (rootNode.hasChildNodes()) {
// Get each element child node
NodeList elementsList = rootNode.getChildNodes();
for (int i = 0; i < elementsList.getLength(); i++) {
if (elementsList.item(i).hasChildNodes()) {
// Get each tag child node to element node
NodeList tagsList = elementsList.item(i).getChildNodes();
for (int i2 = 0; i2 < tagsList.getLength(); i2++) {
Node tagNode = tagsList.item(i2);
if (tagNode.getNodeName().matches("tag1|tag3")) {
System.out.println(tagNode.getNodeName() + " " + tagNode.getTextContent());
}
}
}
}
}
}

Results:

tag1 Key1
tag3 Value1
tag1 Key2
tag1 Key3
tag3 Value3

Why is it such a bad idea to parse XML with regex?

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

<div>
<div id="parse-this">
<!-- oops</div> -->
try to get this value with regex
</div>
</div>

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

RegEx for matching the CDATA from XML strings

You should never parse HTML with regex and instead can use HTML parser like JSoup.

And the problem here is, you need to first call matcherObject.find() (use this for finding the pattern anywhere in the string) or matcherObject.matches() (use this for matching whole string with the pattern) method before you can access the match and also you should first always check if the value retured by find or matches is true by using a if or while loop. Also you need to call group(1) instead of group(0) (this will return whole match) to access contents from group1.

Change your code to this,

String neMsg = "<root>" + "   <CONTENT>"
+ " <![CDATA[00000:<ResponseClass Name=\"Response\"><ITEM>HAHA</ITEM></ResponseClass>]]>"
+ " </CONTENT>" + "</root>";

Pattern pP0 = Pattern.compile(".*<!\\[CDATA\\[00000:(.*)\\]\\]>.*");
java.util.regex.Matcher mP0 = pP0.matcher(neMsg);
if (mP0.find()) { // matches method will also work because your pattern is wrapped with `.*` from both sides
System.out.println(mP0.group(1));
}

Prints whole match,

<ResponseClass Name="Response"><ITEM>HAHA</ITEM></ResponseClass>

Find everything between two XML tags with RegEx

It is not a good idea to use regex for HTML/XML parsing...

However, if you want to do it anyway, search for regex pattern

<primaryAddress>[\s\S]*?<\/primaryAddress>

and replace it with empty string...



Related Topics



Leave a reply



Submit