Java Regex to Extract Text Between Tags

Java regex to extract text between tags

You're on the right track. Now you just need to extract the desired group, as follows:

final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

If you want to extract multiple hits, try this:

public static void main(String[] args) {
final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";
System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear]
}

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

private static List<String> getTagValues(final String str) {
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}

However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath API for more info.

Java Extracting Text Between Tags and Attributes

I don't think placing the entire file contents into a single string is such a great idea but then I suppose that would depend upon the amount of content within the file. If it's a lot of content then I would read in that content a little differently. It would of been nice to see a fictitious example of what the file contains.

I suppose you can try this little method. The heart of it utilizes a regular expression (RegEx) along with Pattern/Matcher to retrieve the desired substring from between tags.

It is important to read the doc's with the method:

/**
* This method will retrieve a string contained between string tags. You
* specify what the starting and ending tags are within the startTag and
* endTag parameters. It is you who determines what the start and end tags
* are to be which can be any strings.<br><br>
*
* @param inputString (String) Any string to process.<br>
*
* @param startTag (String) The Start Tag String or String. Data content retrieved
* will be directly after this tag.<br><br>
*
* The supplied Start Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p~*~>" and of course
* your End Tag can be "</p>". The "<p~*~>" would be the same as supplying
* "<p style=\"padding-left:40px;\">". Anything between the characters <p and
* the supplied close chevron (>) is taken into consideration. This allows for
* contents extraction regardless of what HTML attributes are attached to the
* tag. The use of a wildcard tag (~*~) is also allowed in a supplied End
* Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param endTag (String) The End Tag or String. Data content retrieval will
* end just before this Tag is reached.<br>
*
* The supplied End Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p style=\"padding-left:40px;\">"
* and your End Tag can be "</~*~>". The "</~*~>" would be the same as supplying
* "</p>". Anything between the characters </ and the supplied close chevron (>)
* is taken into consideration. This allows for contents extraction regardless of what the
* HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param trimFoundData (Optional - Boolean - Default is true) By default
* all retrieved data is trimmed of leading and trailing white-spaces. If
* you do not want this then supply false to this optional parameter.
*
* @return (1D String Array) If there is more than one pair of Start and End
* Tags contained within the supplied input String then each set is placed
* into the Array separately.<br>
*
* @throws IllegalArgumentException if any supplied method String argument
* is Null ("").
*/
public static String[] getBetweenTags(String inputString, String startTag,
String endTag, boolean... trimFoundData) {
if (inputString == null || inputString.equals("") || startTag == null ||
startTag.equals("") || endTag == null || endTag.equals("")) {
throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - "
+ "A supplied method argument contains Null (\"\")!\n"
+ "Supplied Method Arguments:\n"
+ "==========================\n"
+ "inputString = \"" + inputString + "\"\n"
+ "startTag = \"" + startTag + "\"\n"
+ "endTag = \"" + endTag + "\"\n");
}

List<String> list = new ArrayList<>();
boolean trimFound = true;
if (trimFoundData.length > 0) {
trimFound = trimFoundData[0];
}

Matcher matcher;
if (startTag.contains("~*~") || endTag.contains("~*~")) {
startTag = startTag.replace("~*~", ".*?");
endTag = endTag.replace("~*~", ".*?");
Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag);
matcher = pattern.matcher(inputString);
} else {
String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag);
Pattern pattern = Pattern.compile("(?iu)" + regexString);
matcher = pattern.matcher(inputString);
}

while (matcher.find()) {
String match = matcher.group(1);
if (trimFound) {
match = match.trim();
}
list.add(match);
}
return list.toArray(new String[list.size()]);
}

Extracting tags and text between tags using regex on an string with XML tags

It is a bad idea to use regex to parse xml. Using a regex there is no way of identifying a complete element from opening to closing tag (a regex cannot "remember" a number of occurances).

However why your regex fails in this specific case:

In re1, re2, re3 you choose the capturing group to include < and > (also you do not include the / in re3). You could simply change this

String re1="<([^>]+)>"; // Tag 1
String re2="([^<]*)"; // Variable Name 1
String re3="</([^>]+)>"; // Tag 2

or use a suitable regex to remove < and > form tag1:

System.out.println(tag1.toString().replaceAll("<|>", ""));

or

System.out.println(tag1.toString().replaceAll("[<>]", ""));

Java Pattern regex to extract between tags

You should use an XML parser as suggested by Lucero.

However if you must use RegEx then you can use following.

<title.*?<\/link>

Working regex101 link https://regex101.com/r/EWG2Io/2

Edit

For the special case where you need everything inside <item></item> use following

<item.*?>(.*?)<\/item>

Working example https://regex101.com/r/Ow1A5F/1

Also here's Java sample

public class TestRegex {
public static void main(String[] args) {
String str = "<item value=\"key\" atr='none'><date><date><title val=\"has value\">Good</title><link>www</link></item><item value=\"key\" atr='none'><title val=\"has value\">Bad</title><link>http</link><author></author></item><item value=\"key\" atr='none'><title val=\"has value\">Neutral</title><link>ftp</link></item>";

Pattern pattern = Pattern.compile("<item.*?>(.*?)<\\/item>");

Matcher match = pattern.matcher(str);

while(match.find()) {
System.out.println(match.group(1));
}
}
}

Output

<date><date><title val="has value">Good</title><link>www</link>
<title val="has value">Bad</title><link>http</link><author></author>
<title val="has value">Neutral</title><link>ftp</link>

Regex for extracting text between Tags but not the tags

Easy. It's because you take all the result, not just the group 1 value.

String nodestr = "<xpath>/Temporary/EIC/SpouseSSNDisqualification</xpath>";
String regex = "<xpath>(.+?)<\/xpath>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(nodestr);
if (matcher.matches()) {
String tag_value = matcher.group(1); //taking only group 1
System.out.println(tag_value); //printing only group 1
}

Java regular expression to extract content between tags

When you use * it will try to absorb all possible characters (greedy).

If you want that .* to match the less possible characters you must use lazy match with *?.

So your regex get as:

<tag class=\"(.*?)\">(.*?)</tag>

Above, is the easy way. But isn't necessary the optimum way. Lazy match is more slow than greedy and if you can, you must try to avoid it. For example if you estimate that you code will be correct (not tag broken without a close tag, etc) is better that you use negate classes instead of .*?. For example, you regex can be write as:

<tag class="([^"]*)">([^<]*)</tag>

Witch is more efficient for the regex engine (although is not always possible to convert lazy match to negate class).

And of course, if you are trying to parse a complete html or xml document in witch you must do many different changes, it's better to use a xml (html) parser.

RegEx to extract text between tags in Java

Try this.

String data = ""
+ ":32B:xxx,\n"
+ ":59:yyy\n"
+ "something\n"
+ ":70:ACK1\n"
+ "ACK2\n"
+ ":21:something\n"
+ ":71A:something\n"
+ ":23E:something\n"
+ "value\n"
+ ":70:ACK2\n"
+ "ACK3\n"
+ ":71A:something\n";
Pattern pattern = Pattern.compile(":70:(.*?)\\s*:", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find())
System.out.println("found="+ matcher.group(1));

result:

found=ACK1
ACK2
found=ACK2
ACK3


Related Topics



Leave a reply



Submit