How to Unescape HTML Character Entities in Java

How to unescape HTML character entities in Java?

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity
escapes to a string containing the
actual Unicode characters
corresponding to the escapes. Supports
HTML 4.0 entities.

How to unescape HTML 5 entities in Java (')

unbescape does the job well:

final String unescapedText = HtmlEscape.unescapeHtml("'");
System.out.println(unescapedText);

Result:

'

Maven:

<!-- https://mvnrepository.com/artifact/org.unbescape/unbescape -->
<dependency>
<groupId>org.unbescape</groupId>
<artifactId>unbescape</artifactId>
<version>1.1.6.RELEASE</version>
</dependency>

Replace HTML codes with equivalent characters in Java

Also, is there any way to optimize this regex?

Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:

import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);

JavaDoc says:

Unescapes a string containing entity escapes to a string containing
the actual Unicode characters corresponding to the escapes. Supports
HTML 4.0 entities.

For example, the string "<Français>" will become "<Français>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. ">&zzzz;x" will become ">&zzzz;x".

Is it possible to use unescapeHTML of StringEscapeUtils to unescape HTML twice?

A workaround would be to replace the strings that you know are not escaped correctly, beforehand in an if-class

String not replacing characters

Html entity decoder java

This might prove to be useful unbescape. I found it to be diverse and very quick to implement. I'm using it to convert huge lists of strings with HTML entities in them back to normal strings. It's quick and accurate so far.



Related Topics



Leave a reply



Submit