Stripping HTML Tags in Java

Remove HTML tags from a String

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

  • RegEx match open tags except XHTML self-contained tags
  • What are the pros and cons of the leading Java HTML parsers?
  • XSS prevention in JSP/Servlet web application

Stripping HTML tags in Java

Use JSoup, it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code...

String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));

And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. It is more or less the same quality as Markdownify for PHP....

Remove HTML tags from a String with content

You may try the below capturing group based regex.

string.replaceAll("(?s)<(\\w+)\\b[^<>]*>.*?</\\1>", "");

How to remove HTML tag in Java

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}

Removing html tags

String is immutable in Java + You never display anything

I recommend you close your Scanner when done with it (as a best practice), and reading the HTML_1.txt file from the user's HOME directory. The simplest way to close is a try-with-resources like

public static void main(String[] args) {
try (Scanner input = new Scanner(new File(
System.getProperty("user.home"), "HTML_1.txt"))) {
while (input.hasNextLine()) {
String html = stripHtmlTags(input.nextLine().trim());
if (!html.isEmpty()) { // <-- removes empty lines.
System.out.println(html);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}

Because String is immutable I would recommend a StringBuilder to remove the HTML tags like

static String stripHtmlTags(String html) {
StringBuilder sb = new StringBuilder(html);
int open;
while ((open = sb.indexOf("<")) != -1) {
int close = sb.indexOf(">", open + 1);
sb.delete(open, close + 1);
}
return sb.toString();
}
When I run the above I get

My web page
There are many pictures of my cat here,
as well as my very cool blog page,
which contains awesome
stuff about my trip to Vegas.
Here's my cat now:

How can I use a regex to remove HTML tags from a String?

Use a proper HTML-parser like Jsoup, instead of string manipilation or regex. Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Using Jsoup your code could look like:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class Example2 {
public static void main(String[] args) {
String html =
"<html>\n"
+ "<head></head>"
+ "<body>"
+ " <table>"
+ " <tr class='list odd'>\n"
+ " <td class=\"list\" align=\"center\">Do</td>\n"
+ " <td class=\"list\" align=\"center\">7.7.</td><td class=\"list\" align=\"center\">3 - 4</td>\n"
+ " <td class=\"list\" align=\"center\">---</td>\n"
+ " <td class=\"list\" align=\"center\"><s>Q1e14</s></td>\n"
+ " <td class=\"list\" align=\"center\">Arbeitsauftrag:</td>\n"
+ " <td class=\"list\" align=\"center\">entfällt</td></tr>\n"
+ " </table>"
+ "</body>\n"
+ "</html>";

Document doc = Jsoup.parse(html);

Elements tds = doc.select("td");
tds.forEach(td -> System.out.println(td.text()));
}
}

output:

Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag:
entfällt

Maven repo:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.2</version>
</dependency>


Related Topics



Leave a reply



Submit