Remove HTML tags from a String
Use a HTML parser instead of regex. This is dead simple with Jsoup.
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>
, <i>
and <u>
.
See also:
- RegEx match open tags except XHTML self-contained tags
- What are the pros and cons of the leading Java HTML parsers?
- XSS prevention in JSP/Servlet web application
Stripping HTML tags in Java
Use JSoup, it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code...
String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));
And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. It is more or less the same quality as Markdownify for PHP....
Remove HTML tags from a String with content
You may try the below capturing group based regex.
string.replaceAll("(?s)<(\\w+)\\b[^<>]*>.*?</\\1>", "");
How to remove HTML tag in Java
You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.
With htmlCleaner you can do:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
Removing html tags
String
is immutable in Java + You never display anything
I recommend you close
your Scanner
when done with it (as a best practice), and reading the HTML_1.txt
file from the user's HOME directory. The simplest way to close
is a try-with-resources
like
public static void main(String[] args) {
try (Scanner input = new Scanner(new File(
System.getProperty("user.home"), "HTML_1.txt"))) {
while (input.hasNextLine()) {
String html = stripHtmlTags(input.nextLine().trim());
if (!html.isEmpty()) { // <-- removes empty lines.
System.out.println(html);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
Because String
is immutable I would recommend a StringBuilder
to remove the HTML tags like
static String stripHtmlTags(String html) {
StringBuilder sb = new StringBuilder(html);
int open;
while ((open = sb.indexOf("<")) != -1) {
int close = sb.indexOf(">", open + 1);
sb.delete(open, close + 1);
}
return sb.toString();
}
When I run the above I getMy web page
There are many pictures of my cat here,
as well as my very cool blog page,
which contains awesome
stuff about my trip to Vegas.
Here's my cat now:
How can I use a regex to remove HTML tags from a String?
Use a proper HTML-parser like Jsoup, instead of string manipilation or regex. Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Using Jsoup your code could look like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Example2 {
public static void main(String[] args) {
String html =
"<html>\n"
+ "<head></head>"
+ "<body>"
+ " <table>"
+ " <tr class='list odd'>\n"
+ " <td class=\"list\" align=\"center\">Do</td>\n"
+ " <td class=\"list\" align=\"center\">7.7.</td><td class=\"list\" align=\"center\">3 - 4</td>\n"
+ " <td class=\"list\" align=\"center\">---</td>\n"
+ " <td class=\"list\" align=\"center\"><s>Q1e14</s></td>\n"
+ " <td class=\"list\" align=\"center\">Arbeitsauftrag:</td>\n"
+ " <td class=\"list\" align=\"center\">entfällt</td></tr>\n"
+ " </table>"
+ "</body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("td");
tds.forEach(td -> System.out.println(td.text()));
}
}
output:
Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag:
entfällt
Maven repo:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.2</version>
</dependency>
Related Topics
How to Measure Distance and Create a Bounding Box Based on Two Latitude+Longitude Points in Java
Java:If a Extends B and B Extends Object, Is That Multiple Inheritance
How to Ask the Selenium-Webdriver to Wait for Few Seconds in Java
Determining If an Object Is of Primitive Type
What Is an Object's Hash Code If Hashcode() Is Not Overridden
Org.Apache.Tomcat.Util.Bcel.Classfile.Classformatexception: Invalid Byte Tag in Constant Pool: 15
Classpath Resource Not Found When Running as Jar
How to Create a Jandex Index in Quarkus for Classes in a External Module
Interview Question: Check If One String Is a Rotation of Other String
What Is More Efficient, I++ or ++I
How to Re-Run Failed Junit Tests Immediately
Programmatically Getting the MAC of an Android Device
How to Print an Image on a Bluetooth Printer in Android
How to Use Vectordrawables in Android API Lower Than 21
How to Put a Jar in Classpath in Eclipse