Stripping Out HTML Tags from a String

How to strip HTML tags from string in JavaScript?

Using the browser's parser is the probably the best bet in current browsers. The following will work, with the following caveats:

  • Your HTML is valid within a <div> element. HTML contained within <body> or <html> or <head> tags is not valid within a <div> and may therefore not be parsed correctly.
  • textContent (the DOM standard property) and innerText (non-standard) properties are not identical. For example, textContent will include text within a <script> element while innerText will not (in most browsers). This only affects IE <=8, which is the only major browser not to support textContent.
  • The HTML does not contain <script> elements.
  • The HTML is not null
  • The HTML comes from a trusted source. Using this with arbitrary HTML allows arbitrary untrusted JavaScript to be executed. This example is from a comment by Mike Samuel on the duplicate question: <img onerror='alert(\"could run arbitrary JS here\")' src=bogus>

Code:

var html = "<p>Some HTML</p>";
var div = document.createElement("div");
div.innerHTML = html;
var text = div.textContent || div.innerText || "";

How do I remove all HTML tags from a string without knowing which tags are in it?

You can use a simple regex like this:

public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.

You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

Stripping out HTML tags from a string

Hmm, I tried your function and it worked on a small example:

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

//output " My First Heading My first paragraph. "

Can you give an example of a problem?

Swift 4 and 5 version:

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)

How to remove HTML tag (not a specific tag ) with content from a string in javascript

Removing all HTML tags and the innerText can be done with the following snippet. The Regexp captures the opening tag's name, then matches all content between the opening and closing tags, then uses the captured tag name to match the closing tag.

const regexForStripHTML = /<([^</> ]+)[^<>]*?>[^<>]*?<\/\1> */gi;
const text = "OCEP <sup>®</sup> water product";
const stripContent = text.replaceAll(regexForStripHTML, '');
console.log(text);
console.log(stripContent);

How to remove all html tags from a string

You can strip out all the html-tags with a regular expression: /<(.|\n)*?>/g

Described in detail here: http://www.pagecolumn.com/tool/all_about_html_tags.htm

In your JS-Code it would look like this:

item = item.replace(/<(.|\n)*?>/g, '');

Remove HTML tags from a String

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

  • RegEx match open tags except XHTML self-contained tags
  • What are the pros and cons of the leading Java HTML parsers?
  • XSS prevention in JSP/Servlet web application

Stripping out html tags in string

Description

This expression will:

  • find and replace all tags with nothing
  • avoid problematic edge cases

Regex: <(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

Replace with: nothing

Sample Image

Example

Sample Text

Note the difficult edge case in the mouse over function

these are <a onmouseover=' href="NotYourHref" ; if (6/a>3) { funRotator(href) } ; ' href=abc.aspx?filter=3&prefix=&num=11&suffix=>the droids</a> you are looking for.

Code

Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "replace with your source string"
Dim replacementstring as String = ""
Dim matchpattern as String = "<(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*>"
Console.Writeline(regex.Replace(sourcestring,matchpattern,replacementstring,RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace OR RegexOptions.Multiline OR RegexOptions.Singleline))
End Sub
End Module

String after replacement

these are the droids you are looking for.

How can I strip HTML tags from a string in ASP.NET?

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

Note:

  1. There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
  2. The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
  3. As with all things HTML and regex:

    Use a proper parser if you must get it right under all circumstances.

Remove HTML tags and its contents from a string - Javascript

Rather than trying to remove the HTML element via Regex, it's much more straightforward to create and populate a DOM Fragment using:

let myDiv = document.createElement('div');
myDiv.innerHTML = test;

and then remove the <title> element from that, using:

myDivTitle = myDiv.querySelector('title');
myDiv.removeChild(myDivTitle);

Working Example (One Element):

const test = "This is outside the HTML tag. <title>How to remove an HTML element using JavaScript ?</title>";

let myDiv = document.createElement('div');
myDiv.innerHTML = test;
myDivTitle = myDiv.querySelector('title');
myDiv.removeChild(myDivTitle);
const testAfter = myDiv.innerHTML;
console.log(testAfter);

How can I remove html tags from a string?

You can remove html tag from string by using NSAttributedString.

Please find the below code :

let htmlString = "<p style=\"text-align: right;\"> text and text"

do {
let encodedData = htmlString.dataUsingEncoding(NSUTF8StringEncoding)!
let attributedOptions : [String: AnyObject] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
]
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)

print("final strings :",attributedString.string)

} catch {
fatalError("Unhandled error: \(error)")
}

Hope it works for you!!!

You can also create String extension for reusability:

extension String {
init(htmlString: String) {
do {
let encodedData = htmlString.dataUsingEncoding(NSUTF8StringEncoding)!
let attributedOptions : [String: AnyObject] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
]
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
self.init(attributedString.string)
} catch {
fatalError("Unhandled error: \(error)")
}
}
}

Swift 3.0 - (Xcode 8.2) Update

extension String {

var normalizedHtmlString : String {

do {
if let encodedData = self.data(using: .utf8){
let attributedOptions : [String: AnyObject] = [
NSDocumentTypeDocumentAttribute : NSHTMLTextDocumentType as AnyObject,
NSCharacterEncodingDocumentAttribute: NSNumber(value: String.Encoding.utf8.rawValue)
]
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
if let stringNormalized = String.init(attributedString.string){
return stringNormalized
}
}
}
catch {
assert(false, "Please check string")
//fatalError("Unhandled error: \(error)")
}
return self
}
}

And call the htmlString method :

let yourHtmlString = "<p style=\"text-align: right;\"> text and text"
let decodedString = String(htmlString:yourHtmlString)


Related Topics



Leave a reply



Submit