How to Filter All HTML Tags Except a Certain Whitelist

How do I filter all HTML tags except a certain whitelist?

Here's a function I wrote for this task:

static string SanitizeHtml(string html)
{
string acceptable = "script|link|title";
string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
return Regex.Replace(html, stringPattern, "sausage");
}

Edit: For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.

I will explain the regex a bit, because it is a little long.

The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).

Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+

Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:\s[a-z,A-Z,0-9,-]+=?(?:(["",']?).?\1?))

Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+

next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later \1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.

next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with \s, and 0 or 1 ending slashes in the tag for xml style self closing tags.

You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.

BeautifulSoup Remove all html tags except for those in whitelist such as img and a tags with python

You could select all of the descendant nodes by accessing the .descendants property.

From there, you could iterate over all of the descendants and filter them based on the name property. If the node doesn't have a name property, then it is likely a text node, which you want to keep. If the name property is a or img, then you keep it as well.

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
if not node.name or node.name == 'a' or node.name == 'img':
keep.append(node)

Here is an alternative where all the filtered elements are used to create the list directly:

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
if not node.name or node.name == 'a' or node.name == 'img']

Also, if you don't want strings that are empty to be returned, you can trim the whitespace and check for that as well:

keep = [node for node in container.descendants
if (not node.name and len(node.strip())) or
(node.name == 'a' or node.name == 'img')]

Based on the HTML that you provided, the following would be returned:

> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]

Strip all HTML tags except certain ones?

<((?!p\s).)*?> will give you all tags except the paragraphs. So your program could delete all matches of this regex and replace the rest of the tags (all p's) with empty paragraph tags. (<p .*?> regex for receiving all p-tags)

Strip all HTML tags, except anchor tags

I suggest you use Html Agility Pack

also check this question/answers: HTML Agility Pack strip tags NOT IN whitelist

Removing all html tags except for html with specific class using regex

here you go

Regex

/(<span(?![^>]*class="mention")[^>]*>)([^<]*)<\/span>/g

Replace pattern

\2

Test String

@test@test <span class="mention">@test</span> @test2@test <span class="mention">@test</span> test@test.com Test @test.com <br> <a></a> <hr></hr> <span>dsfsfdsdfsdfs asdf </span> <span>test</span> <a>f</a>

Result

@test@test <span class="mention">@test</span> @test2@test <span class="mention">@test</span> test@test.com Test @test.com <br> <a></a> <hr></hr> dsfsfdsdfsdfs asdf  test <a>f</a>

Demo

try demo here

this will rip off all the span tags which do not have the specified class attribute class="mention"


EDIT

as requested here is how you can strip off all the html tags except the one which has required mention class

Regex

/(<(\w+)(?![^>]*class="mention")[^>]*>)([^<]*)<\/\2>|(?:<br>|<br\/>)/g

Replace pattern

\3

Result

@test@test <span class="mention">@test</span> @test2@test <span class="mention">@test</span> test@test.com Test @test.com    dsfsfdsdfsdfs asdf  test f

Demo

try demo here

How to remove all html tags except img?

I tried a lot, this regular expression seems work for me:

(?i)<(?!img|/img).*?>

My code is:

html.replaceAll('(?i)<(?!img|/img).*?>', '');

Removing Html tags except few specific ones from String in java

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
Whitelist whitelist = Whitelist.none();
whitelist.addTags(new String[]{"p","br","ul"});

String safe = Jsoup.clean(unsafe, whitelist);
return StringEscapeUtils.unescapeXml(safe);
}

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>


Related Topics



Leave a reply



Submit