How to Use C# to Sanitize Input on an HTML Page

How to use C# to sanitize input on an html page?

We are using the HtmlSanitizer .Net library, which:

  • Is open-source (MIT) - GitHub link
  • Is fully customizable, e.g. configure which elements should be removed. see wiki
  • Is actively maintained
  • Doesn't have the problems like Microsoft Anti-XSS library
  • Is unit tested with the
    OWASP XSS Filter Evasion Cheat Sheet
  • Is special built for this (in contrast to HTML Agility Pack, which is a parser - not a sanitizer)
  • Doesn't use regular expressions (HTML isn't a regular language!)

Also on NuGet

How can I sanitize input on a WebAPI model?

First create an ActionFilterAttribute:

using System.Reflection;
using System.Web.Http.Controllers;
using System.Web.Http.Filters;
using Microsoft.Security.Application;

/// <summary>
/// Sanitizes (HTML encodes) all strings on the model.
/// </summary>
/// <remarks>Use sparingly. Ideally don't use this and instead encode when outputting the values.
/// This is used because we don't have control of other applications that may consume the data.</remarks>
[AttributeUsage(AttributeTargets.Method | AttributeTargets.Class, AllowMultiple = false, Inherited = true)]
public class SanitizeInputAttribute : ActionFilterAttribute
{
public override void OnActionExecuting(HttpActionContext actionContext)
{
if (actionContext.ActionArguments != null && actionContext.ActionArguments.Count == 1)
{
var requestParam = actionContext.ActionArguments.First();
var properties = requestParam.Value.GetType().GetProperties(BindingFlags.Instance | BindingFlags.Public)
.Where(x => x.CanRead && x.CanWrite && x.PropertyType == typeof(string) && x.GetGetMethod(true).IsPublic && x.GetSetMethod(true).IsPublic);
foreach (var propertyInfo in properties)
{
propertyInfo.SetValue(requestParam.Value, Encoder.HtmlEncode(propertyInfo.GetValue(requestParam.Value) as string));
}
}
}
}

Then simply register it on the class or action like this:

[HttpPost]
[SanitizeInput]
public Response Post(Object model)
{...}

How to sanitize input from MCE in ASP.NET?

I don't think there is a built-in sanitizer for C# that you can use but here is what i did when i had the same issue. I used the HtmlAgilityPackSanitizerProvider which comes with AjaxControlToolkit. Code looks like this:

private static AjaxControlToolkit.Sanitizer.HtmlAgilityPackSanitizerProvider sanitizer = new AjaxControlToolkit.Sanitizer.HtmlAgilityPackSanitizerProvider();

private static Dictionary<string, string[]> elementWhitelist = new Dictionary<string, string[]>
{
{"b" , new string[] { "style" }},
{"strong" , new string[] { "style" }},
{"i" , new string[] { "style" }},
{"em" , new string[] { "style" }},
{"u" , new string[] { "style" }},
{"strike" , new string[] { "style" }},
{"sub" , new string[] { "style" }},
{"sup" , new string[] { "style" }},
{"p" , new string[] { "align" }},
{"div" , new string[] { "style", "align" }},
{"ol" , new string[] { }},
{"li" , new string[] { }},
{"ul" , new string[] { }},
{"a" , new string[] { "href" }},
{"font" , new string[] { "style", "face", "size", "color" }},
{"span" , new string[] { "style" }},
{"blockquote" , new string[] { "style", "dir" }},
{"hr" , new string[] { "size", "width", "id" }},
{"img" , new string[] { "src" }},
{"h1" , new string[] { "style" }},
{"h2" , new string[] { "style" }},
{"h3" , new string[] { "style" }},
{"h4" , new string[] { "style" }},
{"h5" , new string[] { "style" }},
{"h6" , new string[] { "style" }}
};

private static Dictionary<string, string[]> attributeWhitelist = new Dictionary<string, string[]>
{
{"style" , new string[] {}},
{"align" , new string[] {}},
{"href" , new string[] {}},
{"face" , new string[] {}},
{"size" , new string[] {}},
{"color" , new string[] {}},
{"dir" , new string[] {}},
{"width" , new string[] {}},
{"id" , new string[] {}},
{"src" , new string[] {}}
};

public string SanitizeHtmlInput(string unsafeStr)
{
return sanitizer.GetSafeHtmlFragment(unsafeStr, elementWhitelist, attributeWhitelist);
}

Hope this helps.

Does @Html.Textarea sanitize input?

The Razor part does not implement any of this checking (since it has no way of knowing what you consider to be valid input). However the database layer you are using in MVC almost certainly deals with injection attacks.

Sanitize registration input?

Assuming that you're (manually?) de-serializing the registration input, you need to encode it as XML before further processing so that characters with special meaning in XML are escaped properly.

Note that there are only 5 of them so it's perfectly reasonable to do this with a manual replace:

  • < = <
  • > = >
  • & = &
  • " = "
  • ' = '

You could use the build-in .NET function HttpUtility.HtmlEncode(input) to do this for you.

UPDATE:

I just realized I didn't really answer your question, you seem to be looking for a way to transform Unicode characters to ASCII-supported Html Entities.

I'm not aware of any built-in functions in .NET that do this, so I wrote a little utility method which should illustrate the concept:

public static class StringUtilities
{
public static string HtmlEncode(string input, Encoding source, Encoding destination)
{
var sourceChars = HttpUtility.HtmlEncode(input).ToArray();
var sb = new StringBuilder();

foreach (var sourceChar in sourceChars)
{
byte[] sourceBytes = source.GetBytes(new[] { sourceChar });
char destChar = destination.GetChars(sourceBytes).FirstOrDefault();

if (destChar != sourceChar)
sb.AppendFormat("&#{0};", (int)sourceChar);
else
sb.Append(sourceChar);
}

return sb.ToString();
}
}

Then, given an input string which has both reserved XML characters and Unicode characters in it, you could use it like this:

string unicode = "<tag>some proӸematic text<tag>";

string escapedASCII = StringUtilities.HtmlEncode(
unicode, Encoding.Unicode, Encoding.ASCII);

// Result: <tag>some proӸematic text<tag>

If you need to do this at several places, to clean it up a bit, you could add an extension method for your specific scenario:

public static class StringExtensions
{
public static string ToEncodedASCII(this string input, Encoding sourceEncoding)
{
return StringUtilities.HtmlEncode(input, sourceEncoding, Encoding.ASCII);
}
public static string ToEncodedASCII(this string input)
{
return StringUtilities.HtmlEncode(input, Encoding.Unicode, Encoding.ASCII);
}
}

You could then do:

string unicode = "<tag>some proӸematic text<tag>";

// Default to Unicode as input
string escapedASCII1 = unicode.ToEncodedASCII();

// Pass in a different encoding for your input
string escapedASCII2 = unicode.ToEncodedASCII(Encoding.BigEndianUnicode);

UPDATE #2

Since you also asked for advice on adhering to standards, the most I can tell you is that you need to take into consideration where the input text will actually end up.

If the input for a certain user will only ever be displayed to that user (for instance when they manage their profile / account settings in your app), and your database supports Unicode, you could just leave everything as-is.

On the other hand, if the information can be displayed to other users (for instance when users can view each others public profile information) then you need to take into consideration that not all users will be visiting your website on a device/browser that supports Unicode. In that case, UTF-8 is likely to be your best bet.

This is also why you can't really find that much useful information on it. If the world was able to agree on a standard then we would not have to deal with all these encoding variations in the first place. Think about your target group and what they need.

A useful blog post on the subject of encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)



Related Topics



Leave a reply



Submit