Characters in String Changed After Downloading HTML from the Internet

Characters in string changed after downloading HTML from the internet

Here's a wrapped download class which supports gzip and checks encoding header and meta tags in order to decode it correctly.

Instantiate the class, and call GetPage().

public class HttpDownloader
{
private readonly string _referer;
private readonly string _userAgent;

public Encoding Encoding { get; set; }
public WebHeaderCollection Headers { get; set; }
public Uri Url { get; set; }

public HttpDownloader(string url, string referer, string userAgent)
{
Encoding = Encoding.GetEncoding("ISO-8859-1");
Url = new Uri(url); // verify the uri
_userAgent = userAgent;
_referer = referer;
}

public string GetPage()
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
if (!string.IsNullOrEmpty(_referer))
request.Referer = _referer;
if (!string.IsNullOrEmpty(_userAgent))
request.UserAgent = _userAgent;

request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Headers = response.Headers;
Url = response.ResponseUri;
return ProcessContent(response);
}

}

private string ProcessContent(HttpWebResponse response)
{
SetEncodingFromHeader(response);

Stream s = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
s = new GZipStream(s, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
s = new DeflateStream(s, CompressionMode.Decompress);

MemoryStream memStream = new MemoryStream();
int bytesRead;
byte[] buffer = new byte[0x1000];
for (bytesRead = s.Read(buffer, 0, buffer.Length); bytesRead > 0; bytesRead = s.Read(buffer, 0, buffer.Length))
{
memStream.Write(buffer, 0, bytesRead);
}
s.Close();
string html;
memStream.Position = 0;
using (StreamReader r = new StreamReader(memStream, Encoding))
{
html = r.ReadToEnd().Trim();
html = CheckMetaCharSetAndReEncode(memStream, html);
}

return html;
}

private void SetEncodingFromHeader(HttpWebResponse response)
{
string charset = null;
if (string.IsNullOrEmpty(response.CharacterSet))
{
Match m = Regex.Match(response.ContentType, @";\s*charset\s*=\s*(?<charset>.*)", RegexOptions.IgnoreCase);
if (m.Success)
{
charset = m.Groups["charset"].Value.Trim(new[] { '\'', '"' });
}
}
else
{
charset = response.CharacterSet;
}
if (!string.IsNullOrEmpty(charset))
{
try
{
Encoding = Encoding.GetEncoding(charset);
}
catch (ArgumentException)
{
}
}
}

private string CheckMetaCharSetAndReEncode(Stream memStream, string html)
{
Match m = new Regex(@"<meta\s+.*?charset\s*=\s*""?(?<charset>[A-Za-z0-9_-]+)""?", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html);
if (m.Success)
{
string charset = m.Groups["charset"].Value.ToLower() ?? "iso-8859-1";
if ((charset == "unicode") || (charset == "utf-16"))
{
charset = "utf-8";
}

try
{
Encoding metaEncoding = Encoding.GetEncoding(charset);
if (Encoding != metaEncoding)
{
memStream.Position = 0L;
StreamReader recodeReader = new StreamReader(memStream, metaEncoding);
html = recodeReader.ReadToEnd().Trim();
recodeReader.Close();
}
}
catch (ArgumentException)
{
}
}

return html;
}
}

HTML encoding issues - Â character showing up instead of  

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  • for HTML5: <meta charset="utf-8">

If you've done that, then any remaining problem is ActivePDF's fault.

Special character not displaying as expected

1 - Replace your

<meta charset="utf-8">

with

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

2 - Check if your HTML Editor's encoding is in UTF8. Usually this option is found on the tabs on the top of the program, like in Notepad++.

3 - Check if your browser is compatible with your font, if you're somehow importing a font. Or try and add a css to set your fonts to a default/generally accepted one like

body
{
font-family: "Times New Roman", Times, serif;
}

Hope it helps :)

JAVA: Greek characters from downloaded HTML file aren't displayed, how can I fix this?

String encoding = "UTF-8"; // Or "ISO-8859-7"
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(), encoding));

ISO-8859-1 is the 8-bit encoding used by Greek, UTF-8 the multibyte unicode encoding.

StringBuilder sb = new StringBuilder();
String temp;
while ((temp = br.readLine()) != null) {
sb.append(temp).append("\n");
System.out.println(temp);
}
String html = sb.toString();

readLine removes the line ending (\r old MacOS, \n Unix or \r\n Windows).

What is this character ( Â ) and how do I remove it with PHP?

"Latin 1" is your problem here. There are approx 65256 UTF-8 characters available to a web page which you cannot store in a Latin-1 code page.

For your immediate problem you should be able to

$clean = str_replace(chr(194)," ",$dirty)

However I would switch your database to use utf-8 ASAP as the problem will almost certainly reoccur.

’ showing on page instead of '

Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.

Or use .

Question mark characters display within text. Why is this?

The following articles will be useful:

10.3 Specifying Character Sets and Collations

10.4 Connection Character Sets and Collations

After you connect to the database, issue the following command:

SET NAMES 'utf8';

Ensure that your web page also uses the UTF-8 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

PHP also offers several functions that will be useful for conversions:

  • iconv
  • mb_convert_encoding

Writing HTML file with special characters

You should use the System.Web.HttpUtility class, specifically the HtmlEncode and HtmlDecode methods to work with html strings.

The HtmlEncode method converts every special character in your string in the equivalent html entity; HtmlDecode does the exact opposite.

See the MSDN reference for more details.

C# and HtmlAgilityPack encoding problem

Actually the page is encoded with UTF-8.

GodLikeHTML.Load(GodLikeClient.OpenRead("http://www.alfa.lt"), Encoding.UTF8);

will work.

Or you could use the code in my SO answer which detects encoding from http headers or meta tags, en re-encodes properly. (It also supports gzip to minimize your download).

With the download class your code would look like:

HttpDownloader downloader = new HttpDownloader("http://www.alfa.lt",null,null);
GodLikeHTML.LoadHtml(downloader.GetPage());


Related Topics



Leave a reply



Submit