Kanji Characters from Webclient HTML Different from Actual Kanji in Website

Kanji characters from WebClient html different from actual Kanji in website

The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.

Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?

The MSDN Docs state this:

This method retrieves the specified resource. After it downloads the
resource, the method uses the encoding specified in the Encoding
property to convert the resource to a String.

Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.

To verify this, check the .NET Reference Source for the WebClient.DownloadString method:

try {
WebRequest request;
byte [] data = DownloadDataInternal(address, out request);
string stringData = GetStringUsingEncoding(request, data);
if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
return stringData;
} finally {
CompleteWebClientState();
}

The encoding is set using the Request settings, not the Response ones.

The result is, the downloaded string is encoded using the default CodePage.

What you can do now is:

  • Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
  • Re-encode the string with the correct encoding, set in the underlying WebResponse.
  • Don't use WebClient, use HttpClient or WebRequest directly. Or, if you like this tool, create a custom WebClient class to handle the WebRequest/WebResponse in a more direct way.

This is a method to perform the re-encoding task:

The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.

EDIT:

Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.

using System.IO;
using System.Net;
using System.Reflection;
using System.Text;

public string WebClient_DownLoadString(Uri uri)
{
using (var client = new WebClient())
{
// If Windows 7 - Windows Server 2008 R2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");

string result = client.DownloadString(uri);

var flags = BindingFlags.Instance | BindingFlags.NonPublic;
using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
{
var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
byte[] bytes = client.Encoding.GetBytes(result);
using (var ms = new MemoryStream(bytes, 0, bytes.Length))
using (var reader = new StreamReader(ms, pageEncoding))
{
ms.Position = 0;
return reader.ReadToEnd();
};
};
}
}

Now your code should get the Japanese characters in their correct form.

Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);

kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();

Text_DailyKanji.Text = kanji;

Can't get the web content with UTF-8

According to 3rd line of view-source, it's encoded in shift-jis:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="ja" id="masterChannel-enterprise"><head>
<meta http-equiv="content-type" content="text/html;charset=shift_jis">

How to Identify the website's content language like English, Japanese, Chinese etc

Well, some webpages contain a "lang" or "xml:lang" attribute in the html element. For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
</head>
<body>

</body>
</html>

In this example the attributes "lang" and "xml:lang" are set as "en" (i.e. English). Additionally, some servers may set a "Content-Language" header and you could check that value of that. (Although, to be honest i haven't actually seen a server which sets this value).

However, the value of these attributes or headers could be anything and some servers and webpages won't even state a language at all. But you'll probably want to search for common language codes as defined by ISO-639 and ISO-3166.

As for the implementation of this in C#, i'll admit it: i don't have much of a clue. But I think the WebResponse class has a property called Headers which you may want to look at.

Oh, and for languages like Hindi, i'm pretty sure that they contain characters unique to that language. In which case you could search your htmlText string for any of these particular characters.

There's also a simple method checking your htmlText string for words common to a particular language. For example, if you wanted to know whether to page was french you could search for the word "bonjour" etc.



Related Topics



Leave a reply



Submit