Odbcconnection Returning Chinese Characters as ""

OdbcConnection returning Chinese Characters as ?

Problems with character set are quite common, let me try to give some general notes.

In principle you have to consider four different character set settings.

1 and 2: NLS_CHARACTERSET and NLS_NCHAR_CHARACTERSET

Example: AL32UTF8

They are defined only on your database, you can interrogate them with

    SELECT * 
FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET');

These settings define which characters (in which format) can be stored in your database - no more, no less. It requires some effort (see Character Set Migration and/or Oracle Database Migration Assistant for Unicode) if you have to change it on existing database.

3: NLS_LANG

Example: AMERICAN_AMERICA.AL32UTF8

This value is defined only on your client. NLS_LANG has nothing to do with the ability to store characters in a database. It is used to let Oracle know what character set you are using on the client side. When you set NLS_LANG value (for example to AL32UTF8) then you just tell the Oracle database "my client uses character set AL32UTF8" - it does not necessarily mean that your client is really using AL32UTF8! (see below #4)

NLS_LANG can be defined by environment variable NLS_LANG or by Windows Registry at HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 32 bit), resp. HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 64 bit). Depending on your application there might be other ways to specify NLS_LANG, but let's stick to the basics. If NLS_LANG value is not provided then Oracle defaults it to AMERICAN_AMERICA.US7ASCII

Format of NLS_LANG is NLS_LANG=language_territory.charset. The {charset} part of NLS_LANG is not shown in any system table or view. All components of the NLS_LANG definition are optional, so following definitions are all valid: NLS_LANG=.WE8ISO8859P1, NLS_LANG=_GERMANY, NLS_LANG=AMERICAN, NLS_LANG=ITALIAN_.WE8MSWIN1252, NLS_LANG=_BELGIUM.US7ASCII.

As stated above the {charset} part of NLS_LANG is not available in database at any system table/view or any function. Strictly speaking this is true, however you can run this query:

SELECT DISTINCT CLIENT_CHARSET
FROM V$SESSION_CONNECT_INFO
WHERE (SID, SERIAL#) = (SELECT SID, SERIAL# FROM v$SESSION WHERE AUDSID = USERENV('SESSIONID'));

It should return character set from your current NLS_LANG setting - however based on my experience the value is often NULL or Unknown, i.e. not reliable.

Find more very useful information here: NLS_LANG FAQ

Note, some technologies do not utilize NLS_LANG, settings there do not have any effect, for example:

  • ODP.NET Managed Driver is not NLS_LANG sensitive. It is only .NET locale sensitive. (see Data Provider for .NET Developer's Guide)

  • OraOLEDB (from Oracle) always use UTF-16 (see OraOLEDB Provider Specific Features)

  • Java based JDBC (for example SQL Developer) has its own methods to deal with character sets (see Database JDBC Developer's Guide - Globalization Support for further details)

4: The "real" character set of your terminal, your application or the encoding of .sql files

Example: UTF-8

If you work with a terminal program (i.e. SQL*plus or isql) you can interrogate the code page with command chcp, on Unix/Linux the equivalent is locale charmap or echo $LANG. You can get a list of all Windows code pages identifiers from here: Code Page Identifiers. Note, for UTF-8 (chcp 65001) there are some issues, see this discussion.

If you work with .sql files and an editor like TOAD or SQL-Developer you have to check the save options. Usually you can choose values like UTF-8, ANSI, ISO-8859-1, etc.
ANSI means the Windows ANSI codepage, typically CP1252, you can check in your Registry at HKLM\SYSTEM\ControlSet001\Control\Nls\CodePage\ACP or here: National Language Support (NLS) API Reference

[Microsoft removed this reference, take it form web-archive [National Language Support (NLS) API Reference]
7]

How to set all these values?

The most important point is to match NLS_LANG and your "real" character set of your terminal, resp. application or the encoding of your .sql files

Some common pairs are:

  • CP850 -> WE8PC850

  • CP1252 or ANSI (in case of "Western" PC) -> WE8MSWIN1252

  • ISO-8859-1 -> WE8ISO8859P1

  • ISO-8859-15 -> WE8ISO8859P15

  • UTF-8 -> AL32UTF8

Or run this query to get some more:

SELECT VALUE AS ORACLE_CHARSET, UTL_I18N.MAP_CHARSET(VALUE) AS IANA_NAME
FROM V$NLS_VALID_VALUES
WHERE PARAMETER = 'CHARACTERSET';

Some technologies make you life easier, e.g. ODP.NET (unmanged driver) or ODBC driver from Oracle automatically inherits the character set from NLS_LANG value, so condition from above is always true.

Is it required to set client NLS_LANG value equal to database NLS_CHARACTERSET value?

No, not necessarily! For example, if you have the database character set NLS_CHARACTERSET=AL32UTF8 and the client character set NLS_LANG=.ZHS32GB18030 then it will work without any problem (provided your client really uses GB18030), although these character sets are completely different. GB18030 is a character set commonly used for Chinese, like UTF-8 it supports all Unicode characters.

If you have, for example NLS_CHARACTERSET=AL32UTF8 and NLS_LANG=.WE8ISO8859P1 it will also work (again, provided your client really uses ISO-8859-P1). However, the database may store characters which your client is not able to display, instead the client will display a placeholder (e.g. ¿).

Anyway, it is beneficial to have matching NLS_LANG and NLS_CHARACTERSET values, if suitable. If they are equal you can be sure that any character which may be stored in database can also be displayed and any character you enter in your terminal or write in your .sql file can also be stored in database and is not substituted by placeholder.

Supplement

So many times you can read advise like "The NLS_LANG character set must be the same as your database character set" (also here on SO). This is simply not true and a popular myth!

Here is the proof:

C:\>set NLS_LANG=.AL32UTF8

C:\>sqlplus ...

SQL> SET SERVEROUTPUT ON
SQL> DECLARE
2 CharSet VARCHAR2(20);
3 BEGIN
4 SELECT VALUE INTO Charset FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
5 DBMS_OUTPUT.PUT_LINE('Database NLS_CHARACTERSET is '||Charset);
6 IF UNISTR('\20AC') = '€' THEN
7 DBMS_OUTPUT.PUT_LINE ( '"€" is equal to U+20AC' );
8 ELSE
9 DBMS_OUTPUT.PUT_LINE ( '"€" is not the same as U+20AC' );
10 END IF;
11 END;
12 /

Database NLS_CHARACTERSET is AL32UTF8
"€" is not the same as U+20AC

PL/SQL procedure successfully completed.

Both, client and database character sets are AL32UTF8, however the characters do not match. The reason is, my cmd.exe and thus also SQL*Plus use Windows CP1252. Therefore I must set NLS_LANG accordingly:

C:\>chcp
Active code page: 1252

C:\>set NLS_LANG=.WE8MSWIN1252

C:\>sqlplus ...

SQL> SET SERVEROUTPUT ON
SQL> DECLARE
2 CharSet VARCHAR2(20);
3 BEGIN
4 SELECT VALUE INTO Charset FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
5 DBMS_OUTPUT.PUT_LINE('Database NLS_CHARACTERSET is '||Charset);
6 IF UNISTR('\20AC') = '€' THEN
7 DBMS_OUTPUT.PUT_LINE ( '"€" is equal to U+20AC' );
8 ELSE
9 DBMS_OUTPUT.PUT_LINE ( '"€" is not the same as U+20AC' );
10 END IF;
11 END;
12 /

Database NLS_CHARACTERSET is AL32UTF8
"€" is equal to U+20AC

PL/SQL procedure successfully completed.

Also consider this example:

CREATE TABLE ARABIC_LANGUAGE (
LANG_CHAR VARCHAR2(20),
LANG_NCHAR NVARCHAR2(20));

INSERT INTO ARABIC_LANGUAGE VALUES ('العربية', 'العربية');

You would need to set two different values for NLS_LANG for a single statement - which is not possible.

Oracle sqlplus Chinese characters garbled

Check your terminal settings with locale charmap or echo $LANG and verify if it matches with ZHS16GBK.

It is not required to use the same character set as your database. Using NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and UTF-8 in your terminal will also work. It is important that NLS_LANG matches your terminal character set.

See also OdbcConnection returning Chinese Characters as "?"

MS SQL - MySQL - Openquery Chinese characters imported as '?'

There are two ODBC drivers, one ANSI (myodbc5a.dll) and one Unicode (myodbc5w.dll). The ANSI driver will always return varchar data. You need to use the Unicode driver in order to return Chinese characters as Unicode. 

Check the relevant paragraph on the ODBC driver's installation page and this related question.

When converting from one codepage to another, unknown or unrepresentable characters are replaced with ? or the Unicode Replacement character �. Such characters in a string are a sure sign that a wrong conversion was attempted. 



Related Topics



Leave a reply



Submit