Is Tchar Still Relevant

Is TCHAR still relevant?

I would still use the TCHAR syntax if I was doing a new project today. There's not much practical difference between using it and the WCHAR syntax, and I prefer code which is explicit in what the character type is. Since most API functions and helper objects take/use TCHAR types (e.g.: CString), it just makes sense to use it. Plus it gives you flexibility if you decide to use the code in an ASCII app at some point, or if Windows ever evolves to Unicode32, etc.

If you decide to go the WCHAR route, I would be explicit about it. That is, use CStringW instead of CString, and casting macros when converting to TCHAR (eg: CW2CT).

That's my opinion, anyway.

Should I use TCHAR today

In modern windows all ANSI functions are internally converting char* to wchar_t* and calling unicode versions of same function. Basically, by adopting TCHAR instead of wchar_t you gain nothing, but have to deal with quirky syntax.

should I eliminate TCHAR from Windows code?

Windows uses UTF16 still and most likely always will. You need to use wstring rather than string therefore. Windows APIs don't offer support for UTF8 directly largely because Windows supported Unicode before UTF8 was invented.

It is thus rather painful to write Unicode code that will compile on both Windows and Unix platforms.

Starting a new Windows app: Should I use _TCHAR or wchar_t for text?

No there is not. Just go with wchar_t.

TCHAR is only useful if you want to be able to use a conditional compilation switch to convert your program to operate in ASCII mode. Since Win2K and up are unicode platforms this switch does not provide any value. You'd instead be implying to other developers in your project that ASCII was a valid target when in fact it's not.

What are TCHAR strings and the 'A' or 'W' version of Win32 API functions?

Let me start off by saying that you should preferably not use TCHAR for new Windows projects and instead directly use Unicode. On to the actual answer:

Character Sets

The first thing we need to understand is how character sets work in Visual Studio. The project property page has an option to select the character set used:

  • Not Set
  • Use Unicode Character Set
  • Use Multi-Byte Character Set

Project Property Page - Character Set

Depending on which of the three option you choose, a lot of definitions change to accommodate the selected character set. There are three main classes: strings, string routines from tchar.h, and API functions:

  • 'Not Set' corresponds to TCHAR = char using ANSI encoding, where you use the standard 8-bit code page of the system for strings. All tchar.h string routines use the basic char versions. All API functions that work with strings will use the 'A' version of the API function.
  • 'Unicode' corresponds to TCHAR = wchar_t using UTF-16 encoding. All tchar.h string routines use the wchar_t versions. All API functions that work with strings will use the 'W' version of the API function.
  • 'Multi-Byte' corresponds to TCHAR = char, using some multi-byte encoding scheme. All tchar.h string routines use the multi-byte character set versions. All API functions that work with strings will use the 'A' version of the API function.

Related reading: About the "Character set" option in visual studio 2010

TCHAR.h header

The tchar.h header is a helper for using generic names for the C string operations on strings, that switch to the correct function for the given character set. For instance, _tcscat will switch to either strcat (not set), wcscat (unicode), or _mbscat (mbcs). _tcslen will switch to either strlen (not set), wcslen (unicode), or strlen (mbcs).

The switch happens by defining all _txxx symbols as macro's that evaluate to the correct function, depending on the compiler switches.

The idea behind it is that you can use the encoding-agnostic types TCHAR (or _TCHAR) and the encoding-agnostic functions that work on them, from tchar.h, instead of the regular string functions from string.h.

Similarly, _tmain is defined to be either main or wmain. See also: What is the difference between _tmain() and main() in C++?

A helper macro _T(..) is defined for getting string literals of the correct type, either "regular literals" or L"wchar_t literals".

See the caveats mentioned here: Is TCHAR still relevant? -- dan04's answer

_tmain example

For the example of main in the question, the following code concatenates all the strings passed as command line arguments into one.

int _tmain(int argc, _TCHAR *argv[])
{
TCHAR szCommandLine[1024];

if (argc < 2) return 0;

_tcscpy(szCommandLine, argv[1]);
for (int i = 2; i < argc; ++i)
{
_tcscat(szCommandLine, _T(" "));
_tcscat(szCommandLine, argv[i]);
}

/* szCommandLine now contains the command line arguments */

return 0;
}

(Error checking is omitted) This code works for all three cases of the character set, because everywhere we used TCHAR, the tchar.h string functions and _T for string literals. Forgetting to surround your string literals with _T(..) is a common source of compiler errors when writing such TCHAR-programs.
If we had not done all these things, then switching character sets would cause the code to either not compile, or worse, compile but misbehave during runtime.

Windows API functions

Windows API functions that work on strings, such as CreateFile and GetCurrentDirectory, are implemented in the Windows headers as macro's that, like the tchar.h macro's, switch to either the 'A' version or 'W' version. For instance, CreateFile is a macro that is defined to CreateFileA for ANSI and MBCS, and to CreateFileW for Unicode.

Whenever you use the flat form (without 'A' or 'W') in your code, the actual function called will switch depending on the selected character set. You can force the use of a particular version by using the explicit 'A' or 'W' names.

The conclusion is that you should always use the unqualified name, unless you want to always refer to a specific version, independently of the character set option.

For the example in the question, where we want to open the file given by the first argument:

int _tmain(int argc, _TCHAR *argv[])
{
if (argc < 2) return 1;

HANDLE hFile = CreateFile(argv[1], GENERIC_READ, 0, NULL, OPEN_EXISTING, 0, NULL);

/* Read from file and do other stuff */
...

CloseHandle(hFile);

return 0;
}

(Error checking is omitted) Note that for this example, nowhere we needed to use any of the TCHAR specific stuff, because the macro definitions have already taken care of this for us.

Utilising C++ strings

We've seen how we can use the tchar.h routines to use C style string operations to work with TCHARs, but it would be nice if we could leverage C++ strings to work with this.

My advice would foremost be to not use TCHAR and instead use Unicode directly, see the Conclusion section, but if you want to work with TCHAR you can do the following.

To use TCHAR, what we want is an instance of std::basic_string that uses TCHAR. You can do this by typedefing your own tstring:

typedef std::basic_string<TCHAR> tstring;

For string literals, don't forget to use _T.

You'll also need to use the correct versions of cin and cout. You can use references to implement a tcin and tcout:

#if defined(_UNICODE)
std::wistream &tcin = wcin;
std::wostream &tcout = wcout;
#else
std::istream &tcin = cin;
std::ostream &tcout = cout;
#end

This should allow you to do almost anything. There might be the occasional exception, such as std::to_string and std::to_wstring, for which you can find a similar workaround.

Conclusion

This answer (hopefully) details what TCHAR is and how it's used and intertwined with Visual Studio and the Windows headers. However, we should also wonder if we want to use it.

My advice is to directly use Unicode for all new Windows programs and don't use TCHAR at all!

Others giving the same advice: Is TCHAR still relevant?

To use Unicode after creating a new project, first ensure the character set is set to Unicode. Then, remove the #include <tchar.h> from your source file (or from stdafx.h). Fix up any TCHAR or _TCHAR to wchar_t and _tmain to wmain:

int wmain(int argc, wchar_t *argv[])

For non-console projects, the entry point for Windows applications is WinMain and will appear in TCHAR-jargon as

int APIENTRY _tWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPTSTR    lpCmdLine, int nCmdShow)

and should become

int APIENTRY wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPWSTR    lpCmdLine, int nCmdShow)

After this, only use wchar_t strings and/or std::wstrings.

Further caveats

  • Be careful when writing sizeof(szMyString) when using TCHAR arrays (strings), because for ANSI this is the size both in characters and in bytes, for Unicode this is only the size in bytes and the number of characters is at most half, and for MBCS this is the size in bytes and the number of characters may or may not be equal. Both Unicode and MBCS can use multiple TCHARs to encode a single character.
  • Mixing TCHAR stuff and fixed char or wchar_t is very annoying; you have to convert the strings from one to the other, using the correct code page! A simple copy will not work in the general case.
  • There is a slight difference between _UNICODE and UNICODE, relevant if you want to conditionally define your own functions. See Why both UNICODE and _UNICODE?

A very good, complementary answer is: Difference between MBCS and UTF-8 on Windows

Char and string types, what should I use?

Heavily opinionated but based on experience answer

Before I begin, let me state I've been working on C++ software for five years, with millions of users globally - in doing so I've learned a hell-of-a-lot about how things work in the real world.

The first thing to understand is windows inherently uses it's (originally homegrown) UTF-16 standard (aka, wide-char). And in doing so makes your life much, much harder.
(almost) every other operating system uses UTF-8. And by that I mean; OS X, *NIX, Android, Ios, pretty much anything you can throw a c++ compiler at.

because of this, do you EVER intend to use your code outside of windows?
If you don't, there's no reason not to do it the "windows way", std::wstring being your best-friend here. You can very easily use .c_str() to get a const wchar_t * (and that implicitly converts into a LPCWSTR). Many of these windows types (such as LPCWSTR, and TCHAR, are actually Macros (aka #define) You can read more on that here.

should you bother with UTF-16 wide characters at all?

It's very very to think "what if I ignore languages that don't use a latin alphabet", trust me when I say, don't.
Yes, you could use Multibyte characters only, or implicitly call only the A variants of API functions.
However, while this works (and very well), If you support any language beyond Latin-types, you will run into problems. And even if you don't, users will expect to input in their native language.

TL;dr

English Only, Cross Platform?-
In short, There is nothing inherently wrong with using only Ansi 8-bit strings all over windows-programming - it won't crash the internet, and if you writing something that you know for certain is only going to be used by English-speakers across platforms (software for america?) then I actually recommend changing your project to Multi-Byte, and using std::string for everything, just don't expect to open A single file with a international filename.

And keep that in mind, if your user-base is in the thousands go utf-8, if its in the tens of thousands, people are going to get mildly angered by not being able to load kanjii-filenames.

International, Windows only -
If your software is going to come even close to approaching the internet-border-of-Sweden (where it needs to load a file-name written in Goa'uld), Use std::wstring, use UTF-16, and be happy in windows-only software. To be honest, this is the state of most windows software today.

International, Mac's are cool? -
Your project manager wants cross-platform software yesterday, it needs to run on Mac and PC - because the users it's being deployed to are 16% mac users (according to marketing), and it needs to have IME support for Arabic and Japanese.

Tell your project manager you are going to write a wrapper for all your API-calls, it will take a week longer, but will prevent any cross-platform language nonsense, if he doesn't agree - quit.

Then do just that, Use UTF-8 under the bonnet, and have any API-calls to the windows / mac system handled using a wrapper-class you wrote yourself. Yes it will take some effort and maintenance, but it lets you save a lot of time in the long run.

EXTRA LINKS

If you need very complex unicode support, check out the ICU library, OSX uses this under the hood!)

Learn to use BOOST - the filesystem support alone makes cross-platform C++ development much, much faster



Related Topics



Leave a reply



Submit