Unicode Character-Specific CSS - a Thought

Unicode character-specific CSS - a thought

This is entirely a problem of fonts. If you choose a well balanced font in which glyph sizes are adjusted in a way that mixed language text looks good together, there's no real problem. CSS can help you here in so far as you can specify custom fonts for certain characters using @font-face:

@font-face {
  font-family:   'bangla';
  src:           url('http://example.com/mybangla.ttf');
  unicode-range: U+0980-09FF;
}

This fictional "bangla" font now applies only to the Unicode range U+0980 - U+09FF, which is the Bengali block. Choose some fonts wisely and you can create a well balanced appearance in modern browsers.

Select Unicode character subset by culture

I think it would be fair to say, a specific culture does not use most Unicode Characters.

Check out the current standard. I don't think there is a direct correlation between Cultures and Scripts, this previous question touches on the problem.

Can I use CSS unicode-range to specify a font across an entire (third party) page?

The answer is yes in most browsers

MDN - Unicode Range

The unicode-range CSS descriptor sets the specific range of characters
to be downloaded from a font defined by @font-face and made available
for use on the current page.

Example:

@font-face {
  font-family: 'Ampersand';
  src: local('Times New Roman');
  unicode-range: U+26;
}

Support: CanIUse.com

Also see this Article

Allowed characters for CSS identifiers

The charset doesn't matter. The allowed characters matters more. Check the CSS specification. Here's a cite of relevance:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

Update: As to the regex question, you can find the grammar here:

ident      -?{nmstart}{nmchar}*

Which contains of the parts:

nmstart    [_a-z]|{nonascii}|{escape}
nmchar     [_a-z0-9-]|{nonascii}|{escape}
nonascii   [\240-\377]
escape     {unicode}|\\[^\r\n\f0-9a-f]
unicode    \\{h}{1,6}(\r\n|[ \t\r\n\f])?
h          [0-9a-f]

This can be translated to a Java regex as follows (I only added parentheses to parts containing the OR and escaped the backslashes):

String h = "[0-9a-f]";
String unicode = "\\\\{h}{1,6}(\\r\\n|[ \\t\\r\\n\\f])?".replace("{h}", h);
String escape = "({unicode}|\\\\[^\\r\\n\\f0-9a-f])".replace("{unicode}", unicode);
String nonascii = "[\\240-\\377]";
String nmchar = "([_a-z0-9-]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String nmstart = "([_a-z]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
String ident = "-?{nmstart}{nmchar}*".replace("{nmstart}", nmstart).replace("{nmchar}", nmchar);

System.out.println(ident); // The full regex.

Update 2: oh, you're more a PHP'er, well I think you can figure how/where to do str_replace?

What is the HTML unicode character for a tall right chevron?

Use '›'

› -> single right angle quote. For single left angle quote, use ‹

Unicode Character-Specific CSS - a Thought