Is There Any Reasonable Way to Access The Contents of a Characterset

Is there any reasonable way to access the contents of a CharacterSet?

By your definition, no, there is no "reasonable" way. That's just how NSCharacterSet stores it. It's optimized for testing membership, not enumerating all members.

Your loop can increment a counter over the codepoints, or it can shift the bits (one per codepoint), but either way you have to loop and test. The highest "Ll" character on my Mac is U+1D7CB (#120,779), so if you want to compute this list of characters at runtime, your code will have to loop at least that many times. See the Objective-C version of the documentation for details on how the bit vector is organized.

The good news is that this is fast. With unoptimized code on my 10-year-old Mac, it takes less than 1/10th of a second to find all 1,841 lowercaseLetters. If that's still not fast enough, it's easy to hide the cost by doing it once, in the background, at startup time.

NSArray from NSCharacterSet

The following code creates an array containing all characters of a given character set. It works also for characters outside of the "basic multilingual plane" (characters > U+FFFF, e.g. U+10400 DESERET CAPITAL LETTER LONG I).

NSCharacterSet *charset = [NSCharacterSet uppercaseLetterCharacterSet];
NSMutableArray *array = [NSMutableArray array];
for (int plane = 0; plane <= 16; plane++) {
if ([charset hasMemberInPlane:plane]) {
UTF32Char c;
for (c = plane << 16; c < (plane+1) << 16; c++) {
if ([charset longCharacterIsMember:c]) {
UTF32Char c1 = OSSwapHostToLittleInt32(c); // To make it byte-order safe
NSString *s = [[NSString alloc] initWithBytes:&c1 length:4 encoding:NSUTF32LittleEndianStringEncoding];
[array addObject:s];
}
}
}
}

For the uppercaseLetterCharacterSet this gives an array of 1467 elements. But note that characters > U+FFFF are stored as UTF-16 surrogate pair in NSString, so for example U+10400 actually is stored in NSString as 2 characters "\uD801\uDC00".

Swift 2 code can be found in other answers to this question.
Here is a Swift 3 version, written as an extension method:

extension CharacterSet {
func allCharacters() -> [Character] {
var result: [Character] = []
for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
result.append(Character(uniChar))
}
}
}
return result
}
}

Example:

let charset = CharacterSet.uppercaseLetters
let chars = charset.allCharacters()
print(chars.count) // 1521
print(chars) // ["A", "B", "C", ... "]

(Note that some characters may not be present in the font used to
display the result.)

For HTTP responses with Content-Types suggesting character data, which charset should be assumed by the client if none is specified?

All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.

If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.

Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).

Internet Explorer will use the default charset (probably stored at registry), as noted:

By default, Internet Explorer uses the
character set specified in the HTTP
content type returned by the server to
determine this translation. If this
parameter is not given, Internet
Explorer uses the character set
specified by the meta element in the
document. It uses the user's
preferences
if no meta element is
specified.

Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx

Mozilla Firefox attempts to auto-detect the charset, as pointed here:

This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.

Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Opera uses auto-detection too, as documented:

If the transport protocol provides an encoding name, that is used. If not, Opera will look at the page for a charset declaration. If this is missing, Opera will attempt to auto-detect the encoding, using the domain name to see if the script is a CJK script, and if so which one. Opera can also auto-detect UTF-8.

Source: http://www.opera.com/docs/specs/opera9/

What does be representable in execution character set mean?

The default execution character set of GCC is UTF-8.

And therein lies the problem. Namely, this is not true. Or at least, not in the way that the C++ standard means it.

The standard defines the "basic character set" as a collection of 96 different characters. However, it does not define an encoding for them. That is, the character "A" is part of the "basic character set". But the value of that character is not specified.

When the standard defines the "basic execution character set", it adds some characters to the basic set, but it also defines that there is a mapping from a character to a value. Outside of the NUL character being 0 however (and that the digits have to be encoded in a contiguous sequence), it lets implementations decide for themselves what that mapping is.

Here's the issue: UTF-8 is not a "character set" by any reasonable definition of that term.

Unicode is a character set; it defines a series of characters which exist and what their meanings are. It also each character in the Unicode character set a unique numeric value (a Unicode codepoint).

UTF-8 is... not that. UTF-8 is a scheme for encoding characters, typically in the Unicode character set (though it's not picky; it can work for any 21-bit number, and it can be extended to 32-bits).

So when GCC's documentation says:

[The execution character set] is under control of the user; the default is UTF-8, matching the source character set.

This statement makes no sense, since as previously stated, UTF-8 is a text encoding, not a character set.

What seems to have happened to GCC's documentation (and likely GCC's command line options) is that they've conflated the concept of "execution character set" with "narrow character encoding scheme". UTF-8 is how GCC encodes narrow character strings by default. But that's different from saying what its "execution character set" is.

That is, you can use UTF-8 to encode just the basic execution character set defined by C++. Using UTF-8 as your narrow character encoding scheme has no bearing on what your execution character set is.

Note that Visual Studio has a similarly-named option and makes a similar conflation of the two concepts. They call it the "execution character set", but they explain that the behavior of the option as:

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps.

So... what is GCC's execution character set? Well, since their documentation has confused "execution character set" with "narrow string encoding", it's pretty much impossible to know.

So what does the standard require out of GCC's behavior? Well, take the rule you quoted and turn it around. A single universal-character-name in a character literal will either be a char or an int, and it will only be the latter if the universal-character-name names a character not in the execution character set. So it's impossible for a system's execution character set to include more characters than char has bits to allow them.

That is, GCC's execution character set cannot be Unicode in its entirety. It must be some subset of Unicode. It can choose for it to be the subset of Unicode whose UTF-8 encoding takes up 1 char, but that's about as big as it can be.


While I've framed this as GCC's problem, it's also technically a problem in the C++ specification. The paragraph you quoted also conflates the encoding mechanism (ie: what char means) with the execution character set (ie: what characters are available to be stored).

This problem has been recognized and addressed by the addition of this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L. Such character-literals are conditionally-supported.

As these are proposed (and accepted) as resolutions for CWG issues, they also retroactively apply to previous versions of the standard.

Fastest way to get the next element in a string from a needle

Use this:

# The character set that we're using to iterate.
$CharacterSet = 'abcdefg';
$chr_array = str_split($CharacterSet);

foreach($chr_array as $Character){
//do whatever with character, everytime it will provide the next one when foreach loop continues
}

I hope it helps

Get everything after the dash in a string in JavaScript

How I would do this:

// function you can use:
function getSecondPart(str) {
return str.split('-')[1];
}
// use the function:
alert(getSecondPart("sometext-20202"));

Restricting user password character set

There is little reason to worry about SQL insertion attacks unless you're actually inserting the password into the database in plain text (Danger, Will Robertson, Danger!) and even then if you paramaterize the query it won't be an issue. You should allow [a-zA-Z0-9] plus some set of special characters. Probably the only character to restrict is '<' which will trigger the ASP.net validation warning. There are a number of fun tools out there to do password complexity checking on the client side. I like this one. It provides some instant feedback to the user as they are typing.

Is there a simple, portable way to determine the ordering of two characters in C?

For A-Z,a-z in a case-insensitive manner (and using compound literals):

char ch = foo();
az_rank = strtol((char []){ch, 0}, NULL, 36);

For 2 char that are known to be A-Z,a-z but may be ASCII or EBCDIC.

int compare2alpha(char c1, char c2) {
int mask = 'A' ^ 'a'; // Only 1 bit is different between upper/lower
return (c1 | mask) - (c2 | mask);
}

Alternatively, if limited to 256 differ char, could use a look-up table that maps the char to its rank. Of course the table is platform dependent.



Related Topics



Leave a reply



Submit