What Do These "\E6##" Characters Mean

How does a file with Chinese characters know how many bytes to use per character?

If the encoding is UTF-8, then the following table shows how a Unicode code point (up to 21 bits) is converted into UTF-8 encoding:

Scalar Value                 1st Byte  2nd Byte  3rd Byte  4th Byte
00000000 0xxxxxxx            0xxxxxxx
00000yyy yyxxxxxx            110yyyyy  10xxxxxx
zzzzyyyy yyxxxxxx            1110zzzz  10yyyyyy  10xxxxxx
000uuuuu zzzzyyyy  yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

There are a number of non-allowed values - in particular, bytes 0xC1, 0xC2, and 0xF5 - 0xFF can never appear in well-formed UTF-8. There are also a number of other verboten combinations. The irregularities are in the 1st byte and 2nd byte columns. Note that the codes U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear in valid UTF-8.

Code Points          1st Byte  2nd Byte  3rd Byte  4th Byte
U+0000..U+007F       00..7F
U+0080..U+07FF       C2..DF    80..BF
U+0800..U+0FFF       E0        A0..BF    80..BF
U+1000..U+CFFF       E1..EC    80..BF    80..BF
U+D000..U+D7FF       ED        80..9F    80..BF
U+E000..U+FFFF       EE..EF    80..BF    80..BF
U+10000..U+3FFFF     F0        90..BF    80..BF    80..BF
U+40000..U+FFFFF     F1..F3    80..BF    80..BF    80..BF
U+100000..U+10FFFF   F4        80..8F    80..BF    80..BF

These tables are lifted from the Unicode standard version 5.1.

In the question, the material from offset 0x0010 .. 0x008F yields:

0x61           = U+0061
0x61           = U+0061
0x61           = U+0061
0xE6 0xBE 0xB3 = U+6FB3
0xE5 0xA4 0xA7 = U+5927
0xE5 0x88 0xA9 = U+5229
0xE4 0xBA 0x9A = U+4E9A
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE8 0xAE 0xBA = U+8BBA
0xE5 0x9D 0x9B = U+575B
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE8 0xAE 0xBA = U+8BBA
0xE5 0x9D 0x9B = U+575B
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE6 0x96 0xB0 = U+65B0
0xE9 0x97 0xBB = U+95FB
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE6 0xB4 0xB2 = U+6D32
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE7 0xBD 0x91 = U+7F51
0xE7 0xAB 0x99 = U+7AD9
0x2C           = U+002C
0xE6 0xBE 0xB3 = U+6FB3
0xE5 0xA4 0xA7 = U+5927
0xE5 0x88 0xA9 = U+5229
0xE4 0xBA 0x9A = U+4E9A
0xE6 0x9C 0x80 = U+6700
0xE5 0xA4 0xA7 = U+5927
0xE7 0x9A 0x84 = U+7684
0xE5 0x8D 0x8E = U+534E
0x2D           = U+002D
0x29           = U+0029
0xE5 0xA5 0xA5 = U+5965
0xE5 0xB0 0xBA = U+5C3A
0xE7 0xBD 0x91 = U+7F51
0x26           = U+0026
0x6C           = U+006C
0x74           = U+0074
0x3B           = U+003B

What character to use to put an item at the end of an alphabetic list?

I found this thread while wanting folders that sort after Z in Finder on Mac OSX. After several false paths and trial and error, here's what I found:

Characters that sort after Z in Finder (in sort-order)

z Lower case Z
ι Greek letter
Ι Greek letter, capital version of above character, not an "I")
Ω Omega
一 Japanese Character? (Thanks, Jam)
口 Japanese character? (Thanks, Jam)
末 Japanese character "End" (Thanks, Jam)
 (a private use character) (Thanks, Peter O.)

These are characters others here and in other places on the web, mentioned sort after Z, but that I found DO NOT sort at the end, at least when sorting by name in Finder on Mac:

† ∆ ~ - ſ [ ø ￭ |

wrong utf-8 characters when writing to file (python)

When you see in the text file something like ÃŠ (or more generally 2 characters the first of which is Ã) it is likely that the file was correctly written in UTF8, and that the editor (or the screen) does not process correctly UTF8.

Let's look at æ. It is the unicode character U+E6. When you encode it in utf8, it gives the two characters b'\xc3\xa6' and when decoded as latin1 it printf 'Ã¦'.

What can you do to confirm? Use the excellent vim editor that knows about multiple encoding and among others utf8, at least when you use its graphical interface gvim.

And just another general advice: never write non ascii characters in a python source file, unless you put a # -*- coding: ... -*- line as first line (or second if first is a hashbang one #! /usr/bin/env python)

And if you want to use unicode under Windows with Python, do use IDLE that natively processes it.

TL/DR: If you are using Linux, it is likely that your system is configured natively to use utf8 encoding, and you correctly write your text files in utf8 but your text editor just fails to display utf8 correctly

Regular expression to match a line that doesn't contain a word

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
    └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
    
index    0      1      2      3      4      5      6      7

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

\d less efficient than [0-9]

\d checks all Unicode digits, while [0-9] is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹, are an example of Unicode digits which are matched with \d, but not [0-9].

You can generate a list of all such characters using the following code:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

Which generates:

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙０１２３４５６７８９

How to insert a character every nth character in a pandas dataframe column

Let us try findall with map (.. means N = 2)

df.mac_address.str.findall('..').map(':'.join)
Out[368]: 
0    00;03;E6;A5;84;C2
1    00;03;E6;A5;84;CC
2    00;03;E6;A5;84;DA
3    00;03;E6;A5;84;DC
4    00;03;E6;A5;84;E4
Name: mac_address, dtype: object

How to print the extended ASCII code in java from integer value

ASCII 153 (0x99) is different from Unicode U+0099 (Control character).

Solution

This program should do what you intend it to do:

public class ExtendedAscii {
    public static final char[] EXTENDED = { 0x00C7, 0x00FC, 0x00E9, 0x00E2,
            0x00E4, 0x00E0, 0x00E5, 0x00E7, 0x00EA, 0x00EB, 0x00E8, 0x00EF,
            0x00EE, 0x00EC, 0x00C4, 0x00C5, 0x00C9, 0x00E6, 0x00C6, 0x00F4,
            0x00F6, 0x00F2, 0x00FB, 0x00F9, 0x00FF, 0x00D6, 0x00DC, 0x00A2,
            0x00A3, 0x00A5, 0x20A7, 0x0192, 0x00E1, 0x00ED, 0x00F3, 0x00FA,
            0x00F1, 0x00D1, 0x00AA, 0x00BA, 0x00BF, 0x2310, 0x00AC, 0x00BD,
            0x00BC, 0x00A1, 0x00AB, 0x00BB, 0x2591, 0x2592, 0x2593, 0x2502,
            0x2524, 0x2561, 0x2562, 0x2556, 0x2555, 0x2563, 0x2551, 0x2557,
            0x255D, 0x255C, 0x255B, 0x2510, 0x2514, 0x2534, 0x252C, 0x251C,
            0x2500, 0x253C, 0x255E, 0x255F, 0x255A, 0x2554, 0x2569, 0x2566,
            0x2560, 0x2550, 0x256C, 0x2567, 0x2568, 0x2564, 0x2565, 0x2559,
            0x2558, 0x2552, 0x2553, 0x256B, 0x256A, 0x2518, 0x250C, 0x2588,
            0x2584, 0x258C, 0x2590, 0x2580, 0x03B1, 0x00DF, 0x0393, 0x03C0,
            0x03A3, 0x03C3, 0x00B5, 0x03C4, 0x03A6, 0x0398, 0x03A9, 0x03B4,
            0x221E, 0x03C6, 0x03B5, 0x2229, 0x2261, 0x00B1, 0x2265, 0x2264,
            0x2320, 0x2321, 0x00F7, 0x2248, 0x00B0, 0x2219, 0x00B7, 0x221A,
            0x207F, 0x00B2, 0x25A0, 0x00A0 };

    public static final char getAscii(int code) {
        if (code >= 0x80 && code <= 0xFF) {
            return EXTENDED[code - 0x7F];
        }
        return (char) code;
    }

    public static final void printChar(int code) {
        System.out.printf("%c%n", getAscii(code));
    }

    public static void main(String[] args) {
        printChar(153);
        printChar(63);
    }
}

Output:

Ü

?

As you can see from the output above, the intended character gets printed correctly.

Extension of Concept

Also, I wrote a program that prints out the Unicode values for the extended Ascii. As you can see in the output below, a lot of characters have trouble being displayed as native char.

Code:

public class ExtendedAscii {
    public static final char[] EXTENDED = { 0x00C7, 0x00FC, 0x00E9, 0x00E2,
            0x00E4, 0x00E0, 0x00E5, 0x00E7, 0x00EA, 0x00EB, 0x00E8, 0x00EF,
            0x00EE, 0x00EC, 0x00C4, 0x00C5, 0x00C9, 0x00E6, 0x00C6, 0x00F4,
            0x00F6, 0x00F2, 0x00FB, 0x00F9, 0x00FF, 0x00D6, 0x00DC, 0x00A2,
            0x00A3, 0x00A5, 0x20A7, 0x0192, 0x00E1, 0x00ED, 0x00F3, 0x00FA,
            0x00F1, 0x00D1, 0x00AA, 0x00BA, 0x00BF, 0x2310, 0x00AC, 0x00BD,
            0x00BC, 0x00A1, 0x00AB, 0x00BB, 0x2591, 0x2592, 0x2593, 0x2502,
            0x2524, 0x2561, 0x2562, 0x2556, 0x2555, 0x2563, 0x2551, 0x2557,
            0x255D, 0x255C, 0x255B, 0x2510, 0x2514, 0x2534, 0x252C, 0x251C,
            0x2500, 0x253C, 0x255E, 0x255F, 0x255A, 0x2554, 0x2569, 0x2566,
            0x2560, 0x2550, 0x256C, 0x2567, 0x2568, 0x2564, 0x2565, 0x2559,
            0x2558, 0x2552, 0x2553, 0x256B, 0x256A, 0x2518, 0x250C, 0x2588,
            0x2584, 0x258C, 0x2590, 0x2580, 0x03B1, 0x00DF, 0x0393, 0x03C0,
            0x03A3, 0x03C3, 0x00B5, 0x03C4, 0x03A6, 0x0398, 0x03A9, 0x03B4,
            0x221E, 0x03C6, 0x03B5, 0x2229, 0x2261, 0x00B1, 0x2265, 0x2264,
            0x2320, 0x2321, 0x00F7, 0x2248, 0x00B0, 0x2219, 0x00B7, 0x221A,
            0x207F, 0x00B2, 0x25A0, 0x00A0 };

    public static void main(String[] args) {
        for (char c : EXTENDED) {
            System.out.printf("%s, ", new String(Character.toChars(c)));
        }
    }
}