What Does Collation Mean

What does character set and collation mean exactly?

From MySQL docs:

A character set is a set of symbols
and encodings. A collation is a set of
rules for comparing characters in a
character set. Let's make the
distinction clear with an example of
an imaginary character set.

Suppose that we have an alphabet with
four letters: 'A', 'B', 'a', 'b'. We
give each letter a number: 'A' = 0,
'B' = 1, 'a' = 2, 'b' = 3. The letter
'A' is a symbol, the number 0 is the
encoding for 'A', and the combination
of all four letters and their
encodings is a character set.

Now, suppose that we want to compare
two string values, 'A' and 'B'. The
simplest way to do this is to look at
the encodings: 0 for 'A' and 1 for
'B'. Because 0 is less than 1, we say
'A' is less than 'B'. Now, what we've
just done is apply a collation to our
character set. The collation is a set
of rules (only one rule in this case):
"compare the encodings." We call this
simplest of all possible collations a
binary collation.

But what if we want to say that the
lowercase and uppercase letters are
equivalent? Then we would have at
least two rules: (1) treat the
lowercase letters 'a' and 'b' as
equivalent to 'A' and 'B'; (2) then
compare the encodings. We call this a
case-insensitive collation. It's a
little more complex than a binary
collation.

In real life, most character sets have
many characters: not just 'A' and 'B'
but whole alphabets, sometimes
multiple alphabets or eastern writing
systems with thousands of characters,
along with many special symbols and
punctuation marks. Also in real life,
most collations have many rules: not
just case insensitivity but also
accent insensitivity (an "accent" is a
mark attached to a character as in
German 'ö') and multiple-character
mappings (such as the rule that 'ö' =
'OE' in one of the two German
collations).

What is the collation useful for in mysql

I'm not sure if this question is too broad. But let me take a stab at it.

A character set is something that we see on the screen. So, you probably have a pretty good idea of what the letter "a" is. And, you can accept that "A" is also a letter.

You can modify letters as they appear with fonts. So, a and a are still the same letter, but they look different. This is not related to the collation, but intended to get you to think about the subject.

Collations tell use whether two letters are the same. So, you can have a case-insensitive collation that says that "A" and "a" are the same. Or, it can be case-sensitive, saying that they are different.

This is a very basic example. As you extend characters into other languages, the problems multiply. Are accented characters the same as unaccented characters? Are enyas in Spanish the same as an n? And so on.

These are the questions that collations answer. A character set translates sequences of bits into characters that can be displayed. A font tells you what the character looks like. A collation allows us to compare two characters to know if they are the same, and to know their ordering in a dictionary.

purpose of collate in Postgres

Collation is used to sort strings (text), for example by alphabetic order, whether or not case matters, how to deal with letters that have accents etc. COLLATE "C" tells the database not to use collation at all. One might use this if they were designing a database to hold data in different languages. Technically, COLLATE "C" will use byte order to drive text comparisons.

The first answer on https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database provides a good example of the differences between using COLLATE "C" vs. COLLATE "fr_FR" which uses the French localization.

What is collation at a low level?

You are interested in the effect of collation on JOIN and UNION?

First, a review of charset/collation... The C3A9 is the encoding, as indicated by the CHARACTER SET of utf8 or utf8mb4. The COLLATION say whether é > e or é < e or é = e. Or any other character. It's an algorithm, not an encoding. In English, the collation says whether or not A = a (case sensitivity) or not.

For JOIN...

FROM a
JOIN b  ON a.x = b.x

The Optimizers needs x in the two tables to have the same charset and collation. That way, INDEX(x) can be use to efficiently do the JOIN..ON.

For UNION DISTINCT ...

The column(s) that are involved in DISTINCTifying need to be compared. Comparing is efficient with an index. Without the same collation, there may be a definitional problem of what to do.

For UNION ALL ...

The result is a "table" with columns. Those columns have datatypes including collation. The simple approach is to demand all the collations be the same. More complex would be to convert on the fly.

The manual (on UNION) says (imprecisely) "For example, the first column selected by the first statement should have the same type as the first column selected by the other statements. If the data types of corresponding SELECT columns do not match, the types and lengths of the columns in the UNION result take into account the values retrieved by all the SELECT statements."

what do collation and cardinality mean in mysql index?

Collation is the way sorting takes place, based on the character set you specify. See https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html

Cardinality is basically "How many unique elements does this column contain." A table with low cardinality has few unique values. Lookup tables generally have much lower cardinality than tables of a particular entity such as Customers.

http://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29

What is the best collation to use for MySQL with PHP?

The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.

utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.

You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

which table collation takes least space and least table size?

"Character set" is the encoding of characters; it is probably the term you wanted.

"Collation" controls how characters are compared; it does not involve space.

"Numbers" do not involve either "character sets" nor "collations".

The typical way to store a yes/no (true/false) value is in a TINYINT, which takes 1 byte. The "(1)" after your tinyint example provides no information and will soon be removed from the syntax.

If you have lots of true/false values, consider using the SET datatype. It can pack up to 64 values in up to 8 bytes.

"ascii" and "latin1" are single-byte characters. latin1 can handle a limited number of accented letters -- as would be found in Western European languages.

"utf8mb4" should be used for general character support. It still takes 1 byte per English letter, 1 or 2 bytes for the rest of Europe, and 3-4 bytes per Chinese character.

For a single boolean, there is nothing smaller than tinyint. For a small number of booleans, I recommend one tinyint each -- it is a tradeoff of a small amount of space versus complexity in the code. For a large number of booleans, SET, BINARY(n) (n is bytes, not bits)or some size ofINT` plus masking operations.

For an app that is limited to Western European languages, latin1 is handy and slightly more compact for strings. Beyond that, use utf8mb4.

Note each "character set" has a variety of "collations"; the default collation is usually appropriate.

A Rule of Thumb that I like to apply: If I can't see at least 10% improvement (in space or speed or whatever), move on. That is, look for something else that might give more improvement. (Andy's Comment is another way of saying my RoT.)