Strtolower() for Unicode/Multibyte Strings

strtolower() for unicode/multibyte strings

Have you tried using mb_strtolower()?

accents and UTf-8 php using strtolower

Seeing the comment and edit (was not shown in the original post) about the use of strtolower(), the manual states:

Note that 'alphabetic' is determined by the current locale. This means that e.g. in the default "C" locale, characters such as umlaut-A (Ä) will not be converted.

mb_strtolower() on the other hand, shows:

By contrast to strtolower(), 'alphabetic' is determined by the Unicode character properties. Thus the behaviour of this function is not affected by locale settings and it can convert any characters that have 'alphabetic' property, such as A-umlaut (Ä).

ucfirst() function for multibyte character encodings

There is no mb_ucfirst function, as you've already noticed. You can fake a mb_ucfirst with two mb_substr:

function mb_ucfirst($string, $encoding)
{
    $firstChar = mb_substr($string, 0, 1, $encoding);
    $then = mb_substr($string, 1, null, $encoding);
    return mb_strtoupper($firstChar, $encoding) . $then;
}

Does InnoDB stores multibyte strings in expanded form, in indexes?

All characters in utf8 string are stored as variable-length encodings. Each character uses either 1, 2, 3, or 4 bytes depending on its code point. A string can have a mix of encodings, because each code point identifies its length in the initial bits of each byte.

Sample Image

The characters that are in the ASCII subset will only use 1 byte.

Combine preg _replace and strtolower

Its better to split this into a two-liner and debug the output of the commands with var_dump() in order to see whats going on:

<?php

/* string with special chars */
$string = 'abczABCZ-#+´!"§123';

$no_special_chars = preg_replace("/[^a-zA-Z]/", "", $string); 

var_dump($no_special_chars);    // string 'abczABCZ' (length=8)   

$lowercased = strtolower($no_special_chars);

var_dump($lowercased);          // string 'abczabcz' (length=8)

And maybe you noticed, that you don't have to handle A-Z in the preg_replace(), if you lowercase the string first.

$res = preg_replace("/[^a-z]/", "", strtolower($string));

var_dump($res); // string 'abczabcz' (length=8)

Detecting and removing multibyte strings in R

This is probably an encoding issue, so try change the encoding during load! Try something like this,

df<- read.csv(file_path, 
                encoding = "iso-8859-1", "use different encodings/langs"
                header = TRUE, 
                stringsAsFactors = FALSE)

ucfirst() not working properly with scandinavic characters

Your problem here is not ucfirst() it's strtolower(). You have to use mb_strtolower(), to get your string in lower case, e.g.

echo ucfirst(mb_strtolower($str));
           //^^^^^^^^^^^^^^ See here

Also you can find a multibyte version of ucfirst() in the comments from the manual:

Simple multi-bytes ucfirst():

<?php

   function my_mb_ucfirst($str) {
       $fc = mb_strtoupper(mb_substr($str, 0, 1));
       return $fc.mb_substr($str, 1);
   }

_{Code from plemieux from the manual comment}

Strtolower() for Unicode/Multibyte Strings