Regex for Names with Special Characters (Unicode)

Regex for names with special characters (Unicode)

Try the following regular expression:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
// valid
}

You should read it like this:

^   # start of subject
(?: # match this:
[ # match a:
\p{L} # Unicode letter, or
\p{Mn} # Unicode accents, or
\p{Pd} # Unicode hyphens, or
\' # single quote, or
\x{2019} # single quote (alternative)
]+ # one or more times
\s # any kind of space
[ #match a:
\p{L} # Unicode letter, or
\p{Mn} # Unicode accents, or
\p{Pd} # Unicode hyphens, or
\' # single quote, or
\x{2019} # single quote (alternative)
]+ # one or more times
\s? # any kind of space (0 or more times)
)+ # one or more times
$ # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

$names = array
(
'Alix',
'André Svenson',
'H4nn3 Andersen',
'Hans',
'John Elkjærd',
'Kristoffer la Cour',
'Marco d\'Almeida',
'Martin Henriksen!',
);

foreach ($names as $name)
{
echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.


Validates:

  • John Elkjærd
  • André Svenson
  • Marco d'Almeida
  • Kristoffer la Cour

Invalidates:

  • Hans
  • H4nn3 Andersen
  • Martin Henriksen!

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

Examples:

  • H4nn3 Andersen -> Hnn Andersen
  • Martin Henriksen! -> Martin Henriksen

Note that you always need to use the u modifier.

validate special characters by negating unicode letters with regex pattern?

You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:

Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));

That prints:

_-.;,!”#€%&/()=?`¨’<>

Which is exactly what you're looking for, no?

No need to mess with lookaheads here.

Regex for names

  • Hyphenated Names (Worthington-Smythe)

Add a - into the second character class. The easiest way to do that is to add it at the start so that it can't possibly be interpreted as a range modifier (as in a-z).

^[A-Z][-a-zA-Z]+$
  • Names with Apostophies (D'Angelo)

A naive way of doing this would be as above, giving:

^[A-Z][-'a-zA-Z]+$

Don't forget you may need to escape it inside the string! A 'better' way, given your example might be:

^[A-Z]'?[-a-zA-Z]+$

Which will allow a possible single apostrophe in the second position.

  • Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.

Here I'd be tempted to just do our naive way again:

^[A-Z]'?[- a-zA-Z]+$

A potentially better way might be:

^[A-Z]'?[- a-zA-Z]( [a-zA-Z])*$

Which looks for extra words at the end. This probably isn't a good idea if you're trying to match names in a body of extra text, but then again, the original wouldn't have done that well either.

  • Joint Names (Ben & Jerry)

At this point you're not looking at single names anymore?

Anyway, as you can see, regexes have a habit of growing very quickly...

Regex pattern including all special characters

Please don't do that... little Unicode BABY ANGELs like this one are dying! ◕◡◕ (← these are not images) (nor is the arrow!)

And you are killing 20 years of DOS :-) (the last smiley is called WHITE SMILING FACE... Now it's at 263A... But in ancient times it was ALT-1)

and his friend

BLACK SMILING FACE... Now it's at 263B... But in ancient times it was ALT-2

Try a negative match:

Pattern regex = Pattern.compile("[^A-Za-z0-9]");

(this will ok only A-Z "standard" letters and "standard" 0-9 digits.)

Java Regex - Allow all regular Unicode characters for names but not obscure variants

The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.

The solution is to match any letter using \p{L}, but exclude code points of high surrogates on up:

"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"

Trying to exclude the unicode characters

"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work

doesn't work, because the surrogate pairs are merged into a single code point.


Test code:

String[] names = {"尤雨溪", "Linus", "Gödel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};

for (String name : names) {
System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));
}

Output:

尤雨溪: true
Linus: true
Gödel: true
: false

Concrete JavaScript regular expression for accented characters (diacritics)

The easier way to accept all accents is this:

[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷

See Unicode Character Table for characters listed in numeric order.

Java Regex names (with spaces and special characters)

Well, if there is no other rules for names, this piece of code should work:

String name = "namé";
Pattern pattern = Pattern.compile("[A-Za-z'èé\\s\\-]*");
Matcher matcher = pattern.matcher(name);
System.out.println(matcher.matches());


Related Topics



Leave a reply



Submit