Regexp Greek Chars by Number

PHP and regexp to accept only Greek characters in form

I'm not too current on the Greek alphabet, but if you wanted to do this with the Roman alphabet, you would do this:

/^[a-zA-Z\s]*$/

So to do this with Greek, you replace a and z with the first and last letters of the Greek alphabet. If I remember right, those are α and ω. So the code would be:

/^[α-ωΑ-Ω\s]*$/

Regular expression to accept only greek characters

You may match chars with \p{Greek} and you must use the /u modifier:

'~^\p{Greek}{2,3}[0-9]{3,4}$~u'

See the regex demo.

Pattern details

  • ^ - start of string
  • \p{Greek}{2,3} - 2 or 3 Greek chars
  • [0-9]{3,4} - 3 or 4 ASCII digits
  • $ - end of string.

Sample Image

Regexp Greek chars by number

Are you using the UTF-8 pattern modifier?

/\p{Greek}{4,}/u

Regular expression testing with Greek characters php

The problem with your regex: /^[\p{Greek}\s\d a-zA-Z]+/u is that it tells your engine what to start matching. That being said, it does not provide any instructions on what to do at the end of your string. Changing your regex to this: /^[\p{Greek}\s\d a-zA-Z]+$/u (notice the $ at the end) should fix the problem.

The ^ and $ combo essentially instruct the regex engine to start matching at the beginning of the string (^ and at the end $).

Python regex greek characters

Just like Latin alphabets, Greek alphabets occupy a continuous space in the utf-8 encoding, so you can use \([α-ωΑ-Ω]*\) instead of \([A-Za-z]*\ to construct your regex.

I would personally prefer to use a regex like "[A-Za-z]* \([α-ωΑ-Ω]*\)" to check if the pattern holds and use string functions to do split jobs. But I believe it depends on your personal preference.

Regular expression - preg_match Latin and Greek characters

Ok, can this replace your function?

$subject = 'OnCEΨΩ é-+@àupon</span> aαθ tIME !#%@$ in MEXIco in the year 1874 <or 1875';

function format($str, $excludeRE = '/[^a-z0-9]+/u', $separator = '-') {
$str = strip_tags($str);
$str = strtolower($str);
$str = preg_replace($excludeRE, $separator, $str);
$str = trim($str, $separator);
return $str;
}
echo format($subject);

Note that you will loose all characters after a < (cause of strip_tags) until you meet a >


// Old answer when I tought you wanted to preserve greek characters

It's possible to build a character range such as α-ω or any strange characters you want! The reason your pattern doesn't work is that you don't inform the regex engine you are dealing with a unicode string. To do that, you must add the u modifier at the end of the pattern. Like that:

/[^a-z0-9α-ω]+/u

You can use chars hexadecimal code too:

/[^a-z0-9\x{3B1}-\x{3C9}]+/u 

Note that if you are sure not to have or want to preserve, uppercase Greek chars in your string, you can use the character class \p{Greek} like this :

/[^a-z0-9\p{Greek}]+/u

(It's a little longer but more explicit)

Greek characters, Regular Expressions, and C#

In .NET languages, you can use \p{IsGreekandCoptic} to match Greek characters. So the resulting regex is

[^a-zA-Z0-9-()/\s\p{IsGreekandCoptic}]

\p{IsGreekandCoptic} matches:

These characters will be matched by \p{IsGreekandCoptic} http://img203.imageshack.us/img203/3760/greekcoptic.png

Javascript - regex to remove special characters but also keep greek characters

The way these ranges are defined is based on their character code. So, since A has char code 65, and z has char code 122, the following regex:

[A-z]

would match every letter, but also every character with char codes that fall between those char codes, namely those with codes 91 through 95, which would be the characters [\]^_. (demo).

Now, for Greek letters, the character codes for the uppercase characters are 913-937 for alpha through omega, and the lowercase characters are 945-969 for alpha through omega (this includes both lowercase variants of sigma, namely ς (962) and σ (963)).

So, to match every character except for latin letters, greek letters, and arabic numerals, you need the following regex:

[a-zA-Z0-9α-ωΑ-Ω]

So, for greek characters, it works just like latin letters.


Edit: I've tested this via a Google Translate'd Lipsum, and it looks like this doesn't take accented letters into account. I've checked what the character codes for these accented letters were, and it turns out they are placed right before the lowercase letters, or right after the uppercase letters. So, the following regex works for all greek letters, including accented ones:

[a-zA-Z0-9ά-ωΑ-ώ]

Demo

This expanded range now also includes άέήίΰ (char codes 940 through 944) and ϊϋόύώ (codes 970 through 974).

To also include whitespace (spaces, tabs, newlines), simply include a \s in the range:

[a-zA-Z0-9ά-ωΑ-ώ\s]

Demo.


Edit: Apparently there are more Greek letters that needed to be included in this range, namely those in the range [Ά-Ϋ], which is the range of letters right before the ά, so the new regex would look like this:

[a-zA-Z0-9Ά-ωΑ-ώ\s]

Demo.



Related Topics



Leave a reply



Submit