Tilde Operator in Regular Expressions

Tilde operator in Regular expressions

In this case, it's just being used as a delimiter.

Generally, in PHP, the first and last characters of a regular expression are "delimiters" to mark the start and ending position of a matching portion (in case you want to add modifiers at the end, like ungreedy, etc)

Generally PHP works this out from the first character in a string that is meant as a regular expression, matching the second occurence of it as the second delimiter. This is useful where you have an occurrence of the normal delimiter in the text (for example, occurences of / in the text) - this means you don't have to do awkward things.

Matching for "//" with the delimiter set to "/"

/\/\//

Matching for "//" with the delimiter of "#"

#//#

Match a pattern having tilde in Python using regex

You are accessing the matches incorrectly. You want to access the first capture group:

text = "~~o3i320-4fjkhe~~"
pattern = r'\~\~(.*?)\~\~'
m = re.search(pattern, text)

print(m.group(1))
print(m.group(1)[2:-2])

o3i320-4fjkhe
i320-4fjk

Update: If you really wanted to work with the full match, we could try using lookarounds in the pattern instead:

text = "~~o3i320-4fjkhe~~"
pattern = r'(?<=~~)(.*)(?=~~)'
m = re.search(pattern, text)
print(m.group())

o3i320-4fjkhe

The equal tilde operator is not working in bash 4

Quoting the section on regular expressions from Greg's Wiki:

Before 3.2 it was safe to wrap your regex pattern in quotes but this has changed in 3.2. Since then, regex should always be unquoted.

This is the most compatible way of using =~:

e='tar xfz'
re='^tar'
[[ $e =~ $re ]] && echo 0 || echo 1

This should work on both versions of bash.

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.

Here's how to do it, but as you can see, it's not very readable.

^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$

... and it still probably doesn't do exactly what you want.

Groovy Regex: What does a tilde in a character class do?

The tilde doesn't have any special meaning in groovy or Java regular expressions. Groovy doesn't change the Java interpretation of regexs at all. All the special characters for are listed on the API reference page for java.util.regex.Pattern.

If you remove the \p{Alnum} character class and the escaped tilde, you can more easily see that ~ isn't being treated specially:

assert ("D" ==~ "(?:[^äöü~D~V~_])") == false
assert ("V" ==~ "(?:[^äöü~D~V~_])") == false
assert ("~" ==~ "(?:[^äöü~D~V~_])") == false
assert (" " ==~ "(?:[^äöü~D~V~_])") == true

I'd throw away these regexs. They're clearly wrong and obfuscated with extra characters. Word boundaries can be matched with \b and the \p{Alnum}äöü should almost certainly be \p{Alphabetic}\p{Digit} to handle unicode properly.

nginx location tilde

The tilde (~) is an identifier for Nginx letting it know that the location block is using a REGEX to match the location.

"~" = REGEX match, case-sensitive

"~*" = REGEX match, case-insensitive

Nginx Docs

Meaning of =~ operator in shell script

it's the Equal Tilde operator that allows the use of regex in an if statement.

An additional binary operator, =~, is available, with the same
precedence as == and !=. When it is used, the string to the right of
the operator is considered an extended regular expression and matched
accordingly (as in regex(3)). The return value is 0 if the string
matches the pattern, and 1 otherwise. If the regular expression is
syntactically incorrect, the conditional expression's return value is
2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the
pattern may be quoted to force it to be matched as a string.

http://linux.die.net/man/1/bash

ANTLR4 grammar regex and tilde

In a lexer rule, the characters inside square brackets define a character set. So ["] is the set with the single character ". Being a set, every character is either in the set or not, so defining a character twice, as in [""] makes no difference, it's the same as ["].

~ negates the set, so ~["] means any character except ".

How do you match accented and tilde characters in a perl regular expression (regexp)?

use Unicode::Normalize;
($gutted = NFD($string)) =~ s/pM//g;

However, this is almost always the wrong(est) thing to do. What are you going to do about

Ævar Arnfjörð
ǅenan ǈubović
King Henry Ⅷ
Carlos Ⅴº, el Emperador

Just embrace Unicode. The correct way to match things with or without diacritics is to instantiate a Unicode::Collator object with the strength set to ignore diacritics. Then just call the cmp or eq methods.

EDIT

This is how you should go about these things. Witness:

«La Alberguería de Argañán»    sí tiene /AN/ en  un par de sitios «añ» y «án»
                               sí tiene /AL/ en     un solo sitio «Al»
«Bóveda del Río Almar»         sí tiene /AL/ en     un solo sitio «Al»
«Cabezón de Liébana»           sí tiene /AN/ en     un solo sitio «an»
                               sí tiene /ON/ en     un solo sitio «ón»
«Doña Mencía»                  sí tiene /EN/ en     un solo sitio «en»
                               sí tiene /ON/ en     un solo sitio «oñ»
«Gallegos de Argañán»          sí tiene /AN/ en  un par de sitios «añ» y «án»
                               sí tiene /AL/ en     un solo sitio «al»
«Griñón»                       sí tiene /IN/ en     un solo sitio «iñ»
                               sí tiene /ON/ en     un solo sitio «ón»
«Logroño»                      sí tiene /ON/ en     un solo sitio «oñ»
«Lliçà d’Amunt»                sí tiene /UN/ en     un solo sitio «un»
«Madroñal»                     sí tiene /ON/ en     un solo sitio «oñ»
                               sí tiene /AL/ en     un solo sitio «al»
«Mantilla»                     sí tiene /AN/ en     un solo sitio «an»
«Mañón»                        sí tiene /AN/ en     un solo sitio «añ»
                               sí tiene /ON/ en     un solo sitio «ón»
«Matilla de los Caños del Río» sí tiene /AN/ en     un solo sitio «añ»
«Montalbán de Córdoba»         sí tiene /AN/ en     un solo sitio «án»
                               sí tiene /ON/ en     un solo sitio «on»
                               sí tiene /AL/ en     un solo sitio «al»
«La Peña»                      sí tiene /EN/ en     un solo sitio «eñ»
«Piñuécar–Gandullas»           sí tiene /AN/ en     un solo sitio «an»
                               sí tiene /IN/ en     un solo sitio «iñ»
«A Pobra do Caramiñal»         sí tiene /IN/ en     un solo sitio «iñ»
                               sí tiene /AL/ en     un solo sitio «al»
«Prats de Lluçanès»            sí tiene /AN/ en     un solo sitio «an»
«Ribamontán al Monte»          sí tiene /AN/ en     un solo sitio «án»
                               sí tiene /ON/ en  un par de sitios «on» y «on»
                               sí tiene /AL/ en     un solo sitio «al»
«La Roca del Vallès»           sí tiene /AL/ en     un solo sitio «al»
«San Martín del Castañar»      sí tiene /AN/ en  un par de sitios «an» y «añ»
                               sí tiene /IN/ en     un solo sitio «ín»
«Santa Eulàlia de Ronçana»     sí tiene /AN/ en  un par de sitios «an» y «an»
                               sí tiene /ON/ en     un solo sitio «on»
                               sí tiene /AL/ en     un solo sitio «àl»
«Santa María de Cayón»         sí tiene /AN/ en     un solo sitio «an»
                               sí tiene /ON/ en     un solo sitio «ón»
«Valverde de Alcalá»           sí tiene /AL/ en          3 sitios «al», «Al» y «al»
«Villar de Argañán»            sí tiene /AN/ en  un par de sitios «añ» y «án»

And here is the code that generates that.

#!/usr/bin/env perl
#
# búsqueda-libre:
#
#    Cómo se debiera ordenar y buscar palabras en Unicode
#    que pueden llevarse marcas diacríticas (o no) sin que
#    éstas afecten la búsqueda.  También cómo cambiar el
#    el orden para que no cuente con articulos al principio
#    del los nombres, como se hace con los títulos de libros &c.
#
# Tom Christiansen <tchrist@perl.com>
# Fri Mar  4 21:06:35 MST 2011
#
#############################################

use utf8;
use 5.10.1;
use strict;
use warnings; # FATAL => "all";
use autodie;
use charnames qw< :full >;

use List::Util qw< max first >;
use Unicode::Collate;

my $INCLUÍR_NINGUNOS               = 0;
my $SI_IMPORTAN_MARCAS_DIACRÍTICAS = 0;

sub sí_ó_no(_) { $_[0] ? "sí" : "no" }

sub encomillar(_) {
    return join $_[0] =>
        "\N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}",
        "\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}",
    ;
}

binmode(STDOUT, ":utf8");
# Ésta está demasiada larga para la pantalla. :(
#
#    La Ciudad de Nuestra Señora la Reina de Los Ángeles de Porciúncula, California Alta
#

my @ciudades_españolas = ordenar_a_la_española(<<'LA_ÚLTIMA' =~ /\S.*\S/g);
        Santa Eulàlia de Ronçana
        Mañón
        A Pobra do Caramiñal
        La Alberguería de Argañán
        Logroño
        La Puebla del Río
        Villar de Argañán
        Piñuécar–Gandullas
        Mantilla
        Gallegos de Argañán
        Madroñal
        Griñón
        Lliçà d’Amunt
        Valverde de Alcalá
        Montalbán de Córdoba
        San Martín del Castañar
        La Peña
        Cabezón de Liébana
        Doña Mencía
        Santa María de Cayón
        Bóveda del Río Almar
        La Roca del Vallès
        Matilla de los Caños del Río
        Prats de Lluçanès
        Ribamontán al Monte
LA_ÚLTIMA

my $cmáx = -(2 + max map { length } @ciudades_españolas);

my @búsquedas = < {A,E,I,O,U}N AL >;
my $bmáx = -(2 + max map { length } @búsquedas);

my $ordenador = new Unicode::Collate::
                    level           => $SI_IMPORTAN_MARCAS_DIACRÍTICAS ? 2 : 1,
                 ## variable        => "non-ignorable",  # blanked, non-ignorable, shifted, shift-trimmed
                    normalization   => undef,
                ;

for my $aldea (@ciudades_españolas) {
    my $déjà_imprimée;
    for my $búsqueda (@búsquedas) {
        my @resultados = $ordenador->gmatch($aldea, $búsqueda);
        next unless @resultados || $INCLUÍR_NINGUNOS;
        printf qq(%*s %s tiene %*s en %17s %s\n),
                $cmáx => !$déjà_imprimée++ && encomillar($aldea),
                sí_ó_no(@resultados),
                $bmáx => "/$búsqueda/",
                cuántos_sitios(@resultados),
                enfilar(@resultados);
    }
}

sub cuántos_sitios {
    my @lista = @_;
    my $cantidad = @_;
    given ($cantidad) {
        when (0)  { return    "ningún sitio"    }
        when (1)  { return   "un solo sitio"    }
        when (2)  { return "un par de sitios"   }
        default   { return "$cantidad sitios"   }
    }
}

sub enfilar {
    my @lista = map { encomillar } @_;

    my $separador  = "\N{COMMA}";
       $separador  = "\N{SEMICOLON}"   if first { /$separador/ } @lista;
       $separador .= "\N{SPACE}";

    given (scalar @lista) {
        when (0)  { return ""                       }
        when (1)  { return "@lista"                 }
        when (2)  { return join " y " => @lista     }
        default   { return
            join($separador  => @lista[ 0 .. ($#lista-1) ])
                     . " y $lista[$#lista]";
        }
    }
}

###################################################
# Para ordenar los elementos de la lista
# en el estilo tradicional del castellano.
#
# Tenemos en cuenta que sí pueden aparecerse nombres
# de ciudades que no son nombres sólo castellanos
# sino tambíen catalanes y gallegos — y tal vez más,
# como en asturianu or aranés, pero no he pensado
# mucho es estos.
###################################################

sub ordenar_a_la_española {
    my @lista = @_;

    state $ordenador_a_la_española = new Unicode::Collate::

        # Si se tuviese Unicode::Collate::Locale con "es__traditional",
        # no haría falta este primer lío con su entrada especial,
        # con la excepción de la c-cedilla, la cual aquí se ordena
        # como si fuese catalán, no castellano.

        # Vamos a meter las nuevas entradas después de éstas,
        # que son copiadas del DUCET v6.0.0.  Tuve que cambiar unos
        # valores que tenía este código desde otra versión anterior.
        #
        # 0043  ; [.123D.0020.0008.0043] # LATIN CAPITAL LETTER C
        # 00C7  ; [.123D.0020.0008.0043][.0000.0056.0002.0327] # LATIN CAPITAL LETTER C WITH CEDILLA; QQCM
        # 004C  ; [.1330.0020.0008.004C] # LATIN CAPITAL LETTER L
        # 004E  ; [.136D.0020.0008.004E] # LATIN CAPITAL LETTER N
        # 00D1  ; [.136D.0020.0008.004E][.0000.004E.0002.0303] # LATIN CAPITAL LETTER N WITH TILDE; QQCM

        entry => <<'SALIDA',   # :)

               00E7      ; [.123E.0020.0002.0327] # c-cedilla
               0063 0327 ; [.123E.0020.0002.0327] # c-cedilla
               00C7      ; [.123E.0020.0002.0327] # C-cedilla
               0043 0327 ; [.123E.0020.0002.0327] # C-cedilla

               0063 0068 ; [.123F.0020.0002.0043] # ch
               0043 0068 ; [.123F.0020.0007.0043] # Ch
               0043 0048 ; [.123F.0020.0008.0043] # CH

               006C 006C ; [.1331.0020.0002.004C] # ll
               004C 006C ; [.1331.0020.0007.004C] # Ll
               004C 004C ; [.1331.0020.0008.004C] # LL

               00F1      ; [.136E.0020.0002.0303] # n-tilde
               006E 0303 ; [.136E.0020.0002.0303] # n-tilde
               00D1      ; [.136E.0020.0008.0303] # N-tilde
               004E 0303 ; [.136E.0020.0008.0303] # N-tilde

SALIDA

       upper_before_lower => 1,

       normalization => "NFKD",  # ¿Y porqué no?

       preprocess => sub {
           my $_ = shift;

       ###
       # no incluye los artículos definitivos ni indefinitivos
       ###

           s/^L\p{QMARK}//;    # puede encontrarse en el catalán

           s{ ^

             (?:         # del castellano
                 El
               | Los
               | La
               | Las
                         # del catalán
               | Els
               | Les
               | Sa
               | Es
                         # del gallego
               | O
               | Os
               | A
               | As
             )

             \h +

          }{}x;

        # Luego quita las palabras no-importantes interiores.

           s/\b[dl]\p{QMARK}//g;   # del catalán

           s{
               \b
               (?:
                   el  | los | la | las | de  | del | y          # ES
                 | els | les | i  | sa  | es  | dels             # CA
                 | o   | os  | a  | as  | do  | da | dos | das   # GAL
               )
               \b
           }{}gx;

          return $_;

       },   # fin de rutina preprocesadora

  ## ¡Fijaos que no borréis esta marca!
  ##     Este punto y coma marca el fin
  ##     de los argumentos del constructor
  ##     empezado ya muchas lineas arriba.
  ##   ˅
       ;  # ←←← Sí, ése — dejadlo en paz o muy tristes os quedaréis.
  ##   ˄

    return $ordenador_a_la_española->sort(@lista);
}

Tilde Operator in Regular Expressions