Regex to Match Words With Hyphens And/Or Apostrophes

Regex to match words with hyphens and/or apostrophes

use this pattern

(?=\S*['-])([a-zA-Z'-]+)

Demo

(?=                 # Look-Ahead
\S # <not a whitespace character>
* # (zero or more)(greedy)
['-] # Character in ['-] Character Class
) # End of Look-Ahead
( # Capturing Group (1)
[a-zA-Z'-] # Character in [a-zA-Z'-] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)

Match words with hyphens and apostrophes

Your \w+(?:'|\-\w+)? starts matching with a word character \w, thus all "words" starting with ' are not matched as per the requirements.

In general, you can match words with and without hyphens with

\w+(?:-\w+)*

In the current scenario, you may include the \w and ' into a character class and use

'?\w[\w']*(?:-\w+)*'?

See the regex demo

If a "word" can only have 1 hyphen, replace * at the end with the ? quantifier.

Breakdown:

  • '? - optional apostrophe
  • \w - a word character
  • [\w']* - 0+ word character or an apostrophe
  • (?:-\w+)* - 0+ sequences of:
    • - - a hyphen
    • \w+ - 1+ word character
  • '? - optional apostrophe

How to match a pattern with a hyphen or apostrophe

Your regex ^[a-zA-Z]['][-]$ matches a letter followed with ' and -. Something like a'-.

You need to add quantifiers and an optional group (* will allow 0 or more occurrences), e.g.

^[a-zA-Z]+(?:['-][a-zA-Z]+)*$
^^^^^^^^^^^^^^^^^^^

See the regex demo

Regular expression visualization

Debuggex Demo

The pattern anchors the whole match (it should match the whole string) and it matches 1 or more letters ([a-zA-Z]+) and then 0 or more occurrences of a ' or - (thanks to ['-]) followed by 1+ letters.

Regex to allow only alphabetical characters, hyphens, apostrophes and period

Remove * quantifier to make letters be at beginning and consider them at end:

^[a-zA-Z](?:[ '.\-a-zA-Z]*[a-zA-Z])?$

Live demo

Regex match hyphenated word with hyphen-less query

My solution to scenarios like this is always to introduce content- and query-processing.

Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.

Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.

EXAMPLE

I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.

The requirement is that ABC-123-DE_F-4.56G should match any of

  • ABC-123-DE_F-4.56G
  • ABC123-DE_F-4.56G
  • ABC_123_DE_F_4_56G
  • ABC.123.DE.F.4.56G
  • ABC 123 DEF 56 G
  • ABC123DEF56G

I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:

<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>

So, if the user entered ABC.123.DE.F.4.56G I would effecively search for

ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G

JavaScript regular expression for word boundaries, tolerating in-word hyphens and apostrophes

You can organize your word-boundary characters into two groups.

  1. Characters that cannot be alone.
  2. Characters that can be alone.

A regex that works with your example would be:

[\s.,'-]{2,}|[\s.]

Regex101 Demo

Now all that's left is to keep adding all non-word characters into those two groups until it fits all of your needs. So you might start adding symbols and more punctuation to those character classes.

Regular expression for alpahbet,underscore,hyphen,apostrophe only

Your regex is wrong. Try this:

/^[0-9A-Za-z_@'-]+$/

OR

/^[\w@'-]+$/

Hyphen needs to be at first or last position inside a character class to avoid escaping. Also if empty string isn't allowed then use + (1 or more) instead of * (0 or more)

Explanation:

^ assert position at start of the string
[\w@'-]+ match a single character present in the list below
Quantifier: Between one and unlimited times, as many times as possible
\w match any word character [a-zA-Z0-9_]
@'- a single character in the list @'- literally
$ assert position at end of the string


Related Topics



Leave a reply



Submit