Why Is Uri.Escape() Marked as Obsolete and Where Is This Regexp::Unsafe Constant

Why is URI.escape() marked as obsolete and where is this REGEXP::UNSAFE constant?

I see you answered your question re: UNSAFE. As to this question:

Additionally, this code has the escape / unescape methods marked as 'obsolete' since 2009. Why are they obsolete?

There's some background in this Dec. 2010 issue: https://bugs.ruby-lang.org/issues/4167 In that thread Yui Naruse writes:

URI lib says it refers RFC2396, so current behavior is correct in its
spec.

Yes, I know current behavior is not what you expect. So we plan to
change the lib to refer RFC3986.

Moreover current URI.encode is simple gsub. But I think it should
split a URI to components, then escape each components, and finally
join them.

So current URI.encode is considered harmful and deprecated. This will
be removed or change behavior drastically.

What is the replacement at this time?


As I said above, current URI.encode is wrong on spec level. So we
won't provide the exact replacement. The replacement will vary by its
use case.

We thought most use case is to generate escaped URI from joined URI
componets. For this, people should use URI.join or
URI.encode_www_form; you should escape each components before join
them.

Ruby 2.7 says URI.escape is obsolete, what replaces it?

There is no official RFC 3986-compliant URI escaper in the Ruby standard library today.

See Why is URI.escape() marked as obsolete and where is this REGEXP::UNSAFE constant? for background.

There are several methods that have various issues with them as you have discovered and pointed out in the comment:

  • They produce deprecation warnings
  • They do not claim standards compliance
  • They are not escaping in accordance with RFC 3986
  • They are implemented in tangentially related libraries

Ruby 1.9.3 add unsafe characters to URI.escape

URI#escape is not what you are looking for. You want CGI#escape:

require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

This will properly encode it to allow Sinatra to retrieve it.

How to URL encode a string in Ruby

str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts CGI.escape str

=> "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"

What's the difference between EscapeUriString and EscapeDataString?

Use EscapeDataString always (for more info about why, see Livven's answer below)

Edit: removed dead link to how the two differ on encoding

PHP FILTER_VALIDATE_EMAIL does not work correctly

Validating e-mail adresses is kinda complicated.
Take a look at this list:

Valid email addresses

  1. niceandsimple@example.com
  2. very.common@example.com
  3. a.little.lengthy.but.fine@dept.example.com
  4. disposable.style.email.with+symbol@example.com
  5. user@[IPv6:2001:db8:1ff::a0b:dbd0]
  6. "much.more unusual"@example.com
  7. "very.unusual.@.unusual.com"@example.com
  8. "very.(),:;<>[]".VERY."very@\
    "very".unusual"@strange.example.com
  9. postbox@com (top-level domains are valid hostnames)
  10. admin@mailserver1 (local domain name with no TLD)
  11. !#$%&'*+-/=?^_`{}|~@example.org
  12. "()<>[]:,;@\"!#$%&'*+-/=?^_`{}| ~.a"@example.org
  13. " "@example.org (space between the quotes)
  14. üñîçøðé@example.com (Unicode characters in local part)

Invalid email addresses

  1. Abc.example.com (an @ character must separate the local and domain
    parts)
  2. A@b@c@example.com (only one @ is allowed outside quotation marks)
  3. a"b(c)d,e:f;gi[j\k]l@example.com (none of the special characters
    in this local part are allowed outside quotation marks)
  4. just"not"right@example.com (quoted strings must be dot separated, or
    the only element making up the local-part)
  5. this is"not\allowed@example.com (spaces, quotes, and backslashes may
    only exist when within quoted strings and preceded by a backslash)
  6. this\ still"not\allowed@example.com (even if escaped (preceded by
    a backslash), spaces, quotes, and backslashes must still be
    contained by quotes)

Source http://en.wikipedia.org/wiki/Email_address

Allmost all e-mail validation implementations are "bugged" but the php implementation is fine to work with because it accepts all common e-mail adresses

UPDATE:

Found on http://www.php.net/manual/en/filter.filters.validate.php

Regarding "partial" addresses with no . in the domain part, a comment in the source code (in ext/filter/logical_filters.c) justifies this rejection thus:

 * The regex below is based on a regex by Michael Rushton.
* However, it is not identical. I changed it to only consider routeable
* addresses as valid. Michael's regex considers a@b a valid address
* which conflicts with section 2.3.5 of RFC 5321 which states that:
*
* Only resolvable, fully-qualified domain names (FQDNs) are permitted
* when domain names are used in SMTP. In other words, names that can
* be resolved to MX RRs or address (i.e., A or AAAA) RRs (as discussed
* in Section 5) are permitted, as are CNAME RRs whose targets can be
* resolved, in turn, to MX or address RRs. Local nicknames or
* unqualified names MUST NOT be used.

And here is a link to the class from Michael Rushton (link broken see source below)
Which supports both RFC 5321/5322

<?php
/**
* Squiloople Framework
*
* LICENSE: Feel free to use and redistribute this code.
*
* @author Michael Rushton <michael@squiloople.com>
* @link http://squiloople.com/
* @package Squiloople
* @version 1.0
* @copyright © 2012 Michael Rushton
*/
/**
* Email Address Validator
*
* Validate email addresses according to the relevant standards
*/
final class EmailAddressValidator
{
// The RFC 5321 constant
const RFC_5321 = 5321;
// The RFC 5322 constant
const RFC_5322 = 5322;
/**
* The email address
*
* @access private
* @var string $_email_address
*/
private $_email_address;
/**
* A quoted string local part is either allowed (true) or not (false)
*
* @access private
* @var boolean $_quoted_string
*/
private $_quoted_string = FALSE;
/**
* An obsolete local part is either allowed (true) or not (false)
*
* @access private
* @var boolean $_obsolete
*/
private $_obsolete = FALSE;
/**
* A basic domain name is either required (true) or not (false)
*
* @access private
* @var boolean $_basic_domain_name
*/
private $_basic_domain_name = TRUE;
/**
* A domain literal domain is either allowed (true) or not (false)
*
* @access private
* @var boolean $_domain_literal
*/
private $_domain_literal = FALSE;
/**
* Comments and folding white spaces are either allowed (true) or not (false)
*
* @access private
* @var boolean $_cfws
*/
private $_cfws = FALSE;
/**
* Set the email address and turn on the relevant standard if required
*
* @access public
* @param string $email_address
* @param null|integer $standard
*/
public function __construct($email_address, $standard = NULL)
{
// Set the email address
$this->_email_address = $email_address;
// Set the relevant standard or throw an exception if an unknown is requested
switch ($standard)
{
// Do nothing if no standard requested
case NULL:
break;
// Otherwise if RFC 5321 requested
case self::RFC_5321:
$this->setStandard5321();
break;
// Otherwise if RFC 5322 requested
case self::RFC_5322:
$this->setStandard5322();
break;
// Otherwise throw an exception
default:
throw new Exception('Unknown RFC standard for email address validation.');
}
}
/**
* Call the constructor fluently
*
* @access public
* @static
* @param string $email_address
* @param null|integer $standard
* @return EmailAddressValidator
*/
public static function setEmailAddress($email_address, $standard = NULL)
{
return new self($email_address, $standard);
}
/**
* Validate the email address using a basic standard
*
* @access public
* @return EmailAddressValidator
*/
public function setStandardBasic()
{
// A quoted string local part is not allowed
$this->_quoted_string = FALSE;
// An obsolete local part is not allowed
$this->_obsolete = FALSE;
// A basic domain name is required
$this->_basic_domain_name = TRUE;
// A domain literal domain is not allowed
$this->_domain_literal = FALSE;
// Comments and folding white spaces are not allowed
$this->_cfws = FALSE;
// Return the EmailAddressValidator object
return $this;
}
/**
* Validate the email address using RFC 5321
*
* @access public
* @return EmailAddressValidator
*/
public function setStandard5321()
{
// A quoted string local part is allowed
$this->_quoted_string = TRUE;
// An obsolete local part is not allowed
$this->_obsolete = FALSE;
// Only a basic domain name is not required
$this->_basic_domain_name = FALSE;
// A domain literal domain is allowed
$this->_domain_literal = TRUE;
// Comments and folding white spaces are not allowed
$this->_cfws = FALSE;
// Return the EmailAddressValidator object
return $this;
}
/**
* Validate the email address using RFC 5322
*
* @access public
* @return EmailAddressValidator
*/
public function setStandard5322()
{
// A quoted string local part is disallowed
$this->_quoted_string = FALSE;
// An obsolete local part is allowed
$this->_obsolete = TRUE;
// Only a basic domain name is not required
$this->_basic_domain_name = FALSE;
// A domain literal domain is allowed
$this->_domain_literal = TRUE;
// Comments and folding white spaces are allowed
$this->_cfws = TRUE;
// Return the EmailAddressValidator object
return $this;
}
/**
* Either allow (true) or do not allow (false) a quoted string local part
*
* @access public
* @param boolean $allow
* @return EmailAddressValidator
*/
public function setQuotedString($allow = TRUE)
{
// Either allow (true) or do not allow (false) a quoted string local part
$this->_quoted_string = $allow;
// Return the EmailAddressValidator object
return $this;
}
/**
* Either allow (true) or do not allow (false) an obsolete local part
*
* @access public
* @param boolean $allow
* @return EmailAddressValidator
*/
public function setObsolete($allow = TRUE)
{
// Either allow (true) or do not allow (false) an obsolete local part
$this->_obsolete = $allow;
// Return the EmailAddressValidator object
return $this;
}
/**
* Either require (true) or do not require (false) a basic domain name
*
* @access public
* @param boolean $allow
* @return EmailAddressValidator
*/
public function setBasicDomainName($allow = TRUE)
{
// Either require (true) or do not require (false) a basic domain name
$this->_basic_domain_name = $allow;
// Return the EmailAddressValidator object
return $this;
}
/**
* Either allow (true) or do not allow (false) a domain literal domain
*
* @access public
* @param boolean $allow
* @return EmailAddressValidator
*/
public function setDomainLiteral($allow = TRUE)
{
// Either allow (true) or do not allow (false) a domain literal domain
$this->_domain_literal = $allow;
// Return the EmailAddressValidator object
return $this;
}
/**
* Either allow (true) or do not allow (false) comments and folding white spaces
*
* @access public
* @param boolean $allow
* @return EmailAddressValidator
*/
public function setCFWS($allow = TRUE)
{
// Either allow (true) or do not allow (false) comments and folding white spaces
$this->_cfws = $allow;
// Return the EmailAddressValidator object
return $this;
}
/**
* Return the regular expression for a dot atom local part
*
* @access private
* @return string
*/
private function _getDotAtom()
{
return "([!#-'*+\/-9=?^-~-]+)(?>\.(?1))*";
}
/**
* Return the regular expression for a quoted string local part
*
* @access private
* @return string
*/
private function _getQuotedString()
{
return '"(?>[ !#-\[\]-~]|\\\[ -~])*"';
}
/**
* Return the regular expression for an obsolete local part
*
* @access private
* @return string
*/
private function _getObsolete()
{
return '([!#-\'*+\/-9=?^-~-]+|"(?>'
. $this->_getFWS()
. '(?>[\x01-\x08\x0B\x0C\x0E-!#-\[\]-\x7F]|\\\[\x00-\xFF]))*'
. $this->_getFWS()
. '")(?>'
. $this->_getCFWS()
. '\.'
. $this->_getCFWS()
. '(?1))*';
}
/**
* Return the regular expression for a domain name domain
*
* @access private
* @return string
*/
private function _getDomainName()
{
// Return the basic domain name format if required
if ($this->_basic_domain_name)
{
return '(?>' . $this->_getDomainNameLengthLimit()
. '[a-z\d](?>[a-z\d-]*[a-z\d])?'
. $this->_getCFWS()
. '\.'
. $this->_getCFWS()
. '){1,126}[a-z]{2,6}';
}
// Otherwise return the full domain name format
return $this->_getDomainNameLengthLimit()
. '([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>'
. $this->_getCFWS()
. '\.'
. $this->_getDomainNameLengthLimit()
. $this->_getCFWS()
. '(?2)){0,126}';
}
/**
* Return the regular expression for an IPv6 address
*
* @access private
* @return string
*/
private function _getIPv6()
{
return '([a-f\d]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?';
}
/**
* Return the regular expression for an IPv4-mapped IPv6 address
*
* @access private
* @return string
*/
private function _getIPv4MappedIPv6()
{
return '(?3)(?>:(?3)){5}:|(?!(?:.*[a-f\d]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?';
}
/**
* Return the regular expression for an IPv4 address
*
* @access private
* @return string
*/
private function _getIPv4()
{
return '(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?6)){3}';
}
/**
* Return the regular expression for a domain literal domain
*
* @access private
* @return string
*/
private function _getDomainLiteral()
{
return '\[(?:(?>IPv6:(?>'
. $this->_getIPv6()
. '))|(?>(?>IPv6:(?>'
. $this->_getIPv4MappedIPv6()
. '))?'
. $this->_getIPv4()
. '))\]';
}
/**
* Return either the regular expression for folding white spaces or its backreference
*
* @access private
* @param boolean $define
* @return string
*/
private function _getFWS($define = FALSE)
{
// Return the backreference if $define is set to FALSE otherwise return the regular expression
if ($this->_cfws)
{
return !$define ? '(?P>fws)' : '(?<fws>(?>(?>(?>\x0D\x0A)?[\t ])+|(?>[\t ]*\x0D\x0A)?[\t ]+)?)';
}
}
/**
* Return the regular expression for comments
*
* @access private
* @return string
*/
private function _getComments()
{
return '(?<comment>\((?>'
. $this->_getFWS()
. '(?>[\x01-\x08\x0B\x0C\x0E-\'*-\[\]-\x7F]|\\\[\x00-\x7F]|(?P>comment)))*'
. $this->_getFWS()
. '\))';
}
/**
* Return either the regular expression for comments and folding white spaces or its backreference
*
* @access private
* @param boolean $define
* @return string
*/
private function _getCFWS($define = FALSE)
{
// Return the backreference if $define is set to FALSE
if ($this->_cfws && !$define)
{
return '(?P>cfws)';
}
// Otherwise return the regular expression
if ($this->_cfws)
{
return '(?<cfws>(?>(?>(?>'
. $this->_getFWS(TRUE)
. $this->_getComments()
. ')+'
. $this->_getFWS()
. ')|'
. $this->_getFWS()
. ')?)';
}
}
/**
* Establish and return the valid format for the local part
*
* @access private
* @return string
*/
private function _getLocalPart()
{
// The local part may be obsolete if allowed
if ($this->_obsolete)
{
return $this->_getObsolete();
}
// Otherwise the local part must be either a dot atom or a quoted string if the latter is allowed
if ($this->_quoted_string)
{
return '(?>' . $this->_getDotAtom() . '|' . $this->_getQuotedString() . ')';
}
// Otherwise the local part must be a dot atom
return $this->_getDotAtom();
}
/**
* Establish and return the valid format for the domain
*
* @access private
* @return string
*/
private function _getDomain()
{
// The domain must be either a domain name or a domain literal if the latter is allowed
if ($this->_domain_literal)
{
return '(?>' . $this->_getDomainName() . '|' . $this->_getDomainLiteral() . ')';
}
// Otherwise the domain must be a domain name
return $this->_getDomainName();
}
/**
* Return the email address length limit
*
* @access private
* @return string
*/
private function _getEmailAddressLengthLimit()
{
return '(?!(?>' . $this->_getCFWS() . '"?(?>\\\[ -~]|[^"])"?' . $this->_getCFWS() . '){255,})';
}
/**
* Return the local part length limit
*
* @access private
* @return string
*/
private function _getLocalPartLengthLimit()
{
return '(?!(?>' . $this->_getCFWS() . '"?(?>\\\[ -~]|[^"])"?' . $this->_getCFWS() . '){65,}@)';
}
/**
* Establish and return the domain name length limit
*
* @access private
* @return string
*/
private function _getDomainNameLengthLimit()
{
return '(?!' . $this->_getCFWS() . '[a-z\d-]{64,})';
}
/**
* Check to see if the domain can be resolved to MX RRs
*
* @access private
* @param array $domain
* @return integer|boolean
*/
private function _verifyDomain($domain)
{
// Return 0 if the domain cannot be resolved to MX RRs
if (!checkdnsrr(end($domain), 'MX'))
{
return 0;
}
// Otherwise return true
return TRUE;
}
/**
* Perform the validation check on the email address's syntax and, if required, call _verifyDomain()
*
* @access public
* @param boolean $verify
* @return boolean|integer
*/
public function isValid($verify = FALSE)
{
// Return false if the email address has an incorrect syntax
if (!preg_match(
'/^'
. $this->_getEmailAddressLengthLimit()
. $this->_getLocalPartLengthLimit()
. $this->_getCFWS()
. $this->_getLocalPart()
. $this->_getCFWS()
. '@'
. $this->_getCFWS()
. $this->_getDomain()
. $this->_getCFWS(TRUE)
. '$/isD'
, $this->_email_address
))
{
return FALSE;
}
// Otherwise check to see if the domain can be resolved to MX RRs if required
if ($verify)
{
return $this->_verifyDomain(explode('@', $this->_email_address));
}
// Otherwise return 1
return 1;
}
}

How can I validate an email address using a regular expression?

The fully RFC 822 compliant regex is inefficient and obscure because of its length. Fortunately, RFC 822 was superseded twice and the current specification for email addresses is RFC 5322. RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use.

One RFC 5322 compliant regex can be found at the top of the page at http://emailregex.com/ but uses the IP address pattern that is floating around the internet with a bug that allows 00 for any of the unsigned byte decimal values in a dot-delimited address, which is illegal. The rest of it appears to be consistent with the RFC 5322 grammar and passes several tests using grep -Po, including cases domain names, IP addresses, bad ones, and account names with and without quotes.

Correcting the 00 bug in the IP pattern, we obtain a working and fairly fast regex. (Scrape the rendered version, not the markdown, for actual code.)

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

or:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Here is diagram of finite state machine for above regexp which is more clear than regexp itself
Sample Image

The more sophisticated patterns in Perl and PCRE (regex library used e.g. in PHP) can correctly parse RFC 5322 without a hitch. Python and C# can do that too, but they use a different syntax from those first two. However, if you are forced to use one of the many less powerful pattern-matching languages, then it’s best to use a real parser.

It's also important to understand that validating it per the RFC tells you absolutely nothing about whether that address actually exists at the supplied domain, or whether the person entering the address is its true owner. People sign others up to mailing lists this way all the time. Fixing that requires a fancier kind of validation that involves sending that address a message that includes a confirmation token meant to be entered on the same web page as was the address.

Confirmation tokens are the only way to know you got the address of the person entering it. This is why most mailing lists now use that mechanism to confirm sign-ups. After all, anybody can put down president@whitehouse.gov, and that will even parse as legal, but it isn't likely to be the person at the other end.

For PHP, you should not use the pattern given in Validate an E-Mail Address with PHP, the Right Way from which I quote:

There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard.

That is no better than all the other non-RFC patterns. It isn’t even smart enough to handle even RFC 822, let alone RFC 5322. This one, however, is.

If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective. A state engine for the purpose can both validate and even correct e-mail addresses that would otherwise be considered invalid as it disassembles the e-mail address according to each RFC. This allows for a potentially more pleasing experience, like

The specified e-mail address 'myemail@address,com' is invalid. Did you mean 'myemail@address.com'?

See also Validating Email Addresses, including the comments. Or Comparing E-mail Address Validating Regular Expressions.

Regular expression visualization

Debuggex Demo

How do I check if a given string is a legal/valid file name under Windows?

You can get a list of invalid characters from Path.GetInvalidPathChars and GetInvalidFileNameChars.

UPD: See Steve Cooper's suggestion on how to use these in a regular expression.

UPD2: Note that according to the Remarks section in MSDN "The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names." The answer provided by sixlettervaliables goes into more details.

Semicolon as URL query separator

The W3C Recommendation from 1999 is obsolete. The current status, according to the 2014 W3C Recommendation, is that semicolon is now illegal as a parameter separator:

To decode application/x-www-form-urlencoded payloads, the following algorithm should be used. [...] The output of this algorithm is a sorted list of name-value pairs. [...]

  1. Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).

In other words, ?foo=bar;baz means the parameter foo will have the value bar;baz; whereas ?foo=bar;baz=sna should result in foo being bar;baz=sna (although technically illegal since the second = should be escaped to %3D).



Related Topics



Leave a reply



Submit