Ruby Convert Idn Domain from Punycode to Unicode

Ruby convert IDN domain from Punycode to Unicode

Try the simpleidn gem. It works with Ruby 1.8.7 and 1.9.2.

Edit your Gemfile:

gem 'simpleidn'

then you can enter the command as follows:

SimpleIDN.to_unicode("xn--mllerriis-l8a.com")
=> "møllerriis.com"

SimpleIDN.to_ascii("møllerriis.com")
=> "xn--mllerriis-l8a.com"

Ruby - internationalized domain names

Thanks to this snippet, I finally found a solution that did not require libidn. It is built upon punicode4r together with either the unicode gem (a prebuilt binary can be found here), or with ActiveSupport. I will use ActiveSupport since I use Rails anyway, but for reference I include both methods.

With the unicode gem:

require 'unicode'
require 'punycode' #This is not a gem, but a standalone file.

def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(Unicode::normalize_KC(Unicode::downcase(label)))
if encoded =~ /-$/ #Pure ASCII
encoded.chop!
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end

With ActiveSupport:

require "punycode"
require "active_support"
$KCODE = "UTF-8" #Have to set this to enable mb_chars

def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(label.mb_chars.downcase.normalize(:kc))
if encoded =~ /-$/ #Pure ASCII
encoded.chop! #Remove trailing '-'
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end

The ActiveSupport solution was found thanks to this StackOverflow question.

Does a Punycode domain name (UName) store the IDN table used?

Does a Punycode domain name (UName) store the IDN table used?

TL;DR: No.

You are mixing multiple things, but it is difficult to summarize everything (I did a very detailed answer at https://webmasters.stackexchange.com/a/122160/75842 which should help you).

For the computers, ê being either Portuguese or Norwegian does not make a difference at the DNS level. In the same way that at the Unicode level, ê is
"U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX" that is just defined as a "Latin" character, irrespective to which language might use it.

In short:

  • the IETF invented the Punycode algorithm, and more precisely the IDNA standard just to make sure that people could use (almost) any character in their domain name. As such the algorithm is just a translation from "any Unicode string" to "an ASCII string starting with xn--"

  • The domain name industry, with ICANN and all registries, then decide on rules on top of that. For example there is a major rule "you can not mix characters from multiple scripts in the same string", to avoid IDN homograph attacks mostly (so not really a technical constraint); my answer above gets in full details on this.

  • At the EPP level, various actors created various extensions, there is no real standardized "IDN" specification here. Which is also why you will find people speaking about "scripts", other about "languages", other about "repertoire", etc. It is a mess (Unicode only speaks about scripts, not languages). Some registries do not use any extension, while others do. Some want you to always pass an IDN "table" (aka script/language/whatever) reference, some will require it only in some cases. For example look at Verisign IDN practices at https://www.verisign.com/en_US/channel-resources/domain-registry-products/idn/idn-policy/registration-rules/index.xhtml; It boils down to "all IDN registrations need a language tag; some of them are attached to specific list of possible characters"

You can find in theory all but in practice only most of IDN tables existing at https://www.iana.org/domains/idn-tables and you can see they are per registry, showing that this extra information is really not encoded in the ASCII form of the domain name, after conversion by Punycode algorithm.

I am trying to understand who is assuming the IDN table here...

There should be no assumption (either it is given by registrar or not given) or there is no IDN table needed (the registry will just do the Punycode conversion in reverse and decide, based on characters found, which table it should be in).

I can see the EPP transaction - it is not using the IDN extension and therefore cannot supply an IDN table to the server, even if it wanted to

Which registry? If you are a registrar, in practice the registry should be able to help you and answer this kind of questions. Note that most of the time (I could write "all the time", but I am not sure no counter example exists or at least I have none in mind right now), during EPP domain:check you just pass the name (in ASCII form) without any IDN extension, while you pass the IDN extension, if any, during the domain:create. Which also means that the domain:check might not get you the proper full reply, just because at that point not everything is known.

See these EPP documents on IDN extensions:

  • https://datatracker.ietf.org/doc/html/draft-ietf-eppext-idnmap-02
  • https://datatracker.ietf.org/doc/html/draft-wilcox-cira-idn-eppext
  • https://tools.ietf.org/id/draft-gould-idn-table-07.html
  • https://datatracker.ietf.org/doc/html/draft-sienkiewicz-epp-idn-00

ruby toUnicode fun does not return the idn site when there is no www. in the url

finally figured it out. the problem was the http part in the url. the toUnicode fun works fine. if we remove the http part in the url and pass it.

How to convert domain names with greek characters to an ascii URL?

You can use any tool that supports "Libidn". A quick search showed SimpleDNS might be of help to you.

There are heaps of converters for IDN online, if that's enough for you, you can use one of them.



Related Topics



Leave a reply



Submit