Validate Japanese Character in Active Record Callback

Validate Japanese Character in Active Record Callback

The following code may just push you over the line to fulfil the exact requirement you've so far specified in the least possible time. It uses the Moji gem (Japanese documentation), which gives lots of convenience methods in determining the content of a Japanese language string.

It validates a maximum of 14 characters in a name that only consists of half-width characters, and a maximum of 7 characters for names otherwise (including names that contain a combination of half- and full-width characters i.e. the presence of even one full-width character in the string will make the whole string be regarded as "full-width").

class Customer 

validates_length_of :name, :maximum => 14,
:if => Proc.new { |customer| half_width?(customer.name) }
validates_length_of :name, :maximum => 7
:unless => Proc.new { |customer| half_width?(customer.name) }

def half_width?(string)
Moji.type?(string, Moji::HAN_KATA)
end

end

Assumptions made:

  1. Data encoding within the system is UTF-8, and gets stored as such in the database; any further necessary re-encoding (such as for passing the data to a legacy system etc) is done in another module.
  2. No automatic conversion of half-to-full width characters done before data is saved to database i.e. half-width characters are allowed in the database for reasons perhaps of legacy system integration, proper preservation of actual user input(!), and/or aesthetic value of half-width characters(!)
  3. Diacritics in half-width characters are treated as their own separate character (i.e. no parsing of カ and ゙ to be considered one character for purposes of determining string length)
  4. There is only one name field as you specify and not, say, four (for surname, surname furigana, given name, given name furigana) which is quite common nowadays.

Ruby: Checking for East Asian Width (Unicode)

Late to the party, but hopefully still helpful: In Ruby, you can use the unicode-display_width gem to check for a string's east-asian-width:

require 'unicode/display_width'
"⚀".display_width #=> 1
'一'.display_width #=> 2

Using JavaScript to check whether a string contains Japanese characters (including kanji)

Check whether this works or not. I found this website that seems to list all the characters in Unicode that might be used in Japanese text.

The corresponding regex (for single character) would be:

/[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/
-------------_____________-------------_____________-------------_____________
Punctuation Hiragana Katakana Full-width CJK CJK Ext. A
Roman/ (Common & (Rare)
Half-width Uncommon)
Katakana

The ranges are (as quoted from the site):

  • 3000 - 303f: Japanese-style punctuation
  • 3040 - 309f: Hiragana
  • 30a0 - 30ff: Katakana
  • ff00 - ff9f: Full-width Roman characters and half-width Katakana
  • 4e00 - 9faf: CJK unified ideographs - Common and uncommon Kanji
  • 3400 - 4dbf: CJK unified ideographs Extension A - Rare Kanji

I have changed the ranges a bit:

  • I have changed from ff00 - ffef to ff00 - ff9f for Full-width Roman characters and half-width Katakana. The code points from ffa0 - ffdc contains Hangul half-width characters, which is not what you want. You may want to re-add the code points from ffe0 - ffef, but they are mostly half-width punctuations or full-width currency symbols.

You can check the site and take off any range you don't want, or are sure that it will not appear in your input.

Postgresql convert Japanese Full-Width to Half-Width

How about using translate() function?

-- prepare test data
CREATE TABLE address (
id integer,
name text
);
INSERT INTO address VALUES (1, 'SYSKEN, 松井ケ丘3, コメリH&G, 篠路7-1');

-- show test data
SELECT * from address;

-- convert Full-Width to Half-Width Japanese
UPDATE address SET name = translate(name,
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
);

-- see the converted data
SELECT * from address;

This code made the name column to "SYSKEN, 松井ケ丘3, コメリH&G, 篠路7-1".

Why do I get this undefined method error in a Rails model callback, passing the method as a symbol?

This may not be your primary issue (since your error message doesn't seem to relate to it), but you're not using changed? correctly. changed? needs to be called on your model object, optionally prefixed with your attribute name. So your condition method should look like:

def markdown_changed_or_html_nil?
# based on your method name, shouldn't this be:
# content_markdown_changed? || content_html.nil?
content_markdown_changed? || content_markdown.nil?
end

Find more information about Dirty methods at http://api.rubyonrails.org/classes/ActiveModel/Dirty.html.

ALSO

I'm pretty sure Rails 4 hasn't moved Dirty out of ActiveRecord::Base, so you don't need to manually include ActiveModel::Dirty in your model.

ALSO

This line:

validates :user, :title, :content_markdown, { presence: true, on: create }

Should be:

validates :user, :title, :content_markdown, { presence: true, on: :create }

determine whether a unicode character is fullwidth or halfwidth in C++

You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.

For example:

bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}

Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.



Related Topics



Leave a reply



Submit