Ruby | Find a Way to Find an Exception on the Same Word to Capitalize

How can I capitalize a letter from a word one at a time, then add each instance of the word with a caps letter into a array?


str = "hello"
str.size.times.map { |i| str[0,i] << str[i].upcase << str[i+1..] }
#=> ["Hello", "hEllo", "heLlo", "helLo", "hellO"]

Capitalize first letter in Ruby with UTF-8 strings with exceptions


"åbc".mb_chars.capitalize
#=> "Åbc"
"ébc".mb_chars.capitalize.to_s
#=> "Ébc"

UPD

And to ignore none word chars:

string = "-åbc"
str = string.match(/^(\W*)(.*)/)
str[1] + str[2].mb_chars.capitalize.to_s
#=> "-Åbc"

Regex to match only uppercase words with some exceptions

To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses \b for word boundaries. In the last example, it also uses negative lookaround (?<!) and (?!) as well as non-capturing parentheses (?:)

Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use

\b[A-Z]+[0-9]+\b

For all-uppercase and numbers (total must be 2 or more):

\b[A-Z0-9]{2,}\b

For all-uppercase and numbers, but starting with at least one letter:

\b[A-Z][A-Z0-9]+\b

The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:

(?:(?<!^)[A-Z]\b|(?<!^[A-Z0-9 ]*)\b[A-Z0-9]+\b(?![A-Z0-9 ]$))

breakdown:

The regex starts with (?:. The ?: signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).

Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol |. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.

Now, let's look at each expression in the alternation.

The first expression is: (?<!^)[A-Z]\b. The main clause here is [A-Z]\b, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is (?<!^), which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is (?<!x), where x is the expression that must not exist before our main clause. Here that expression is simply ^, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."

Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.

That is handled by a relatively small portion of the second expression in the alternation: \b[A-Z0-9]+\b. The \bs represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

The rest of the expression consists of other lookarounds. (?<!^[A-Z0-9 ]*) is another negative lookbehind, where the expression is ^[A-Z0-9 ]*. This means what precedes must not be all capital letters and numbers.

The second lookaround is (?![A-Z0-9 ]$), which is a negative lookahead. This means what follows must not be all capital letters and numbers.

So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.

There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.

It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.

Sentence capitalization exception

Since Javascript doesn't support lookbehinds, you'll have a much easier time going through exactly the function you've written and then arbitrarily correcting the mistakenly capitalized bits back to lowercase.

Working example:

String.prototype.capitalize = function(exception) {
var result = this.replace(/.+?[\.\?\!](\s|$)/g, function (txt) {
return txt.charAt(0).toUpperCase() + txt.slice(1);
});
var r = new RegExp(exception + "\\.\\s*\(\\w+\)", "i");
return result.replace(r, function(re) { return(re.toLowerCase()) });
};

alert("capitalization of string xy. is not correct.".capitalize("xy"));

You probably could enhance it to handle an array of exceptions, or even use a regular expression.

Here's a working example: http://jsfiddle.net/remus/4EZBb/

How can I capitalize the first letter of each word in a string?

The .title() method of a string (either ASCII or Unicode is fine) does this:

>>> "hello world".title()
'Hello World'
>>> u"hello world".title()
u'Hello World'

However, look out for strings with embedded apostrophes, as noted in the docs.

The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:

>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"


Related Topics



Leave a reply



Submit