Programmatically Determine Whether to Describe an Object with "A" or "An"

Programmatically determine whether to describe an object with a or an?

What you want is to determine the appropriate indefinite article. Lingua::EN::Inflect is a Perl module that does an great job. I've extracted the relevant code and pasted it below. It's just a bunch of cases and some regular expressions, so it shouldn't be difficult to port to PHP. A friend ported it to Python here if anyone is interested.

# 2. INDEFINITE ARTICLES

# THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
# CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
# TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)

my $A_abbrev = q{
(?! FJO | [HLMNS]Y. | RY[EO] | SQU
| ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
[FHLMNRSX][A-Z]
};

# THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
# 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
# IMPLIES AN ABBREVIATION.

my $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';

# EXCEPTIONS TO EXCEPTIONS

my $A_explicit_an = enclose join '|',
(
"euler",
"hour(?!i)", "heir", "honest", "hono",
);

my $A_ordinal_an = enclose join '|',
(
"[aefhilmnorsx]-?th",
);

my $A_ordinal_a = enclose join '|',
(
"[bcdgjkpqtuvwyz]-?th",
);

sub A {
my ($str, $count) = @_;
my ($pre, $word, $post) = ( $str =~ m/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i );
return $str unless $word;
my $result = _indef_article($word,$count);
return $pre.$result.$post;
}

sub AN { goto &A }

sub _indef_article {
my ( $word, $count ) = @_;

$count = $persistent_count
if !defined($count) && defined($persistent_count);

return "$count $word"
if defined $count && $count!~/^($PL_count_one)$/io;

# HANDLE USER-DEFINED VARIANTS

my $value;
return "$value $word"
if defined($value = ud_match($word, @A_a_user_defined));

# HANDLE ORDINAL FORMS

$word =~ /^($A_ordinal_a)/i and return "a $word";
$word =~ /^($A_ordinal_an)/i and return "an $word";

# HANDLE SPECIAL CASES

$word =~ /^($A_explicit_an)/i and return "an $word";
$word =~ /^[aefhilmnorsx]$/i and return "an $word";
$word =~ /^[bcdgjkpqtuvwyz]$/i and return "a $word";

# HANDLE ABBREVIATIONS

$word =~ /^($A_abbrev)/ox and return "an $word";
$word =~ /^[aefhilmnorsx][.-]/i and return "an $word";
$word =~ /^[a-z][.-]/i and return "a $word";

# HANDLE CONSONANTS

$word =~ /^[^aeiouy]/i and return "a $word";

# HANDLE SPECIAL VOWEL-FORMS

$word =~ /^e[uw]/i and return "a $word";
$word =~ /^onc?e\b/i and return "a $word";
$word =~ /^uni([^nmd]|mo)/i and return "a $word";
$word =~ /^ut[th]/i and return "an $word";
$word =~ /^u[bcfhjkqrst][aeiou]/i and return "a $word";

# HANDLE SPECIAL CAPITALS

$word =~ /^U[NK][AIEO]?/ and return "a $word";

# HANDLE VOWELS

$word =~ /^[aeiou]/i and return "an $word";

# HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)

$word =~ /^($A_y_cons)/io and return "an $word";

# OTHERWISE, GUESS "a"
return "a $word";
}

A vs. An - any php library can handle this grammar rule?

See this answer for a somewhat usable solution. The answer contains an excerpt from the Lingua::EN::Inflect Perl module that seems to do a pretty good job on determining which indefinite article to use:

A("cat")        # -> "a cat"
AN("cat") # -> "a cat"
A("euphemism") # -> "a euphemism"
A("Euler number") # -> "an Euler number"
A("hour") # -> "an hour"
A("houri") # -> "a houri"

The rules are defined as regular expressions so it shouldn't be too hard to port to PHP.

EDIT: I ended up converting this to PHP (also available on github).

Usage: print IndefiniteArticle::A("umbrella"); // an umbrella

<?php

class IndefiniteArticle
{

public static function AN($input, $count=1) {
return self::A($input, $count);
}

public static function A($input, $count=1) {
$matches = array();
$matchCount = preg_match("/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i", $input, $matches);
list($all, $pre, $word, $post) = $matches;
if(!$word)
return $input;
$result = self::_indef_article($word, $count);
return $pre.$result.$post;
}

# THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
# CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
# TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)

private static $A_abbrev = "(?! FJO | [HLMNS]Y. | RY[EO] | SQU
| ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
[FHLMNRSX][A-Z]
";

# THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
# 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
# IMPLIES AN ABBREVIATION.

private static $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';

# EXCEPTIONS TO EXCEPTIONS

private static $A_explicit_an = "euler|hour(?!i)|heir|honest|hono";

private static $A_ordinal_an = "[aefhilmnorsx]-?th";

private static $A_ordinal_a = "[bcdgjkpqtuvwyz]-?th";

private static function _indef_article($word, $count) {
if($count != 1) // TODO: Check against $PL_count_one instead
return "$count $word";

# HANDLE USER-DEFINED VARIANTS
// TODO

# HANDLE ORDINAL FORMS
if(preg_match("/^(".self::$A_ordinal_a.")/i", $word)) return "a $word";
if(preg_match("/^(".self::$A_ordinal_an.")/i", $word)) return "an $word";

# HANDLE SPECIAL CASES

if(preg_match("/^(".self::$A_explicit_an.")/i", $word)) return "an $word";
if(preg_match("/^[aefhilmnorsx]$/i", $word)) return "an $word";
if(preg_match("/^[bcdgjkpqtuvwyz]$/i", $word)) return "a $word";

# HANDLE ABBREVIATIONS

if(preg_match("/^(".self::$A_abbrev.")/x", $word)) return "an $word";
if(preg_match("/^[aefhilmnorsx][.-]/i", $word)) return "an $word";
if(preg_match("/^[a-z][.-]/i", $word)) return "a $word";

# HANDLE CONSONANTS

if(preg_match("/^[^aeiouy]/i", $word)) return "a $word";

# HANDLE SPECIAL VOWEL-FORMS

if(preg_match("/^e[uw]/i", $word)) return "a $word";
if(preg_match("/^onc?e\b/i", $word)) return "a $word";
if(preg_match("/^uni([^nmd]|mo)/i", $word)) return "a $word";
if(preg_match("/^ut[th]/i", $word)) return "an $word";
if(preg_match("/^u[bcfhjkqrst][aeiou]/i", $word)) return "a $word";

# HANDLE SPECIAL CAPITALS

if(preg_match("/^U[NK][AIEO]?/", $word)) return "a $word";

# HANDLE VOWELS

if(preg_match("/^[aeiou]/i", $word)) return "an $word";

# HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)

if(preg_match("/^(".self::$A_y_cons.")/i", $word)) return "an $word";

# OTHERWISE, GUESS "a"
return "a $word";
}
}

C# Start with a vs an

Let's assume that any word that begins with a vowel will be preceded by "an" and that all other words will be preceded by "a".

string getArticle(string forWord)
{
var vowels = new List<char> { 'A', 'E', 'I', 'O', 'U' };

var firstLetter = forWord[0];
var firstLetterCapitalized = char.ToUpper(firstLetter);
var beginsWithVowel = vowels.Contains(firstLetterCapitalized);

if (beginsWithVowel)
return "an";

return "a";
}

Could this be simplified and improved? Of course. However, it should serve as somewhere from which to start.

Less readable but shorter versions exists, such as:

string getArticle(string forWord) => (new List<char> { 'A', 'E', 'I', 'O', 'U' }).Contains(char.ToUpper(forWord[0])) ? "an" : "a";

However, both of these ignore edge cases such as forWord being null or empty.

How can I correctly prefix a word with a and an?

  1. Download Wikipedia
  2. Unzip it and write a quick filter program that spits out only article text (the download is generally in XML format, along with non-article metadata too).
  3. Find all instances of a(n).... and make an index on the following word and all of its prefixes (you can use a simple suffixtrie for this). This should be case sensitive, and you'll need a maximum word-length - 15 letters?
  4. (optional) Discard all those prefixes which occur less than 5 times or where "a" vs. "an" achieves less than 2/3 majority (or some other threshholds - tweak here). Preferably keep the empty prefix to avoid corner-cases.
  5. You can optimize your prefix database by discarding all those prefixes whose parent shares the same "a" or "an" annotation.
  6. When determining whether to use "A" or "AN" find the longest matching prefix, and follow its lead. If you didn't discard the empty prefix in step 4, then there will always be a matching prefix (namely the empty prefix), otherwise you may need a special case for a completely-non matching string (such input should be very rare).

You probably can't get much better than this - and it'll certainly beat most rule-based systems.

Edit: I've implemented this in JS/C#. You can try it in your browser, or download the small, reusable javascript implementation it uses. The .NET implementation is package AvsAn on nuget. The implementations are trivial, so it should be easy to port to any other language if necessary.

Turns out the "rules" are quite a bit more complex than I thought:

  • it's an unanticipated result but it's a unanimous vote
  • it's an honest decision but a honeysuckle shrub
  • Symbols: It's an 0800 number, or an ∞ of oregano.
  • Acronyms: It's a NASA scientist, but an NSA analyst; a FIAT car but an FAA policy.

...which just goes to underline that a rule based system would be tricky to build!

Determine the type of an object?

There are two built-in functions that help you identify the type of an object. You can use type() if you need the exact type of an object, and isinstance() to check an object’s type against something. Usually, you want to use isinstance() most of the times since it is very robust and also supports type inheritance.


To get the actual type of an object, you use the built-in type() function. Passing an object as the only parameter will return the type object of that object:

>>> type([]) is list
True
>>> type({}) is dict
True
>>> type('') is str
True
>>> type(0) is int
True

This of course also works for custom types:

>>> class Test1 (object):
pass
>>> class Test2 (Test1):
pass
>>> a = Test1()
>>> b = Test2()
>>> type(a) is Test1
True
>>> type(b) is Test2
True

Note that type() will only return the immediate type of the object, but won’t be able to tell you about type inheritance.

>>> type(b) is Test1
False

To cover that, you should use the isinstance function. This of course also works for built-in types:

>>> isinstance(b, Test1)
True
>>> isinstance(b, Test2)
True
>>> isinstance(a, Test1)
True
>>> isinstance(a, Test2)
False
>>> isinstance([], list)
True
>>> isinstance({}, dict)
True

isinstance() is usually the preferred way to ensure the type of an object because it will also accept derived types. So unless you actually need the type object (for whatever reason), using isinstance() is preferred over type().

The second parameter of isinstance() also accepts a tuple of types, so it’s possible to check for multiple types at once. isinstance will then return true, if the object is of any of those types:

>>> isinstance([], (tuple, list, set))
True

How to programmatically determine if the class is a case class or a simple class?

Currently (2011), you can use reflection to find out if the class implements the interface scala.Product:

scala> def isCaseClass(o: AnyRef) = o.getClass.getInterfaces.find(_ == classOf[scala.Product]) != None
isCaseClass: (o: AnyRef)Boolean

scala> isCaseClass(Some(1))
res3: Boolean = true

scala> isCaseClass("")
res4: Boolean = false

This is just an approximation - you could go further and check if it has a copy method, if it implements Serializable, if it has a companion object with an appropriate apply or unapply method - in essence, check for all the things expected from a case class using reflection.

The scala reflection package coming in one of the next releases should make case class detection easier and more precise.

EDIT:

You can now do it using the new Scala Reflection library -- see other answer.

How to check whether an object has certain method/property?

You could write something like that :

public static bool HasMethod(this object objectToCheck, string methodName)
{
var type = objectToCheck.GetType();
return type.GetMethod(methodName) != null;
}

Edit : you can even do an extension method and use it like this

myObject.HasMethod("SomeMethod");

In python at runtime determine if an object is a class (old and new type) instance

While the poster might most likely need to rethink his design, in some cases there is a legitimate need to distinguish between instances of built-in/extension types, created in C, and instances of classes created in Python with the class statement. While both are types, the latter are a category of types that CPython internally calls "heap types" because their type structures are allocated at run-time. That python continues to distinguish them can be seen in __repr__ output:

>>> int       # "type"
<type 'int'>
>>> class X(object): pass
...
>>> X # "class"
<class '__main__.X'>

The __repr__ distinction is implemented exactly by checking whether the type is a heap type.

Depending on the exact needs of the application, an is_class_instance function can be implemented in one of the following ways:

# Built-in types such as int or object do not have __dict__ by
# default. __dict__ is normally obtained by inheriting from a
# dictless type using the class statement. Checking for the
# existence of __dict__ is an indication of a class instance.
#
# Caveat: a built-in or extension type can still request instance
# dicts using tp_dictoffset, and a class can suppress it with
# __slots__.
def is_class_instance(o):
return hasattr(o, '__dict__')

# A reliable approach, but one that is also more dependent
# on the CPython implementation.
Py_TPFLAGS_HEAPTYPE = (1<<9) # Include/object.h
def is_class_instance(o):
return bool(type(o).__flags__ & Py_TPFLAGS_HEAPTYPE)

EDIT

Here is an explanation of the second version of the function. It really tests whether the type is a "heap type" using the same test that CPython uses internally for its own purposes. That ensures that it will always return True for instances of heap types ("classes") and False for instances of non-heap-types ("types", but also old-style classes, which is easy to fix). It does that by checking whether the tp_flags member of the C-level PyTypeObject structure has the Py_TPFLAGS_HEAPTYPE bit set. The weak part of the implementation is that it hardcodes the value of the Py_TPFLAGS_HEAPTYPE constant to the currently observed value. (This is necessary because the constant is not exposed to Python by a symbolic name.) While in theory this constant could change, it is highly unlikely to happen in practice because such a change would gratuitously break the ABI of existing extension modules. Looking at the definitions of Py_TPFLAGS constants in Include/object.h, it is apparent that new ones are being carefully added without disturbing the old ones. Another weakness is that this code has zero chance running on a non-CPython implementation, such as Jython or IronPython.



Related Topics



Leave a reply



Submit