CamelCase conversion to friendly name, i.e. Enum constants; Problems? - c#

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Related

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?
If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.
I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());
From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

Camel casing acronyms?

This question may seem pedantic or just silly, but what is your practice for camel casing when it comes to acronyms? Do you insist that everything, even acronyms must be camel cased, or do you make an exception for acronyms. Explanations would be great too. I'm not sure how this practice effects IDE features (autocomplete) or what the industry standard are.
For C#, check out Microsoft's guidelines:
Do capitalize both characters of two-character acronyms, except the
first word of a camel-cased
identifier.
A property named DBRate is an example
of a short acronym (DB) used as the
first word of a Pascal-cased
identifier. A parameter named
ioChannel is an example of a short
acronym (IO) used as the first word of
a camel-cased identifier.
Do capitalize only the first character of acronyms with three or
more characters, except the first word
of a camel-cased identifier.
A class named XmlWriter is an example
of a long acronym used as the first
word of a Pascal-cased identifier. A
parameter named htmlReader is an
example of a long acronym used as the
first word of a camel-cased
identifier.
Do not capitalize any of the characters of any acronyms, whatever
their length, at the beginning of a
camel-cased identifier.
A parameter named xmlStream is an
example of a long acronym (xml) used
as the first word of a camel-cased
identifier. A parameter named
dbServerName is an example of a short
acronym (db) used as the first word of
a camel-cased identifier.
Personal preference.
I tend to do it just because it doesn't merge well with other words, like, XMLHTTPParser, compared to XmlHttpParser. Do whatever makes you feel good, but do it in a standard way.
Here's what i like, and this is for Java: classes start with upper-case, fields with lower case, and acronyms do not affect that. That leads to things that look like this,
UrlConnection urlConnection;
The problem is that if you try to apply a rule where you always upper case acronyms, or even the first letter of an acronym irrespective of it being a field or class name, you get strange things like,
URLConnection URLConnection; // huh?
In other words, the field starts with lower case rule contradicts with a hypothetical uppercase acronym rule. You can't apply them both.
Even the Java SDK has examples of both, within a single class name: HttpURLConnection. You'd think it would be either HTTPURLConnection or HttpUrlConnection.
In general, treating acronyms the same as the overall word case is the most intuitive, for the following reasons.
It helps avoid memorization of a bunch of special rules (case in point, the MS guidelines above)
The lettercase of specific terms may change over time or between personal interpretations like with posix, unix, regex, radar, scuba, laser, and email
It can otherwise get visually clunky when you need to deal with (and/or combine) tech terms like miniiOSLayout, SSHHTTPSSession, CPUURLLinkC, and schemaeBay (granted these specific examples are unlikely, they help illustrate the problem)
It removes some confusion if the language uses different cases for different things, like capitalizing AClassName but not anObjectName
Yes, httpUrl looks dumb. But the more you code, the more you're likely to realize the above points on your own. And while not a huge deal in and of itself, a lot of little things can add up and create frustration with your work.
We have no hard & fast rule, but we generally do not camel case acronyms. A few with more than three letters are but most aren't.
Generally, our acronyms are PascalCased or camelCased as most have stated.
Some exceptions:
If an acronym used in a member name is well known in the business for which the software is being written, and it is a true acronym (the capital letters form a dictionary word, as opposed to just an initialism like XML), we often capitalize it to avoid confusion with the dictionary word.
Sometimes in ORMs working against existing DBs, I've just named the mapped variable the same as the DB column, capitalization and all, rather than having to map FdicId => FDICID explicitly in a case-sensitive DB. This does have its downside, as future developers can silently break functionality if they feel more strongly than I did that it should be properly cased, but didn't know why.
ID is a bit of a flip-flopper when used on the end of a member name; Whether it's ID or Id depends on the developer who writes the first such member in a class or namespace, and they're seldom revised.
Depends on the length of acronyms also.
DB--->looks decent
openDBConnection
HTTP--->looks odd
openHTTPCConnection
Apart from rules:
Readability matters a lot to me.
Consistency shows your effort on programming.
So, do it consistent and legible.

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

How can I correctly prefix a word with "a" and "an"?

I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that?
Before you think the answer is to simply check if the first letter is a vowel, consider phrases like:
an honest mistake
a used car
Download Wikipedia
Unzip it and write a quick filter program that spits out only article text (the download is generally in XML format, along with non-article metadata too).
Find all instances of a(n).... and make an index on the following word and all of its prefixes (you can use a simple suffixtrie for this). This should be case sensitive, and you'll need a maximum word-length - 15 letters?
(optional) Discard all those prefixes which occur less than 5 times or where "a" vs. "an" achieves less than 2/3 majority (or some other threshholds - tweak here). Preferably keep the empty prefix to avoid corner-cases.
You can optimize your prefix database by discarding all those prefixes whose parent shares the same "a" or "an" annotation.
When determining whether to use "A" or "AN" find the longest matching prefix, and follow its lead. If you didn't discard the empty prefix in step 4, then there will always be a matching prefix (namely the empty prefix), otherwise you may need a special case for a completely-non matching string (such input should be very rare).
You probably can't get much better than this - and it'll certainly beat most rule-based systems.
Edit: I've implemented this in JS/C#. You can try it in your browser, or download the small, reusable javascript implementation it uses. The .NET implementation is package AvsAn on nuget. The implementations are trivial, so it should be easy to port to any other language if necessary.
Turns out the "rules" are quite a bit more complex than I thought:
it's an unanticipated result but it's a unanimous vote
it's an honest decision but a honeysuckle shrub
Symbols: It's an 0800 number, or an ∞ of oregano.
Acronyms: It's a NASA scientist, but an NSA analyst; a FIAT car but an FAA policy.
...which just goes to underline that a rule based system would be tricky to build!
You need to use a list of exceptions. I don't think all of the exceptions are well defined, because it sometimes depends on the accent of the person saying the word.
One stupid way is to ask Google for the two possibilities (using the one of the search APIs) and use the most popular:
http://www.google.co.uk/search?q=%22a+europe%22 - 841,000 hits
http://www.google.co.uk/search?q=%22an+europe%22 - 25,000 hits
Or:
http://www.google.co.uk/search?q=%22a+honest%22 - 797,000 hits
http://www.google.co.uk/search?q=%22an+honest%22 - 8,220,000 hits
Therefore "a europe" and "an honest" are the correct versions.
If you could find a source of word spellings to word pronunciations, like:
"honest":"on-ist"
"horrible":"hawr-uh-buhl, hor-"
You could base your decision on the first character of the spelled pronunciation string.
For performance, perhaps you could use such a lookup to pre-generate exception sets and use those smaller lookup sets during execution instead.
Edited to add:
!!! - I think you could use this to generate your exceptions:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Not everything will be in the dictionary, of course - meaning not every possible exception would wind up in your exceptions sets - but in that case, you could just default to an for vowels/ a for consonants or use some other heuristic with better odds.
(Looking through the CMU dictionary, I was pleased to see it includes proper nouns for countries and some other places - so it will hande examples like "a Ukrainian", "a USA Today paper", "a Urals-inspired painting".)
Editing once more to add: The CMU dictionary does not contain common acronyms, and you have to worry about those starting with s,f,l,m,n,u,and x. But there are plenty of acronym lists out there, like in Wikipedia, which you could use to add to the exceptions.
You have to implemented manually and add the exceptions you want like for example if the first letter is 'H' and followed by an 'O' like honest, hour ... and also the opposite ones like europe, university, used ...
Since "a" and "an" is determined by phonetic rules and not spelling conventions, I would probably do it like this:
If the first letter of the word is a consonant -> 'a'
If the first letter of the word is a vowel-> 'an'
Keep a list of exceptions (heart, x-ray, house) as rjumnro says.
You need to look at the grammatical rules for indefinite articles (there are only two indefinite articles in English grammar - "a" and "an). You may not agree these sound correct, but the rules of English grammar are very clear:
"The words a and an are indefinite
articles. We use the indefinite
article an before words that begin
with a vowel sound (a, e, i, o, u) and
the indefinite article a before words
that begin with a consonant sound (all
other letters)."
Note this means a vowel sound, and not a vowel letter. For instance, words beginning with a silent "h", such as "honour" or "heir" are treated as vowels an so are proceeded with "an" - for example, "It is an honour to meet you". Words beginning with a consonant sound are prefixed with a - which is why you say "a used car" rather than "an used car" - because "used" has a "yoose" sound rather than a "uhh" sound.
So, as a programmer, these are the rules to follow. You just need to work out a way of determining what sound a word begins with, rather than what letter. I've seen examples of this, such as this one in PHP by Jaimie Sirovich :
function aOrAn($next_word)
{
$_an = array('hour', 'honest', 'heir', 'heirloom');
$_a = array('use', 'useless', 'user');
$_vowels = array('a','e','i','o','u');
$_endings = array('ly', 'ness', 'less', 'lessly', 'ing', 'ally', 'ially');
$_endings_regex = implode('|', $_endings);
$tmp = preg_match('#(.*?)(-| |$)#', $next_word, $captures);
$the_word = trim($captures[1]);
//$the_word = Format::trimString(Utils::pregGet('#(.*?)(-| |$)#', $next_word, 1));
$_an_regex = implode('|', $_an);
if (preg_match("#($_an_regex)($_endings_regex)#i", $the_word)) {
return 'an';
}
$_a_regex = implode('|', $_a);
if (preg_match("#($_a_regex)($_endings_regex)#i", $the_word)) {
return 'a';
}
if (in_array(strtolower($the_word{0}), $_vowels)) {
return 'an';
}
return 'a';
}
It's probably easiest to create the rule and then create a list of exceptions and use that. I don't imagine there will be that many.
Man, I realize that this is probably a settled argument, but I think it can be settled easier than using ad hoc grammar rules from Wikipedia, which would derive vernacular grammar, at best.
The best solution, it seems, is to have the use of a or an trigger a phoneme-based matching of the following word, with certain phonemes always associated with "an" and the remaining belonging to "a".
Carnegie Mellon University has a great online tool for these kind of checks - http://www.speech.cs.cmu.edu/cgi-bin/cmudict - and at 125k words with the matching 39 phonemes. Plugging a word in provides the entire phonemic set, of which only the first is important.
If the word does not appear in the dictionary, such as "NSA" and is all capitalized, then the system can assume the word is an Acronym and use the first letter to determine which indefinite article to use based on the same original rule set.
#Nathan Long:
Downloading wikipedia is actually not a bad idea. All images, videos and other media is not needed.
I wrote a (crappy) program in php and javascript(!) to read the entire Swedish wikipedia (or at least all aricles that could be reached from the aricle about math, which was the start for my spider.)
I collected all words and internal links in a database, and also kept track of the frequency of every word. I now use that as a word database for various tasks:
* Finding all words that can be created from a given set of letters (including wildcard)
* Created a simple syntax file for Swedish (all words not in the database are considered incorrect).
Oh, and downloading the entire wiki took about one week, using my laptop running most of the time, with 10Mbit connection.
When you're at it, log all occurrences that are inconsistent with the english language and see if some of them are mistakes. Go fix 'em and give something back to the community.
Note that there are differences between American and British dialects, as Grammar Girl pointed out in her episode A Versus An.
One complication is when words are pronounced differently in British and American English. For example, the word for a certain kind of plant is pronounced “erb” in American English and “herb” in British English. In the rare cases where this is a problem, use the form that will be expected in your country or by the majority of your readers.
Take a look at Perl's Lingua::EN::Inflect. See sub _indef_article in the source code.
I've ported a function from Python (originally from CPAN package Lingua-EN-Inflect) that correctly determines vowel sounds in C# and posted it as an answer to the question Programmatically determine whether to describe an object with a or an?. You can see the code snippet here.
Could you get a English dictionary that stores the words written in our regular alphabet, and the International Phoenetic Alphabet?
Then use the phoenetics to figure out the beginning sound of the word, and thus whether “a” or “an” is appropriate?
Not sure if that would actually be easier than (or as much fun as) the statistical Wikipedia approach.
I would use a rule-based algorithm to cover as many as I could, then use a list of exceptions. If you wanted to get fancy, you could try to determine some new "rules" from your exception list.
I just looks like a set of heuristics. It needs be a bit more complicated and answer some things which I never got a good answer for, for example how do you treat abbreviations ("a RPM" or "an RPM"? I always thought the latter one makes more sense).
A quick search yielded on linguistic libraries that talk about how to handle the English singular prefix, but you can probably find something if you dig dip enough. And if not - you can always write your own inflection library and gain world fame :-) .
I don't suppose you can just fill-in some boiler plate stuff like 'a/an' as a one step cover-all. Otherwise you will end up with assumption errors like all words with 'h' proceed by 'o' get 'an' instead of 'a' like 'home' - (an home?). Basically, you will end up including the logic of the english language or occassionally find rare cases that will make you look foolish.
Check for whether a word starts with a vowel or a consonent. A "u" is generally a consonant and a vowel ("yu"), hence belongs in the consonant group for your purposes.
The letter "h" stands for a gottal stop (a consonant) in French and in French words used in English. You can make a list of those (in fact, including "honor", "honour", and "hour" might be sufficient) and count them as starting with vowels (since English doesn't recognise a glottal stop).
Also count "eu" as a consonant etc.
It's not too difficult.
choice of an or a depends on the way the word is pronounced. By looking at the word you can't necessarily tell its correct pronunciation e.g. a Jargon or abbreviation etc.
One of the ways can be to have a dictionary with support for phonemes and use the phoneme information associated with the word to determine whether an "a" or an "an" should be used.
I can't be certain that it has the appropriate information in it to differentiate "a" and "an", but Princeton's WordNet database exists precisely for the purpose of similar sorts of tasks, so I think it's likely that the data is in there. It has some tens of thousands of words and hundreds of thousands of relationships between said words (IIRC; I can't find the current statistics on the site). Give it a look. It's freely downloadable.
How? How about when? Get the noun with article attached. Ask for it in a specific form.
Ask for the noun with the article. Many a MUD codebase store items as information consisting of:
one or more keywords
a short form
a long form
The keyword form might be "short sword rusty". The short form will be "a sword". The long form will be "a rusty short sword".
Are you writing an "a vs. an" Web service? Take a step back and look at if you can attack this leak further upstream. You can build a dam, but unless you stop it from flowing, it will spill over eventually.
Determine how critical this is, and as others have suggested, go for "quick but crude", or "expensive but sturdy".
The rule is very simple. If the next word starts with a vowel sound then use 'an', if it starts with a consonant then use 'a'. The hard thing is that our school classification of vowels and consonants doesn't work. The 'h' in 'honour' is a vowel, but the 'h' in 'hospital' is a consonant.
Even worse, some words like 'honest' start with a vowel or a consonant depending on who is saying them. Even worse, some words change depending on the words around them for some speakers.
The problem is bounded only by how much time and effort you want to put into it. You can write something in a couple using 'aeiou' as vowels in a couple of minutes, or you can spends months doing linguistic analysis of your target audience. Between them are a huge number of heuristics which will be right for some speakers and wrong for others -- but because different speakers have different determinations for the same word it simply isn't possible to be right all of the time no matter how you do it.
The ideal approach would be to find someplace online that can give you the answers, dynamically query them and cache the answers. You can prime the system with a few hundred words for starters.
(I don't know of such an online source, but I wouldn't be surprised if there is one.)
So, a reasonable solution is possible without downloading all of the internet. Here's what I did:
I remembered that Google published their raw data for Google Books N-Gram frequencies here. So I downloaded the 2-gram files for "a_" and "an". It's about 26 gigs if I recall correctly. From that I produced a list of strings where they were overwhelmingly preceded by the opposite article you'd expect (if we were to expect vowels take an "an"). That final list of words I was able to store in under 7 kilobytes.
Rather than writing code that could be culture-dependent and have numerous exceptions I tend to rework the statement that includes the indefinite article. For example, rather than saying "This customer wants to live in a Single-Family Home.", you could say "This customer wants a housing type of 'Single-Family Home'." That way, the indefinite article is not dependent on the variable - e.g., "This customer wants a housing type of 'Apartment'."
I'd like to synthesize a few of the given answers, and contribute my own solutions as well.
Let's start with some basic heuristics:
Start with the first letter of the word.
If it starts with an "a", "i" or "o", then use "an". As far as I know, those letters always begin with an actual vowel.
If it starts with an "e", then it will be pronounced as a vowel, unless it is followed by a "u" (e.g., euphonium, eugenics, euphoric, euphemism, etc.). This would be the case with "i" as well, in the unlikely cases of "Iuka", "Iuliyanov", and "IUPAC". (https://en.wiktionary.org/w/index.php?title=Category:English_terms_with_IPA_pronunciation&from=iu)
If it starts with a "b", "c", "d", "g", "k", "p", "q", "t", "v", "w", or "z", then it is guaranteed to be a consonant, and pronounced like a consonant.
If it starts with an "f", "l", "m", "n", "r", "s", or "x", it may be pronounced with a vowel, but only if it's in an acronym. Otherwise, it's guaranteed to be pronounced as a consonant.
If it begins with a "u", or with an "h", "j", or "y", then it falls into a corner case.
Determine whether the word is an acronym.
If the word is an acronym, then assume that it contains more than one consecutive capital letter, or contains periods. This could be solved via a simple regex (e.g. [A-Z][A-Z]+).
If the word is an acronym, then first turn it into a more "word-like" form (i.e., not all capitalized, not containing periods) before going to Step 3. If it isn't an acronym, then refer back to the information in Step 1.
Use a dictionary!
If the word is in this dictionary, and begins with an "a", "e", "i", "o", or "u", then it begins with a vowel. Otherwise, it's a consonant.
Wiktionary and Wikipedia use the IPA to represent the pronunciations of words. If the word begins with one of these letters, then it begins with a vowel.
Hopefully this helps. I suspect that it will be less resource intensive than any single option, given that much of it can be solved by either a simple "equals" statement (e.g. word[0] == 'a'), or by a regex expression (e.g. [aioAIO]), and by some simple knowledge of linguistics and the pronunciations of the English letter names. If the word doesn't fall into a simple case, then use one of the more complex solutions that the other answerers have provided.
You use "a" whenever the next word isn't a vowel? And you use "an" whenever there is a vowel?
With that said, couldn't you just do a regular expression like "a\s[a,e,i,o,u].*"? And then replace it with an "an?"

Categories