Phone number normalization: Any pre-existing libraries? - c#

I have a system which is using phone numbers as unique identifiers. For this reason, I want to format all phone numbers as they come in using a normalized format. Because I have no control over my source data, I need to parse out these numbers myself and format them before adding them to my DB.
I'm about to write a parser that can read phone numbers in and output a normalized phone format, but before I do I was wondering if anyone knew of any pre-existing libraries I could use to format phone numbers.
If there are no pre-existing libraries out there, what things should I be keeping in mind when creating this feature that may not be obvious?
Although my system is only dealing with US numbers right now, I plan to try to include support for international numbers just in case since there is a chance it will be needed.
Edit I forgot to mention I'm using C#.NET 2.0.

You could use libphonenumber from Google. Here's a blog post:
http://blog.appharbor.com/2012/02/03/net-phone-number-validation-with-google-libphonenumber
Parsing numbers is as easy as installing the NuGet package and then doing this:
var util = PhoneNumberUtil.GetInstance();
var number = util.Parse("555-555-5555", "US");
You can then format the number like this:
util.Format(number, PhoneNumberFormat.E164);
libphonenumber supports several formats other than E.164.

I'm currently involved in the OpenMoko project, which is developing a completely open source cell phone (including hardware). There has been a lot of trouble around normalizing phone numbers. I don't know if anyone has come up with a good solution yet. The biggest problem seems to be with US phone numbers, since sometimes they come in with a 1 on the front and sometimes not. Depending on what you have stored in your contacts list, it may or may not display the caller ID info correctly. I'd recommend stripping off the 1 on the phone number (though I'd expect most people wouldn't enter it in the first place). You may also need to look for a plus sign or country code on the front of international numbers.
You can check around the OpenMoko website, mailing list, and source control to see if they've solved this bug yet.

perl and rails examples
http://validates-as-phone.googlecode.com/svn/trunk/README
http://www.perlmonks.org/?node_id=159645

Just strip out any non-digits, possibly using a RegEx: [^\d]
The only exception might be if you want to handle extensions, to distinguish a number without an area code but with a 3 digit extension, or if you need to handle international numbers.

What you need is list of all country codes and start matching your string first few characters against list of country codes to make sure it's correct then for the rest of the number, make sure it's all digits and of proper length which usually varies from 5-10 digits.
To achieve checking against country codes, install NGeoNames nuget which uses website www.geonames.org to get list of all country codes to use to match against them.

Related

How to convert a set of arabic numbers (order numbers) to speech in c#.net?

To make my question clear, I don't want to use System.Speech.Synthesis library by Microsoft, since it does not support Arabic at all ..
and I tried looking for other TTS engines but couldn't find anything helpful ..
so I figured there should be another way without using TTS .. like playing a set of audio files corresponding with my numbers.
in short, its a call system that calls for numbers in a queue .. can any one with enough experience in this area show me a good start to go on ? or if there are good libraries out there that could be used with the .net framework that already does my thing?

algorithm: analyzing web pages for tags

I've been working on a project in the last few days and there is a task in this project that I actually don't know how to do, the project includes analyzing web pages to find tags that Characterizes the page.
hey buddy , what you mean by tags? by saying tags I mean keywords that summarize what the web page about. For example here on SO you write you're own tags so people can find you're question better. What I am talking about is building an algorithm to analyze the web pages to find it's tags by the text within the page.
I started with getting the text from the page -> accomplished
generally im looking for a way to find the keywords that Concludes what the webpage about
However, I don't really know what to do next. Does anyone have a suggestion?
For a really basic approach, you could use the TF-IDF algorithm to find the most important word in your page
Quick overlook from wikipedia:
The tf–idf weight (term frequency–inverse document frequency) is a
weight often used in information retrieval and text mining. This
weight is a statistical measure used to evaluate how important a word
is to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in the corpus. Variations
of the tf–idf weighting scheme are often used by search engines as a
central tool in scoring and ranking a document's relevance given a
user query. tf–idf can be successfully used for stop-words filtering
in various subject fields including text summarization and
classification
Once you find the most important word in your page you can use them as tags.
If you want to improve your tags and make them more relevant.
There are a lot of way to proceed, but you can proceed as below:
Extract a bunch of text from which you know the main tags.
For all this text run a TF-IDF algorithm and create a vector with the
ones with the highest score.
Try to find a main direction will all these vectors. (running an ACP
for example, or any machine learning tool)
And use this tag to represent the set of words from the main direction. (the largest vector of the ACP)
Hope it's understandable and it helps
Typically you look for certain words surrounded by certain html. For example, titles are typically in an H tag such as <h1>.
If you parse a page for all of it's H1 tags then it stands to reason that the content following that tag is related. An example is this very page. It has an H1 tag surrounding the question title. This gives google a hint that the page is about "algorithm", "analyzing", "web pages", etc.
The hard part is to determine context.
In our example here, the term "pages" is very generic and can relate to anything. However "web pages" are a bit more specific. You can do this with an internal dictionary that is built up over time based on term frequency after analyizing a number of documents to find commonality. The frequency should provide a weighted value in determining the top X "tags" for a given page.
This is more of an Information Retrieval and Data Mining question. Reviewing some of Rao's lectures may help.
When you're spidering web pages, you're essentially trying to build an index. You do this by building a global Term-Frequency dictionary, where each word in the language (often stemmed to account for pluralization and other modifications) is stored as a key, and the number of times they occur in the document as values.
From there, you can use algorthms such as PageRank and Authorities and hubs to do data analysis.
You can implement a number of heuristics:
Acronyms and words in all uppercase
Words that are not frequent, i.e. discard words that appear in all or most documents and favour the ones that appear relatively frequently only on this one.
Sequences of words that always appear in the same order in this document and possibly in others as well
etc.

Get users countrycode (ISO 3166-1) in MonoTouch

I see that the NSLocale bindings aren't complete in MonoTouch so I am having a bit difficulties writing them myself.
Does anyone have the code to get the users countrycode in ISO 3166-1 alpha 3 format? Three letters for each country:
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
This is being ported from Android where we already have the API call:
Locale.getDefault().getISO3Country();
Have a look at Iphone, Obtaining a List of countries from MonoTouch - it should cover it (or easily be adapted to do so).
note: this will be part of a future version of MonoTouch (got the code in a backup waiting for my iMac reparation ;-)
EDIT
iOS NSLocale returns the ISO 2 letters country code not the ISO 3 letters you're looking for. The best you can do is build a map from 2->3 letters and use the linked code to get the 2 letters code. There's some code to do the reverse (that you can adapt) or even a map (to reverse) available in: Converting country codes in .NET
Note that depending on your requirements this could be incomplete and not exactly matching what Android provides you.

Detect the language of a text is english in PDF or DOC files

Requirement is that i want to identify that the text written in PDF or Doc is english or non english. if i got a single word of (turiskh, french,arabic and etc.) have to avoid the whole documnet
its urgent plz give me sample code for this functionality
Have a look on Google Translate API, only free service who could do this for you what I know. Otherwise I can only see the solution of having your own dictionary etc.. But thats a different story
I guess you could use LangId. However there are some restrictions:
To use our API in live websites or services we suggest you to apply for a free API key, using the below form. The API key expands your developing possibilities allowing you to do up till 1,000 requests per hour (~720,000 per month).
I don't think this will solve your 'single word' issue however. I believe if the text has 6 words English and 4 words in another language it will see the text as English since that language is mainly used in the file. I haven't looked at the API myself though so there might be some solutions for that.
Hope it is of use to you.
Maybe the detect function of Google's Translate API could help you:
http://code.google.com/apis/language/translate/v2/getting_started.html#language_detect
This is not possible for single words.
Is "the" an English word? Well, yes, but it's also a Danish word (meaning tea). Does the word Schadenfreude indicate a non-english text? Not necessarily, it all depends on the context.
Adding to the list of APIs that support language determination, Bing API has a call that will determine the language for an array of strings.
http://msdn.microsoft.com/en-us/library/ff512412.aspx
Hope this helps somewhat.

Windows App spellcheck

I was wondering if there is another way to spell check a Windows app instead what I've been of using: "Microsoft.Office.Interop.Word". I can't buy a spell checking add-on. I also cannot use open source and would like the spell check to be dynamic..any suggestions?
EDIT:
I have seen several similar questions, the problem is they all suggest using open source applications (which I would love) or Microsoft Word.
I am currently using Word to spell check and it slows my current application down and causes several glitches in my application. Word is not a clean solution so I'm really wanting to find some other way.. Is my only other option to recreate my app as a WPF app so I can take advantage of the SpellCheck Class?
If I were you I would download the data from the English Wiktionary and parse it to obtain a list of all English words (for instance). Then you could rather easily write at least a primitive spell-checker yourself. In fact, I use a parsed version of the English Wiktionary in my own mathematical application AlgoSim. If you'd like, I could send you the data file.
Update
I have now published a parsed word list at english.zip (942 kB, 383735 entries, zip). The data originates from the English Wiktionary, and as such, is licensed under the Creative Commons Attribution/Share-Alike License.
To obtain a list like this, you can either download all articles on Wiktionary as a huge XML file containing all Wiki- and HTML-formatted articles. This is then more or less trivial to parse. Alternatively, you can run a bot on the site. I got help to obtain a parsed file from a user at Wiktionary (I seem to have forgotten his name, though...), and this file (english.txt in english.zip) is a further processed version of the file I got.
http://msdn.microsoft.com/en-us/library/system.windows.controls.spellcheck.aspx
I use Aspell-win32, it's old but it's open source, and works as well or better than the Word spell check. Came here looking for a built in solution.

Categories