extracting postal code from addresses

extracting postal code from addresses - c#

I am looking for a solution in c# to extract postal code info from address.
The postal codes of following countries
Canada,US,Germany,UK,Turkey,France,Pakistan,India,Italy.
The address can be something like these
188 pleasant street, new minas, Nova Scotia b2p 6r6, Canada.
or
109 A, block 3, DHA, Karachi 75600, Pakistan.
what I want: I want to extract any alphanumerics that is adjacent to city or country name. But having difficulty creating regular expression for it

It's quite an open-ended task. You have to follow some specific format in there. Because what will happen if there'll be two numeric strings in the address (like a case where street is a number). So two options are possible:
Address is always in a specific format and you know the actual format
The zip is always of a given length
In both case regular expressions will lead you to the solution.
- For the first example, assuming the zip code is in the given order (let's say '6r6' in your original example), you can use the following regular expression pattern: "(\S+)\, ?\w+$"
- For the second case, assuming the zip code is a number of 5+ digits, which comes after the first ',', then the following pattern can be used to extract it: "(,.*)+(\d{5})". The second group will be the zip code in the match.
Here is the code you can use:
public static string GetSingleMatch(string address, string pattern, int group = 0)
{
return new Regex(pattern, RegexOptions.IgnoreCase).Match(address).Groups[group].Value;
}
The "group" optional parameter indicates the regex group which will contain the zip code.

I believe it's reasonable that you assume general rule in address which the country is the last and city or state before it, so post code can be placed between city or state and country and as you stated in the example ',' is used as separator, so it can be as following :
private string GetPostCode(string address )
{
string result = string.Empty;
string[] list = address.Split(',');
list.Reverse();
foreach (var item in list)
{
// if item contains numeric postcode
Regex re = new Regex(#"\d+");
Match m = re.Match(item);
result = m.Value;
if (!string.IsNullOrEmpty(result))
break;
}
return result;
}
I hope it would be helpful.

Related

check for valid string format

i am trying to make a validaing system where it checks a string is in the correct format.
the required format can only contain numbers and dashes - and be ordered like so ***-**-*****-**-* 3-2-5-2-1.
For example, 978-14-08855-65-2
can i use Regex like i have for a email checking system by change the format key #"^([\w]+)#([\w])\.([\w]+)$"
the email checking code is
public static bool ValidEmail(string email, out string error)
{
error = "";
string regexEmailCOM = #"^([\w]+)#([\w])\.([\w]+)$"; // allows for .com emails
string regexEmailCoUK = #"^([\w]+)#([\w])\.([\w]+)\.([\w]+)$"; // this allows fo .co.uk emails
var validEmail = new Regex(email);
return validEmail.IsMatch(regexEmailCOM) || validEmail.IsMatch(regexEmailCoUK) && error == "") // if the new instance matches with the string, and there is no error
}

Regex is indeed a good fit for this situation.
One possible expression would be:
^\d{3}-\d\d-\d{5}-\d\d-\d$
This matches exactly 5 groups of only digits (\d) separated by -. Use curly brackets to set a fixed number of repeats.

Is it possible to store a regex match and use part of it as a list enumerator?

I have created a MadLibs style game where the user enters responses to prompts which in turn replace blanks, represented by %s0, %s1 etc., in a story. I have this working using a for loop but someone else suggested I could do it using regex. What I have so far is below, which replaces all instances of %s+number with "wibble". What I was wondering is if it is possible to store the number found by the regex in a temporary variable and in turn use that to return a value from the list Words? E.g. return Regex.Replace(story, pattern, Global.Words[x]); where x is the number returned by the regex pattern as it goes over the string.
static void Main(string[] args)
{
Globals.Words = new List<string>();
Globals.Words.Add("nathan");
Globals.Words.Add("bob");
var text = "Once upon a time there was a %s0 and it was %s1";
Console.WriteLine(FindEscapeCharacters(text));
}
public static string FindEscapeCharacters(string story)
{
var pattern = #"%s([0-9]+)";
return Regex.Replace(story, "%s([0-9]+)", "wibble");
}
Thanks in advance, Nathan.

Not a direct answer to your question about regexes, but if I understand you correctly, there is an easier way to do this:
string baseString = "I have a {0} {1} in my {0} {2}.";
List<string> words = new List<string>() { "red", "cat", "hat" };
string outputString = String.Format(baseString, words.ToArray());
outputString will be I have a red cat in my red hat..
Is that not what you want, or is there more to your question that I'm missing?
Minor elaboration
String.Format uses the following signature:
string Format(string format, params object[] values)
The neat thing about params is that you can either list values separately:
var a = String.Format("...", valueA, valueB, valueC);
but you can also pass in an array directly:
var a = String.Format("...", valueArray);
Note that you can't mix and match the two approaches.

Yes, you are very close in your attempt with Regex.Replace; the last step is to change constant "wibble" into lambda match => how_to_replace_the_match:
var text = "Once upon a time there was a %s0 and it was %s1";
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s([0-9]+)",
match => Globals.Words[int.Parse(match.Groups[1].Value)]);
Edit: In case you don't want working with capturing groups by their numbers, you can name them explicitly:
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s(?<number>[0-9]+)",
match => Globals.Words[int.Parse(match.Groups["number"].Value)]);

There is an overload of Regex.Replace that, rather than taking a string for the last argument, takes a MatchEvaluator delegate - a function that takes a Match object and returns a string.
You could make that function parse the integer from the Match's Groups[1].Value property and then use that to index into your list, returning the string you find.

Split string with plus sign as a delimiter

I have an issue with a string containing the plus sign (+).
I want to split that string (or if there is some other way to solve my problem)
string ColumnPlusLevel = "+-J10+-J10+-J10+-J10+-J10";
string strpluslevel = "";
strpluslevel = ColumnPlusLevel;
string[] strpluslevel_lines = Regex.Split(strpluslevel, "+");
foreach (string line in strpluslevel_lines)
{
MessageBox.Show(line);
strpluslevel_summa = strpluslevel_summa + line;
}
MessageBox.Show(strpluslevel_summa, "summa sumarum");
The MessageBox is for my testing purpose.
Now... The ColumnPlusLevel string can have very varied entry but it is always a repeated pattern starting with the plus sign.
i.e. "+MJ+MJ+MJ" or "+PPL14.1+PPL14.1+PPL14.1" as examples.
(It comes form Another software and I cant edit the output from that software)
How can I find out what that pattern is that is being repeated?
That in this exampels is the +-J10 or +MJ or +PPL14.1
In my case above I have tested it by using only a MessageBox to show the result but I want the repeated pattering stored in a string later on.
Maybe im doing it wrong by using Split, maybe there is another solution.
Maybe I use Split in the wrong way.
Hope you understand my problem and the result I want.
Thanks for any advice.
/Tomas

How can I find out what that pattern is that is being repeated?
Maybe i didn't understand the requirement fully, but isn't it easy as:
string[] tokens = ColumnPlusLevel.Split(new[]{'+'}, StringSplitOptions.RemoveEmptyEntries);
string first = tokens[0];
bool repeatingPattern = tokens.Skip(1).All(s => s == first);
If repeatingPattern is true you know that the pattern itself is first.
Can you maybe explain how the logic works
The line which contains tokens.Skip(1) is a LINQ query, so you need to add using System.Linq at the top of your code file. Since tokens is a string[] which implements IEnumerable<string> you can use any LINQ (extension-)method. Enumerable.Skip(1) will skip the first because i have already stored that in a variable and i want to know if all others are same. Therefore i use All which returns false as soon as one item doesn't match the condition(so one string is different to the first). If all are same you know that there is a repeating pattern which is already stored in the variable first.

You should use String.Split function :
string pattern = ColumnPlusLevel.Split("+")[0];

...but it is always a repeated pattern starting with the plus sign.
Why do you even need String.Split() here if the pattern always only repeats itself?
string input = #"+MJ+MJ+MJ";
int indexOfSecondPlus = input.IndexOf('+', 1);
string pattern = input.Remove(indexOfSecondPlus, input.Length - indexOfSecondPlus);
//pattern is now "+MJ"
No need of string split, no need to use LinQ

String has a method called Split which let's you split/divide the string based on a given character/character-set:
string givenString = "+-J10+-J10+-J10+-J10+-J10"'
string SplittedString = givenString.Split("+")[0] ///Here + is the character based on which the string would be splitted and 0 is the index number
string result = SplittedString.Replace("-","") //The mothod REPLACE replaces the given string with a targeted string,i added this so that you can get the numbers only from the string

Convert string into three letter Abbreviation

I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks

Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.

It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.

Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];

Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word

C# Regex.Match to decimal

I have a string "-4.00 %" which I need to convert to a decimal so that I can declare it as a variable and use it later. The string itself is found in string[] rows. My code is as follows:
foreach (string[] row in rows)
{
string row1 = row[0].ToString();
Match rownum = Regex.Match(row1.ToString(), #"\-?\d+\.+?\d+[^%]");
string act = Convert.ToString(rownum); //wouldn't convert match to decimal
decimal actual = Convert.ToDecimal(act);
textBox1.Text = (actual.ToString());
}
This results in "Input string was not in a correct format." Any ideas?
Thanks.

I see two things happening here that could contribute.
You are treating the Regex Match as though you expect it to be a string, but what a Match retrieves is a MatchGroup.
Rather than converting rownum to a string, you need to lookat rownum.Groups[0].
Secondly, you have no parenthesised match to capture. #"(\-?\d+\.+?\d+)%" will create a capture group from the whole lot. This may not matter, I don't know how C# behaves in this circumstance exactly, but if you start stretching your regexes you will want to use bracketed capture groups so you might as well start as you want to go on.

Here's a modified version of your code that changes the regex to use a capturing group and explicitly look for a %. As a consequence, this also simplifies the parsing to decimal (no longer need an intermediary string):
EDIT : check rownum.Success as per executor's suggestion in comments
string[] rows = new [] {"abc -4.01%", "def 6.45%", "monkey" };
foreach (string row in rows)
{
//regex captures number but not %
Match rownum = Regex.Match(row.ToString(), #"(\-?\d+\.+?\d+)%");
//check for match
if(!rownum.Success) continue;
//get value of first (and only) capture
string capture = rownum.Groups[1].Value;
//convert to decimal
decimal actual = decimal.Parse(capture);
//TODO: do something with actual
}

If you're going to use the Match class to handle this, then you have to access the Match.Groups property to get the collection of matches. This class assumes that more than one occurrence appears. If you can guarantee that you'll always get 1 and only 1 you could get it with:
string act = rownum.Groups[0];
Otherwise you'll need to parse through it as in the MSDN documentation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

extracting postal code from addresses - c#

Related

check for valid string format

Is it possible to store a regex match and use part of it as a list enumerator?

Split string with plus sign as a delimiter

Convert string into three letter Abbreviation

C# Regex.Match to decimal

Categories

Resources