Convert string into three letter Abbreviation - c#

I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks

Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.

It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.

Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];​

Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word

Related

How do you find a delimited/isolated substring with string.contains?

I am trying to parse out and identify some values from strings that I have in a list.
I am using string.Contains to identify the value im looking for, but I am getting hits even if the value is surrounded by other text. How can I make sure I only get a hit if the value is isolated?
Example parse:
Looking for value = "302"
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
var result = sale.ToLower().Contains(”302”));
In this example I will get a hit for "serialNumber302" and "F18529302E", which in the context is incorrect since I only want a hit if it finds “302” isolated, like “dontfind302 shouldfind 302”.
Any ideas on how to do this?
If you try Regex, you can define a word boundary using \b:
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
bool result = Regex.IsMatch(sale, #"\b302\b"); // false
sale = "A string with 302 isolated";
result = Regex.IsMatch(sale, #"\b302\b"); // true
So 302 will only be found if it is at the start of the string, at the end of the string, or if it is surrounded by non-word characters i.e. not a-z A-Z 0-9 or _
EDIT: From the comments I realiſed that it waſn't clear whether or not "serialNum302" ſhould get a hit. I aſſumed ſo in this anſwer.
I ſee a few eaſy ways you could do this:
1) If the input is always a number as in the example, one option would be to only ſearch for ſubſtrings not ſurrounded by more numbers, by examining all the reſults of an initial ſearch and comparing their neighboring characters againſt the ſtring "0123456789". I really don't think this is the beſt option though, becauſe ſooner or later it's goïng to break when it miſinterprets one of the other bits of data.
2) If the ſtring sale always has the ſeriäl number in the format "serialNumber[Num]", inſtead of juſt looking for Num, look for "serialNumber" + Num, as this is leſs likely to be meſſed up with the other data.
3) From your ſtring, it looks like you have a ſtandardized format that's beïng introduced to the ſyſtem. In this caſe, parſe it in a ſtandardized way, e.g. by ſplitting it into ſubſtrings at the commas, then parſing each ſubſtring differently as it requires.

Is it possible to store a regex match and use part of it as a list enumerator?

I have created a MadLibs style game where the user enters responses to prompts which in turn replace blanks, represented by %s0, %s1 etc., in a story. I have this working using a for loop but someone else suggested I could do it using regex. What I have so far is below, which replaces all instances of %s+number with "wibble". What I was wondering is if it is possible to store the number found by the regex in a temporary variable and in turn use that to return a value from the list Words? E.g. return Regex.Replace(story, pattern, Global.Words[x]); where x is the number returned by the regex pattern as it goes over the string.
static void Main(string[] args)
{
Globals.Words = new List<string>();
Globals.Words.Add("nathan");
Globals.Words.Add("bob");
var text = "Once upon a time there was a %s0 and it was %s1";
Console.WriteLine(FindEscapeCharacters(text));
}
public static string FindEscapeCharacters(string story)
{
var pattern = #"%s([0-9]+)";
return Regex.Replace(story, "%s([0-9]+)", "wibble");
}
Thanks in advance, Nathan.
Not a direct answer to your question about regexes, but if I understand you correctly, there is an easier way to do this:
string baseString = "I have a {0} {1} in my {0} {2}.";
List<string> words = new List<string>() { "red", "cat", "hat" };
string outputString = String.Format(baseString, words.ToArray());
outputString will be I have a red cat in my red hat..
Is that not what you want, or is there more to your question that I'm missing?
Minor elaboration
String.Format uses the following signature:
string Format(string format, params object[] values)
The neat thing about params is that you can either list values separately:
var a = String.Format("...", valueA, valueB, valueC);
but you can also pass in an array directly:
var a = String.Format("...", valueArray);
Note that you can't mix and match the two approaches.
Yes, you are very close in your attempt with Regex.Replace; the last step is to change constant "wibble" into lambda match => how_to_replace_the_match:
var text = "Once upon a time there was a %s0 and it was %s1";
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s([0-9]+)",
match => Globals.Words[int.Parse(match.Groups[1].Value)]);
Edit: In case you don't want working with capturing groups by their numbers, you can name them explicitly:
// Once upon a time there was a nathan and it was bob
var result = Regex.Replace(
text,
"%s(?<number>[0-9]+)",
match => Globals.Words[int.Parse(match.Groups["number"].Value)]);
There is an overload of Regex.Replace that, rather than taking a string for the last argument, takes a MatchEvaluator delegate - a function that takes a Match object and returns a string.
You could make that function parse the integer from the Match's Groups[1].Value property and then use that to index into your list, returning the string you find.

extracting postal code from addresses

I am looking for a solution in c# to extract postal code info from address.
The postal codes of following countries
Canada,US,Germany,UK,Turkey,France,Pakistan,India,Italy.
The address can be something like these
188 pleasant street, new minas, Nova Scotia b2p 6r6, Canada.
or
109 A, block 3, DHA, Karachi 75600, Pakistan.
what I want: I want to extract any alphanumerics that is adjacent to city or country name. But having difficulty creating regular expression for it
It's quite an open-ended task. You have to follow some specific format in there. Because what will happen if there'll be two numeric strings in the address (like a case where street is a number). So two options are possible:
Address is always in a specific format and you know the actual format
The zip is always of a given length
In both case regular expressions will lead you to the solution.
- For the first example, assuming the zip code is in the given order (let's say '6r6' in your original example), you can use the following regular expression pattern: "(\S+)\, ?\w+$"
- For the second case, assuming the zip code is a number of 5+ digits, which comes after the first ',', then the following pattern can be used to extract it: "(,.*)+(\d{5})". The second group will be the zip code in the match.
Here is the code you can use:
public static string GetSingleMatch(string address, string pattern, int group = 0)
{
return new Regex(pattern, RegexOptions.IgnoreCase).Match(address).Groups[group].Value;
}
The "group" optional parameter indicates the regex group which will contain the zip code.
I believe it's reasonable that you assume general rule in address which the country is the last and city or state before it, so post code can be placed between city or state and country and as you stated in the example ',' is used as separator, so it can be as following :
private string GetPostCode(string address )
{
string result = string.Empty;
string[] list = address.Split(',');
list.Reverse();
foreach (var item in list)
{
// if item contains numeric postcode
Regex re = new Regex(#"\d+");
Match m = re.Match(item);
result = m.Value;
if (!string.IsNullOrEmpty(result))
break;
}
return result;
}
I hope it would be helpful.

Most efficient way to parse a delimited string in C#

This has been asked a few different ways but I am debating on "my way" vs "your way" with another developer. Language is C#.
I want to parse a pipe delimited string where the first 2 characters of each chunk is my tag.
The rules. Not my rules but rules I have been given and must follow.
I can't change the format of the string.
This function will be called possibly many times so efficiency is key.
I need to keep is simple.
The input string and tag I am looking for may/will change during runtime.
Example input string: AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4
Example tag I may need value for: AB
I split string into an array based on delimiter and loop through the array each time the function is called. I then looked at the first 2 characters and return the value minus the first 2 characters.
The "other guys" way is to take the string and use a combination of IndexOf and SubString to find the starting point and ending point of the field I am looking for. Then using SubString again to pullout the value minus the first 2 characters. So he would say IndexOf("|AB") the find then next pipe in the string. This would be the start and end. Then SubString that out.
Now I should think that IndexOf and SubString would parse the string each time at a char by char level so this would be less efficient than using large chunks and reading the string minus the first 2 characters. Or is there another way the is better then what both of us has proposed?
The other guy's approach is going to be more efficient in time given that input string needs to be reevaluated each time. If the input string is long, it is also won't require the extra memory that splitting the string would.
If I'm trying to code a really tight loop I prefer to directly use array/string operators rather than LINQ to avoid that additional overhead:
string inputString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
static string FindString(string tag)
{
int startIndex;
if (inputString.StartsWith(tag))
{
startIndex = tag.Length;
}
else
{
startIndex = inputString.IndexOf(string.Format("|{0}", tag));
if (startIndex == -1)
return string.Empty;
startIndex += tag.Length + 1;
}
int endIndex = inputString.IndexOf('|', startIndex);
if (endIndex == -1)
endIndex = inputString.Length;
return inputString.Substring(startIndex, endIndex - startIndex);
}
I've done a lot of parsing in C# and I would probably take the approach suggested by the "other guys" just because it would be a bit lighter on resources used and likely to be a little faster as well.
That said, as long as the data isn't too big, there's nothing wrong with the first approach and it will be much easier to program.
Something like this may work ok
string myString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
string selector = "AB";
var results = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, ""));
Returns: list of the matches, in this case just one "VALUE2"
If you are just looking for the first or only match this will work.
string result = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, "")).FirstOrDefault();
SubString does not parse the string.
IndexOf does parse the string.
My preference would be the Split method, primarily code coding efficiency:
string[] inputArr = input.Split("|".ToCharArray()).Select(s => s.Substring(3)).ToArray();
is pretty concise. How many LoC does the substring/indexof method take?

Levenshtein algorithm with custom character mapping

I want to use Levenshtein algorithm to search in a list of strings. I want to implement a custom character mapping in order to type latin characters and searching in items in greek.
mapping example:
a = α, ά
b = β
i = ι,ί,ΐ,ϊ
... (etc)
u = ου, ού
So searching using abu in a list with
αbu
abού
αού (all greek characters)
will result with all items in the list. (item order is not a problem)
How do I apply a mapping in the algorithm? (this is where I start)
I think the best way would be to preprocess your symbols to one definite form (e.g. all in latin) and then use Levenshtein as you would do normaly.
In pseudocode:
int func(String latinStr, String greekStr) {
String mappedStr = convertToLatin(greekStr); // e.g. now αβ would be ab
return Levenstein(latinStr, mappedStr);
}
And in convertToLatin you may symbol-by-symbol ask Dictionary with mappings for a replace and construct new string

Categories