Regex for a case change - c#

I am trying to get a string to have an underscore '_' everywhere on a case changes in a string to make it more clear to the user. For example if we have a String 'personal IDnumber', I want to make it to 'personal_ID_number'. This is in C#.
Thank you, I appreciate any help, suggestions.
-Sid

Replace on a case change
You are looking for this (I removed the space before ID, assuming it a typo)
(?:(?<=[a-z])(?=[A-Z]))|(?:(?<=[A-Z])(?=[a-z]))
It will transform personalIDnumber into personal_ID_number
See it here on Regexr
the Lookbehind ((?<=[a-z])) and lookahead ((?=[A-Z])) construction is matching the empty string between a lower case and a uppercase letter (or the other way round in the second part after the pipe |) and replace this with a underscore.
Replace on a case change with optional whitespace
If you want to include whitespace in the replace, just include it between the lookarounds
(?:(?<=[a-z])\s*(?=[A-Z]))|(?:(?<=[A-Z])\s*(?=[a-z]))
It will transform personal IDnumber into personal_ID_number
See this here on Regexr
Replace on a case change, but each word starts with an uppercase
If you say every word starts with an uppercase letter, then you can do this
(?:(?<=[a-z])(?=[A-Z]))|(?:(?<=[A-Z])(?=[A-Z][a-z]))
This will make from
PersonalIDNumber
JustAnotherTest
OtherWordWithID
SomeMORETest
this
Personal_ID_Number
Just_Another_Test
Other_Word_With_ID
Some_MORE_Test
See this here on Regexr

Non-regex solution (just in case)
string str = "personalIDnumber";
var result =
Enumerable.Range(0, str.Length - 1)
.Where(i => char.IsUpper(str[i]) != char.IsUpper(str[i + 1]))
.ToList();
string resultString = result.Aggregate(str, (current, i) => current.Insert(i + 1 + result.IndexOf(i), "_"));
the result stores a list of locations where an _ has to be inserted and the Aggregate() function inserts it.

Related

How to remove multiple, repeating & unnecessary punctuation from string in C#?

Considering strings like this:
"This is a string....!"
"This is another...!!"
"What is this..!?!?"
...
// There are LOTS of examples of weird/angry sentence-endings like the ones above.
I want to replace the unnecessary punctuation at the end to make it look like this:
"This is a string!"
"This is another!"
"What is this?"
What I basically do is:
- split by space
- check if last char in string contains a punctuation
- start replacing with the patterns below
I have tried a very big ".Replace(string, string)" function, but it does not work - there has to be a simpler regex I guess.
Documentation:
Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
As well as:
Because this method returns the modified string, you can chain together successive calls to the Replace method to perform multiple replacements on the original string.
Anything is wrong here.
EDIT: ALL the proposed solutions work fine! Thank you very much!
This one was the best suited solution for my project:
Regex re = new Regex("[.?!]*(?=[.?!]$)");
string output = re.Replace(input, "");
Your solution works almost fine (demo), the only issue is when the same sequence could be matched starting at different spots. For example, ..!?!? from your last line is not part of the substitution list, so ..!? and !? get replaced by two separate matches, producing ?? in the output.
It looks like your strategy is pretty straightforward: in a chain of multiple punctuation characters the last character wins. You can use regular expressions to do the replacement:
[!?.]*([!?.])
and replace it with $1, i.e. the capturing group that has the last character:
string s;
while ((s = Console.ReadLine()) != null) {
s = Regex.Replace(s, "[!?.]*([!?.])", "$1");
Console.WriteLine(s);
}
Demo
Simply
[.?!]*(?=[.?!]$)
should do it for you. Like
Regex re = new Regex("[.?!]*(?=[.?!]$)");
Console.WriteLine(re.Replace("This is a string....!", ""));
This replaces all punctuations but the last with nothing.
[.?!]* matches any number of consecutive punctuation characters, and the (?=[.?!]$) is a positive lookahead making sure it leaves one at the end of the string.
See it here at ideone.
Or you can do it without regExps:
string TrimPuncMarks(string str)
{
HashSet<char> punctMarks = new HashSet<char>() {'.', '!', '?'};
int i = str.Length - 1;
for (; i >= 0; i--)
{
if (!punctMarks.Contains(str[i]))
break;
}
// the very last punct mark or null if there were no any punct marks in the end
char? suffix = i < str.Length - 1 ? str[str.Length - 1] : (char?)null;
return str.Substring(0, i+1) + suffix;
}
Debug.Assert("What is this?" == TrimPuncMarks("What is this..!?!?"));
Debug.Assert("What is this" == TrimPuncMarks("What is this"));
Debug.Assert("What is this." == TrimPuncMarks("What is this."));

Use Regex to capitalize the first letters of the first and last words of a string?

Hopefully the title says it all, but I am wanting to upper case the first letters of both the first and the last words in a string like this:
Turn this:
this is a regular sentence
Into this:
This is a regular Sentence
Ideally, I'd like it to work on ANY characters such as à -> À, but I do not wish to over complicate this if that is a bigger deal to pull off.
Regular expressions alone can't do this, but you can pass a custom MatchEvaluator to the Replace method. This can be a lambda expression, like this:
var input = "this is a regular sentence";
var output = Regex.Replace(
input,
#"^(?<cap>\w)(?<rest>\w*)|(?<cap>\w)(?<rest>\w*)$",
m => m.Groups["cap"].Value.ToUpper() + m.Groups["rest"]);
Console.WriteLine(output); // This is a regular Sentence
Notice that in the pattern, I used named groups, so that I wouldn't have to worry about whether I was formatting the first or the last word.
Or perhaps more simply
var output = Regex.Replace(
input,
#"^(?<cap>\w)|\b(?<cap>\w)(?=\w*$)",
m => m.Groups["cap"].Value.ToUpper());
Here, I needed to use a lookahead assertion to identify the last word, but otherwise, the idea is the same.
If performance is a big concern, you can always do this:
int c = input.LastIndexOf(' ');
var output =
char.ToUpper(input[0]) +
input.Substring(1, c) +
char.ToUpper(input[c + 1]) +
input.Substring(c + 2);
However, this method does assume that the last word is preceded by a space.

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

Capitalizing words in a string using C#

I need to take a string, and capitalize words in it. Certain words ("in", "at", etc.), are not capitalized and are changed to lower case if encountered. The first word should always be capitalized. Last names like "McFly" are not in the current scope, so the same rule will apply to them - only first letter capitalized.
For example: "of mice and men By CNN" should be changed to "Of Mice and Men by CNN". (Therefore ToTitleString won't work here.)
What would be the best way to do that?
I thought of splitting the string by spaces, and go over each word, changing it if necessary, and concatenating it to the previous word, and so on.
It seems pretty naive and I was wondering if there's a better way to do it. I am using .NET 3.5.
Use
Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase("of mice and men By CNN");
to convert to proper case and then you can loop through the keywords as you have mentioned.
Depending on how often you plan on doing the capitalization I'd go with the naive approach. You could possibly do it with a regular expression, but the fact that you don't want certain words capitalized makes that a little trickier.
You can do it with two passes using regular expressions:
var result = Regex.Replace("of mice and men isn't By CNN", #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"(\s(of|in|by|and)|\'[st])\b", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
This outputs Of Mice and Men Isn't by CNN.
The first expression capitalizes every letter on a word boundary and the second one downcases any words matching the list that are surrounded by white space.
The downsides to this approach is that you're using regexs (now you have two problems) and you'll need to keep that list of excluded words up to date. My regex-fu isn't good enough to be able to do it in one expression, but it might be possible.
An answer from another question, How to Capitalize names -
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Console.WriteLine(textInfo.ToTitleCase(title));
Console.WriteLine(textInfo.ToLower(title));
Console.WriteLine(textInfo.ToUpper(title));
Use ToTitleCase() first and then keep a list of applicable words and Replace back to the all-lower-case version of those applicable words (provided that list is small).
The list of applicable words could be kept in a dictionary and looped through pretty efficiently, replacing with the .ToLower() equivalent.
Try something like this:
public static string TitleCase(string input, params string[] dontCapitalize) {
var split = input.Split(' ');
for(int i = 0; i < split.Length; i++)
split[i] = i == 0
? CapitalizeWord(split[i])
: dontCapitalize.Contains(split[i])
? split[i]
: CapitalizeWord(split[i]);
return string.Join(" ", split);
}
public static string CapitalizeWord(string word)
{
return char.ToUpper(word[0]) + word.Substring(1);
}
You can then later update the CapitalizeWord method if you need to handle complex surnames.
Add those methods to a class and use it like this:
SomeClass.TitleCase("a test is a sentence", "is", "a"); // returns "A Test is a Sentence"
A slight improvement on jonnii's answer:
var result = Regex.Replace(s.Trim(), #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"\s(of|in|by|and)\s", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
result = result.Replace("'S", "'s");
You can have a Dictionary having the words you would like to ignore, split the sentence in phrases (.split(' ')) and for each phrase, check if the phrase exists in the dictionary, if it does not, capitalize the first character and then, add the string to a string buffer. If the phrase you are currently processing is in the dictionary, simply add it to the string buffer.
A non-clever approach that handles the simple case:
var s = "of mice and men By CNN";
var sa = s.Split(' ');
for (var i = 0; i < sa.Length; i++)
sa[i] = sa[i].Substring(0, 1).ToUpper() + sa[i].Substring(1);
var sout = string.Join(" ", sa);
Console.WriteLine(sout);
The easiest obvious solution (for English sentences) would be to:
"sentence".Split(" ") the sentence on space characters
Loop through each item
Capitalize the first letter of each item - item[i][0].ToUpper(),
Remerge back into a string joined on a space.
Repeat this process with "." and "," using that new string.
You should create your own function like you're describing.

extract last match from string in c#

i have strings in the form [abc].[some other string].[can.also.contain.periods].[our match]
i now want to match the string "our match" (i.e. without the brackets), so i played around with lookarounds and whatnot. i now get the correct match, but i don't think this is a clean solution.
(?<=\.?\[) starts with '[' or '.['
([^\[]*) our match, i couldn't find a way to not use a negated character group
`.*?` non-greedy did not work as expected with lookarounds,
it would still match from the first match
(matches might contain escaped brackets)
(?=\]$) string ends with an ]
language is .net/c#. if there is an easier solution not involving a regex i'd be also happy to know
what really irritates me is the fact, that i cannot use (.*?) to capture the string, as it seems non-greedy does not work with lookbehinds.
i also tried: Regex.Split(str, #"\]\.\[").Last().TrimEnd(']');, but i'm not really pround of this solution either
The following should do the trick. Assuming the string ends after the last match.
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
var search = new Regex("\\.\\[(.*?)\\]$", RegexOptions.RightToLeft);
string ourMatch = search.Match(input).Groups[1]);
Assuming you can guarantee the input format, and it's just the last entry you want, LastIndexOf could be used:
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
int lastBracket = input.LastIndexOf("[");
string result = input.Substring(lastBracket + 1, input.Length - lastBracket - 2);
With String.Split():
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
char[] seps = {'[',']','\\'};
string[] splitted = input.Split(seps,StringSplitOptions.RemoveEmptyEntries);
you get "out match" in splitted[7] and can.also.contain.periods is left as one string (splitted[4])
Edit: the array will have the string inside [] and then . and so on, so if you have a variable number of groups, you can use that to get the value you want (or remove the strings that are just '.')
Edited to add the backslash to the separator to treat cases like '\[abc\]'
Edit2: for nested []:
string input = #"[abc].[some other string].[can.also.contain.periods].[our [the] match]";
string[] seps2 = { "].["};
string[] splitted = input.Split(seps2, StringSplitOptions.RemoveEmptyEntries);
you our [the] match] in the last element (index 3) and you'd have to remove the extra ]
You have several options:
RegexOptions.RightToLeft - yes, .NET regex can do this! Use it!
Match the whole thing with greedy prefix, use brackets to capture the suffix that you're interested in
So generally, pattern becomes .*(pattern)
In this case, .*\[([^\]]*)\], then extract what \1 captures (see this on rubular.com)
References
regular-expressions.info/Grouping with brackets

Categories