Regex: convert camel case to all caps with underscores - c#

What regular expression can be used to make the following conversions?
City -> CITY
FirstName -> FIRST_NAME
DOB -> DOB
PATId -> PAT_ID
RoomNO -> ROOM_NO
The following almost works - it just adds an extra underscore to the beginning of the word:
var rgx = #"(?x)( [A-Z][a-z,0-9]+ | [A-Z]+(?![a-z]) )";
var tests = new string[] { "City",
"FirstName",
"DOB",
"PATId",
"RoomNO"};
foreach (var test in tests)
Console.WriteLine("{0} -> {1}", test,
Regex.Replace(test, rgx, "_$0").ToUpper());
// output:
// City -> _CITY
// FirstName -> _FIRST_NAME
// DOB -> _DOB
// PATId -> _PAT_ID
// RoomNO -> _ROOM_NO

Flowing from John M Gant's idea of adding underscores then capitalizing, I think this regular expression should work:
([A-Z])([A-Z][a-z])|([a-z0-9])([A-Z])
replacing with:
$1$3_$2$4
You can rename the capture zones to make the replace string a little nicer to read. Only $1 or $3 should have a value, same with $2 and $4. The general idea is to add underscores when:
There are two capital letters followed by a lower case letter, place the underscore between the two capital letters. (PATId -> PAT_Id)
There is a small letter followed by a capital letter, place the underscore in the middle of the two. (RoomNO -> Room_NO and FirstName -> First_Name)
Hope this helps.

I suggest a simple Regex to insert the underscore, and then string.ToUpper() to convert to uppercase.
Regex.Replace(test, #"(\p{Ll})(\p{Lu})", "$1_$2").ToUpper()
It's two operations instead of one, but to me it's much easier to read than one big complicated regex replace.

I can probably come up with a regex that will do it... but I believe a transformative regex may not be the right answer. I suggest you take what you already have and just chop the first character (the leading underscore) off the output. The CPU time is probably going to be the same or less that way, and your coding time inconsequential.
Try: (?x)(.)( [A-Z][a-z,0-9]+ | [A-Z]+(?![a-z]) ) and change you code to output $0_$1 instead of _$0 <--misguided and failed attempt to dream up what I said was a silly idea.

Seems like Rails does it using more than one regular expression.
var rgx = #"([A-Z]+)([A-Z][a-z])";
var rgx2 = #"([a-z\d])([A-Z])";
foreach (var test in tests)
{
var result = Regex.Replace(test, rgx, "$1_$2");
result = Regex.Replace(result, rgx2, "$1_$2");
result = result.ToUpper();
Console.WriteLine("{0} -> {1}", test, result);
}

I realize this is an old question, but it is still something that comes up often, so I have decided to share my own approach to it.
Instead of trying to do it with replacements, the idea is to find all “words” in the string and then convert them to upper case and join:
var tests = new string[] { "City",
"FirstName",
"DOB",
"PATId",
"RoomNO"};
foreach (var test in tests)
Console.WriteLine("{0} -> {1}", test,
String.Join("_", new Regex(#"^(\p{Lu}(?:\p{Lu}*|[\p{Ll}\d]*))*$")
.Match(test)
.Groups[1]
.Captures
.Cast<Capture>()
.Select(c => c.Value.ToUpper())));
Not terribly concise, but allows you to concentrate on defining what a “word” is, exactly, instead of struggling with anchors, separators and whatnot. In this case I've defined a word as something starting with an uppercase letter following by either a sequence of uppercase letters or a mix of lowercase and uppercase letters. I could have wanted to separate sequences of digits, too. "^(\p{Lu}(?:\p{Lu}*|\p{Ll}*)|\d+)*$" would do the trick. Or maybe I wanted to have the digits as a part of the previous uppercase word, then I'd do "^(\p{Lu}(?:[\p{Lu}\d]*|[\p{Ll}\d]*))*$".

There is no javascript answer here, so may as well add it.
( This is using the regex from #John McDonald )
var text = "fooBar barFoo";
var newText = text.replace(/([A-Z])([A-Z][a-z])|([a-z0-9])([A-Z])/g, "$1$3_$2$4");
newText.toLowerCase()

Related

Extract Menu from String

I want to extract a menu from a string whenever there is one.
recipe ABC: Quelle bonne idC)e!
L: 33348, C: 2130
1 Like
2 Comment
3 Next
4 See Comments
# Home
Since I am new to regex, I tried this for a start:
If Regex.IsMatch(text, "(\d\w*\n)*") Then
End If
And it returned true.
Am I doing this right?
I want to be able to extract the menu whenever there is one. Menus don't have a pre-defined format. So I used whatever starts with number \d followed by alphanumeric character \w and new line \n.
After regex returning true, how can I extract the text that did match the regex?
Any help would be appreciated.
You can use a regex (?sm).*(?=^\d+\s+\p{L}+[\r\n]) that is taking everything from the beginning and up to a line (due to ^) that starts with a number (\d+), then some spaces (\s+), then some letters (\p{L}), then a newline ([\r\n]):
var txt ="Lorem ipsum:amet, consectetur adipiscing elit!!\r\nL: 33348, C: 2130\r\n\r\n1 Next\r\n\r\n2 Forward\r\n\r\n3 Last\r\n\r\n4 See more";
var rx = new Regex(#"(?sm).*?(?=^\d+\s+\p{L}+[\r\n])");
var res = rx.Match(txt).Value;
However, I believe your menu always starts with 1 at the line start, and all menu items are generally capitalized. That is why I suggest using another regex to reflect the following conditions: take all until a line that starts with 1 followed by some space(s), and then by an uppercase letter:
var rx = new Regex(#"(?sm).*(?=^1\s+\p{Lu})");
Or, you can try to split the string into lines, and check if a line starts with 1.
var out2 = string.Join("\r\n",txt.Split(new string[] { "\r\n" }, StringSplitOptions.None).TakeWhile(p => !p.StartsWith("1 ")).ToList());
Results:
You are using isMatch, which will return only the information "Did the pattern match anything"?
You should use something like this :
Regex regex = new Regex(#"(\d\w*\n)*");
Match match = regex.Match(yourText);
if (match.Success)
{
Console.WriteLine(match.Value);
}
Disclaimer : As your question was not about your expression itself, I haven't checked what it does. You haven't asked help on that part so I didn't give any.

c# regex to extract link after =

Couldn't find better title but i need a Regex to extract link from sample below.
snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip
assuming regex is the best way.
thanks
Consider the following sample code. It shows how one might extract from your supplied string. But I have expanded upon the string some. Generally, the use of .* is too all inclusive (as the example below demonstrates).
The main point, is there are several ways to do what you are asking, the first answer given uses "look-around" while the second suggests the "Groups" approach. The choice mainly depend upon your actual data.
string[] tests = {
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' ..snip",
#"snip... flashvars.image_url = 'http://domain.com/test.jpg' flashvars2.image_url = 'http://someother.domain.com/test.jpg'",
};
string[] patterns = {
#"(?<==\s')[^']*(?=')",
#"=\s*'(.*)'",
#"=\s*'([^']*)'",
};
foreach (string pattern in patterns)
{
Console.WriteLine();
foreach (string test in tests)
foreach (Match m in Regex.Matches(test, pattern))
{
if (m.Groups.Count > 1)
Console.WriteLine("{0}", m.Groups[1].Value);
else
Console.WriteLine("{0}", m.Value);
}
}
A simple regex for this would be #"=\s*'(.*)'".
Edit: New regex matching your edited question:
You need to match what's between quotes, after a =, right?
#"(?<==\s*')[^']*(?=')"
should do.
(?<==\s*') asserts that there is a =, optionally followed by whitespace, followed by a ', just before our current position (positive lookbehind).
[^']* matches any number of non-' characters.
(?=') asserts that the match stops before the next '.
This regex doesn't check if there is indeed a URL inside those quotes. If you want to do that, use
#"(?<==\s*')(?=(?:https?|ftp|mailto)\b)[^']*(?=')"

Capitalizing words in a string using C#

I need to take a string, and capitalize words in it. Certain words ("in", "at", etc.), are not capitalized and are changed to lower case if encountered. The first word should always be capitalized. Last names like "McFly" are not in the current scope, so the same rule will apply to them - only first letter capitalized.
For example: "of mice and men By CNN" should be changed to "Of Mice and Men by CNN". (Therefore ToTitleString won't work here.)
What would be the best way to do that?
I thought of splitting the string by spaces, and go over each word, changing it if necessary, and concatenating it to the previous word, and so on.
It seems pretty naive and I was wondering if there's a better way to do it. I am using .NET 3.5.
Use
Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase("of mice and men By CNN");
to convert to proper case and then you can loop through the keywords as you have mentioned.
Depending on how often you plan on doing the capitalization I'd go with the naive approach. You could possibly do it with a regular expression, but the fact that you don't want certain words capitalized makes that a little trickier.
You can do it with two passes using regular expressions:
var result = Regex.Replace("of mice and men isn't By CNN", #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"(\s(of|in|by|and)|\'[st])\b", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
This outputs Of Mice and Men Isn't by CNN.
The first expression capitalizes every letter on a word boundary and the second one downcases any words matching the list that are surrounded by white space.
The downsides to this approach is that you're using regexs (now you have two problems) and you'll need to keep that list of excluded words up to date. My regex-fu isn't good enough to be able to do it in one expression, but it might be possible.
An answer from another question, How to Capitalize names -
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Console.WriteLine(textInfo.ToTitleCase(title));
Console.WriteLine(textInfo.ToLower(title));
Console.WriteLine(textInfo.ToUpper(title));
Use ToTitleCase() first and then keep a list of applicable words and Replace back to the all-lower-case version of those applicable words (provided that list is small).
The list of applicable words could be kept in a dictionary and looped through pretty efficiently, replacing with the .ToLower() equivalent.
Try something like this:
public static string TitleCase(string input, params string[] dontCapitalize) {
var split = input.Split(' ');
for(int i = 0; i < split.Length; i++)
split[i] = i == 0
? CapitalizeWord(split[i])
: dontCapitalize.Contains(split[i])
? split[i]
: CapitalizeWord(split[i]);
return string.Join(" ", split);
}
public static string CapitalizeWord(string word)
{
return char.ToUpper(word[0]) + word.Substring(1);
}
You can then later update the CapitalizeWord method if you need to handle complex surnames.
Add those methods to a class and use it like this:
SomeClass.TitleCase("a test is a sentence", "is", "a"); // returns "A Test is a Sentence"
A slight improvement on jonnii's answer:
var result = Regex.Replace(s.Trim(), #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"\s(of|in|by|and)\s", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
result = result.Replace("'S", "'s");
You can have a Dictionary having the words you would like to ignore, split the sentence in phrases (.split(' ')) and for each phrase, check if the phrase exists in the dictionary, if it does not, capitalize the first character and then, add the string to a string buffer. If the phrase you are currently processing is in the dictionary, simply add it to the string buffer.
A non-clever approach that handles the simple case:
var s = "of mice and men By CNN";
var sa = s.Split(' ');
for (var i = 0; i < sa.Length; i++)
sa[i] = sa[i].Substring(0, 1).ToUpper() + sa[i].Substring(1);
var sout = string.Join(" ", sa);
Console.WriteLine(sout);
The easiest obvious solution (for English sentences) would be to:
"sentence".Split(" ") the sentence on space characters
Loop through each item
Capitalize the first letter of each item - item[i][0].ToUpper(),
Remerge back into a string joined on a space.
Repeat this process with "." and "," using that new string.
You should create your own function like you're describing.

How to capitalize first letter of each sentence?

I know how to capitalize first letter in each word. But I want to know how to capitalize first letter of each sentence in C#.
This is not necessarily a trivial problem. Sentences can end with a number of different punctuation marks, and those same punctuation marks don't always denote the end of a sentence (abbreviations like Dr. may pose a particular problem because there are potentially many of them).
That being said, you might be able to get a "good enough" solution by using regular expressions to look for words after a sentence-ending punctuation, but you would have to add quite a few special cases. It might be easier to process the string character by character or word by word. You would still have to handle all the same special cases, but it might be easier than trying to build that into a regex.
There are lots of weird rules for grammar and punctuation. Any solution you come up with probably won't be able to take them all into account. Some things to consider:
Sentences can end with different punctuation marks (. ! ?)
Some punctuation marks that end sentences might also be used in the middle of a sentence (e.g. abbreviations such as Dr. Mr. e.g.)
Sentences could contain nested sentences. Quotations could pose a particular problem (e.g. He said, "This is a hard problem! I wonder," he mused, "if it can be solved.")
As a first approximation, you could probably treat any sequence like [a-z]\.[ \n\t] as the end of a sentence.
Consider a sentence as a word containing spaces an ending with a period.
There's some VB code on this page which shouldn't be too hard to convert to C#.
However, subsequent posts point out the errors in the algorithm.
This blog has some C# code which claims to work:
It auto capitalises the first letter after every full stop (period), question mark and exclamation mark.
UPDATE 16 Feb 2010: I’ve reworked it so that it doesn’t affect strings such as URL’s and the like
Don't forget sentences with parentheses. Also, * if used as an idicator for bold text.
http://www.grammarbook.com/punctuation/parens.asp
I needed to do something similar, and this served my purposes. I pass in my "sentences" as a IEnumerable of strings.
// Read sentences from text file (each sentence on a separate line)
IEnumerable<string> lines = File.ReadLines(inputPath);
// Call method below
lines = CapitalizeFirstLetterOfEachWord(lines);
private static IEnumerable<string> CapitalizeFirstLetterOfString(IEnumerable<string> inputLines)
{
// Will output: Lorem lipsum et
List<string> outputLines = new List<string>();
TextInfo textInfo = new CultureInfo("en-US", false).TextInfo;
foreach (string line in inputLines)
{
string lineLowerCase = textInfo.ToLower(line);
string[] lineSplit = lineLowerCase.Split(' ');
bool first = true;
for (int i = 0; i < lineSplit.Length; i++ )
{
if (first)
{
lineSplit[0] = textInfo.ToTitleCase(lineSplit[0]);
first = false;
}
}
outputLines.Add(string.Join(" ", lineSplit));
}
return outputLines;
}
I know I'm little late, but just like You, I needed to capitalize every first character on each of my sentences.
I just fell here (and a lot of other pages while I was researching) and found nothing to help me out. So, I burned some neurons, and made a algorithm by myself.
Here is my extension method to capitalize sentences:
public static string CapitalizeSentences(this string Input)
{
if (String.IsNullOrEmpty(Input))
return Input;
if (Input.Length == 1)
return Input.ToUpper();
Input = Regex.Replace(Input, #"\s+", " ");
Input = Input.Trim().ToLower();
Input = Char.ToUpper(Input[0]) + Input.Substring(1);
var objDelimiters = new string[] { ". ", "! ", "? " };
foreach (var objDelimiter in objDelimiters)
{
var varDelimiterLength = objDelimiter.Length;
var varIndexStart = Input.IndexOf(objDelimiter, 0);
while (varIndexStart > -1)
{
Input = Input.Substring(0, varIndexStart + varDelimiterLength) + (Input[varIndexStart + varDelimiterLength]).ToString().ToUpper() + Input.Substring((varIndexStart + varDelimiterLength) + 1);
varIndexStart = Input.IndexOf(objDelimiter, varIndexStart + 1);
}
}
return Input;
}
Details about the algorithm:
This simple algorithm starts removing all double spaces. Then, it capitalize the first character of the string. then search for every delimiter. When find one, capitalize the very next character.
I made it easy to Add/Remove or Edit the delimiters, so You can change a lot how code works with a little change on it.
It doesn't check if the substrings go out of the string length, because the delimiters end with spaces, and the algorithm starts with a "Trim()", so every delimiter if found in the string will be followed by another character.
Important:
You didn't specify what were exactly your needs. I mean, it's a grammar corrector, it's just to prettify a text, etc... So, it's important to consider that my algorithm is just perfect for my needs, that can be different of yours.
*This algorithm was created to format a "Product Description" that isn't normalized (almost always it's entirely uppercased) in a nice format to the user (To be more specific, I need to show a pretty and "smaller" text for user. So, all characters in Upper Case is just opposite of what I want). So, it was not created to be grammatically perfect.
*Also, there maybe some exceptions where the character will not be uppercased because bad formatting.
*I choose to include spaces in the delimiter, so "http://www.stackoverflow.com" will not become "http://www.Stackoverflow.Com". In the other hand, sentences like "the box is blue.it's on the floor" will become "The box is blue.it's on the floor", and not "The box is blue.It's on the floor"
*In abbreviations cases, it will capitalize, but once again, it's not a problem because my needs is just show a product description (where grammar is not extremely critic). And in abbreviations like Mr. or Dr. the very first character is a name, so, it's perfect to be capitalized.
If You, or somebody else needs a more accurate algorithm, I'll be glad to improve it.
Hope I could help somebody!
However you can make a class or method to convert each text in TitleCase. Here is the example you just need to call the method.
public static string ToTitleCase(string strX)
{
string[] aryWords = strX.Trim().Split(' ');
List<string> lstLetters = new List<string>();
List<string> lstWords = new List<string>();
foreach (string strWord in aryWords)
{
int iLCount = 0;
foreach (char chrLetter in strWord.Trim())
{
if (iLCount == 0)
{
lstLetters.Add(chrLetter.ToString().ToUpper());
}
else
{
lstLetters.Add(chrLetter.ToString().ToLower());
}
iLCount++;
}
lstWords.Add(string.Join("", lstLetters));
lstLetters.Clear();
}
string strNewString = string.Join(" ", lstWords);
return strNewString;
}

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories