I need to take a string, and capitalize words in it. Certain words ("in", "at", etc.), are not capitalized and are changed to lower case if encountered. The first word should always be capitalized. Last names like "McFly" are not in the current scope, so the same rule will apply to them - only first letter capitalized.
For example: "of mice and men By CNN" should be changed to "Of Mice and Men by CNN". (Therefore ToTitleString won't work here.)
What would be the best way to do that?
I thought of splitting the string by spaces, and go over each word, changing it if necessary, and concatenating it to the previous word, and so on.
It seems pretty naive and I was wondering if there's a better way to do it. I am using .NET 3.5.
Use
Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase("of mice and men By CNN");
to convert to proper case and then you can loop through the keywords as you have mentioned.
Depending on how often you plan on doing the capitalization I'd go with the naive approach. You could possibly do it with a regular expression, but the fact that you don't want certain words capitalized makes that a little trickier.
You can do it with two passes using regular expressions:
var result = Regex.Replace("of mice and men isn't By CNN", #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"(\s(of|in|by|and)|\'[st])\b", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
This outputs Of Mice and Men Isn't by CNN.
The first expression capitalizes every letter on a word boundary and the second one downcases any words matching the list that are surrounded by white space.
The downsides to this approach is that you're using regexs (now you have two problems) and you'll need to keep that list of excluded words up to date. My regex-fu isn't good enough to be able to do it in one expression, but it might be possible.
An answer from another question, How to Capitalize names -
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Console.WriteLine(textInfo.ToTitleCase(title));
Console.WriteLine(textInfo.ToLower(title));
Console.WriteLine(textInfo.ToUpper(title));
Use ToTitleCase() first and then keep a list of applicable words and Replace back to the all-lower-case version of those applicable words (provided that list is small).
The list of applicable words could be kept in a dictionary and looped through pretty efficiently, replacing with the .ToLower() equivalent.
Try something like this:
public static string TitleCase(string input, params string[] dontCapitalize) {
var split = input.Split(' ');
for(int i = 0; i < split.Length; i++)
split[i] = i == 0
? CapitalizeWord(split[i])
: dontCapitalize.Contains(split[i])
? split[i]
: CapitalizeWord(split[i]);
return string.Join(" ", split);
}
public static string CapitalizeWord(string word)
{
return char.ToUpper(word[0]) + word.Substring(1);
}
You can then later update the CapitalizeWord method if you need to handle complex surnames.
Add those methods to a class and use it like this:
SomeClass.TitleCase("a test is a sentence", "is", "a"); // returns "A Test is a Sentence"
A slight improvement on jonnii's answer:
var result = Regex.Replace(s.Trim(), #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"\s(of|in|by|and)\s", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
result = result.Replace("'S", "'s");
You can have a Dictionary having the words you would like to ignore, split the sentence in phrases (.split(' ')) and for each phrase, check if the phrase exists in the dictionary, if it does not, capitalize the first character and then, add the string to a string buffer. If the phrase you are currently processing is in the dictionary, simply add it to the string buffer.
A non-clever approach that handles the simple case:
var s = "of mice and men By CNN";
var sa = s.Split(' ');
for (var i = 0; i < sa.Length; i++)
sa[i] = sa[i].Substring(0, 1).ToUpper() + sa[i].Substring(1);
var sout = string.Join(" ", sa);
Console.WriteLine(sout);
The easiest obvious solution (for English sentences) would be to:
"sentence".Split(" ") the sentence on space characters
Loop through each item
Capitalize the first letter of each item - item[i][0].ToUpper(),
Remerge back into a string joined on a space.
Repeat this process with "." and "," using that new string.
You should create your own function like you're describing.
Related
Hopefully the title says it all, but I am wanting to upper case the first letters of both the first and the last words in a string like this:
Turn this:
this is a regular sentence
Into this:
This is a regular Sentence
Ideally, I'd like it to work on ANY characters such as à -> À, but I do not wish to over complicate this if that is a bigger deal to pull off.
Regular expressions alone can't do this, but you can pass a custom MatchEvaluator to the Replace method. This can be a lambda expression, like this:
var input = "this is a regular sentence";
var output = Regex.Replace(
input,
#"^(?<cap>\w)(?<rest>\w*)|(?<cap>\w)(?<rest>\w*)$",
m => m.Groups["cap"].Value.ToUpper() + m.Groups["rest"]);
Console.WriteLine(output); // This is a regular Sentence
Notice that in the pattern, I used named groups, so that I wouldn't have to worry about whether I was formatting the first or the last word.
Or perhaps more simply
var output = Regex.Replace(
input,
#"^(?<cap>\w)|\b(?<cap>\w)(?=\w*$)",
m => m.Groups["cap"].Value.ToUpper());
Here, I needed to use a lookahead assertion to identify the last word, but otherwise, the idea is the same.
If performance is a big concern, you can always do this:
int c = input.LastIndexOf(' ');
var output =
char.ToUpper(input[0]) +
input.Substring(1, c) +
char.ToUpper(input[c + 1]) +
input.Substring(c + 2);
However, this method does assume that the last word is preceded by a space.
What would the syntax to get all the words in a string after the first space. For example, bobs nice house. So the result should be " nice house" without the quote.
([^\s]+) gives me all 3 words seperated by ;
,[\s\S]*$ > not compiling.
I was really looking shortest possible code. following did the job. thanks guys
\s(.*)
I think it should be done this way:
[^ ]* (.*)
It allows 0 or more elements that are not a space, than a single space and selects whatever comes after that space.
C# usage
var input = "bobs nice house";
var afterSpace = Regex.Match(input, "[^ ]* (.*)").Groups[1].Value;
afterSpace is "nice house".
To get that first space as well in result string change expression to [^ ]*( .*)
No regex solution
var afterSpace = input.Substring(input.IndexOf(' '));
Actually, you don't need to use regex for that process. You just need to use String.Split() method like this;
string s = "bobs nice house";
string[] s1 = s.Split(' ');
for(int i = 1; i < s1.Length; i++)
Console.WriteLine(s1[i]);
Output will be;
nice
house
Here is a DEMO.
use /(?<=\s).*/g to find string after first space.run snippet check
str="first second third forth fifth";
str=str.match(/(?<=\s).*/g)
console.log(str);
I have string. "12341234115151_log_1.txt" (this string length is not fixed. but "log" pattern always same)
I have a for loop.
each iteration, I want to set the number after "log" of i.
like "12341234115151_log_2.txt"
"12341234115151_log_3.txt"
....
to
"12341234115151_log_123.txt"
in c#, what is a good way to do so?
thanks.
A regex is ideal for this. You can use the Regex.Replace method and use a MatchEvaluator delegate to perform the numerical increment.
string input = "12341234115151_log_1.txt";
string pattern = #"(\d+)(?=\.)";
string result = Regex.Replace(input, pattern,
m => (int.Parse(m.Groups[1].Value) + 1).ToString());
The pattern breakdown is as follows:
(\d+): this matches and captures any digit, at least once
(?=\.): this is a look-ahead which ensures that a period (or dot) follows the number. A dot must be escaped to be a literal dot instead of a regex metacharacter. We know that the value you want to increment is right before the ".txt" so it should always have a dot after it. You could also use (?=\.txt) to make it clearer and be explicit, but you may have to use RegexOptions.IgnoreCase if your filename extension can have different cases.
You can use Regex. like this
var r = new Regex("^(.*_log_)(\\d).txt$")
for ... {
var newname = r.Replace(filename, "${1}"+i+".txt");
}
Use regular expressions to get the counter, then just append them together.
If I've read your question right...
How about,
for (int i =0; i<some condition; i++)
{
string name = "12341234115151_log_"+ i.ToString() + ".txt";
}
I know how to capitalize first letter in each word. But I want to know how to capitalize first letter of each sentence in C#.
This is not necessarily a trivial problem. Sentences can end with a number of different punctuation marks, and those same punctuation marks don't always denote the end of a sentence (abbreviations like Dr. may pose a particular problem because there are potentially many of them).
That being said, you might be able to get a "good enough" solution by using regular expressions to look for words after a sentence-ending punctuation, but you would have to add quite a few special cases. It might be easier to process the string character by character or word by word. You would still have to handle all the same special cases, but it might be easier than trying to build that into a regex.
There are lots of weird rules for grammar and punctuation. Any solution you come up with probably won't be able to take them all into account. Some things to consider:
Sentences can end with different punctuation marks (. ! ?)
Some punctuation marks that end sentences might also be used in the middle of a sentence (e.g. abbreviations such as Dr. Mr. e.g.)
Sentences could contain nested sentences. Quotations could pose a particular problem (e.g. He said, "This is a hard problem! I wonder," he mused, "if it can be solved.")
As a first approximation, you could probably treat any sequence like [a-z]\.[ \n\t] as the end of a sentence.
Consider a sentence as a word containing spaces an ending with a period.
There's some VB code on this page which shouldn't be too hard to convert to C#.
However, subsequent posts point out the errors in the algorithm.
This blog has some C# code which claims to work:
It auto capitalises the first letter after every full stop (period), question mark and exclamation mark.
UPDATE 16 Feb 2010: I’ve reworked it so that it doesn’t affect strings such as URL’s and the like
Don't forget sentences with parentheses. Also, * if used as an idicator for bold text.
http://www.grammarbook.com/punctuation/parens.asp
I needed to do something similar, and this served my purposes. I pass in my "sentences" as a IEnumerable of strings.
// Read sentences from text file (each sentence on a separate line)
IEnumerable<string> lines = File.ReadLines(inputPath);
// Call method below
lines = CapitalizeFirstLetterOfEachWord(lines);
private static IEnumerable<string> CapitalizeFirstLetterOfString(IEnumerable<string> inputLines)
{
// Will output: Lorem lipsum et
List<string> outputLines = new List<string>();
TextInfo textInfo = new CultureInfo("en-US", false).TextInfo;
foreach (string line in inputLines)
{
string lineLowerCase = textInfo.ToLower(line);
string[] lineSplit = lineLowerCase.Split(' ');
bool first = true;
for (int i = 0; i < lineSplit.Length; i++ )
{
if (first)
{
lineSplit[0] = textInfo.ToTitleCase(lineSplit[0]);
first = false;
}
}
outputLines.Add(string.Join(" ", lineSplit));
}
return outputLines;
}
I know I'm little late, but just like You, I needed to capitalize every first character on each of my sentences.
I just fell here (and a lot of other pages while I was researching) and found nothing to help me out. So, I burned some neurons, and made a algorithm by myself.
Here is my extension method to capitalize sentences:
public static string CapitalizeSentences(this string Input)
{
if (String.IsNullOrEmpty(Input))
return Input;
if (Input.Length == 1)
return Input.ToUpper();
Input = Regex.Replace(Input, #"\s+", " ");
Input = Input.Trim().ToLower();
Input = Char.ToUpper(Input[0]) + Input.Substring(1);
var objDelimiters = new string[] { ". ", "! ", "? " };
foreach (var objDelimiter in objDelimiters)
{
var varDelimiterLength = objDelimiter.Length;
var varIndexStart = Input.IndexOf(objDelimiter, 0);
while (varIndexStart > -1)
{
Input = Input.Substring(0, varIndexStart + varDelimiterLength) + (Input[varIndexStart + varDelimiterLength]).ToString().ToUpper() + Input.Substring((varIndexStart + varDelimiterLength) + 1);
varIndexStart = Input.IndexOf(objDelimiter, varIndexStart + 1);
}
}
return Input;
}
Details about the algorithm:
This simple algorithm starts removing all double spaces. Then, it capitalize the first character of the string. then search for every delimiter. When find one, capitalize the very next character.
I made it easy to Add/Remove or Edit the delimiters, so You can change a lot how code works with a little change on it.
It doesn't check if the substrings go out of the string length, because the delimiters end with spaces, and the algorithm starts with a "Trim()", so every delimiter if found in the string will be followed by another character.
Important:
You didn't specify what were exactly your needs. I mean, it's a grammar corrector, it's just to prettify a text, etc... So, it's important to consider that my algorithm is just perfect for my needs, that can be different of yours.
*This algorithm was created to format a "Product Description" that isn't normalized (almost always it's entirely uppercased) in a nice format to the user (To be more specific, I need to show a pretty and "smaller" text for user. So, all characters in Upper Case is just opposite of what I want). So, it was not created to be grammatically perfect.
*Also, there maybe some exceptions where the character will not be uppercased because bad formatting.
*I choose to include spaces in the delimiter, so "http://www.stackoverflow.com" will not become "http://www.Stackoverflow.Com". In the other hand, sentences like "the box is blue.it's on the floor" will become "The box is blue.it's on the floor", and not "The box is blue.It's on the floor"
*In abbreviations cases, it will capitalize, but once again, it's not a problem because my needs is just show a product description (where grammar is not extremely critic). And in abbreviations like Mr. or Dr. the very first character is a name, so, it's perfect to be capitalized.
If You, or somebody else needs a more accurate algorithm, I'll be glad to improve it.
Hope I could help somebody!
However you can make a class or method to convert each text in TitleCase. Here is the example you just need to call the method.
public static string ToTitleCase(string strX)
{
string[] aryWords = strX.Trim().Split(' ');
List<string> lstLetters = new List<string>();
List<string> lstWords = new List<string>();
foreach (string strWord in aryWords)
{
int iLCount = 0;
foreach (char chrLetter in strWord.Trim())
{
if (iLCount == 0)
{
lstLetters.Add(chrLetter.ToString().ToUpper());
}
else
{
lstLetters.Add(chrLetter.ToString().ToLower());
}
iLCount++;
}
lstWords.Add(string.Join("", lstLetters));
lstLetters.Clear();
}
string strNewString = string.Join(" ", lstWords);
return strNewString;
}
I have a string with multiple sentences. How do I Capitalize the first letter of first word in every sentence. Something like paragraph formatting in word.
eg ."this is some code. the code is in C#. "
The ouput must be "This is some code. The code is in C#".
one way would be to split the string based on '.' and then capitalize the first letter and then rejoin.
Is there a better solution?
In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.
I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.
Here's the code:
public static string Capitalise(string input)
{
//now the first character
return Regex.Replace(input, #"(?<=(^|[.;:])\s*)[a-z]",
(match) => { return match.Value.ToUpper(); });
}
The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.
This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).
All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).
Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.
EDIT
Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.
Here's the crude benchmark I did:
public string LowerCaseLipsum
{
get
{
//went to lipsum.com and generated 10 paragraphs of lipsum
//which I then initialised into the backing field with #"[lipsumtext]".ToLower()
return _lowerCaseLipsum;
}
}
[TestMethod]
public void CapitaliseAhmadsWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1)));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
[TestMethod]
public void CapitaliseLookAroundWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).
Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.
Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.
string input = #"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = #"(^|\p{P}\s+)(\w+)";
// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);
string result = rx.Replace(input, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:
string pattern = #"(^|[.?!""]\s+)(\w+)";
EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.
The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:
P: all punctuations (comprises all of the categories below)
Pc: underscore _
Pd: dash -
Ps: open parenthesis, brackets and braces ( [ {
Pe: closing parenthesis, brackets and braces ) ] }
Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
Pf: final single/double quotes (MSDN Pi note applies)
Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /
Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.
string input = #"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";
string[] patterns = {
#"(^|\p{P}\s+)(\w+)", // all punctuation chars
#"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
#"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};
// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
Console.WriteLine("*** Current pattern: {0}", pattern);
string result = Regex.Replace(input, pattern,
m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
Console.WriteLine(result);
Console.WriteLine();
}
Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.
Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.
You have a few different options:
Your approach of splitting the string, capitalizing and then re-joining
Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.
The code below uses an iterator:
public static string ToSentenceCase( string someString )
{
var sb = new StringBuilder( someString.Length );
bool wasPeriodLastSeen = true; // We want first letter to be capitalized
foreach( var c in someString )
{
if( wasPeriodLastSeen && !c.IsWhiteSpace )
{
sb.Append( c.ToUpper() );
wasPeriodLastSeen = false;
}
else
{
if( c == '.' ) // you may want to expand this to other punctuation
wasPeriodLastSeen = true;
sb.Append( c );
}
}
return sb.ToString();
}
I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.
static IEnumerable<char> CapitalLetters(string sentence)
{
//capitalize first letter
bool capitalize = true;
char lastLetter;
for (int i = 0; i < sentence.Length; i++)
{
lastLetter = sentence[i];
yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];
if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
continue;
capitalize = false;
if (lastLetter == '.' || lastLetter == '!') //etc
capitalize = true;
}
}
To use it:
string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());
Do your work in a StringBuffer.
Lowercase the whole thing.
Loop through and uppercase leading chars.
Call ToString.