This is tough to explain enough to ask the question, but i'll try:
I have two possibilities of user input:
S01E05 or 0105 (two different input strings)
which both translate to season 01, episode 05
but if they user inputs it backwards E05S01 or 0501, i need to be able to return the same result, Season 01 Episode 05
The control for this would be the user defining the format of the original filename with something like this:
"SssEee" -- uppercase 'S' denoting that the following lowercase 's' belong to Season and uppercase 'E' denoting that the following lowercase 'e' belong to Episode. So if the user decides to define the format as EeeSss then my function should still return the same result since it knows which numbers belong to season or episode.
I don't have anything working quite yet to share, but what I was toying with is a loop that builds the regex pattern. The function, so far, accepts the user format and the file name:
public static int(string userFormat, string fileName)
{
}
the userFormat would be a string and look something like this:
t.t.t.SssEee
or even
t.SssEee
where t is for title, and the rest you know.
The file name might look like this:
battlestar.galactica.S01E05.mkv
Ive got the function that extracts the title from the file name by using the userFormat to build the regex string
public static string GetTitle(string userFormat, string fileName)
{
string pattern = "^";
char positionChar;
string fileTitle;
for (short i = 0; i < userFormat.Length; i++)
{
positionChar = userFormat[i];
//build the regex pattern
if (positionChar == 't')
{
pattern += #"\w+";
}
else if (positionChar == '#')
{
pattern += #"\d+";
}
else if (positionChar == ' ')
{
pattern += #"\s+";
}
else
pattern += positionChar;
}
//pulls out the title with or without the delimiter
Match title = Regex.Match(fileName, pattern, RegexOptions.IgnoreCase);
fileTitle = title.Groups[0].Value;
//remove the delimiter
string[] tempString = fileTitle.Split(#"\/.-<>".ToCharArray());
fileTitle = "";
foreach (string part in tempString)
{
fileTitle += part + " ";
}
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(fileTitle);
}
but im kind of stumped on how to do the extraction of the episode and season numbers. In my head im thinking the process would look something like:
Look through the userFormat string to find the uppercase S
Determine how many lowercase 's' are following the uppercase S
Build the regex expression that describes this
Search through the file name and find that pattern
Extract the number from that pattern
Sounds simple enough but im having a hard time putting it into actions. The complication being the the fact that the format in the filename could be S01E05 or it could be simply 0105. Either scenario would be identified by the user when they define the format.
Ex 1. the file name is battlestar.galactica.S01E05
the user format submitted will be t.t.?ss?ee
Ex 2. the file name is battlestar.galactica.0105
the user format submitted will be t.t.SssEee
Ex 3. the file name is battlestar.galactica.0501
the user format submitted will be t.t.EeeSss
Sorry for the book... the concept is simple, the regex function should be dynamic, allowing the user to define the format of a file name to where my method can generate the expression and use it to extract information from the file name. Something is telling me that this is simpler than it seems... but im at a loss. lol... any suggestions?
So if I read this right, you know where the the Season/Episode number is in the string because the user has told you. That is, you have t.t.<number>.more.stuff. And <number> can take one of these forms:
SssEee
EeeSss
ssee
eess
Or did you say that the user can define how many digits will be used for season and episode? That is, could it be S01E123?
I'm not sure you need a regex for this. Since you know the format, and it appears that things are separated by periods (I assume that there can't be periods in the individual fields), you should be able to use String.Split to extract the pieces, and you know from the user's format where the Season/Episode is in the resulting array. So you now have a string that takes one of the forms above.
You have the user's format definition and the Season/Episode number. You should be able to write a loop that steps through the two strings together and extracts the necessary information, or issues an error.
string UserFormat = "SssEee";
string EpisodeNumber = "0105";
int ifmt = 0;
int iepi = 0;
int season = 0;
int episode = 0;
while (ifmt <= UserFormat.Length && iepi < EpisodeNumber.Length)
{
if ((UserFormat[ifmt] == "S" || UserFormat[ifmt] == "E"))
{
if (EpisodeNumber[iepi] == UserFormat[ifmt])
{
++iepi;
}
else if (!char.IsDigit(EpisodeNumber[iepi]))
{
// Error! Chars didn't match, and it wasn't a digit.
break;
}
++ifmt;
}
else
{
char c = EpisodeNumber[iepi];
if (!char.IsDigit(c))
{
// error. Expected digit.
}
if (UserFormat[ifmt] == 'e')
{
episode = (episode * 10) + (int)c - (int)'0';
}
else if (UserFormat[ifmt] == 's')
{
season = (season * 10) + (int)c - (int)'0';
}
else
{
// user format is broken
break;
}
++iepi;
++ifmt;
}
}
Note that you'll probably have to do some checking to see that the lengths are correct. That is, the code above will accept S01E1 when the user's format is SssEee. There's a bit more error handling that you can add, depending on how worried you are about bad input. But I think this gives you the gist of the idea.
I have to think that's going to be a whole lot easier than trying to dynamically build regular expressions.
After #Sinaesthetic answered my question we can reduce his original post to:
The challenge is to receive any of these inputs:
0105 (if your input is 0105 you assume SxxEyy)
S01E05
E05S01 OR
1x05 (read as season 1 episode 5)
and transform any of these inputs into: S01E05
At this point title and file format are irrelevant, they just get tacked on to the ends.
Based on that the following code will always result in 'Battlestar.Galactica.S01E05.mkv'
static void Main(string[] args)
{
string[] inputs = new string[6] { "E05S01", "S01E05", "0105", "105", "1x05", "1x5" };
foreach (string input in inputs)
{
Console.WriteLine(FormatEpisodeTitle("Battlestar.Galactica", input, "mkv"));
}
Console.ReadLine();
}
private static string FormatEpisodeTitle(string showTitle, string identifier, string fileFormat)
{
//first make identifier upper case
identifier = identifier.ToUpper();
//normalize for SssEee & EeeSee
if (identifier.IndexOf('S') > identifier.IndexOf('E'))
{
identifier = identifier.Substring(identifier.IndexOf('S')) + identifier.Substring(identifier.IndexOf('E'), identifier.IndexOf('S'));
}
//now get rid of S and replace E with x as needed:
identifier = identifier.Replace("S", string.Empty).Replace("E", "X");
//at this point, if there isn't an "X" we need one, as in 105 or 0105
if (identifier.IndexOf('X') == -1)
{
identifier = identifier.Substring(0, identifier.Length - 2) + "X" + identifier.Substring(identifier.Length - 2);
}
//now split by the 'X'
string[] identifiers = identifier.Split('X');
// and put it back together:
identifier = 'S' + identifiers[0].PadLeft(2, '0') + 'E' + identifiers[1].PadLeft(2, '0');
//tack it all together
return showTitle + '.' + identifier + '.' + fileFormat;
}
Related
How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document
I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items
Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;
I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());
I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}
I have these data files comming in on a server that i need to split into [date time] and [value]. Most of them are delimited a single time between time and value and between date and time is a space. I already have a program processing the data with a simple split(char[]) but now found data where the delimiter is a space and i am wondering how to tackle this best.
So most files i encountered look like this:
18-06-2014 12:00:00|220.6
The delimiters vary, but i tackled that with a char[]. But today i ran into a problem on this format:
18-06-2014 12:00:00 220.6
This complicates things a little. The easy solution would be to just add a space to my split characters and when i find 3 splits combine the first two before processing?
I'm looking for a 2nd opining on this matter. Also the time format can change to something like d/m/yy and the amount of lines can run into the millions so i would like to keep it as efficient as possible.
Yes I believe the most efficient solution is to add space as a delimiter and then just combine the first two if you get three. That is going to be be more efficient than regex.
You've got a string 18-06-2014 12:00:00 220.6 where first 19 characters is a date, one character is a separation symbol and other characters is a value. So:
var test = "18-06-2014 12:00:00|220.6";
var dateString = test.Remove(19);
var val = test.Substring(20);
Added normalization:
static void Main(string[] args) {
var test = "18-06-2014 12:00:00|220.6";
var test2 = "18-6-14 12:00:00|220.6";
var test3 = "8-06-14 12:00:00|220.6";
Console.WriteLine(test);
Console.WriteLine(TryNormalizeImportValue(test));
Console.WriteLine(test2);
Console.WriteLine(TryNormalizeImportValue(test2));
Console.WriteLine(test3);
Console.WriteLine(TryNormalizeImportValue(test3));
}
private static string TryNormalizeImportValue(string value) {
var valueSplittedByDateSeparator = value.Split('-');
if (valueSplittedByDateSeparator.Length < 3) throw new InvalidDataException();
var normalizedDay = NormalizeImportDayValue(valueSplittedByDateSeparator[0]);
var normalizedMonth = NormalizeImportMonthValue(valueSplittedByDateSeparator[1]);
var valueYearPartSplittedByDateTimeSeparator = valueSplittedByDateSeparator[2].Split(' ');
if (valueYearPartSplittedByDateTimeSeparator.Length < 2) throw new InvalidDataException();
var normalizedYear = NormalizeImportYearValue(valueYearPartSplittedByDateTimeSeparator[0]);
var valueTimeAndValuePart = valueYearPartSplittedByDateTimeSeparator[1];
return string.Concat(normalizedDay, '-', normalizedMonth, '-', normalizedYear, ' ', valueTimeAndValuePart);
}
private static string NormalizeImportDayValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportMonthValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportYearValue(string value) {
return value.Length == 4 ? value : DateTime.Now.Year.ToString(CultureInfo.InvariantCulture).Remove(2) + value;
}
Well you can use this one to get the date and the value.
(((0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[012])-(19|20)\d\d)\s((\d{2}:?){3})|(\d+\.?\d+))
This will give you 2 matches
1º 18-06-2014 12:00:00
2º 220.6
Example:
http://regexr.com/391d3
This regex matches both kinds of strings, capturing the two tokens to Groups 1 and 2.
Note that we are not using \d because in .NET it can match any Unicode digits such as Thai...
The key is in the [ |] character class, which specifies your two allowable delimiters
Here is the regex:
^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$
In the demo, please pay attention to the capture Groups in the right pane.
Here is how to retrieve the values:
var myRegex = new Regex(#"^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$", RegexOptions.IgnoreCase);
string mydate = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(mydate);
string myvalue = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(myvalue);
Please let me know if you have questions
Given the provided format I'd use something like
char delimiter = ' '; //or whatever the delimiter for the specific file is, this can be set in a previous step
int index = line.LastIndexOf(delimiter);
var date = line.Remove(index);
var value = line.Substring(++index);
If there are that many lines and efficiency matters, you could obtain the delimiter once on the first line, by looping back from the end and find the first index that is not a digit or dot (or comma if the value can contain those) to determine the delimiter, and then use something such as the above.
If each line can contain a different delimiter, you could always track back to the first not value char as described above and still maintain adequate performance.
Edit: for completeness sake, to find the delimiter, you could perform the following once per file (provided that the delimiter stays consistent within the file)
char delimiter = '\0';
for (int i = line.Length - 1; i >= 0; i--)
{
var c= line[i];
if (!char.IsDigit(c) && c != '.')
{
delimiter = c;
break;
}
}
I would like to check some string for invalid characters. With invalid characters I mean characters that should not be there. What characters are these? This is different, but I think thats not that importan, important is how should I do that and what is the easiest and best way (performance) to do that?
Let say I just want strings that contains 'A-Z', 'empty', '.', '$', '0-9'
So if i have a string like "HELLO STaCKOVERFLOW" => invalid, because of the 'a'.
Ok now how to do that? I could make a List<char> and put every char in it that is not allowed and check the string with this list. Maybe not a good idea, because there a lot of chars then. But I could make a list that contains all of the allowed chars right? And then? For every char in the string I have to compare the List<char>? Any smart code for this? And another question: if I would add A-Z to the List<char> I have to add 25 chars manually, but these chars are as I know 65-90 in the ASCII Table, can I add them easier? Any suggestions? Thank you
You can use a regular expression for this:
Regex r = new Regex("[^A-Z0-9.$ ]$");
if (r.IsMatch(SomeString)) {
// validation failed
}
To create a list of characters from A-Z or 0-9 you would use a simple loop:
for (char c = 'A'; c <= 'Z'; c++) {
// c or c.ToString() depending on what you need
}
But you don't need that with the Regex - pretty much every regex engine understands the range syntax (A-Z).
I have only just written such a function, and an extended version to restrict the first and last characters when needed. The original function merely checks whether or not the string consists of valid characters only, the extended function adds two integers for the numbers of valid characters at the beginning of the list to be skipped when checking the first and last characters, in practice it simply calls the original function 3 times, in the example below it ensures that the string begins with a letter and doesn't end with an underscore.
StrChr(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"));
StrChrEx(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ", 11, 1));
BOOL __cdecl StrChr(CHAR* str, CHAR* chars)
{
for (int s = 0; str[s] != 0; s++)
{
int c = 0;
while (true)
{
if (chars[c] == 0)
{
return false;
}
else if (str[s] == chars[c])
{
break;
}
else
{
c++;
}
}
}
return true;
}
BOOL __cdecl StrChrEx(CHAR* str, CHAR* chars, UINT excl_first, UINT excl_last)
{
char first[2] = {str[0], 0};
char last[2] = {str[strlen(str) - 1], 0};
if (!StrChr(str, chars))
{
return false;
}
if (excl_first != 0)
{
if (!StrChr(first, chars + excl_first))
{
return false;
}
}
if (excl_last != 0)
{
if (!StrChr(last, chars + excl_last))
{
return false;
}
}
return true;
}
If you are using c#, you do this easily using List and contains. You can do this with single characters (in a string) or a multicharacter string just the same
var pn = "The String To ChecK";
var badStrings = new List<string>()
{
" ","\t","\n","\r"
};
foreach(var badString in badStrings)
{
if(pn.Contains(badString))
{
//Do something
}
}
If you're not super good with regular expressions, then there is another way to go about this in C#. Here is a block of code I wrote to test a string variable named notifName:
var alphabet = "a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z";
var numbers = "0,1,2,3,4,5,6,7,8,9";
var specialChars = " ,(,),_,[,],!,*,-,.,+,-";
var validChars = (alphabet + "," + alphabet.ToUpper() + "," + numbers + "," + specialChars).Split(',');
for (int i = 0; i < notifName.Length; i++)
{
if (Array.IndexOf(validChars, notifName[i].ToString()) < 0) {
errorFound = $"Invalid character '{notifName[i]}' found in notification name.";
break;
}
}
You can change the characters added to the array as needed. The Array IndexOf method is the key to the whole thing. Of course if you want commas to be valid, then you would need to choose a different split character.
Not enough reps to comment directly, but I recommend the Regex approach. One small caveat: you probably need to anchor both ends of the input string, and you will want at least one character to match. So (with thanks to ThiefMaster), here's my regex to validate user input for a simple arithmetical calculator (plus, minus, multiply, divide):
Regex r = new Regex(#"^[0-9\.\-\+\*\/ ]+$");
I'd go with a regex, but still need to add my 2 cents here, because all the proposed non-regex solutions are O(MN) in the worst case (string is valid) which I find repulsive for religious reasons.
Even more so when LINQ offers a simpler and more efficient solution than nesting loops:
var isInvalid = "The String To Test".Intersect("ALL_INVALID_CHARS").Any();
I could write my own algorithm to do it, but I feel there should be the equivalent to ruby's humanize in C#.
I googled it but only found ways to humanize dates.
Examples:
A way to turn "Lorem Lipsum Et" into "Lorem lipsum et"
A way to turn "Lorem lipsum et" into "Lorem Lipsum Et"
As discussed in the comments of #miguel's answer, you can use TextInfo.ToTitleCase which has been available since .NET 1.1. Here is some code corresponding to your example:
string lipsum1 = "Lorem lipsum et";
// Creates a TextInfo based on the "en-US" culture.
TextInfo textInfo = new CultureInfo("en-US",false).TextInfo;
// Changes a string to titlecase.
Console.WriteLine("\"{0}\" to titlecase: {1}",
lipsum1,
textInfo.ToTitleCase( lipsum1 ));
// Will output: "Lorem lipsum et" to titlecase: Lorem Lipsum Et
It will ignore casing things that are all caps such as "LOREM LIPSUM ET" because it is taking care of cases if acronyms are in text so that "IEEE" (Institute of Electrical and Electronics Engineers) won't become "ieee" or "Ieee".
However if you only want to capitalize the first character you can do the solution that is over here… or you could just split the string and capitalize the first one in the list:
string lipsum2 = "Lorem Lipsum Et";
string lipsum2lower = textInfo.ToLower(lipsum2);
string[] lipsum2split = lipsum2lower.Split(' ');
bool first = true;
foreach (string s in lipsum2split)
{
if (first)
{
Console.Write("{0} ", textInfo.ToTitleCase(s));
first = false;
}
else
{
Console.Write("{0} ", s);
}
}
// Will output: Lorem lipsum et
There is another elegant solution :
Define the function ToTitleCase in an static class of your projet
using System.Globalization;
public static string ToTitleCase(this string title)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(title.ToLower());
}
And then use it like a string extension anywhere on your project:
"have a good day !".ToTitleCase() // "Have A Good Day !"
Use regular expressions for this looks much cleaner:
string s = "the quick brown fox jumps over the lazy dog";
s = Regex.Replace(s, #"(^\w)|(\s\w)", m => m.Value.ToUpper());
All the examples seem to make the other characters lowered first which isn't what I needed.
customerName = CustomerName <-- Which is what I wanted
this is an example = This Is An Example
public static string ToUpperEveryWord(this string s)
{
// Check for empty string.
if (string.IsNullOrEmpty(s))
{
return string.Empty;
}
var words = s.Split(' ');
var t = "";
foreach (var word in words)
{
t += char.ToUpper(word[0]) + word.Substring(1) + ' ';
}
return t.Trim();
}
If you just want to capitalize the first character, just stick this in a utility method of your own:
return string.IsNullOrEmpty(str)
? str
: str[0].ToUpperInvariant() + str.Substring(1).ToLowerInvariant();
There's also a library method to capitalize the first character of every word:
http://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase.aspx
CSS technique is ok but only changes the presentation of the string in the browser. A better method is to make the text itself capitalised before sending to browser.
Most of the above implimentations are ok, but none of them address the issue of what happens if you have mixed case words that need to be preserved, or if you want to use true Title Case, for example:
"Where to Study PHd Courses in the USA"
or
"IRS Form UB40a"
Also using CultureInfo.CurrentCulture.TextInfo.ToTitleCase(string) preserves upper case words as in
"sports and MLB baseball" which becomes "Sports And MLB Baseball" but if the whole string is put in upper case, then this causes an issue.
So I put together a simple function that allows you to keep the capital and mixed case words and make small words lower case (if they are not at the start and end of the phrase) by including them in a specialCases and lowerCases string arrays:
public static string TitleCase(string value) {
string titleString = ""; // destination string, this will be returned by function
if (!String.IsNullOrEmpty(value)) {
string[] lowerCases = new string[12] { "of", "the", "in", "a", "an", "to", "and", "at", "from", "by", "on", "or"}; // list of lower case words that should only be capitalised at start and end of title
string[] specialCases = new string[7] { "UK", "USA", "IRS", "UCLA", "PHd", "UB40a", "MSc" }; // list of words that need capitalisation preserved at any point in title
string[] words = value.ToLower().Split(' ');
bool wordAdded = false; // flag to confirm whether this word appears in special case list
int counter = 1;
foreach (string s in words) {
// check if word appears in lower case list
foreach (string lcWord in lowerCases) {
if (s.ToLower() == lcWord) {
// if lower case word is the first or last word of the title then it still needs capital so skip this bit.
if (counter == 0 || counter == words.Length) { break; };
titleString += lcWord;
wordAdded = true;
break;
}
}
// check if word appears in special case list
foreach (string scWord in specialCases) {
if (s.ToUpper() == scWord.ToUpper()) {
titleString += scWord;
wordAdded = true;
break;
}
}
if (!wordAdded) { // word does not appear in special cases or lower cases, so capitalise first letter and add to destination string
titleString += char.ToUpper(s[0]) + s.Substring(1).ToLower();
}
wordAdded = false;
if (counter < words.Length) {
titleString += " "; //dont forget to add spaces back in again!
}
counter++;
}
}
return titleString;
}
This is just a quick and simple method - and can probably be improved a bit if you want to spend more time on it.
if you want to keep the capitalisation of smaller words like "a" and "of" then just remove them from the special cases string array. Different organisations have different rules on capitalisation.
You can see an example of this code in action on this site: Egg Donation London - this site automatically creates breadcrumb trails at the top of the pages by parsing the url eg "/services/uk-egg-bank/introduction" - then each folder name in the trail has hyphens replaced with spaces and capitalises the folder name, so uk-egg-bank becomes UK Egg Bank. (preserving the upper case 'UK')
An extension of this code could be to have a lookup table of acronyms and uppercase/lowercase words in a shared text file, database table or web service so that the list of mixed case words can be maintained from one single place and apply to many different applications that rely on the function.
There is no prebuilt solution for proper linguistic captialization in .NET. What kind of capitialization are you going for? Are you following the Chicago Manual of Style conventions? AMA or MLA? Even plain english sentence capitalization has 1000's of special exceptions for words. I can't speak to what ruby's humanize does, but I imagine it likely doesn't follow linguistic rules of capitalization and instead does something much simpler.
Internally, we encountered this same issue and had to write a fairly large amount code just to handle proper (in our little world) casing of article titles, not even accounting for sentence capitalization. And it indeed does get "fuzzy" :)
It really depends on what you need - why are you trying to convert the sentences to proper capitalization (and in what context)?
I have achieved the same using custom extension methods. For First Letter of First sub-string use the method yourString.ToFirstLetterUpper(). For First Letter of Every sub-string excluding articles and some propositions, use the method yourString.ToAllFirstLetterInUpper(). Below is a console program:
class Program
{
static void Main(string[] args)
{
Console.WriteLine("this is my string".ToAllFirstLetterInUpper());
Console.WriteLine("uniVersity of lonDon".ToAllFirstLetterInUpper());
}
}
public static class StringExtension
{
public static string ToAllFirstLetterInUpper(this string str)
{
var array = str.Split(" ");
for (int i = 0; i < array.Length; i++)
{
if (array[i] == "" || array[i] == " " || listOfArticles_Prepositions().Contains(array[i])) continue;
array[i] = array[i].ToFirstLetterUpper();
}
return string.Join(" ", array);
}
private static string ToFirstLetterUpper(this string str)
{
return str?.First().ToString().ToUpper() + str?.Substring(1).ToLower();
}
private static string[] listOfArticles_Prepositions()
{
return new[]
{
"in","on","to","of","and","or","for","a","an","is"
};
}
}
OUTPUT
This is My String
University of London
Process finished with exit code 0.
Far as I know, there's not a way to do that without writing (or cribbing) code. C# nets (ha!) you upper, lower and title (what you have) cases:
http://support.microsoft.com/kb/312890/EN-US/
How would you convert names to proper case in C#?
I have a list of names that I'd like to proof.
For example: mcdonalds to McDonalds or o'brien to O'Brien.
You could consider using a search engine to help you. Submit a query and see how the results have capitalized the name.
I wrote the following extension methods. Feel free to use them.
public static class StringExtensions
{
public static string ToProperCase( this string original )
{
if( original.IsNullOrEmpty() )
return original;
string result = _properNameRx.Replace( original.ToLower( CultureInfo.CurrentCulture ), HandleWord );
return result;
}
public static string WordToProperCase( this string word )
{
if( word.IsNullOrEmpty() )
return word;
if( word.Length > 1 )
return Char.ToUpper( word[0], CultureInfo.CurrentCulture ) + word.Substring( 1 );
return word.ToUpper( CultureInfo.CurrentCulture );
}
private static readonly Regex _properNameRx = new Regex( #"\b(\w+)\b" );
private static readonly string[] _prefixes = { "mc" };
private static string HandleWord( Match m )
{
string word = m.Groups[1].Value;
foreach( string prefix in _prefixes )
{
if( word.StartsWith( prefix, StringComparison.CurrentCultureIgnoreCase ) )
return prefix.WordToProperCase() + word.Substring( prefix.Length ).WordToProperCase();
}
return word.WordToProperCase();
}
}
There is absolutely no way for a computer just to magically know that the first "D" in "McDonalds" should be capitalized. So, I think there are two choices.
Someone out there may have a piece of software or a library that will do this for you.
Barring that, your only choice is to take the following approach: First, I'd look up the name in a dictionary of words that have "interesting" capitalization. Obviously you'd have to provide this dictionary yourself, unless one exists already. Second, apply an algorithm that corrects some of the obvious ones, like Celtic names beginning with O' and Mac and Mc, although given a large enough pool of names, such an algorithm will undoubtedly have a lot of false positives. Lastly, capitalize the first letter of every name that doesn't meet the first two criteria.
The hard part of this is the algorithms to decide on the capitalization. The string manipulation itself is pretty easy. There isn't a perfect way, since there are no "rules" for cases. One strategy might be a set of rules, such as "capitalize the first letter...usually" and "capitalize the 3rd letter if the first two letters are mc...usually"
Starting with a dictionary of real names and comparing them to your own name for matches will help. You could also take a dictionary of real names, generate a Markhov chain from it, and throw any new names at the Markhov chain to determine the capitalization. That's a crazy, complicated solution.
The ultimate perfect solution is to use humans to correct the data.
Doing this requires that your program be able to interpret the english language to an extent. At the very least be able to break down a string into a set of words. There is no API built-into the .Net Framework that can achieve this.
However if there was, you could use the following code.
public string ProperCase(string str, Func<string,bool> isWord) {
var word = new StringBuilder();
var cur = new StringBuilder();
for ( var i = 0; i < str.Length; i++ ) {
cur.Append(cur.Length == 0 ? Char.ToUpper(str[i]) : str[i]));
if ( isWord(cur.ToString()) {
word.Append(cur.ToString());
cur.Length = 0;
}
}
if ( cur.Length > 0 ) {
word.Append(cur);
}
return word.ToString();
}
It's not a perfect solution but it gives you a general idea of the outline
You could check the lower/mixed case surname against a dictionary (file) that has the correct casings in it, then return the 'real' value from the dictionary.
I had a quick google to see if one exists, but to no avail!
I'm planning on writing such a function, but will probably not go into too many edge cases... Below in psuedo-code with regex for matching...
start with /\b[A-Z]+\b/ as set matching, so each sequence of letters up against a word boundary, match as a set.
if the string is all uppercase...
lower-case the string
upper-case the first letter
do the following beginning of string replacements
Vanb -> VanB
Vanh -> VanH
Mc? -> Mc? (uppercase wildcard character)
Mac[^kh] -> Mac? (uppercase wildcard match)
With the replaced whole-name string do matching against other replacement sets like...
"De La " -> "de la "
That should catch most cases for names in particular... but a nice database of common name casing would be very nice.
Here was my solution. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them.
public static String toProperName(String name)
{
if (name != null)
{
if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald"
return "Mc" + Regex.Replace(name.ToLower().Substring(2), #"\b[a-z]", m => m.Value.ToUpper());
if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle"
return "Van" + Regex.Replace(name.ToLower().Substring(3), #"\b[a-z]", m => m.Value.ToUpper());
return Regex.Replace(name.ToLower(), #"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes
// appostrophes like O'HARE or o'hare to O'Hare
}
return "";
}
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
string txt = textInfo.ToTitleCase("texthere");