Escape invalid XML characters in C# - c#

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

Use SecurityElement.Escape
using System;
using System.Security;
class Sample {
static void Main() {
string text = "Escape characters : < > & \" \'";
string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : < > & " &apos;
Console.WriteLine(xmlText);
}
}

If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.
Console.Write(new XElement("Data", "< > &"));
Will output
<Data>< > &</Data>
If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:
static void Main()
{
const string content = "\v\U00010330";
string newContent = RemoveInvalidXmlChars(content);
Console.WriteLine(newContent);
}
This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.
To support surrogate characters, I suggest using the following method:
public static string RemoveInvalidXmlChars(string text)
{
if (string.IsNullOrEmpty(text))
return text;
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}

Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}

// Replace invalid characters with empty strings.
Regex.Replace(inputString, #"[^\w\.#-]", "");
The regular expression pattern [^\w.#-] matches any character that is not a word character, a period, an # symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.#-\%] also allows a percentage symbol and a backslash in an input string.
Regex.Replace(inputString, #"[!##$%_]", "");
Refer this too :
Removing Invalid Characters from XML Name Tag - RegEx C#
Here is a function to remove the characters from a specified XML string:
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace XMLUtils
{
class Standards
{
/// <summary>
/// Strips non-printable ascii characters
/// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
/// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
/// </summary>
/// <param name="content">contents</param>
/// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
{
string pattern = String.Empty;
switch (XMLVersion)
{
case "1.0":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
break;
case "1.1":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
break;
default:
throw new Exception("Error: Invalid XML Version!");
}
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(tmpContents))
{
tmpContents = regex.Replace(tmpContents, String.Empty);
}
tmpContents = string.Empty;
}
}
}

If you are only escaping invalid XML characters for a string that is used inside of an XML tag you could do something simple like this.
This works when you aren't using an XML library.
public string EscapeXMLCharacters (string target)
{
return
target
.Replace("&", "&")
.Replace("<", "<")
.Replace(">", ">")
.Replace("\"", """)
.Replace("'", "&apos;");
}
you could then call it like so:
public string GetXMLBody(string content)
{
return #"<input>" + EscapeXMLCharacters(content) + "</input>";
}

string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.EncodeName(UnfilteredString);
}
string XMLReadStringWithoutIllegalCharacters(string FilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.DecodeName(UnfilteredString);
}
This simple method replace the invalid characters with the same value but accepted in the XML context.
To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).

Related

Regex for string validation (with char tags)

I need a regex to validate string.
string test = "C:\Dic\<:Id:>.<:Dic:>testtest<:Location:>.Test.doc"
I made I regex to get all fields between "<:" and ":>".
Regex.Matches(fileNameConfig, #"\<(.+?)\>")
.Cast<Match>()
.Select(m => m.Groups[0].Value).ToList();
Now, I need to check that if are there any opened tags that have not close tags and are there any nested tags.
string test = "C:\Dic\<:<:Id:>.<:Dic:>testtest<:Location:>.Test.doc"
string test = "<:C:\Dic\<:Id:>.<:Dic:>testtest<:Location:>.Test.doc:>"
The nesting can be tested by counting the opening and closing brackets.
At any position in the string, the number of opening brackets before this position must be greater or equal the number of closing brackets.
At the end of the string, the number of opening brackets must equal the number of closing brackets exactly.
public static bool IsBracketNestingValid(string input) {
if (string.IsNullOrWhiteSpace(input)) {
return true; // empty string is always nested correctly
}
const string openingBracket = "<:";
const string closingBracket = ":>";
if (input.Length < openingBracket.Length) {
// perform this check if we expect that input strings
// must contain at least one bracket (e.g. "<" should be invalid)
return false;
}
int openingCount = 0;
int closingCount = 0;
for (int pos = 0; pos < input.Length-1; pos++) {
string currentToken = string.Format("{0}{1}", input[pos], input[pos+1]);
if (currentToken == openingBracket) {
openingCount++;
// skip over this recognized token
// (so we do not count any ':' twice, e.g. "<:>" should be invalid)
pos++;
}
if (currentToken == closingBracket) {
closingCount++;
pos++; // skip over this recognized token
}
if (closingCount > openingCount) {
return false; // found closing bracket before opening bracket
}
}
return openingCount == closingCount;
}
C# Fiddle

How to unescape a sequence include \u and \U?

I have some strings in a .resx file include some sequences like this:
\u26A0 warning
So i use the following code to unscape it
str = Regex.Unescape(str);
Now, when i see the result everything works well (with \u) and it show the related emoji.
But Regex.Unescape(...) method dose not work when the input string is include \U like this:
\U0001F4D8 book
and it return this error:
Error: Unrecognized escape sequence \U
My question:
Is there another method in .Net framework to Unescape the sequences include \u and \U?
If there is not an embed method, how can i write a helper method manually to do it?
Edit:
When i read string from the resx file it has double backslash, i should convert these Unicode sequences to their characters:
Indeed, according to source code of Regex.Unescape, RegexParser.ScanCharEscape, \U is not handled.
Instead, you could consider a manual conversion with help of char.ConnvertFromUtf32:
string converted = char.ConvertFromUtf32(int.Parse("0001F4D8", NumberStyles.HexNumber));
This is a draft implementation. (The annoying complexity comes from an attempt to distinguish \U and \\U.)
static string Unescape(string str)
{
StringBuilder builder = new StringBuilder();
int startIndex = 0;
while(true)
{
int index = IndexOfBackslashU(str, startIndex);
if (index == -1)
return builder.Append(Regex.Unescape(str.Substring(startIndex))).ToString();
builder.Append(Regex.Unescape(str.Substring(startIndex, index - startIndex)));
string number = str.Substring(index + 2, 8);
builder.Append(char.ConvertFromUtf32(int.Parse(number, NumberStyles.HexNumber)));
startIndex = index + 10;
}
}
static int IndexOfBackslashU(string str, int startIndex)
{
while (true)
{
int index = str.IndexOf(#"\U", startIndex);
if (index == -1)
return index;
bool evenNumberOfPreviousBackslashes = true;
for (int k = index-1; k >= 0 && str[k] == '\\'; k--)
evenNumberOfPreviousBackslashes = !evenNumberOfPreviousBackslashes;
if (evenNumberOfPreviousBackslashes)
return index;
startIndex = index + 2;
}
}
I wrote this method and the problem solved:
public static string UnescapeIt(string str)
{
var regex = new Regex(#"(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})", RegexOptions.Compiled);
return regex.Replace(str,
m =>
{
if (m.Value.IndexOf("\\U", StringComparison.Ordinal) > -1)
return char.ConvertFromUtf32(int.Parse(m.Value.Replace("\\U", ""), NumberStyles.HexNumber));
return Regex.Unescape(m.Value);
});
}
It unescape \u sequences and convert \U sequences to related character. So we can see the emojis.
Use:
str= UnescapeIt(str);
Result:
Update:
I changed the regex from
\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}
to
(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
Now it will fail the match if we have a backslash before \u or \U

XElement.Parse breaks due to an invalid xml text [duplicate]

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Use SecurityElement.Escape
using System;
using System.Security;
class Sample {
static void Main() {
string text = "Escape characters : < > & \" \'";
string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : < > & " &apos;
Console.WriteLine(xmlText);
}
}
If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.
Console.Write(new XElement("Data", "< > &"));
Will output
<Data>< > &</Data>
If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.
The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:
static void Main()
{
const string content = "\v\U00010330";
string newContent = RemoveInvalidXmlChars(content);
Console.WriteLine(newContent);
}
This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.
To support surrogate characters, I suggest using the following method:
public static string RemoveInvalidXmlChars(string text)
{
if (string.IsNullOrEmpty(text))
return text;
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}
Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}
// Replace invalid characters with empty strings.
Regex.Replace(inputString, #"[^\w\.#-]", "");
The regular expression pattern [^\w.#-] matches any character that is not a word character, a period, an # symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.#-\%] also allows a percentage symbol and a backslash in an input string.
Regex.Replace(inputString, #"[!##$%_]", "");
Refer this too :
Removing Invalid Characters from XML Name Tag - RegEx C#
Here is a function to remove the characters from a specified XML string:
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace XMLUtils
{
class Standards
{
/// <summary>
/// Strips non-printable ascii characters
/// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
/// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
/// </summary>
/// <param name="content">contents</param>
/// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
{
string pattern = String.Empty;
switch (XMLVersion)
{
case "1.0":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
break;
case "1.1":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
break;
default:
throw new Exception("Error: Invalid XML Version!");
}
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(tmpContents))
{
tmpContents = regex.Replace(tmpContents, String.Empty);
}
tmpContents = string.Empty;
}
}
}
If you are only escaping invalid XML characters for a string that is used inside of an XML tag you could do something simple like this.
This works when you aren't using an XML library.
public string EscapeXMLCharacters (string target)
{
return
target
.Replace("&", "&")
.Replace("<", "<")
.Replace(">", ">")
.Replace("\"", """)
.Replace("'", "&apos;");
}
you could then call it like so:
public string GetXMLBody(string content)
{
return #"<input>" + EscapeXMLCharacters(content) + "</input>";
}
string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.EncodeName(UnfilteredString);
}
string XMLReadStringWithoutIllegalCharacters(string FilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.DecodeName(UnfilteredString);
}
This simple method replace the invalid characters with the same value but accepted in the XML context.
To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).

Remove all non-ASCII characters from string

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.
I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?
Here a simple solution:
public static bool IsASCII(this string value)
{
// ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
return Encoding.UTF8.GetByteCount(value) == value.Length;
}
source: http://snipplr.com/view/35806/
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Do it all at once
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach(char c in s)
{
if((int)c > 127) // you probably don't want 127 either
continue;
if((int)c < 32) // I bet you don't want control characters
continue;
if(c == ',')
continue;
if(c == '"')
continue;
sb.Append(c);
}
return sb.ToString();
}
If you wanted to test a specific character, you could use
if ((int)myChar <= 127)
Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.
Here's an improvement upon the accepted answer:
string fallbackStr = "";
Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(fallbackStr),
new DecoderReplacementFallback(fallbackStr));
string cleanStr = enc.GetString(enc.GetBytes(inputStr));
This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)
It sounds kind of strange that it's accepted to drop the non-ASCII.
Also I always recommend the excellent FileHelpers library for parsing CSV-files.
strText = Regex.Replace(strText, #"[^\u0020-\u007E]", string.Empty);
public string RunCharacterCheckASCII(string s)
{
string str = s;
bool is_find = false;
char ch;
int ich = 0;
try
{
char[] schar = str.ToCharArray();
for (int i = 0; i < schar.Length; i++)
{
ch = schar[i];
ich = (int)ch;
if (ich > 127) // not ascii or extended ascii
{
is_find = true;
schar[i] = '?';
}
}
if (is_find)
str = new string(schar);
}
catch (Exception ex)
{
}
return str;
}

Best way to convert Pascal Case to a sentence

What is the best way to convert from Pascal Case (upper Camel Case) to a sentence.
For example starting with
"AwaitingFeedback"
and converting that to
"Awaiting feedback"
C# preferable but I could convert it from Java or similar.
public static string ToSentenceCase(this string str)
{
return Regex.Replace(str, "[a-z][A-Z]", m => m.Value[0] + " " + char.ToLower(m.Value[1]));
}
In versions of visual studio after 2015, you can do
public static string ToSentenceCase(this string str)
{
return Regex.Replace(str, "[a-z][A-Z]", m => $"{m.Value[0]} {char.ToLower(m.Value[1])}");
}
Based on: Converting Pascal case to sentences using regular expression
I will prefer to use Humanizer for this. Humanizer is a Portable Class Library that meets all your .NET needs for manipulating and displaying strings, enums, dates, times, timespans, numbers and quantities.
Short Answer
"AwaitingFeedback".Humanize() => Awaiting feedback
Long and Descriptive Answer
Humanizer can do a lot more work other examples are:
"PascalCaseInputStringIsTurnedIntoSentence".Humanize() => "Pascal case input string is turned into sentence"
"Underscored_input_string_is_turned_into_sentence".Humanize() => "Underscored input string is turned into sentence"
"Can_return_title_Case".Humanize(LetterCasing.Title) => "Can Return Title Case"
"CanReturnLowerCase".Humanize(LetterCasing.LowerCase) => "can return lower case"
Complete code is :
using Humanizer;
using static System.Console;
namespace HumanizerConsoleApp
{
class Program
{
static void Main(string[] args)
{
WriteLine("AwaitingFeedback".Humanize());
WriteLine("PascalCaseInputStringIsTurnedIntoSentence".Humanize());
WriteLine("Underscored_input_string_is_turned_into_sentence".Humanize());
WriteLine("Can_return_title_Case".Humanize(LetterCasing.Title));
WriteLine("CanReturnLowerCase".Humanize(LetterCasing.LowerCase));
}
}
}
Output
Awaiting feedback
Pascal case input string is turned into sentence
Underscored input string is turned into sentence Can Return Title Case
can return lower case
If you prefer to write your own C# code you can achieve this by writing some C# code stuff as answered by others already.
Here you go...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace CamelCaseToString
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(CamelCaseToString("ThisIsYourMasterCallingYou"));
}
private static string CamelCaseToString(string str)
{
if (str == null || str.Length == 0)
return null;
StringBuilder retVal = new StringBuilder(32);
retVal.Append(char.ToUpper(str[0]));
for (int i = 1; i < str.Length; i++ )
{
if (char.IsLower(str[i]))
{
retVal.Append(str[i]);
}
else
{
retVal.Append(" ");
retVal.Append(char.ToLower(str[i]));
}
}
return retVal.ToString();
}
}
}
This works for me:
Regex.Replace(strIn, "([A-Z]{1,2}|[0-9]+)", " $1").TrimStart()
This is just like #SSTA, but is more efficient than calling TrimStart.
Regex.Replace("ThisIsMyCapsDelimitedString", "(\\B[A-Z])", " $1")
Found this in the MvcContrib source, doesn't seem to be mentioned here yet.
return Regex.Replace(input, "([A-Z])", " $1", RegexOptions.Compiled).Trim();
Just because everyone has been using Regex (except this guy), here's an implementation with StringBuilder that was about 5x faster in my tests. Includes checking for numbers too.
"SomeBunchOfCamelCase2".FromCamelCaseToSentence == "Some Bunch Of Camel Case 2"
public static string FromCamelCaseToSentence(this string input) {
if(string.IsNullOrEmpty(input)) return input;
var sb = new StringBuilder();
// start with the first character -- consistent camelcase and pascal case
sb.Append(char.ToUpper(input[0]));
// march through the rest of it
for(var i = 1; i < input.Length; i++) {
// any time we hit an uppercase OR number, it's a new word
if(char.IsUpper(input[i]) || char.IsDigit(input[i])) sb.Append(' ');
// add regularly
sb.Append(input[i]);
}
return sb.ToString();
}
Here's a basic way of doing it that I came up with using Regex
public static string CamelCaseToSentence(this string value)
{
var sb = new StringBuilder();
var firstWord = true;
foreach (var match in Regex.Matches(value, "([A-Z][a-z]+)|[0-9]+"))
{
if (firstWord)
{
sb.Append(match.ToString());
firstWord = false;
}
else
{
sb.Append(" ");
sb.Append(match.ToString().ToLower());
}
}
return sb.ToString();
}
It will also split off numbers which I didn't specify but would be useful.
string camel = "MyCamelCaseString";
string s = Regex.Replace(camel, "([A-Z])", " $1").ToLower().Trim();
Console.WriteLine(s.Substring(0,1).ToUpper() + s.Substring(1));
Edit: didn't notice your casing requirements, modifed accordingly. You could use a matchevaluator to do the casing, but I think a substring is easier. You could also wrap it in a 2nd regex replace where you change the first character
"^\w"
to upper
\U (i think)
I'd use a regex, inserting a space before each upper case character, then lowering all the string.
string spacedString = System.Text.RegularExpressions.Regex.Replace(yourString, "\B([A-Z])", " \k");
spacedString = spacedString.ToLower();
It is easy to do in JavaScript (or PHP, etc.) where you can define a function in the replace call:
var camel = "AwaitingFeedbackDearMaster";
var sentence = camel.replace(/([A-Z].)/g, function (c) { return ' ' + c.toLowerCase(); });
alert(sentence);
Although I haven't solved the initial cap problem... :-)
Now, for the Java solution:
String ToSentence(String camel)
{
if (camel == null) return ""; // Or null...
String[] words = camel.split("(?=[A-Z])");
if (words == null) return "";
if (words.length == 1) return words[0];
StringBuilder sentence = new StringBuilder(camel.length());
if (words[0].length() > 0) // Just in case of camelCase instead of CamelCase
{
sentence.append(words[0] + " " + words[1].toLowerCase());
}
else
{
sentence.append(words[1]);
}
for (int i = 2; i < words.length; i++)
{
sentence.append(" " + words[i].toLowerCase());
}
return sentence.toString();
}
System.out.println(ToSentence("AwaitingAFeedbackDearMaster"));
System.out.println(ToSentence(null));
System.out.println(ToSentence(""));
System.out.println(ToSentence("A"));
System.out.println(ToSentence("Aaagh!"));
System.out.println(ToSentence("stackoverflow"));
System.out.println(ToSentence("disableGPS"));
System.out.println(ToSentence("Ahh89Boo"));
System.out.println(ToSentence("ABC"));
Note the trick to split the sentence without loosing any character...
Pseudo-code:
NewString = "";
Loop through every char of the string (skip the first one)
If char is upper-case ('A'-'Z')
NewString = NewString + ' ' + lowercase(char)
Else
NewString = NewString + char
Better ways can perhaps be done by using regex or by string replacement routines (replace 'X' with ' x')
An xquery solution that works for both UpperCamel and lowerCamel case:
To output sentence case (only the first character of the first word is capitalized):
declare function content:sentenceCase($string)
{
let $firstCharacter := substring($string, 1, 1)
let $remainingCharacters := substring-after($string, $firstCharacter)
return
concat(upper-case($firstCharacter),lower-case(replace($remainingCharacters, '([A-Z])', ' $1')))
};
To output title case (first character of each word capitalized):
declare function content:titleCase($string)
{
let $firstCharacter := substring($string, 1, 1)
let $remainingCharacters := substring-after($string, $firstCharacter)
return
concat(upper-case($firstCharacter),replace($remainingCharacters, '([A-Z])', ' $1'))
};
Found myself doing something similar, and I appreciate having a point-of-departure with this discussion. This is my solution, placed as an extension method to the string class in the context of a console application.
using System;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string piratese = "avastTharMatey";
string ivyese = "CheerioPipPip";
Console.WriteLine("{0}\n{1}\n", piratese.CamelCaseToString(), ivyese.CamelCaseToString());
Console.WriteLine("For Pete\'s sake, man, hit ENTER!");
string strExit = Console.ReadLine();
}
}
public static class StringExtension
{
public static string CamelCaseToString(this string str)
{
StringBuilder retVal = new StringBuilder(32);
if (!string.IsNullOrEmpty(str))
{
string strTrimmed = str.Trim();
if (!string.IsNullOrEmpty(strTrimmed))
{
retVal.Append(char.ToUpper(strTrimmed[0]));
if (strTrimmed.Length > 1)
{
for (int i = 1; i < strTrimmed.Length; i++)
{
if (char.IsUpper(strTrimmed[i])) retVal.Append(" ");
retVal.Append(char.ToLower(strTrimmed[i]));
}
}
}
}
return retVal.ToString();
}
}
}
Most of the preceding answers split acronyms and numbers, adding a space in front of each character. I wanted acronyms and numbers to be kept together so I have a simple state machine that emits a space every time the input transitions from one state to the other.
/// <summary>
/// Add a space before any capitalized letter (but not for a run of capitals or numbers)
/// </summary>
internal static string FromCamelCaseToSentence(string input)
{
if (string.IsNullOrEmpty(input)) return String.Empty;
var sb = new StringBuilder();
bool upper = true;
for (var i = 0; i < input.Length; i++)
{
bool isUpperOrDigit = char.IsUpper(input[i]) || char.IsDigit(input[i]);
// any time we transition to upper or digits, it's a new word
if (!upper && isUpperOrDigit)
{
sb.Append(' ');
}
sb.Append(input[i]);
upper = isUpperOrDigit;
}
return sb.ToString();
}
And here's some tests:
[TestCase(null, ExpectedResult = "")]
[TestCase("", ExpectedResult = "")]
[TestCase("ABC", ExpectedResult = "ABC")]
[TestCase("abc", ExpectedResult = "abc")]
[TestCase("camelCase", ExpectedResult = "camel Case")]
[TestCase("PascalCase", ExpectedResult = "Pascal Case")]
[TestCase("Pascal123", ExpectedResult = "Pascal 123")]
[TestCase("CustomerID", ExpectedResult = "Customer ID")]
[TestCase("CustomABC123", ExpectedResult = "Custom ABC123")]
public string CanSplitCamelCase(string input)
{
return FromCamelCaseToSentence(input);
}
Mostly already answered here
Small chage to the accepted answer, to convert the second and subsequent Capitalised letters to lower case, so change
if (char.IsUpper(text[i]))
newText.Append(' ');
newText.Append(text[i]);
to
if (char.IsUpper(text[i]))
{
newText.Append(' ');
newText.Append(char.ToLower(text[i]));
}
else
newText.Append(text[i]);
Here is my implementation. This is the fastest that I got while avoiding creating spaces for abbreviations.
public static string PascalCaseToSentence(string input)
{
if (string.IsNullOrEmpty(input) || input.Length < 2)
return input;
var sb = new char[input.Length + ((input.Length + 1) / 2)];
var len = 0;
var lastIsLower = false;
for (int i = 0; i < input.Length; i++)
{
var current = input[i];
if (current < 97)
{
if (lastIsLower)
{
sb[len] = ' ';
len++;
}
lastIsLower = false;
}
else
{
lastIsLower = true;
}
sb[len] = current;
len++;
}
return new string(sb, 0, len);
}

Categories