escaping tricky string to CSV format - c#

I have to create a CSV file from webservice output and the CSV file uses quoted strings with comma separator. I cannot change the format...
So if I have a string it becomes a "string"...
If the value has quotes already they are replaced with double quotes.
For example a str"ing becomes "str""ing"...
However, lately my import has been failing because of the following
original input string is: "","word1,word2,..."
every single quote is replaced by double resulting in: """",""word1,word2,...""
then its prefixed and suffixed with quote before written to CVS file: """"",""word1,word2,..."""
As you can see the final result is this:
""""",""word1,word2,..."""
which breaks my import (is sees it as another field)...
I think the issue is appereance of "," in the original input string.
Is there a CVS escape sequence for this scenario?
Update
The reason why above breaks is due to BCP mapping file (BCP utility is used to load CSV file into SQL db) which has terminator defined as "," . So instead of seeing 1 field it sees 2...But I cannot change the mapping file...

I use this code and it has always worked:
/// <summary>
/// Turn a string into a CSV cell output
/// </summary>
/// <param name="str">String to output</param>
/// <returns>The CSV cell formatted string</returns>
public static string StringToCSVCell(string str)
{
bool mustQuote = (str.Contains(",") || str.Contains("\"") || str.Contains("\r") || str.Contains("\n"));
if (mustQuote)
{
StringBuilder sb = new StringBuilder();
sb.Append("\"");
foreach (char nextChar in str)
{
sb.Append(nextChar);
if (nextChar == '"')
sb.Append("\"");
}
sb.Append("\"");
return sb.ToString();
}
return str;
}

Based on Ed Bayiates' answer:
/// <summary>
/// Turn a string into a CSV cell output
/// </summary>
/// <param name="value">String to output</param>
/// <returns>The CSV cell formatted string</returns>
private string ConvertToCsvCell(string value)
{
var mustQuote = value.Any(x => x == ',' || x == '\"' || x == '\r' || x == '\n');
if (!mustQuote)
{
return value;
}
value = value.Replace("\"", "\"\"");
return string.Format("\"{0}\"", value);
}

My penny thought:
String[] lines = new String[] { "\"\",\"word\",word,word2,1,34,5,2,\"details\"" };
for (int j = 0; j < lines.Length; j++)
{
String[] fields=lines[j].Split(',');
for (int i =0; i<fields.Length; i++)
{
if (fields[i].StartsWith("\"") && fields[i].EndsWith("\""))
{
char[] tmp = new char[fields[i].Length-2];
fields[i].CopyTo(1,tmp,0,fields[i].Length-2);
fields[i] =tmp.ToString();
fields[i] = "\""+fields[i].Replace("\"","\"\"")+"\"";
}
else
fields[i] = fields[i].Replace("\"","\"\"");
}
lines[j]=String.Join(",",fields);
}

Based on contribution of "Ed Bayiates" here's an helpful class to buid csv document:
/// <summary>
/// helpful class to build csv document
/// </summary>
public class CsvBuilder
{
/// <summary>
/// create the csv builder
/// </summary>
public CsvBuilder(char csvSeparator)
{
m_csvSeparator = csvSeparator;
}
/// <summary>
/// append a cell
/// </summary>
public void appendCell(string strCellValue)
{
if (m_nCurrentColumnIndex > 0) m_strBuilder.Append(m_csvSeparator);
bool mustQuote = (strCellValue.Contains(m_csvSeparator)
|| strCellValue.Contains('\"')
|| strCellValue.Contains('\r')
|| strCellValue.Contains('\n'));
if (mustQuote)
{
m_strBuilder.Append('\"');
foreach (char nextChar in strCellValue)
{
m_strBuilder.Append(nextChar);
if (nextChar == '"') m_strBuilder.Append('\"');
}
m_strBuilder.Append('\"');
}
else
{
m_strBuilder.Append(strCellValue);
}
m_nCurrentColumnIndex++;
}
/// <summary>
/// end of line, new line
/// </summary>
public void appendNewLine()
{
m_strBuilder.Append(Environment.NewLine);
m_nCurrentColumnIndex = 0;
}
/// <summary>
/// Create the CSV file
/// </summary>
/// <param name="path"></param>
public void save(string path )
{
File.WriteAllText(path, ToString());
}
public override string ToString()
{
return m_strBuilder.ToString();
}
private StringBuilder m_strBuilder = new StringBuilder();
private char m_csvSeparator;
private int m_nCurrentColumnIndex = 0;
}
How to use it:
void exportAsCsv( string strFileName )
{
CsvBuilder csvStringBuilder = new CsvBuilder(';');
csvStringBuilder.appendCell("#Header col 1 : Name");
csvStringBuilder.appendCell("col 2 : Value");
csvStringBuilder.appendNewLine();
foreach (Data data in m_dataSet)
{
csvStringBuilder.appendCell(data.getName());
csvStringBuilder.appendCell(data.getValue());
csvStringBuilder.appendNewLine();
}
csvStringBuilder.save(strFileName);
}

the first step in parsing this is removing the extra added " 's around your string. Once you do this, you should be able to deal with the embedded " as well as the ,'s.

After much deliberation, it was decided that import utility format was needed to be fixed. The escaping of the string was correct (as users indicated) but the format file that import utility used was incorrect and was causing it to break import.
Thanks all and special thanks to #dbt (up vote)

Related

Remove html tags from MainBody

Have an issue here where I try to remove all html tags from this line of EPiServer code
#(Html.PropertyFor(m => m.MainBody)
Because this is suppose to be inside a <a>example code here</a>
Whats a good way to solve this when running EPi Server?
First, it is bad practice using XhtmlString this way, that being said we don't always get to choose.
I'm using this which is a modified version of Rob Volk's extension method.
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
public static class HtmlStringExtensions
{
/// <summary>
/// Truncates a string containing HTML to a number of text characters, keeping whole words.
/// The result contains HTML and any tags left open are closed.
/// by Rob Volk with modifications
/// http://robvolk.com/truncate-html-string-c-extension-method/
/// </summary>
/// <param name="html"></param>
/// <param name="maxCharacters"></param>
/// <param name="trailingText"></param>
/// <returns></returns>
public static string TruncateHtmlString(this string html, int maxCharacters, string trailingText)
{
if (string.IsNullOrEmpty(html))
return html;
// find the spot to truncate
// count the text characters and ignore tags
var textCount = 0;
var charCount = 0;
var ignore = false;
var newString = string.Empty;
foreach (char c in html)
{
newString += c;
charCount++;
if (c == '<')
{
ignore = true;
}
else if (!ignore)
{
textCount++;
}
if (c == '>')
{
ignore = false;
}
// stop once we hit the limit
if (textCount >= maxCharacters)
{
break;
}
}
// Truncate the html and keep whole words only
var trunc = new StringBuilder(newString);
//var trunc = new StringBuilder(html.TruncateWords(charCount));
// keep track of open tags and close any tags left open
var tags = new Stack<string>();
var matches = Regex.Matches(trunc.ToString(), // trunc.ToString()
#"<((?<tag>[^\s/>]+)|/(?<closeTag>[^\s>]+)).*?(?<selfClose>/)?\s*>",
RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);
foreach (Match match in matches)
{
if (match.Success)
{
var tag = match.Groups["tag"].Value;
var closeTag = match.Groups["closeTag"].Value;
// push to stack if open tag and ignore it if it is self-closing, i.e. <br />
if (!string.IsNullOrEmpty(tag) && string.IsNullOrEmpty(match.Groups["selfClose"].Value))
tags.Push(tag);
// pop from stack if close tag
else if (!string.IsNullOrEmpty(closeTag))
{
// pop the tag to close it.. find the matching opening tag
// ignore any unclosed tags
while (tags.Pop() != closeTag && tags.Count > 0)
{ }
}
}
}
if (html.Length > charCount)
// add the trailing text
trunc.Append(trailingText);
// pop the rest off the stack to close remainder of tags
while (tags.Count > 0)
{
trunc.Append("</");
trunc.Append(tags.Pop());
trunc.Append('>');
}
return trunc.ToString();
}
/// <summary>
/// Truncates a string containing HTML to a number of text characters, keeping whole words.
/// The result contains HTML and any tags left open are closed.
/// </summary>
/// <param name="html"></param>
/// <param name="maxCharacters"></param>
/// <returns></returns>
public static string TruncateHtmlString(this string html, int maxCharacters)
{
return html.TruncateHtmlString(maxCharacters, null);
}
/// <summary>
/// Strips all HTML tags from a string
/// </summary>
/// <param name="s"></param>
/// <returns></returns>
public static string StripHtml(this string html)
{
if (string.IsNullOrEmpty(html))
return html;
return Regex.Replace(html, #"<(.|\n)*?>", string.Empty);
}
}
Implement using the ToHtmlString() from EPiServer.Core
In example
// #using EPiServer.Core
#(Html.PropertyFor(m => m.MainBody.ToHtmlString().TruncateHtmlString(160, "..."))
Why don't you use string backed by TextArea?
[UIHint(UIHint.Textarea)]
[Display(Name = "Main Body")]
public virtual string MainBody { get; set; }
What you trying to do using XhtmlString is not a best practise and it could have so many negative effects on your rendering.

HttpStatusCode to readable string

HttpWebResponse response = (HttpWebResponse)await request.GetResponseAsync();
HttpStatusCode statusCode = response.StatusCode;
In this code statusCode.ToString() returns for example "BadRequest" but I need "Bad Request"
I saw arcticles about response.ReasonPhrase, but that's not what I need and it is not supported by HttpWebResponse, only supported by HttpResponseMessage from HttpClient
Another example against Regex.Replace solution:
(414) RequestUriTooLong -> Request-Uri Too Long
Based on reference source, you can retrieve the English status description with a simple call into a static class, given the status code:
int code = 400;
/* will assign "Bad Request" to text */
var text = System.Web.HttpWorkerRequest.GetStatusDescription(code);
Texts are defined for ranges 100 - 507, returning empty strings for special codes like 418 and 506.
HttpStatusCode is an enum which has camel-cased member names.
You can use this one-liner to accomplish your need by putting space between camel-cased:
return Regex.Replace(statusCode.ToString(), "(?<=[a-z])([A-Z])", " $1", RegexOptions.Compiled);
Here, I pulled this out of a string utility class I have. Might be overkill, but it's a useful thing. Use the ToTitleCase extension method.
/// <summary>
/// A dictionary that holds the collection of previous title case
/// conversions so they don't have to be done again if needed more than once.
/// </summary>
private static Dictionary<string, string> _prevTitleCaseConversions = new Dictionary<String, String>();
/// <summary>
/// A collection of English words that should be lower-case in title-cased phrases.
/// </summary>
private static List<string> _englishTitleCaseLowerCaseWords = new List<string>() {"aboard", "about", "above", "across", "after",
"against", "along", "amid", "among", "anti", "around", "is", "as", "at", "before", "behind", "below",
"beneath", "beside", "besides", "between", "beyond", "but", "by", "concerning", "considering",
"despite", "down", "during", "except", "excepting", "excluding", "following", "for", "from", "in",
"inside", "into", "like", "minus", "near", "of", "off", "on", "onto", "opposite", "outside", "over",
"past", "per", "plus", "regarding", "round", "save", "since", "than", "through", "to", "toward",
"towards", "under", "underneath", "unlike", "until", "up", "upon", "versus", "via", "with", "within", "without",
"and", "but", "or", "nor", "for", "yet", "so", "although", "because", "since", "unless", "the", "a", "an"};
/// <summary>
/// Convert the provided alpha-numeric string to title case. The string may contain spaces in addition to letters and numbers, or it can be
/// one individual lowercase, uppercase, or camel case token.
/// </summary>
/// <param name="forValue">The input string which will be converted. The string can be a
/// normal string with spaces or a single token in all lowercase, all uppercase, or camel case.</param>
/// <returns>A version of the input string which has had spaces inserted between each internal "word" that is
/// delimited by an uppercase character and which has otherwise been converted to title case, i.e. all
/// words except for conjunctions, prepositions, and articles are upper case.</returns>
public static string ToTitleCase(this string forValue)
{
if (string.IsNullOrEmpty(forValue)) return forValue;
if (!Regex.IsMatch(forValue, "^[A-Za-z0-9 ]+$"))
throw new ArgumentException($#"""{forValue}"" is not a valid alpha-numeric token for this method.");
if (_prevTitleCaseConversions.ContainsKey(forValue)) return _prevTitleCaseConversions[forValue];
var tokenizedChars = GetTokenizedCharacterArray(forValue);
StringBuilder wordsSB = GetTitleCasedTokens(tokenizedChars);
string ret = wordsSB.ToString();
_prevTitleCaseConversions.Add(forValue, ret);
return ret;
}
/// <summary>
/// Convert the provided string such that first character is
/// uppercase and the remaining characters are lowercase.
/// </summary>
/// <param name="forInput">The string which will have
/// its first character converted to uppercase and
/// subsequent characters converted to lowercase.</param>
/// <returns>The provided string with its first character
/// converted to uppercase and subsequent characters converted to lowercase.</returns>
private static string FirstUpper(this string forInput)
{
return Alter(forInput, new Func<string, string>((input) => input.ToUpperInvariant()), new Func<string, string>((input) => input.ToLowerInvariant()));
}
/// <summary>
/// Return an array of characters built from the provided string with
/// spaces in between each word (token).
/// </summary>
private static ReadOnlyCollection<char> GetTokenizedCharacterArray(string fromInput)
{
var ret = new List<char>();
var tokenChars = fromInput.ToCharArray();
bool isPrevCharUpper = false;
bool isPrevPrevCharUpper = false;
bool isPrevPrevPrevCharUpper = false;
bool isNextCharUpper = false;
bool isNextCharSpace = false;
for (int i = 0; i < tokenChars.Length; i++)
{
char letter = tokenChars[i];
bool addSpace;
bool isCharUpper = char.IsUpper(letter);
if (i == 0) addSpace = false; // no space before first char.
else
{
bool isAtLastChar = i == tokenChars.Length - 1;
isNextCharUpper = !isAtLastChar && char.IsUpper(tokenChars[i + 1]);
isNextCharSpace = !isAtLastChar && !isNextCharUpper && tokenChars[i + 1].Equals(' ');
bool isInAcronym = (isCharUpper && isPrevCharUpper && (isAtLastChar || isNextCharSpace || isNextCharUpper));
addSpace = isCharUpper && !isInAcronym;
}
if (addSpace) ret.Add(' ');
ret.Add(letter);
isPrevPrevPrevCharUpper = isPrevPrevCharUpper;
isPrevPrevCharUpper = isPrevCharUpper;
isPrevCharUpper = isCharUpper;
}
return ret.AsReadOnly();
}
/// <summary>
/// Return a string builder that will produce a string which contains
/// all the tokens (words separated by spaces) in the provided collection
/// of characters and where the string conforms to title casing rules as defined above.
/// </summary>
private static StringBuilder GetTitleCasedTokens(IEnumerable<char> fromTokenChars)
{
StringBuilder wordsSB = new StringBuilder();
var comparer = StringComparer.Create(System.Globalization.CultureInfo.CurrentCulture, true);
var words = new string(fromTokenChars.ToArray()).Split(' ');
bool isFirstWord = true;
foreach (string word in words)
{
if (word.Length == 0) continue;
if (wordsSB.Length > 0) wordsSB.Append(' ');
bool isAcronym = word.Length > 1 && word.All((c) => char.IsUpper(c));
string toAppend;
// leave acronyms as-is, and lower-case all title case exceptions unless it's the first word.
if (isAcronym) toAppend = word;
else if (isFirstWord || !_englishTitleCaseLowerCaseWords.Contains(word, comparer)) toAppend = word.FirstUpper();
else toAppend = word.ToLower();
wordsSB.Append(toAppend);
isFirstWord = false;
}
return wordsSB;
}
/// <summary>
/// Convert the provided string such that first character is altered using
/// <paramref name="firstCharAlterationFunction"/> and the remaining characters
/// are altered using <paramref name="remainingCharsAlterationFunction"/>.
/// </summary>
/// <param name="forInput">The string which will have
/// its first character altered using <paramref name="firstCharAlterationFunction"/> and
/// subsequent characters altered using <paramref name="remainingCharsAlterationFunction"/>.</param>
/// <param name="firstCharAlterationFunction">The function which will
/// be used to alter the first character of the input string.</param>
/// <param name="remainingCharsAlterationFunction">The function which
/// will be used to ever character in the string after the first character.</param>
/// <returns>The provided string with its first character
/// altered using <paramref name="firstCharAlterationFunction"/> and
/// subsequent characters altered using <paramref name="remainingCharsAlterationFunction"/>.</returns>
private static string Alter(string forInput, Func<string, string> firstCharAlterationFunction, Func<string, string> remainingCharsAlterationFunction)
{
if (string.IsNullOrWhiteSpace(forInput)) return forInput;
if (forInput.Length == 1) return firstCharAlterationFunction(forInput);
return firstCharAlterationFunction(forInput[0].ToString()) + remainingCharsAlterationFunction(forInput.Substring(1));
}

Get certain value in the string from text file

I have this in my text file:
000000000:Carrots:$1.99:214:03/11/2015:03/11/2016:$0.99
000000001:Bananas:$1.99:872:03/11/2015:03/11/2016:$0.99
000000002:Chocolate:$2.99:083:03/11/2015:03/11/2016:$1.99
000000003:Spaghetti:$3.99:376:03/11/2015:03/11/2016:$2.99
000000004:Tomato Sauce:$1.99:437:03/11/2015:03/11/2016:$0.99
000000005:Lettuce:$0.99:279:03/11/2015:03/11/2016:$0.99
000000006:Orange Juice:$2.99:398:03/11/2015:03/11/2016:$1.99
000000007:Potatoes:$2.99:792:03/11/2015:03/11/2016:$1.99
000000008:Celery:$0.99:973:03/11/2015:03/11/2016:$0.99
000000009:Onions:$1.99:763:03/11/2015:03/11/2016:$0.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99
I need to get the value of each of the "quantity" values from the position in bold.
EDIT:
I want to also compare the values that I got and give an error if the quantity is low.
Solution with minimal memory consumption in case of large input data.
In additional: there are not processing of incorrect data in quantity column. To do this just replace int.Parse block;
This is several methods to process file data using LINQ expressions
internal static class MyExtensions
{
/// <exception cref="OutOfMemoryException">There is insufficient memory to allocate a buffer for the returned string. </exception>
/// <exception cref="IOException">An I/O error occurs. </exception>
/// <exception cref="ArgumentException"><paramref name="stream" /> does not support reading. </exception>
/// <exception cref="ArgumentNullException"><paramref name="stream" /> is null. </exception>
public static IEnumerable<string> EnumerateLines(this Stream stream)
{
using (var reader = new StreamReader(stream))
{
do
{
var line = reader.ReadLine();
if (line == null) break;
yield return line;
} while (true);
}
}
/// <exception cref="ArgumentNullException"><paramref name="line"/> is <see langword="null" />.</exception>
public static IEnumerable<string> ChunkLine(this string line)
{
if (line == null) throw new ArgumentNullException("line");
return line.Split(':');
}
/// <exception cref="ArgumentNullException"><paramref name="chuckedData"/> is <see langword="null" />.</exception>
/// <exception cref="ArgumentException">Index should be not negative value</exception>
public static string GetColumnData(this IEnumerable<string> chuckedData, int columnIndex)
{
if (chuckedData == null) throw new ArgumentNullException("chuckedData");
if (columnIndex < 0) throw new ArgumentException("Column index should be >= 0", "columnIndex");
return chuckedData.Skip(columnIndex).FirstOrDefault();
}
}
This is example of usage:
private void button1_Click(object sender, EventArgs e)
{
var values = EnumerateQuantityValues("largefile.txt");
// do whatever you need
}
private IEnumerable<int> EnumerateQuantityValues(string fileName)
{
const int columnIndex = 3;
using (var stream = File.OpenRead(fileName))
{
IEnumerable<int> enumerable = stream
.EnumerateLines()
.Select(x => x.ChunkLine().GetColumnData(columnIndex))
.Select(int.Parse);
foreach (var value in enumerable)
{
yield return value;
}
}
}
just consider if you are managed to get all these lines in string array or list.
you can apply the below code to get the collection of quantity as IEnumerable<string>.
var quantity = arr.Select(c =>
{
var temp = c.Split('$');
if (temp.Length > 1)
{
temp = temp[1].Split(':');
if (temp.Length > 1)
{
return temp[1];
}
}
return null;
}).Where(c => c != null);
UPDATE
Check the Fiddle.
https://dotnetfiddle.net/HqKdeI
you simply need to split the string
string data = #"000000000:Carrots:$1.99:214:03/11/2015:03/11/2016:$0.99
000000001:Bananas:$1.99:872:03/11/2015:03/11/2016:$0.99
000000002:Chocolate:$2.99:083:03/11/2015:03/11/2016:$1.99
000000003:Spaghetti:$3.99:376:03/11/2015:03/11/2016:$2.99
000000004:Tomato Sauce:$1.99:437:03/11/2015:03/11/2016:$0.99
000000005:Lettuce:$0.99:279:03/11/2015:03/11/2016:$0.99
000000006:Orange Juice:$2.99:398:03/11/2015:03/11/2016:$1.99
000000007:Potatoes:$2.99:792:03/11/2015:03/11/2016:$1.99
000000008:Celery:$0.99:973:03/11/2015:03/11/2016:$0.99
000000009:Onions:$1.99:763:03/11/2015:03/11/2016:$0.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99";
string[] rows = data.split(Environment.Newline.ToCharArray());
foreach(var row in rows)
{
string[] cols = row.Split(':');
var quantity = cols[3];
}
You can use String.Split to do this.
// Read all lines into an array
string[] lines = File.ReadAllLines(#"C:\path\to\your\file.txt");
// Loop through each one
foreach (string line in lines)
{
// Split into an array based on the : symbol
string[] split = line.Split(':');
// Get the column based on index
Console.WriteLine(split[3]);
}
Check out the example code below. The string you care about is named theValueYouWantInTheString.
char[] delimiterChar = { ':' };
string input = #"000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99";
string[] values = input.Split(delimiterChar);
string theValueYouWantInTheString = values[3];
If you have a problem, use regular expression. Now you have two problems.
Here is a program that uses your input as a txt file. The function GetQuantity returns a list with int that contains the quantity. With this approach you can define more groups to extract information from each line.
namespace RegExptester
{
class Program
{
private static List<int> GetQuantity(string txtFile)
{
string tempLineValue;
Regex regex = new Regex(#"[0-9]*:[a-zA-Z]*:\$[0-9]*\.[0-9]*:([0-9]*).*", RegexOptions.Compiled);
List<int> retValue = new List<int>();
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (null != (tempLineValue = inputReader.ReadLine()))
{
Match match = regex.Match(tempLineValue);
if (match.Success)
{
if(match.Groups.Count == 2)
{
int numberValue;
if (int.TryParse(match.Groups[1].Value, out numberValue))
retValue.Add(numberValue);
}
}
}
}
return retValue;
}
static void Main(string[] args)
{
var tmp = GetQuantity("c:\\tmp\\junk.txt");
}
}
}
Apparently from each line you want the part between the 3th and the 4th colon. Linq can do that for you:
using (var textReader = new StreamReader(fileName))
{
// read all text and divide into lines:
var allText = textReader.ReadToEnd();
var allLines = textReader.Split(new char[] {'\r','\n'}, StringSplitIoptions.RemoveEmptyEntries);
// split each line based on ':', and take the fourth element
var myValues = allLines.Select(line => line.Split(new char[] {':'})
.Skip(3)
.FirstOrDefault();
}
If you want less readability, of course you can concatenate these statements into one line.

Dealing with invalid XML hexadecimal characters

I'm trying to send an XML document over the wire but receiving the following exception:
"MY LONG EMAIL STRING" was specified for the 'Body' element. ---&gt; System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
at System.Xml.XmlRawWriter.WriteValue(String value)
at System.Xml.XmlWellFormedWriter.WriteValue(String value)
at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
--- End of inner exception stack trace ---
I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?
I'd like to keep the original characters one way or another.
The following code removes XML invalid characters from a string and returns a new string without them:
public static string CleanInvalidXmlChars(string text)
{
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string re = #"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}
byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
is one way of doing this
Work for me:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };
Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.
-- update --
See below how one can also directly write an XElement that has these invalid characters, though it uses this code --
Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.
USAGE:
string xmlStrBack = XML.ToValidXmlCharactersString("any string");
TEST:
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
string goodString = "My name is Inigo Montoya!";
string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string
XElement x1 = new XElement("test", back1);
XElement x2 = new XElement("test", back2);
XElement x3WithBadString = new XElement("test", badString);
string xml1 = x1.ToString();
string xml2 = x2.ToString().Print();
string xmlShouldFail = x3WithBadString.ToString();
}
// --- CODE --- (I have these methods in a static utility class called XML)
/// <summary>
/// Determines if any invalid XML 1.0 characters exist within the string,
/// and if so it returns a new string with the invalid chars removed, else
/// the same string is returned (with no wasted StringBuilder allocated, etc).
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">The index to begin checking at.</param>
public static string ToValidXmlCharactersString(string s, int startIndex = 0)
{
int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
if (firstInvalidChar < 0)
return s;
startIndex = firstInvalidChar;
int len = s.Length;
var sb = new StringBuilder(len);
if (startIndex > 0)
sb.Append(s, 0, startIndex);
for (int i = startIndex; i < len; i++)
if (IsLegalXmlChar(s[i]))
sb.Append(s[i]);
return sb.ToString();
}
/// <summary>
/// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">Start index.</param>
public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
{
if (s != null && s.Length > 0 && startIndex < s.Length) {
if (startIndex < 0) startIndex = 0;
int len = s.Length;
for (int i = startIndex; i < len; i++)
if (!IsLegalXmlChar(s[i]))
return i;
}
return -1;
}
/// <summary>
/// Indicates whether a given character is valid according to the XML 1.0 spec.
/// This code represents an optimized version of Tom Bogle's on SO:
/// https://stackoverflow.com/a/13039301/264031.
/// </summary>
public static bool IsLegalXmlChar(char c)
{
if (c > 31 && c <= 55295)
return true;
if (c < 32)
return c == 9 || c == 10 || c == 13;
return (c >= 57344 && c <= 65533) || c > 65535;
// final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
//c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
}
======== ======== ========
Write XElement.ToString directly
======== ======== ========
First, the usage of this extension method:
string result = xelem.ToStringIgnoreInvalidChars();
-- Fuller test --
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
XElement x = new XElement("test", badString);
string xml1 = x.ToStringIgnoreInvalidChars();
//result: <test>My name is Inigo Montoya</test>
string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
//result: <test>My name is Inigo Montoya</test>
}
--- code ---
/// <summary>
/// Writes this XML to string while allowing invalid XML chars to either be
/// simply removed during the write process, or else encoded into entities,
/// instead of having an exception occur, as the standard XmlWriter.Create
/// XmlWriter does (which is the default writer used by XElement).
/// </summary>
/// <param name="xml">XElement.</param>
/// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
/// <param name="indent">Indent setting.</param>
/// <param name="indentChar">Indent char (leave null to use default)</param>
public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
{
if (xml == null) return null;
StringWriter swriter = new StringWriter();
using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {
// -- settings --
// unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
writer.Formatting = indent ? Formatting.Indented : Formatting.None;
if (indentChar != null)
writer.IndentChar = (char)indentChar;
// -- write --
xml.WriteTo(writer);
}
return swriter.ToString();
}
-- this uses the following XmlTextWritter --
public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
public bool DeleteInvalidChars { get; set; }
public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
{
DeleteInvalidChars = deleteInvalidChars;
}
public override void WriteString(string text)
{
if (text != null && DeleteInvalidChars)
text = XML.ToValidXmlCharactersString(text);
base.WriteString(text);
}
}
I'm on the receiving end of #parapurarajkumar's solution, where the illegal characters are being properly loaded into XmlDocument, but breaking XmlWriter when I'm trying to save the output.
My Context
I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the time of the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with XmlWriter.
During a website attack, I noticed that some xmls weren't parsing and was receiving this '.', hexadecimal value 0x00, is an invalid character. exception.
NON-RESOLUTION: I converted the document to a byte[] and sanitized it of 0x00, but it found none.
When I scanned the xml document, I found the following:
...
<form>
...
<item name="SomeField">
<value
string="C:\boot.ini.htm" />
</item>
...
There was the nul byte encoded as an html entity  !!!
RESOLUTION: To fix the encoding, I replaced the  value before loading it into my XmlDocument, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:
XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("", "[0x00]"); // in my case I wanted to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);
string formattedXml = null;
// I stuff this all in a helper function, but put it in-line for this example
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
OmitXmlDeclaration = true,
Indent = true,
IndentChars = "\t",
NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
xml.Save(writer);
formattedXml = sb.ToString();
}
LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.
There is a generic solution that works nicely:
public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }
public Func<string, string> TextTransform = s => s;
public override void WriteString(string text)
{
base.WriteString(TextTransform(text));
}
public override void WriteCData(string text)
{
base.WriteCData(TextTransform(text));
}
public override void WriteComment(string text)
{
base.WriteComment(TextTransform(text));
}
public override void WriteRaw(string data)
{
base.WriteRaw(TextTransform(data));
}
public override void WriteValue(string value)
{
base.WriteValue(TextTransform(value));
}
}
Once this is in place, you can then create your override of THIS as follows:
public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }
void SetTransform()
{
TextTransform = XmlUtil.RemoveInvalidXmlChars;
}
}
where XmlUtil.RemoveInvalidXmlChars is defined as follows:
public static string RemoveInvalidXmlChars(string content)
{
if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
else
return content;
}
Can't the string be cleaned with:
System.Net.WebUtility.HtmlDecode()
?

Regex camelcase in c#

I'm trying to use regex to convert a string like this "North Korea"
to a string like "northKorea" - does someone know how I might accomplish this in c# ?
Cheers
if you know all your input strings are in title case (like "North Korea") you can simply do:
string input = "North Korea";
input = input.Replace(" ",""); //remove spaces
string output = char.ToLower(input[0]) +
input.Substring(1); //make first char lowercase
// output = "northKorea"
if some of your input is not in title case you can use TextInfo.ToTitleCase
string input = "NoRtH kORea";
input = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase(input);
input = input.Replace(" ",""); //remove spaces
string output = char.ToLower(input[0]) +
input.Substring(1); //make first char lowercase
// output = "northKorea"
Forget regex.
All you need is a camelCase conversion algorithm:
See here:
http://www.codekeep.net/snippets/096fea45-b426-40fd-8beb-dec49d8a8662.aspx
Use this one:
string camelCase = ConvertCaseString(a, Case.CamelCase);
Copy-pasted in case it goes offline:
void Main() {
string a = "background color-red.brown";
string camelCase = ConvertCaseString(a, Case.CamelCase);
string pascalCase = ConvertCaseString(a, Case.PascalCase);
}
/// <summary>
/// Converts the phrase to specified convention.
/// </summary>
/// <param name="phrase"></param>
/// <param name="cases">The cases.</param>
/// <returns>string</returns>
static string ConvertCaseString(string phrase, Case cases)
{
string[] splittedPhrase = phrase.Split(' ', '-', '.');
var sb = new StringBuilder();
if (cases == Case.CamelCase)
{
sb.Append(splittedPhrase[0].ToLower());
splittedPhrase[0] = string.Empty;
}
else if (cases == Case.PascalCase)
sb = new StringBuilder();
foreach (String s in splittedPhrase)
{
char[] splittedPhraseChars = s.ToCharArray();
if (splittedPhraseChars.Length > 0)
{
splittedPhraseChars[0] = ((new String(splittedPhraseChars[0], 1)).ToUpper().ToCharArray())[0];
}
sb.Append(new String(splittedPhraseChars));
}
return sb.ToString();
}
enum Case
{
PascalCase,
CamelCase
}
You could just split it and put it back together:
string[] split = ("North Korea").Split(' ');
StringBuilder sb = new StringBuilder();
for (int i = 0; i < split.Count(); i++)
{
if (i == 0)
sb.Append(split[i].ToLower());
else
sb.Append(split[i]);
}
Edit: Switched to a StringBuilder instead, like Bazzz suggested.
This builds on Paolo Falabella's answer as a String extension and handles a few boundary cases such as empty string. Since there is some confusion between CamelCase and camelCase, I called it LowerCamelCase as described on Wikipedia. I resisted the temptation to go with nerdCaps.
internal static string ToLowerCamelCase( this string input )
{
string output = "";
if( String.IsNullOrEmpty( input ) == false )
{
output = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase( input ); //in case not Title Case
output = output.Replace( " ", "" ); //remove any white spaces between words
if( String.IsNullOrEmpty( output ) == false ) //handles the case where input is " "
{
output = char.ToLower( output[0] ) + output.Substring( 1 ); //lowercase first (even if 1 character string)
}
}
return output;
}
Use:
string test = "Foo Bar";
test = test.ToLowerCamelCase();
... //test is now "fooBar"
Update:
toong raised a good point in the comments - this will not work for graphemes. See the link provided by toong. There are also examples of iterating graphemes here and here if you want to tweak the above code for graphemes.
String::Split definitely is one of my pet peeves. Also, none of the other answers deal with:
Cultures
All forms of word seperators
Numbers
What happens when it starts with word seperators
I tried to get it as close as possible to what you would find in base class library code:
static string ToCamelCaseInvariant(string value) { return ToCamelCase(value, true, CultureInfo.InvariantCulture); }
static string ToCamelCaseInvariant(string value, bool changeWordCaps) { return ToCamelCase(value, changeWordCaps, CultureInfo.InvariantCulture); }
static string ToCamelCase(string value) { return ToCamelCase(value, true, CultureInfo.CurrentCulture); }
static string ToCamelCase(string value, bool changeWordCaps) { return ToCamelCase(value, changeWordCaps, CultureInfo.CurrentCulture); }
/// <summary>
/// Converts the given string value into camelCase.
/// </summary>
/// <param name="value">The value.</param>
/// <param name="changeWordCaps">If set to <c>true</c> letters in a word (apart from the first) will be lowercased.</param>
/// <param name="culture">The culture to use to change the case of the characters.</param>
/// <returns>
/// The camel case value.
/// </returns>
static string ToCamelCase(string value, bool changeWordCaps, CultureInfo culture)
{
if (culture == null)
throw new ArgumentNullException("culture");
if (string.IsNullOrEmpty(value))
return value;
var result = new StringBuilder(value.Length);
var lastWasBreak = true;
for (var i = 0; i < value.Length; i++)
{
var c = value[i];
if (char.IsWhiteSpace(c) || char.IsPunctuation(c) || char.IsSeparator(c))
{
lastWasBreak = true;
}
else if (char.IsNumber(c))
{
result.Append(c);
lastWasBreak = true;
}
else
{
if (result.Length == 0)
{
result.Append(char.ToLower(c, culture));
}
else if (lastWasBreak)
{
result.Append(char.ToUpper(c, culture));
}
else if (changeWordCaps)
{
result.Append(char.ToLower(c, culture));
}
else
{
result.Append(c);
}
lastWasBreak = false;
}
}
return result.ToString();
}
// Tests
' This is a test. 12345hello world' = 'thisIsATest12345HelloWorld'
'--north korea' = 'northKorea'
'!nOrTH koreA' = 'northKorea'
'System.Console.' = 'systemConsole'
Try the following:
var input = "Hi my name is Rony";
var subStrs = input.ToLower().Split(' ');
var output = "";
foreach(var s in subStrs)
{
if(s!=subStrs[0])
output += s.First().ToString().ToUpper() + String.Join("", s.Skip(1));
else
output += s;
}
should get "hiMyNameIsRony" as the output
string toCamelCase(string s)
{
if (s.Length < 2) return s.ToLower();
return Char.ToLowerInvariant(s[0]) + s.Substring(1);
}
similar to Paolo Falabella's code but survives empty strings and 1 char strings.

Categories