Dealing with invalid XML hexadecimal characters - c#

I'm trying to send an XML document over the wire but receiving the following exception:
"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
at System.Xml.XmlRawWriter.WriteValue(String value)
at System.Xml.XmlWellFormedWriter.WriteValue(String value)
at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
--- End of inner exception stack trace ---
I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?
I'd like to keep the original characters one way or another.

The following code removes XML invalid characters from a string and returns a new string without them:
public static string CleanInvalidXmlChars(string text)
{
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string re = #"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}

byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
is one way of doing this

Work for me:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.
-- update --
See below how one can also directly write an XElement that has these invalid characters, though it uses this code --
Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.
USAGE:
string xmlStrBack = XML.ToValidXmlCharactersString("any string");
TEST:
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
string goodString = "My name is Inigo Montoya!";
string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string
XElement x1 = new XElement("test", back1);
XElement x2 = new XElement("test", back2);
XElement x3WithBadString = new XElement("test", badString);
string xml1 = x1.ToString();
string xml2 = x2.ToString().Print();
string xmlShouldFail = x3WithBadString.ToString();
}
// --- CODE --- (I have these methods in a static utility class called XML)
/// <summary>
/// Determines if any invalid XML 1.0 characters exist within the string,
/// and if so it returns a new string with the invalid chars removed, else
/// the same string is returned (with no wasted StringBuilder allocated, etc).
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">The index to begin checking at.</param>
public static string ToValidXmlCharactersString(string s, int startIndex = 0)
{
int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
if (firstInvalidChar < 0)
return s;
startIndex = firstInvalidChar;
int len = s.Length;
var sb = new StringBuilder(len);
if (startIndex > 0)
sb.Append(s, 0, startIndex);
for (int i = startIndex; i < len; i++)
if (IsLegalXmlChar(s[i]))
sb.Append(s[i]);
return sb.ToString();
}
/// <summary>
/// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">Start index.</param>
public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
{
if (s != null && s.Length > 0 && startIndex < s.Length) {
if (startIndex < 0) startIndex = 0;
int len = s.Length;
for (int i = startIndex; i < len; i++)
if (!IsLegalXmlChar(s[i]))
return i;
}
return -1;
}
/// <summary>
/// Indicates whether a given character is valid according to the XML 1.0 spec.
/// This code represents an optimized version of Tom Bogle's on SO:
/// https://stackoverflow.com/a/13039301/264031.
/// </summary>
public static bool IsLegalXmlChar(char c)
{
if (c > 31 && c <= 55295)
return true;
if (c < 32)
return c == 9 || c == 10 || c == 13;
return (c >= 57344 && c <= 65533) || c > 65535;
// final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
//c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
}
======== ======== ========
Write XElement.ToString directly
======== ======== ========
First, the usage of this extension method:
string result = xelem.ToStringIgnoreInvalidChars();
-- Fuller test --
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
XElement x = new XElement("test", badString);
string xml1 = x.ToStringIgnoreInvalidChars();
//result: <test>My name is Inigo Montoya</test>
string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
//result: <test>My name is Inigo Montoya</test>
}
--- code ---
/// <summary>
/// Writes this XML to string while allowing invalid XML chars to either be
/// simply removed during the write process, or else encoded into entities,
/// instead of having an exception occur, as the standard XmlWriter.Create
/// XmlWriter does (which is the default writer used by XElement).
/// </summary>
/// <param name="xml">XElement.</param>
/// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
/// <param name="indent">Indent setting.</param>
/// <param name="indentChar">Indent char (leave null to use default)</param>
public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
{
if (xml == null) return null;
StringWriter swriter = new StringWriter();
using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {
// -- settings --
// unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
writer.Formatting = indent ? Formatting.Indented : Formatting.None;
if (indentChar != null)
writer.IndentChar = (char)indentChar;
// -- write --
xml.WriteTo(writer);
}
return swriter.ToString();
}
-- this uses the following XmlTextWritter --
public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
public bool DeleteInvalidChars { get; set; }
public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
{
DeleteInvalidChars = deleteInvalidChars;
}
public override void WriteString(string text)
{
if (text != null && DeleteInvalidChars)
text = XML.ToValidXmlCharactersString(text);
base.WriteString(text);
}
}

I'm on the receiving end of #parapurarajkumar's solution, where the illegal characters are being properly loaded into XmlDocument, but breaking XmlWriter when I'm trying to save the output.
My Context
I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the time of the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with XmlWriter.
During a website attack, I noticed that some xmls weren't parsing and was receiving this '.', hexadecimal value 0x00, is an invalid character. exception.
NON-RESOLUTION: I converted the document to a byte[] and sanitized it of 0x00, but it found none.
When I scanned the xml document, I found the following:
...
<form>
...
<item name="SomeField">
<value
string="C:\boot.ini.htm" />
</item>
...
There was the nul byte encoded as an html entity  !!!
RESOLUTION: To fix the encoding, I replaced the  value before loading it into my XmlDocument, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:
XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("", "[0x00]"); // in my case I wanted to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);
string formattedXml = null;
// I stuff this all in a helper function, but put it in-line for this example
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
OmitXmlDeclaration = true,
Indent = true,
IndentChars = "\t",
NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
xml.Save(writer);
formattedXml = sb.ToString();
}
LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.

There is a generic solution that works nicely:
public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }
public Func<string, string> TextTransform = s => s;
public override void WriteString(string text)
{
base.WriteString(TextTransform(text));
}
public override void WriteCData(string text)
{
base.WriteCData(TextTransform(text));
}
public override void WriteComment(string text)
{
base.WriteComment(TextTransform(text));
}
public override void WriteRaw(string data)
{
base.WriteRaw(TextTransform(data));
}
public override void WriteValue(string value)
{
base.WriteValue(TextTransform(value));
}
}
Once this is in place, you can then create your override of THIS as follows:
public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }
void SetTransform()
{
TextTransform = XmlUtil.RemoveInvalidXmlChars;
}
}
where XmlUtil.RemoveInvalidXmlChars is defined as follows:
public static string RemoveInvalidXmlChars(string content)
{
if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
else
return content;
}

Can't the string be cleaned with:
System.Net.WebUtility.HtmlDecode()
?

Related

Get certain value in the string from text file

I have this in my text file:
000000000:Carrots:$1.99:214:03/11/2015:03/11/2016:$0.99
000000001:Bananas:$1.99:872:03/11/2015:03/11/2016:$0.99
000000002:Chocolate:$2.99:083:03/11/2015:03/11/2016:$1.99
000000003:Spaghetti:$3.99:376:03/11/2015:03/11/2016:$2.99
000000004:Tomato Sauce:$1.99:437:03/11/2015:03/11/2016:$0.99
000000005:Lettuce:$0.99:279:03/11/2015:03/11/2016:$0.99
000000006:Orange Juice:$2.99:398:03/11/2015:03/11/2016:$1.99
000000007:Potatoes:$2.99:792:03/11/2015:03/11/2016:$1.99
000000008:Celery:$0.99:973:03/11/2015:03/11/2016:$0.99
000000009:Onions:$1.99:763:03/11/2015:03/11/2016:$0.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99
I need to get the value of each of the "quantity" values from the position in bold.
EDIT:
I want to also compare the values that I got and give an error if the quantity is low.
Solution with minimal memory consumption in case of large input data.
In additional: there are not processing of incorrect data in quantity column. To do this just replace int.Parse block;
This is several methods to process file data using LINQ expressions
internal static class MyExtensions
{
/// <exception cref="OutOfMemoryException">There is insufficient memory to allocate a buffer for the returned string. </exception>
/// <exception cref="IOException">An I/O error occurs. </exception>
/// <exception cref="ArgumentException"><paramref name="stream" /> does not support reading. </exception>
/// <exception cref="ArgumentNullException"><paramref name="stream" /> is null. </exception>
public static IEnumerable<string> EnumerateLines(this Stream stream)
{
using (var reader = new StreamReader(stream))
{
do
{
var line = reader.ReadLine();
if (line == null) break;
yield return line;
} while (true);
}
}
/// <exception cref="ArgumentNullException"><paramref name="line"/> is <see langword="null" />.</exception>
public static IEnumerable<string> ChunkLine(this string line)
{
if (line == null) throw new ArgumentNullException("line");
return line.Split(':');
}
/// <exception cref="ArgumentNullException"><paramref name="chuckedData"/> is <see langword="null" />.</exception>
/// <exception cref="ArgumentException">Index should be not negative value</exception>
public static string GetColumnData(this IEnumerable<string> chuckedData, int columnIndex)
{
if (chuckedData == null) throw new ArgumentNullException("chuckedData");
if (columnIndex < 0) throw new ArgumentException("Column index should be >= 0", "columnIndex");
return chuckedData.Skip(columnIndex).FirstOrDefault();
}
}
This is example of usage:
private void button1_Click(object sender, EventArgs e)
{
var values = EnumerateQuantityValues("largefile.txt");
// do whatever you need
}
private IEnumerable<int> EnumerateQuantityValues(string fileName)
{
const int columnIndex = 3;
using (var stream = File.OpenRead(fileName))
{
IEnumerable<int> enumerable = stream
.EnumerateLines()
.Select(x => x.ChunkLine().GetColumnData(columnIndex))
.Select(int.Parse);
foreach (var value in enumerable)
{
yield return value;
}
}
}
just consider if you are managed to get all these lines in string array or list.
you can apply the below code to get the collection of quantity as IEnumerable<string>.
var quantity = arr.Select(c =>
{
var temp = c.Split('$');
if (temp.Length > 1)
{
temp = temp[1].Split(':');
if (temp.Length > 1)
{
return temp[1];
}
}
return null;
}).Where(c => c != null);
UPDATE
Check the Fiddle.
https://dotnetfiddle.net/HqKdeI
you simply need to split the string
string data = #"000000000:Carrots:$1.99:214:03/11/2015:03/11/2016:$0.99
000000001:Bananas:$1.99:872:03/11/2015:03/11/2016:$0.99
000000002:Chocolate:$2.99:083:03/11/2015:03/11/2016:$1.99
000000003:Spaghetti:$3.99:376:03/11/2015:03/11/2016:$2.99
000000004:Tomato Sauce:$1.99:437:03/11/2015:03/11/2016:$0.99
000000005:Lettuce:$0.99:279:03/11/2015:03/11/2016:$0.99
000000006:Orange Juice:$2.99:398:03/11/2015:03/11/2016:$1.99
000000007:Potatoes:$2.99:792:03/11/2015:03/11/2016:$1.99
000000008:Celery:$0.99:973:03/11/2015:03/11/2016:$0.99
000000009:Onions:$1.99:763:03/11/2015:03/11/2016:$0.99
000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99";
string[] rows = data.split(Environment.Newline.ToCharArray());
foreach(var row in rows)
{
string[] cols = row.Split(':');
var quantity = cols[3];
}
You can use String.Split to do this.
// Read all lines into an array
string[] lines = File.ReadAllLines(#"C:\path\to\your\file.txt");
// Loop through each one
foreach (string line in lines)
{
// Split into an array based on the : symbol
string[] split = line.Split(':');
// Get the column based on index
Console.WriteLine(split[3]);
}
Check out the example code below. The string you care about is named theValueYouWantInTheString.
char[] delimiterChar = { ':' };
string input = #"000000010:Chicken:$8.99:345:03/11/2015:03/11/2016:$7.99";
string[] values = input.Split(delimiterChar);
string theValueYouWantInTheString = values[3];
If you have a problem, use regular expression. Now you have two problems.
Here is a program that uses your input as a txt file. The function GetQuantity returns a list with int that contains the quantity. With this approach you can define more groups to extract information from each line.
namespace RegExptester
{
class Program
{
private static List<int> GetQuantity(string txtFile)
{
string tempLineValue;
Regex regex = new Regex(#"[0-9]*:[a-zA-Z]*:\$[0-9]*\.[0-9]*:([0-9]*).*", RegexOptions.Compiled);
List<int> retValue = new List<int>();
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (null != (tempLineValue = inputReader.ReadLine()))
{
Match match = regex.Match(tempLineValue);
if (match.Success)
{
if(match.Groups.Count == 2)
{
int numberValue;
if (int.TryParse(match.Groups[1].Value, out numberValue))
retValue.Add(numberValue);
}
}
}
}
return retValue;
}
static void Main(string[] args)
{
var tmp = GetQuantity("c:\\tmp\\junk.txt");
}
}
}
Apparently from each line you want the part between the 3th and the 4th colon. Linq can do that for you:
using (var textReader = new StreamReader(fileName))
{
// read all text and divide into lines:
var allText = textReader.ReadToEnd();
var allLines = textReader.Split(new char[] {'\r','\n'}, StringSplitIoptions.RemoveEmptyEntries);
// split each line based on ':', and take the fourth element
var myValues = allLines.Select(line => line.Split(new char[] {':'})
.Skip(3)
.FirstOrDefault();
}
If you want less readability, of course you can concatenate these statements into one line.

Counting/sorting characters in a text file

I am trying to write a program that reads a text file, sorts it by character, and keeps track of how many times each character appears in the document. This is what I have so far.
class Program
{
static void Main(string[] args)
{
CharFrequency[] Charfreq = new CharFrequency[128];
try
{
string line;
System.IO.StreamReader file = new System.IO.StreamReader(#"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt");
while ((line = file.ReadLine()) != null)
{
int ch = file.Read();
if (Charfreq.Contains(ch))
{
}
}
file.Close();
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("The process failed: {0}", e.ToString());
}
}
}
My question is, what should go in the if statement here?
I also have a Charfrequency class, which I'll include here in case it is helpful/necessary that I include it (and yes, it is necessary that I use an array versus a list or arraylist).
public class CharFrequency
{
private char m_character;
private long m_count;
public CharFrequency(char ch)
{
Character = ch;
Count = 0;
}
public CharFrequency(char ch, long charCount)
{
Character = ch;
Count = charCount;
}
public char Character
{
set
{
m_character = value;
}
get
{
return m_character;
}
}
public long Count
{
get
{
return m_count;
}
set
{
if (value < 0)
value = 0;
m_count = value;
}
}
public void Increment()
{
m_count++;
}
public override bool Equals(object obj)
{
bool equal = false;
CharFrequency cf = new CharFrequency('\0', 0);
cf = (CharFrequency)obj;
if (this.Character == cf.Character)
equal = true;
return equal;
}
public override int GetHashCode()
{
return m_character.GetHashCode();
}
public override string ToString()
{
String s = String.Format("'{0}' ({1}) = {2}", m_character, (byte)m_character, m_count);
return s;
}
}
Have a look at this post.
https://codereview.stackexchange.com/questions/63872/counting-the-number-of-character-occurrences
It uses LINQ to achieve your goal
You shouldn't use Contains
first you need to initialize your Charfreq array:
CharFrequency[] Charfreq = new CharFrequency[128];
for (int i = 0; i < Charferq.Length; i++)
{
Charfreq[i] = new CharFrequency((char)i);
}
try
then you can
int ch;
// -1 means that there are no more characters to read,
// otherwise ch is the char read
while ((ch = file.Read()) != -1)
{
CharFrequency cf = new CharFrequency((char)ch);
// This works because CharFrequency overloads the
// Equals method, and the Equals method checks only
// for the Character property of CharFrequency
int ix = Array.IndexOf(Charfreq, cf);
// if there is the "right" charfrequency
if (ix != -1)
{
Charfreq[ix].Increment();
}
}
Note that this isn't the way I would write the program. This is the minimum changes needed to make your program working.
As a sidenote, this program will count the "frequency" of ASCII characters (characters with code <= 127)
CharFrequency cf = new CharFrequency('\0', 0);
cf = (CharFrequency)obj;
And this is an useless initialization:
CharFrequency cf = (CharFrequency)obj;
is enough, otherwise you are creating a CharFrequency just to discard it the line below.
A dictionary is well suited for a task like this. You didn't say which character set and encoding the file was in. So, because Unicode is so common, let's assume the Unicode character set and UTF-8 encoding. (After all, it is the default for .NET, Java, JavaScript, HTML, XML,….) If that's not the case then read the file using the applicable encoding and fix your code because you currently are using UTF-8 in your StreamReader.
Next comes iterating across the "characters". And then incrementing the count for a "character" in the dictionary as it is seen in the text.
Unicode does have a few complex features. One is combining characters, where a base character can be overlaid with diacritics etc. Users view such combinations as one "character", or, as Unicode calls them, graphemes. Thankfully, .NET gives is the StringInfo class that iterates over them as a "text element."
So, if you think about it, using an array would be quite difficult. You'd have to build your own dictionary on top of your array.
The example below uses a Dictionary and is runnable using a LINQPad script. After it creates the dictionary, it orders and dumps it with a nice display.
var path = Path.GetTempFileName();
// Get some text we know is encoded in UTF-8 to simplify the code below
// and contains combining codepoints as a matter of example.
using (var web = new WebClient())
{
web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path);
}
// since the question asks to analyze a file
var content = File.ReadAllText(path, Encoding.UTF8);
var frequency = new Dictionary<String, int>();
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
while (itor.MoveNext())
{
var element = (String)itor.Current;
if (!frequency.ContainsKey(element))
{
frequency.Add(element, 0);
}
frequency[element]++;
}
var histogram = frequency
.OrderByDescending(f => f.Value)
// jazz it up with the list of codepoints in each text element
.Select(pair =>
{
var bytes = Encoding.UTF32.GetBytes(pair.Key);
var codepoints = new UInt32[bytes.Length/4];
Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
return new {
Count = pair.Value,
textElement = pair.Key,
codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
});
histogram.Dump(); // For use in LINQPad

Regex camelcase in c#

I'm trying to use regex to convert a string like this "North Korea"
to a string like "northKorea" - does someone know how I might accomplish this in c# ?
Cheers
if you know all your input strings are in title case (like "North Korea") you can simply do:
string input = "North Korea";
input = input.Replace(" ",""); //remove spaces
string output = char.ToLower(input[0]) +
input.Substring(1); //make first char lowercase
// output = "northKorea"
if some of your input is not in title case you can use TextInfo.ToTitleCase
string input = "NoRtH kORea";
input = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase(input);
input = input.Replace(" ",""); //remove spaces
string output = char.ToLower(input[0]) +
input.Substring(1); //make first char lowercase
// output = "northKorea"
Forget regex.
All you need is a camelCase conversion algorithm:
See here:
http://www.codekeep.net/snippets/096fea45-b426-40fd-8beb-dec49d8a8662.aspx
Use this one:
string camelCase = ConvertCaseString(a, Case.CamelCase);
Copy-pasted in case it goes offline:
void Main() {
string a = "background color-red.brown";
string camelCase = ConvertCaseString(a, Case.CamelCase);
string pascalCase = ConvertCaseString(a, Case.PascalCase);
}
/// <summary>
/// Converts the phrase to specified convention.
/// </summary>
/// <param name="phrase"></param>
/// <param name="cases">The cases.</param>
/// <returns>string</returns>
static string ConvertCaseString(string phrase, Case cases)
{
string[] splittedPhrase = phrase.Split(' ', '-', '.');
var sb = new StringBuilder();
if (cases == Case.CamelCase)
{
sb.Append(splittedPhrase[0].ToLower());
splittedPhrase[0] = string.Empty;
}
else if (cases == Case.PascalCase)
sb = new StringBuilder();
foreach (String s in splittedPhrase)
{
char[] splittedPhraseChars = s.ToCharArray();
if (splittedPhraseChars.Length > 0)
{
splittedPhraseChars[0] = ((new String(splittedPhraseChars[0], 1)).ToUpper().ToCharArray())[0];
}
sb.Append(new String(splittedPhraseChars));
}
return sb.ToString();
}
enum Case
{
PascalCase,
CamelCase
}
You could just split it and put it back together:
string[] split = ("North Korea").Split(' ');
StringBuilder sb = new StringBuilder();
for (int i = 0; i < split.Count(); i++)
{
if (i == 0)
sb.Append(split[i].ToLower());
else
sb.Append(split[i]);
}
Edit: Switched to a StringBuilder instead, like Bazzz suggested.
This builds on Paolo Falabella's answer as a String extension and handles a few boundary cases such as empty string. Since there is some confusion between CamelCase and camelCase, I called it LowerCamelCase as described on Wikipedia. I resisted the temptation to go with nerdCaps.
internal static string ToLowerCamelCase( this string input )
{
string output = "";
if( String.IsNullOrEmpty( input ) == false )
{
output = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase( input ); //in case not Title Case
output = output.Replace( " ", "" ); //remove any white spaces between words
if( String.IsNullOrEmpty( output ) == false ) //handles the case where input is " "
{
output = char.ToLower( output[0] ) + output.Substring( 1 ); //lowercase first (even if 1 character string)
}
}
return output;
}
Use:
string test = "Foo Bar";
test = test.ToLowerCamelCase();
... //test is now "fooBar"
Update:
toong raised a good point in the comments - this will not work for graphemes. See the link provided by toong. There are also examples of iterating graphemes here and here if you want to tweak the above code for graphemes.
String::Split definitely is one of my pet peeves. Also, none of the other answers deal with:
Cultures
All forms of word seperators
Numbers
What happens when it starts with word seperators
I tried to get it as close as possible to what you would find in base class library code:
static string ToCamelCaseInvariant(string value) { return ToCamelCase(value, true, CultureInfo.InvariantCulture); }
static string ToCamelCaseInvariant(string value, bool changeWordCaps) { return ToCamelCase(value, changeWordCaps, CultureInfo.InvariantCulture); }
static string ToCamelCase(string value) { return ToCamelCase(value, true, CultureInfo.CurrentCulture); }
static string ToCamelCase(string value, bool changeWordCaps) { return ToCamelCase(value, changeWordCaps, CultureInfo.CurrentCulture); }
/// <summary>
/// Converts the given string value into camelCase.
/// </summary>
/// <param name="value">The value.</param>
/// <param name="changeWordCaps">If set to <c>true</c> letters in a word (apart from the first) will be lowercased.</param>
/// <param name="culture">The culture to use to change the case of the characters.</param>
/// <returns>
/// The camel case value.
/// </returns>
static string ToCamelCase(string value, bool changeWordCaps, CultureInfo culture)
{
if (culture == null)
throw new ArgumentNullException("culture");
if (string.IsNullOrEmpty(value))
return value;
var result = new StringBuilder(value.Length);
var lastWasBreak = true;
for (var i = 0; i < value.Length; i++)
{
var c = value[i];
if (char.IsWhiteSpace(c) || char.IsPunctuation(c) || char.IsSeparator(c))
{
lastWasBreak = true;
}
else if (char.IsNumber(c))
{
result.Append(c);
lastWasBreak = true;
}
else
{
if (result.Length == 0)
{
result.Append(char.ToLower(c, culture));
}
else if (lastWasBreak)
{
result.Append(char.ToUpper(c, culture));
}
else if (changeWordCaps)
{
result.Append(char.ToLower(c, culture));
}
else
{
result.Append(c);
}
lastWasBreak = false;
}
}
return result.ToString();
}
// Tests
' This is a test. 12345hello world' = 'thisIsATest12345HelloWorld'
'--north korea' = 'northKorea'
'!nOrTH koreA' = 'northKorea'
'System.Console.' = 'systemConsole'
Try the following:
var input = "Hi my name is Rony";
var subStrs = input.ToLower().Split(' ');
var output = "";
foreach(var s in subStrs)
{
if(s!=subStrs[0])
output += s.First().ToString().ToUpper() + String.Join("", s.Skip(1));
else
output += s;
}
should get "hiMyNameIsRony" as the output
string toCamelCase(string s)
{
if (s.Length < 2) return s.ToLower();
return Char.ToLowerInvariant(s[0]) + s.Substring(1);
}
similar to Paolo Falabella's code but survives empty strings and 1 char strings.

escaping tricky string to CSV format

I have to create a CSV file from webservice output and the CSV file uses quoted strings with comma separator. I cannot change the format...
So if I have a string it becomes a "string"...
If the value has quotes already they are replaced with double quotes.
For example a str"ing becomes "str""ing"...
However, lately my import has been failing because of the following
original input string is: "","word1,word2,..."
every single quote is replaced by double resulting in: """",""word1,word2,...""
then its prefixed and suffixed with quote before written to CVS file: """"",""word1,word2,..."""
As you can see the final result is this:
""""",""word1,word2,..."""
which breaks my import (is sees it as another field)...
I think the issue is appereance of "," in the original input string.
Is there a CVS escape sequence for this scenario?
Update
The reason why above breaks is due to BCP mapping file (BCP utility is used to load CSV file into SQL db) which has terminator defined as "," . So instead of seeing 1 field it sees 2...But I cannot change the mapping file...
I use this code and it has always worked:
/// <summary>
/// Turn a string into a CSV cell output
/// </summary>
/// <param name="str">String to output</param>
/// <returns>The CSV cell formatted string</returns>
public static string StringToCSVCell(string str)
{
bool mustQuote = (str.Contains(",") || str.Contains("\"") || str.Contains("\r") || str.Contains("\n"));
if (mustQuote)
{
StringBuilder sb = new StringBuilder();
sb.Append("\"");
foreach (char nextChar in str)
{
sb.Append(nextChar);
if (nextChar == '"')
sb.Append("\"");
}
sb.Append("\"");
return sb.ToString();
}
return str;
}
Based on Ed Bayiates' answer:
/// <summary>
/// Turn a string into a CSV cell output
/// </summary>
/// <param name="value">String to output</param>
/// <returns>The CSV cell formatted string</returns>
private string ConvertToCsvCell(string value)
{
var mustQuote = value.Any(x => x == ',' || x == '\"' || x == '\r' || x == '\n');
if (!mustQuote)
{
return value;
}
value = value.Replace("\"", "\"\"");
return string.Format("\"{0}\"", value);
}
My penny thought:
String[] lines = new String[] { "\"\",\"word\",word,word2,1,34,5,2,\"details\"" };
for (int j = 0; j < lines.Length; j++)
{
String[] fields=lines[j].Split(',');
for (int i =0; i<fields.Length; i++)
{
if (fields[i].StartsWith("\"") && fields[i].EndsWith("\""))
{
char[] tmp = new char[fields[i].Length-2];
fields[i].CopyTo(1,tmp,0,fields[i].Length-2);
fields[i] =tmp.ToString();
fields[i] = "\""+fields[i].Replace("\"","\"\"")+"\"";
}
else
fields[i] = fields[i].Replace("\"","\"\"");
}
lines[j]=String.Join(",",fields);
}
Based on contribution of "Ed Bayiates" here's an helpful class to buid csv document:
/// <summary>
/// helpful class to build csv document
/// </summary>
public class CsvBuilder
{
/// <summary>
/// create the csv builder
/// </summary>
public CsvBuilder(char csvSeparator)
{
m_csvSeparator = csvSeparator;
}
/// <summary>
/// append a cell
/// </summary>
public void appendCell(string strCellValue)
{
if (m_nCurrentColumnIndex > 0) m_strBuilder.Append(m_csvSeparator);
bool mustQuote = (strCellValue.Contains(m_csvSeparator)
|| strCellValue.Contains('\"')
|| strCellValue.Contains('\r')
|| strCellValue.Contains('\n'));
if (mustQuote)
{
m_strBuilder.Append('\"');
foreach (char nextChar in strCellValue)
{
m_strBuilder.Append(nextChar);
if (nextChar == '"') m_strBuilder.Append('\"');
}
m_strBuilder.Append('\"');
}
else
{
m_strBuilder.Append(strCellValue);
}
m_nCurrentColumnIndex++;
}
/// <summary>
/// end of line, new line
/// </summary>
public void appendNewLine()
{
m_strBuilder.Append(Environment.NewLine);
m_nCurrentColumnIndex = 0;
}
/// <summary>
/// Create the CSV file
/// </summary>
/// <param name="path"></param>
public void save(string path )
{
File.WriteAllText(path, ToString());
}
public override string ToString()
{
return m_strBuilder.ToString();
}
private StringBuilder m_strBuilder = new StringBuilder();
private char m_csvSeparator;
private int m_nCurrentColumnIndex = 0;
}
How to use it:
void exportAsCsv( string strFileName )
{
CsvBuilder csvStringBuilder = new CsvBuilder(';');
csvStringBuilder.appendCell("#Header col 1 : Name");
csvStringBuilder.appendCell("col 2 : Value");
csvStringBuilder.appendNewLine();
foreach (Data data in m_dataSet)
{
csvStringBuilder.appendCell(data.getName());
csvStringBuilder.appendCell(data.getValue());
csvStringBuilder.appendNewLine();
}
csvStringBuilder.save(strFileName);
}
the first step in parsing this is removing the extra added " 's around your string. Once you do this, you should be able to deal with the embedded " as well as the ,'s.
After much deliberation, it was decided that import utility format was needed to be fixed. The escaping of the string was correct (as users indicated) but the format file that import utility used was incorrect and was causing it to break import.
Thanks all and special thanks to #dbt (up vote)

How to check if directory 1 is a subdirectory of dir2 and vice versa

What is an easy way to check if directory 1 is a subdirectory of directory 2 and vice versa?
I checked the Path and DirectoryInfo helperclasses but found no system-ready function for this. I thought it would be in there somewhere.
Do you guys have an idea where to find this?
I tried writing a check myself, but it's more complicated than I had anticipated when I started.
In response to the first part of the question: "Is dir1 a sub-directory of dir2?", this code should work:
public bool IsSubfolder(string parentPath, string childPath)
{
var parentUri = new Uri(parentPath);
var childUri = new DirectoryInfo(childPath).Parent;
while (childUri != null)
{
if(new Uri(childUri.FullName) == parentUri)
{
return true;
}
childUri = childUri.Parent;
}
return false;
}
The URIs (on Windows at least, might be different on Mono/Linux) are case-insensitive. If case sensitivity is important, use the Compare method on Uri instead.
Here's a simpler way to do it using the Uri class:
var parentUri = new Uri(parentPath);
var childUri = new Uri(childPath);
if (parentUri != childUri && parentUri.IsBaseOf(childUri))
{
//dowork
}
See original answer here: https://stackoverflow.com/a/31941159/134761
Case insensitive
Tolerates mix of \ and / folder delimiters
Tolerates ..\ in path
Avoids matching on partial folder names (c:\foobar not a subpath of c:\foo)
Code:
public static class StringExtensions
{
/// <summary>
/// Returns true if <paramref name="path"/> starts with the path <paramref name="baseDirPath"/>.
/// The comparison is case-insensitive, handles / and \ slashes as folder separators and
/// only matches if the base dir folder name is matched exactly ("c:\foobar\file.txt" is not a sub path of "c:\foo").
/// </summary>
public static bool IsSubPathOf(this string path, string baseDirPath)
{
string normalizedPath = Path.GetFullPath(path.Replace('/', '\\')
.WithEnding("\\"));
string normalizedBaseDirPath = Path.GetFullPath(baseDirPath.Replace('/', '\\')
.WithEnding("\\"));
return normalizedPath.StartsWith(normalizedBaseDirPath, StringComparison.OrdinalIgnoreCase);
}
/// <summary>
/// Returns <paramref name="str"/> with the minimal concatenation of <paramref name="ending"/> (starting from end) that
/// results in satisfying .EndsWith(ending).
/// </summary>
/// <example>"hel".WithEnding("llo") returns "hello", which is the result of "hel" + "lo".</example>
public static string WithEnding([CanBeNull] this string str, string ending)
{
if (str == null)
return ending;
string result = str;
// Right() is 1-indexed, so include these cases
// * Append no characters
// * Append up to N characters, where N is ending length
for (int i = 0; i <= ending.Length; i++)
{
string tmp = result + ending.Right(i);
if (tmp.EndsWith(ending))
return tmp;
}
return result;
}
/// <summary>Gets the rightmost <paramref name="length" /> characters from a string.</summary>
/// <param name="value">The string to retrieve the substring from.</param>
/// <param name="length">The number of characters to retrieve.</param>
/// <returns>The substring.</returns>
public static string Right([NotNull] this string value, int length)
{
if (value == null)
{
throw new ArgumentNullException("value");
}
if (length < 0)
{
throw new ArgumentOutOfRangeException("length", length, "Length is less than zero");
}
return (length < value.Length) ? value.Substring(value.Length - length) : value;
}
}
Test cases (NUnit):
[TestFixture]
public class StringExtensionsTest
{
[TestCase(#"c:\foo", #"c:", Result = true)]
[TestCase(#"c:\foo", #"c:\", Result = true)]
[TestCase(#"c:\foo", #"c:\foo", Result = true)]
[TestCase(#"c:\foo", #"c:\foo\", Result = true)]
[TestCase(#"c:\foo\", #"c:\foo", Result = true)]
[TestCase(#"c:\foo\bar\", #"c:\foo\", Result = true)]
[TestCase(#"c:\foo\bar", #"c:\foo\", Result = true)]
[TestCase(#"c:\foo\a.txt", #"c:\foo", Result = true)]
[TestCase(#"c:\FOO\a.txt", #"c:\foo", Result = true)]
[TestCase(#"c:/foo/a.txt", #"c:\foo", Result = true)]
[TestCase(#"c:\foobar", #"c:\foo", Result = false)]
[TestCase(#"c:\foobar\a.txt", #"c:\foo", Result = false)]
[TestCase(#"c:\foobar\a.txt", #"c:\foo\", Result = false)]
[TestCase(#"c:\foo\a.txt", #"c:\foobar", Result = false)]
[TestCase(#"c:\foo\a.txt", #"c:\foobar\", Result = false)]
[TestCase(#"c:\foo\..\bar\baz", #"c:\foo", Result = false)]
[TestCase(#"c:\foo\..\bar\baz", #"c:\bar", Result = true)]
[TestCase(#"c:\foo\..\bar\baz", #"c:\barr", Result = false)]
public bool IsSubPathOfTest(string path, string baseDirPath)
{
return path.IsSubPathOf(baseDirPath);
}
}
Update 2015-08-18: Fix bug matching on partial folder names. Add test cases.
Update 2016-01-29: Link to original question https://stackoverflow.com/a/31941159/134761
DirectoryInfo has a property Parent which is also a DirectoryInfo type. You can use that to to determine if your directory is a subdirectory of a parent directory.
The second directories(d2) full name will contain the full name of the first directory(d1) if it is a sub-folder of d1.
This assumes that you are using valid directories
if (d2.FullName.Contains(d1.FullName))
{
//dowork
}
If you need to check for mapped drives you could try
static void Main(string[] args)
{
if (GetUNCPath(d2.FullName).ToLower().Contains(GetUNCPath(d1.FullName).ToLower()))
{
}
}
[DllImport("mpr.dll", CharSet = CharSet.Unicode, SetLastError = true)]
private static extern int WNetGetConnection(
[MarshalAs(UnmanagedType.LPTStr)] string localName,
[MarshalAs(UnmanagedType.LPTStr)] StringBuilder remoteName, ref int length);
private static string GetUNCPath(string originalPath)
{
StringBuilder sb = new StringBuilder(512);
int size = sb.Capacity;
// look for the {LETTER}: combination ...
if (originalPath.Length > 2 && originalPath[1] == ':')
{
// don't use char.IsLetter here - as that can be misleading
// the only valid drive letters are a-z && A-Z.
char c = originalPath[0];
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
{
int error = WNetGetConnection(originalPath.Substring(0, 2), sb, ref size);
if (error == 0)
{
DirectoryInfo dir = new DirectoryInfo(originalPath);
string path = Path.GetFullPath(originalPath).Substring(Path.GetPathRoot(originalPath).Length);
return Path.Combine(sb.ToString().TrimEnd(), path);
}
}
}
return originalPath;
}
Code for mapped drive taken from http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/6f79f2b3-d092-431f-bc28-d15d93cf5d09
If you have two path then look at this:
Normalize directory names in C#
http://filedirectorypath.codeplex.com/ (I don't know the quality of it)
And use this:
var ancestor = new DirectoryPathAbsolute(ancestorPath);
var child = new DirectoryPathAbsolute(childPath);
var res = child.IsChildDirectoryOf(ancestor); //I don't think it actually checks for case-sensitive filesystems
Otherwise, if you want to know whether a directory exists as a subdirectory in a path take a look on:
Directory.EnumerateDirectories
Came in .Net 4.0. Example:
Does path contain a directory starting with Console:
//* is a wildcard. If you remove it, it search for directories called "Console"
var res = Directory.EnumerateDirectories(#path, "Console*", SearchOption.AllDirectories).Any();
public static bool IsSubfolder(DirectoryInfo parentPath, DirectoryInfo childPath)
{
return parentPath.FullName.StartsWith(childPath.FullName+Path.DirectorySeparatorChar);
}
You can use Path.GetDirectoryName Method to get parent directory. It works for directories too.
With help from the great test cases written in angularsen's answer, I wrote the following simpler extension method on .NET Core 3.1 for Windows:
public static bool IsSubPathOf(this string dirPath, string baseDirPath, StringComparison comparisonType = StringComparison.OrdinalIgnoreCase)
{
dirPath = dirPath.Replace(Path.AltDirectorySeparatorChar, Path.DirectorySeparatorChar);
if (!dirPath.EndsWith(Path.DirectorySeparatorChar))
{
dirPath += Path.DirectorySeparatorChar;
}
baseDirPath = baseDirPath.Replace(Path.AltDirectorySeparatorChar, Path.DirectorySeparatorChar);
if (!baseDirPath.EndsWith(Path.DirectorySeparatorChar))
{
baseDirPath += Path.DirectorySeparatorChar;
}
string dirPathUri = new Uri(dirPath).LocalPath;
string baseDirUri = new Uri(baseDirPath).LocalPath;
return dirPathUri.Contains(baseDirUri, comparisonType);
}
this is what I got, after first verifying that the two directory path strings are something and in a path format I know something about: shouldnotbechilddirpath.ToUpper().StartsWith(maybeparentdirpath.ToUpper())
Be sure to take out the ToUppers() if you are maybe working in a case sensitive file system.
You can compare directory2 to directory1's Parent property when using a DirectoryInfo in both cases.
DirectoryInfo d1 = new DirectoryInfo(#"C:\Program Files\MyApp");
DirectoryInfo d2 = new DirectoryInfo(#"C:\Program Files\MyApp\Images");
if(d2.Parent.FullName == d1.FullName)
{
Console.WriteLine ("Sub directory");
}

Categories