XmlException when loading an XML file with certain characters

XmlException when loading an XML file with certain characters - c#

I need to use the XmlDocument class to load an XML file:
var doc = new XmlDocument();
doc.Load(filename);
Unfortunately I get an XmlException when in my XML there are specifc characters that I use to rappresent my data, in particular I have a node like the following:
<rect data="string with invalid characters: † ¶"/>
So, the forbidden characters are: † and ¶.
How can I load the file without exceptions and leaving these characters in my XML file?

You'll need to replace those characters with a numerical character reference. Similar to how you replace > and < with & gt; and & lt;, you would replace those characters with something like & #931; or whatever references those specific characters.
edit: I had to add a space after the & to avoid the editor actually picking up and interpreting the character. Just remove the space in use - you get the idea.

Alternatively, if you have no control over the source of the XML and just need to read all of the values in to a database or something, you could use an XmlTextReader to read through the xml line by line, stop on the element you know may contain bad data, and read the chars of that element. I've had to do this in the past. Something like this
static void Main(string[] args)
{
var xtr = new XmlTextReader("");
xtr.Normalization = false;
while (xtr.Read())
{
if(xtr.IsStartElement("Row")) // My xml doc contains many row elements
{
var fields = new string[6];
while(xtr.Read())
{
for (int i = 0; i < 6; i++) // I know my xml only has six child elements per row
{
while(!xtr.IsStartElement())
{
xtr.Read(); // We're not interested in hitting the end elements
}
if(i == 1) // I know my special characters are in the second child element of my row
{
var charBuff = new char[255];
xtr.ReadChars(charBuff, 0, 255); // I know there will be a maximum of 255 characters
fields[i] = new string(charBuff);
}
else
{
fields[i] = xtr.ReadElementContentAsString();
}
}
}
}
}
}

Related

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.

I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.

sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

Check if list contains a string that matches closely

I'm trying to figure out the most efficient way to implement the following scenario:
I have a list like this:
public static IEnumerable<string> ValidTags = new List<string> {
"ABC.XYZ",
"PQR.SUB.UID",
"PQR.ALI.OBD",
};
I have a huge CSV with multiple columns. One of the column is tags. This column either contains blank values, or one of the above values. The problem is, the tag column may contain values like "ABC.XYZ?#" i.e. the valid tags plus some extraneous characters. I need to update such columns with the valid tag, since they "closely match" one of our valid tags.
Example:
if the CSV contains PQR.ALI.OBD? update it with the valid tag PQR.ALI.OBD
if the CSV contains PQR.ALI.OBA, this is invalid, just add suffix invalid and update it PQR.ALI.OBA-invalid.
I'm trying to figure out the best possible way to do this.
My current approach is:
Iterate through each column in CSV, get the tagValue
Now check if our tagValue contains any of the string from list
If it contains but is not exactly the same, update it with the value it contains.
If it doesnt "contain" any value from the list, add suffix-invalid.
Is there any better/more efficient way to do this?
Update:
The list has only 5 items, I have shown three here.
The extra chars are only at the end, and that's happening because people are editing those CSVs in excel web version and that messes up some entries.
My current code: (I'm sure there is a better way to do this, also new at C# so please tell me how I can improve this). I'm using CSVHelper to get CSV cells.
var record = csv.GetRecord<Record>();
string tag = csv.GetField(10); //tag column number in CSV is 10
/* Criteria for validation:
* tag matches our list, but has extraneous chars - strip extraneous chars and update csv
* tag doesn't match our list - add suffix invalid.*/
int listIndex = 0;
bool valid;
foreach (var validTags in ValidTags) //ValidTags is the enum above
{
if (validTags.Contains(tag.ToUpper()) && !string.Equals(validTags, subjectIdentifier.ToUpper()))
{
valid = true;
continue; //move on to next csv row.
//this means that tag is valid but has some extra characters appended to it because of web excel, strip extra charts
}
listIndex++;
if(listIndex == 3 && !valid) {
//means we have reached the end of the list but not found valid tag
//add suffix invalid and move on to next csv row
}
}

Since you say that the extra characters are only at the end, and assuming that the original tag is still present before the extra characters, you could just search the list for each tag to see if the tag contains an entry from the list. If it does, then update it to the correct entry if it's not an exact match, and if it doesn't, append the "-invalid" tag to it.
Before doing this, we may need to first sort the list Descending so that when we're searching we find the closest (longest) match (in a case where one item in the list begins with another item in the list).
var csvPath = #"f:\public\temp\temp.csv";
var entriesUpdated = 0;
// Order the list so we match on the most similar match (ABC.DEF before ABC)
var orderedTags = ValidTags.OrderByDescending(t => t);
var newFileLines = new List<string>();
// Read each line in the file
foreach (var csvLine in File.ReadLines(csvPath))
{
// Get the columns
var columns = csvLine.Split(',');
// Process each column
for (int index = 0; index < columns.Length; index++)
{
var column = columns[index];
switch (index)
{
case 0: // tag column
var correctTag = orderedTags.FirstOrDefault(tag =>
column.IndexOf(tag, StringComparison.OrdinalIgnoreCase) > -1);
if (correctTag != null)
{
// This item contains a correct tag, so
// update it if it's not an exact match
if (column != correctTag)
{
columns[index] = correctTag;
entriesUpdated++;
}
}
else
{
// This column does not contain a correct tag, so mark it as invalid
columns[index] += "-invalid";
entriesUpdated++;
}
break;
// Other cases for other columns follow if needed
}
}
newFileLines.Add(string.Join(",", columns));
}
// Write the new lines if any were changed
if (entriesUpdated > 0) File.WriteAllLines(csvPath, newFileLines);

Replacing XElement content with XElement

Is there a way to selectively replace XElement content with other XElements?
I have this XML:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
And I want to render it as this:
<prompt>
There is something I want to tell you.<break time="3s"/>
You are my favorite caller today.<break time="1s"/>
Have a great day!
</prompt>
I need to replace the placeholders with actual XElements, but when I try to alter the content of an XElement, .NET of course escapes all of the angle brackets. I understand why the content would normally need to be correctly escaped, but I need to bypass that behavior and inject XML directly into content.
Here's my code that would otherwise work.
MatchCollection matches = Regex.Matches(content, #"\[(\w+)=(\d+)]");
foreach (XElement element in voiceXmlDocument.Descendants("prompt"))
{
if (matches[0] == null)
continue;
element.Value = element.Value.Replace(matches[0].Value, #"<break time=""5s""/>");
}
This is a work in progress, so don't worry so much about the validity of the RegEx pattern, as I will work that out later to match several conditions. This is proof of concept code and the focus is on replacing the placeholders as described. I only included the iteration and RegEx code here to illustrate that I need to be able to do this to a whole document that is already populated with content.

You can use XElement.Parse() method:
First, get the outer xml of your XElement, for example,
string outerXml = element.ToString();
The you have exactly this string to work with:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
Then you can do your replacement
outerXml = outerXml.Replace(matches[0].Value, #"<break time=""5s""/>");
Then you can parse it back:
XElement repElement = XElement.Parse(outerXml);
And, finally, replace original XElement:
element.ReplaceWith(repElement);

The key to all of this is the XText, which allows you to work with text as an element.
This is the loop:
foreach (XElement prompt in voiceXmlDocument.Descendants("prompt"))
{
string text = prompt.Value;
prompt.RemoveAll();
foreach (string phrase in text.Split('['))
{
string[] parts = phrase.Split(']');
if (parts.Length > 1)
{
string[] pause = parts[0].Split('=');
prompt.Add(new XElement("break", new XAttribute("time", pause[1])));
// add a + "s" if you REALLY want it, but then you have to get rid
// of it later in some other code.
}
prompt.Add(new XText(parts[parts.Length - 1]));
}
}
This is the end result
<prompt>
There is something I want to tell you.<break time="3" />
You are my favorite caller today.<break time="1" />
Have a great day!
</prompt>

class Program
{
static void Main(string[] args)
{
var xml =
#"<prompt>There is something I want to tell you.[pause=3] You are my favorite caller today.[pause=1] Have a great day!</prompt>";
var voiceXmlDocument = XElement.Parse(xml);
var pattern = new Regex(#"\[(\w+)=(\d+)]");
foreach (var element in voiceXmlDocument.DescendantsAndSelf("prompt"))
{
var matches = pattern.Matches(element.Value);
foreach (var match in matches)
{
var matchValue = match.ToString();
var number = Regex.Match(matchValue, #"\d+").Value;
var newValue = string.Format(#"<break time=""{0}""/>", number);
element.Value = element.Value.Replace(matchValue, newValue);
}
}
Console.WriteLine(voiceXmlDocument.ToString());
}
}

Oh, my goodness, you guys were quicker than I expected! So, thanks for that, however in the meantime, I solved it a slightly different way. The code here looks expanded from before because once I got it working, I added some specifics into this particular condition:
foreach (XElement element in voiceXmlDocument.Descendants("prompt").ToArray())
{
// convert the element to a string and see to see if there are any instances
// of pause placeholders in it
string elementAsString = element.ToString();
MatchCollection matches = Regex.Matches(elementAsString, #"\[pause=(\d+)]");
if (matches == null || matches.Count == 0)
continue;
// if there were no matches or an empty set, move on to the next one
// iterate through the indexed matches
for (int i = 0; i < matches.Count; i++)
{
int pauseValue = 0; // capture the original pause value specified by the user
int pauseMilliSeconds = 1000; // if things go wrong, use a 1 second default
if (matches[i].Groups.Count == 2) // the value is expected to be in the second group
{
// if the value could be parsed to an integer, convert it from 1/8 seconds to milliseconds
if (int.TryParse(matches[i].Groups[1].Value, out pauseValue))
pauseMilliSeconds = pauseValue * 125;
}
// replace the specific match with the new <break> tag content
elementAsString = elementAsString.Replace(matches[i].Value, string.Format(#"<break time=""{0}ms""/>", pauseMilliSeconds));
}
// finally replace the element by parsing
element.ReplaceWith(XElement.Parse(elementAsString));
}

Oh, my goodness, you guys were quicker than I expected!
Doh! Might as well post my solution anyway!
foreach (var element in xml.Descendants("prompt"))
{
Queue<string> pauses = new Queue<string>(Regex.Matches(element.Value, #"\[pause *= *\d+\]")
.Cast<Match>()
.Select(m => m.Value));
Queue<string> text = new Queue<string>(element.Value.Split(pauses.ToArray(), StringSplitOptions.None));
element.RemoveAll();
while (text.Any())
{
element.Add(new XText(text.Dequeue()));
if (pauses.Any())
element.Add(new XElement("break", new XAttribute("time", Regex.Match(pauses.Dequeue(), #"\d+"))));
}
}
For every prompt element, Regex match all your pauses and put them in a queue.
Then use these prompts to delimit the inner text and grab the 'other' text and put that in a queue.
Clear all data from the element using RemoveAll and then iterate over your delimited data and re-add it as the appropriate data type. When you are adding in the new attributes you can use Regex to get the number value out of the original match.

How to format and read CSV file?

Here is just an example of the data I need to format.
The first column is simple, the problem the second column.
What would be the best approach to format multiple data fields in one column?
How to parse this data?
Important*: The second column needs to contain multiple values, like in an example below
Name Details
Alex Age:25
Height:6
Hair:Brown
Eyes:Hazel

A csv should probably look like this:
Name,Age,Height,Hair,Eyes
Alex,25,6,Brown,Hazel
Each cell should be separated by exactly one comma from its neighbor.
You can reformat it as such by using a simple regex which replaces certain newline and non-newline whitespace with commas (you can easily find each block because it has values in both columns).

A CSV file is normally defined using commas as field separators and CR for a row separator. You are using CR within your second column, this will cause problems. You'll need to reformat your second column to use some other form of separator between multiple values. A common alternate separator is the | (pipe) character.
Your format would then look like:
Alex,Age:25|Height:6|Hair:Brown|Eyes:Hazel
In your parsing, you would first parse the comma separated fields (which would return two values), and then parse the second field as pipe separated.

This is an interesting one - it can be quite difficult to parse specific format files which is why people often write specific classes to deal with them. More conventional file formats like CSV, or other delimited formats are [more] easy to read because they are formatted in a similar way.
A problem like the above can be addressed in the following way:
1) What should the output look like?
In your instance, and this is just a guess, but I believe you are aiming for the following:
Name, Age, Height, Hair, Eyes
Alex, 25, 6, Brown, Hazel
In which case, you have to parse out this information based on the structure above. If it's repeated blocks of text like the above then we can say the following:
a. Every person is in a block starting with Name Details
b. The name value is the first text after Details, with the other columns being delimited in the format Column:Value
However, you might also have sections with addtional attributes, or attributes that are missing if the original input was optional, so tracking the column and ordinal would be useful too.
So one approach might look like the following:
public void ParseFile(){
String currentLine;
bool newSection = false;
//Store the column names and ordinal position here.
List<String> nameOrdinals = new List<String>();
nameOrdinals.Add("Name"); //IndexOf == 0
Dictionary<Int32, List<String>> nameValues = new Dictionary<Int32 ,List<string>>(); //Use this to store each person's details
Int32 rowNumber = 0;
using (TextReader reader = File.OpenText("D:\\temp\\test.txt"))
{
while ((currentLine = reader.ReadLine()) != null) //This will read the file one row at a time until there are no more rows to read
{
string[] lineSegments = currentLine.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (lineSegments.Length == 2 && String.Compare(lineSegments[0], "Name", StringComparison.InvariantCultureIgnoreCase) == 0
&& String.Compare(lineSegments[1], "Details", StringComparison.InvariantCultureIgnoreCase) == 0) //Looking for a Name Details Line - Start of a new section
{
rowNumber++;
newSection = true;
continue;
}
if (newSection && lineSegments.Length > 1) //We can start adding a new person's details - we know that
{
nameValues.Add(rowNumber, new List<String>());
nameValues[rowNumber].Insert(nameOrdinals.IndexOf("Name"), lineSegments[0]);
//Get the first column:value item
ParseColonSeparatedItem(lineSegments[1], nameOrdinals, nameValues, rowNumber);
newSection = false;
continue;
}
if (lineSegments.Length > 0 && lineSegments[0] != String.Empty) //Ignore empty lines
{
ParseColonSeparatedItem(lineSegments[0], nameOrdinals, nameValues, rowNumber);
}
}
}
//At this point we should have collected a big list of items. We can then write out the CSV. We can use a StringBuilder for now, although your requirements will
//be dependent upon how big the source files are.
//Write out the columns
StringBuilder builder = new StringBuilder();
for (int i = 0; i < nameOrdinals.Count; i++)
{
if(i == nameOrdinals.Count - 1)
{
builder.Append(nameOrdinals[i]);
}
else
{
builder.AppendFormat("{0},", nameOrdinals[i]);
}
}
builder.Append(Environment.NewLine);
foreach (int key in nameValues.Keys)
{
List<String> values = nameValues[key];
for (int i = 0; i < values.Count; i++)
{
if (i == values.Count - 1)
{
builder.Append(values[i]);
}
else
{
builder.AppendFormat("{0},", values[i]);
}
}
builder.Append(Environment.NewLine);
}
//At this point you now have a StringBuilder containing the CSV data you can write to a file or similar
}
private void ParseColonSeparatedItem(string textToSeparate, List<String> columns, Dictionary<Int32, List<String>> outputStorage, int outputKey)
{
if (String.IsNullOrWhiteSpace(textToSeparate)) { return; }
string[] colVals = textToSeparate.Split(new[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
List<String> outputValues = outputStorage[outputKey];
if (!columns.Contains(colVals[0]))
{
//Add the column to the list of expected columns. The index of the column determines it's index in the output
columns.Add(colVals[0]);
}
if (outputValues.Count < columns.Count)
{
outputValues.Add(colVals[1]);
}
else
{
outputStorage[outputKey].Insert(columns.IndexOf(colVals[0]), colVals[1]); //We append the value to the list at the place where the column index expects it to be. That way we can miss values in certain sections yet still have the expected output
}
}
After running this against your file, the string builder contains:
"Name,Age,Height,Hair,Eyes\r\nAlex,25,6,Brown,Hazel\r\n"
Which matches the above (\r\n is effectively the Windows new line marker)
This approach demonstrates how a custom parser might work - it's purposefully over verbose as there is plenty of refactoring that could take place here, and is just an example.
Improvements would include:
1) This function assumes there are no spaces in the actual text items themselves. This is a pretty big assumption and, if wrong, would require a different approach to parsing out the line segments. However, this only needs to change in one place - as you read a line at a time, you could apply a reg ex, or just read in characters and assume that everything after the first "column:" section is a value, for example.
2) No exception handling
3) Text output is not quoted. You could test each value to see if it's a date or number - if not, wrap it in quotes as then other programs (like Excel) will attempt to preserve the underlying datatypes more effectively.
4) Assumes no column names are repeated. If they are, then you have to check if a column item has already been added, and then create an ColName2 column in the parsing section.

"\r\n" appears as small square boxes in word document, C#

I am appending some text containing '\r\n' into a word document at run-time.
But when I see the word document, they are replaced with small square boxes :-(
I tried replacing them with System.Environment.NewLine but still I see these small boxes.
Any idea?

the answer is to use \v - it's a paragraph break.

Have you not tried one or the other in isolation i.e.\r or \n as Word will interpret a carriage return and line feed respectively. The only time you would use the Environment.Newline is in a pure ASCII text file. Word would handle those characters differently! Or even a Ctrl+M sequence. Try that and if it does not work, please post the code.

Word uses the <w:br/> XML element for line breaks.

After much trial and error, here is a function that sets the text for a Word XML node, and takes care of multiple lines:
//Sets the text for a Word XML <w:t> node
//If the text is multi-line, it replaces the single <w:t> node for multiple nodes
//Resulting in multiple Word XML lines
private static void SetWordXmlNodeText(XmlDocument xmlDocument, XmlNode node, string newText)
{
//Is the text a single line or multiple lines?>
if (newText.Contains(System.Environment.NewLine))
{
//The new text is a multi-line string, split it to individual lines
var lines = newText.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
//And add XML nodes for each line so that Word XML will accept the new lines
var xmlBuilder = new StringBuilder();
for (int count = 0; count < lines.Length; count++)
{
//Ensure the "w" prefix is set correctly, otherwise docFrag.InnerXml will fail with exception
xmlBuilder.Append("<w:t xmlns:w=\"http://schemas.microsoft.com/office/word/2003/wordml\">");
xmlBuilder.Append(lines[count]);
xmlBuilder.Append("</w:t>");
//Not the last line? add line break
if (count != lines.Length - 1)
{
xmlBuilder.Append("<w:br xmlns:w=\"http://schemas.microsoft.com/office/word/2003/wordml\" />");
}
}
//Create the XML fragment with the new multiline structure
var docFrag = xmlDocument.CreateDocumentFragment();
docFrag.InnerXml = xmlBuilder.ToString();
node.ParentNode.AppendChild(docFrag);
//Remove the single line child node that was originally holding the single line text, only required if there was a node there to start with
node.ParentNode.RemoveChild(node);
}
else
{
//Text is not multi-line, let the existing node have the text
node.InnerText = newText;
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XmlException when loading an XML file with certain characters - c#

Related

Remove control characters sequence from string EOT comma ETX

Check if list contains a string that matches closely

Replacing XElement content with XElement

How to format and read CSV file?

"\r\n" appears as small square boxes in word document, C#

Categories

Resources