Replace character references for invalid XML characters

Replace character references for invalid XML characters - c#

I am projecting some data as XML from SQL Server using ADO.NET. Some of my data contains characters that are invalid in XML, such as CHAR(7) (known as BEL).
SELECT 'This is BEL: ' + CHAR(7) AS A FOR XML RAW
SQL Server encodes such invalid characters as numeric references:
<row A="This is BEL: " />
However, even the encoded form is invalid under XML 1.0, and will give rise to errors in XML parsers:
var doc = XDocument.Parse("<row A=\"This is BEL: \" />");
// XmlException: ' ', hexadecimal value 0x07, is an invalid character. Line 1, position 25.
I would like to replace all these invalid numeric references with the Unicode replacement character, '�'. I know how to do this for unencoded XML:
string str = "<row A=\"This is BEL: \u0007\" />";
if (str.Any(c => !XmlConvert.IsXmlChar(c)))
str = new string(str.Select(c => XmlConvert.IsXmlChar(c) ? c : '�').ToArray());
// <row A="This is BEL: �" />
Is there a straightforward way to make it work for encoded XML too? I would prefer to avoid having to HtmlDecode then HtmlEncode the whole string, in order not to risk introducing changes other than invalid character replacement.
Edit: The conversion needs to be done in my C# code, not SQL, in order for it to be implemented centrally.

I made another go at it using regular expressions. This should handle both decimal and hex character codes. Also, this will not affect anything but numerically encoded characters.
public string ReplaceXMLEncodedCharacters(string input)
{
const string pattern = #"&#(x?)([A-Fa-f0-9]+);";
MatchCollection matches = Regex.Matches(input, pattern);
int offset = 0;
foreach (Match match in matches)
{
int charCode = 0;
if (string.IsNullOrEmpty(match.Groups[1].Value))
charCode = int.Parse(match.Groups[2].Value);
else
charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
char character = (char)charCode;
input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
offset += match.Length - 1;
}
return input;
}

You can wrap the special characters in the CDATA tag. This informs the parser to ignore text within the tag. To use your example:
SELECT 'This is BEL: <![CDATA[' + CHAR(7) + ']]>' AS A FOR XML RAW
This will allow the XML to be parsed at the very least, albeit requiring a slight change to the document structure.

For reference, this is my solution. I've built on Tonkleton's answer, but modified it to match the internal implementation of HtmlDecode more closely. The code below ignores surrogate pairs.
// numeric character references
static readonly Regex ncrRegex = new Regex("&#x?[A-Fa-f0-9]+;");
static string ReplaceInvalidXmlCharacterReferences(string input)
{
if (input.IndexOf("&#") == -1) // optimization
return input;
return ncrRegex.Replace(input, match =>
{
string ncr = match.Value;
uint num;
var frmt = NumberFormatInfo.InvariantInfo;
bool isParsed =
ncr[2] == 'x' ? // the x must be lowercase in XML documents
uint.TryParse(ncr.Substring(3, ncr.Length - 4), NumberStyles.AllowHexSpecifier, frmt, out num) :
uint.TryParse(ncr.Substring(2, ncr.Length - 3), NumberStyles.Integer, frmt, out num);
return isParsed && !XmlConvert.IsXmlChar((char)num) ? "�" : ncr;
});
}

Related

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.

I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.

sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

Splitting data with inconsistent delimiters

I have these data files comming in on a server that i need to split into [date time] and [value]. Most of them are delimited a single time between time and value and between date and time is a space. I already have a program processing the data with a simple split(char[]) but now found data where the delimiter is a space and i am wondering how to tackle this best.
So most files i encountered look like this:
18-06-2014 12:00:00|220.6
The delimiters vary, but i tackled that with a char[]. But today i ran into a problem on this format:
18-06-2014 12:00:00 220.6
This complicates things a little. The easy solution would be to just add a space to my split characters and when i find 3 splits combine the first two before processing?
I'm looking for a 2nd opining on this matter. Also the time format can change to something like d/m/yy and the amount of lines can run into the millions so i would like to keep it as efficient as possible.

Yes I believe the most efficient solution is to add space as a delimiter and then just combine the first two if you get three. That is going to be be more efficient than regex.

You've got a string 18-06-2014 12:00:00 220.6 where first 19 characters is a date, one character is a separation symbol and other characters is a value. So:
var test = "18-06-2014 12:00:00|220.6";
var dateString = test.Remove(19);
var val = test.Substring(20);
Added normalization:
static void Main(string[] args) {
var test = "18-06-2014 12:00:00|220.6";
var test2 = "18-6-14 12:00:00|220.6";
var test3 = "8-06-14 12:00:00|220.6";
Console.WriteLine(test);
Console.WriteLine(TryNormalizeImportValue(test));
Console.WriteLine(test2);
Console.WriteLine(TryNormalizeImportValue(test2));
Console.WriteLine(test3);
Console.WriteLine(TryNormalizeImportValue(test3));
}
private static string TryNormalizeImportValue(string value) {
var valueSplittedByDateSeparator = value.Split('-');
if (valueSplittedByDateSeparator.Length < 3) throw new InvalidDataException();
var normalizedDay = NormalizeImportDayValue(valueSplittedByDateSeparator[0]);
var normalizedMonth = NormalizeImportMonthValue(valueSplittedByDateSeparator[1]);
var valueYearPartSplittedByDateTimeSeparator = valueSplittedByDateSeparator[2].Split(' ');
if (valueYearPartSplittedByDateTimeSeparator.Length < 2) throw new InvalidDataException();
var normalizedYear = NormalizeImportYearValue(valueYearPartSplittedByDateTimeSeparator[0]);
var valueTimeAndValuePart = valueYearPartSplittedByDateTimeSeparator[1];
return string.Concat(normalizedDay, '-', normalizedMonth, '-', normalizedYear, ' ', valueTimeAndValuePart);
}
private static string NormalizeImportDayValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportMonthValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportYearValue(string value) {
return value.Length == 4 ? value : DateTime.Now.Year.ToString(CultureInfo.InvariantCulture).Remove(2) + value;
}

Well you can use this one to get the date and the value.
(((0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[012])-(19|20)\d\d)\s((\d{2}:?){3})|(\d+\.?\d+))
This will give you 2 matches
1º 18-06-2014 12:00:00
2º 220.6
Example:
http://regexr.com/391d3

This regex matches both kinds of strings, capturing the two tokens to Groups 1 and 2.
Note that we are not using \d because in .NET it can match any Unicode digits such as Thai...
The key is in the [ |] character class, which specifies your two allowable delimiters
Here is the regex:
^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$
In the demo, please pay attention to the capture Groups in the right pane.
Here is how to retrieve the values:
var myRegex = new Regex(#"^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$", RegexOptions.IgnoreCase);
string mydate = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(mydate);
string myvalue = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(myvalue);
Please let me know if you have questions

Given the provided format I'd use something like
char delimiter = ' '; //or whatever the delimiter for the specific file is, this can be set in a previous step
int index = line.LastIndexOf(delimiter);
var date = line.Remove(index);
var value = line.Substring(++index);
If there are that many lines and efficiency matters, you could obtain the delimiter once on the first line, by looping back from the end and find the first index that is not a digit or dot (or comma if the value can contain those) to determine the delimiter, and then use something such as the above.
If each line can contain a different delimiter, you could always track back to the first not value char as described above and still maintain adequate performance.
Edit: for completeness sake, to find the delimiter, you could perform the following once per file (provided that the delimiter stays consistent within the file)
char delimiter = '\0';
for (int i = line.Length - 1; i >= 0; i--)
{
var c= line[i];
if (!char.IsDigit(c) && c != '.')
{
delimiter = c;
break;
}
}

Check string for invalid characters? Smartest way?

I would like to check some string for invalid characters. With invalid characters I mean characters that should not be there. What characters are these? This is different, but I think thats not that importan, important is how should I do that and what is the easiest and best way (performance) to do that?
Let say I just want strings that contains 'A-Z', 'empty', '.', '$', '0-9'
So if i have a string like "HELLO STaCKOVERFLOW" => invalid, because of the 'a'.
Ok now how to do that? I could make a List<char> and put every char in it that is not allowed and check the string with this list. Maybe not a good idea, because there a lot of chars then. But I could make a list that contains all of the allowed chars right? And then? For every char in the string I have to compare the List<char>? Any smart code for this? And another question: if I would add A-Z to the List<char> I have to add 25 chars manually, but these chars are as I know 65-90 in the ASCII Table, can I add them easier? Any suggestions? Thank you

You can use a regular expression for this:
Regex r = new Regex("[^A-Z0-9.$ ]$");
if (r.IsMatch(SomeString)) {
// validation failed
}
To create a list of characters from A-Z or 0-9 you would use a simple loop:
for (char c = 'A'; c <= 'Z'; c++) {
// c or c.ToString() depending on what you need
}
But you don't need that with the Regex - pretty much every regex engine understands the range syntax (A-Z).

I have only just written such a function, and an extended version to restrict the first and last characters when needed. The original function merely checks whether or not the string consists of valid characters only, the extended function adds two integers for the numbers of valid characters at the beginning of the list to be skipped when checking the first and last characters, in practice it simply calls the original function 3 times, in the example below it ensures that the string begins with a letter and doesn't end with an underscore.
StrChr(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"));
StrChrEx(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ", 11, 1));
BOOL __cdecl StrChr(CHAR* str, CHAR* chars)
{
for (int s = 0; str[s] != 0; s++)
{
int c = 0;
while (true)
{
if (chars[c] == 0)
{
return false;
}
else if (str[s] == chars[c])
{
break;
}
else
{
c++;
}
}
}
return true;
}
BOOL __cdecl StrChrEx(CHAR* str, CHAR* chars, UINT excl_first, UINT excl_last)
{
char first[2] = {str[0], 0};
char last[2] = {str[strlen(str) - 1], 0};
if (!StrChr(str, chars))
{
return false;
}
if (excl_first != 0)
{
if (!StrChr(first, chars + excl_first))
{
return false;
}
}
if (excl_last != 0)
{
if (!StrChr(last, chars + excl_last))
{
return false;
}
}
return true;
}

If you are using c#, you do this easily using List and contains. You can do this with single characters (in a string) or a multicharacter string just the same
var pn = "The String To ChecK";
var badStrings = new List<string>()
{
" ","\t","\n","\r"
};
foreach(var badString in badStrings)
{
if(pn.Contains(badString))
{
//Do something
}
}

If you're not super good with regular expressions, then there is another way to go about this in C#. Here is a block of code I wrote to test a string variable named notifName:
var alphabet = "a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z";
var numbers = "0,1,2,3,4,5,6,7,8,9";
var specialChars = " ,(,),_,[,],!,*,-,.,+,-";
var validChars = (alphabet + "," + alphabet.ToUpper() + "," + numbers + "," + specialChars).Split(',');
for (int i = 0; i < notifName.Length; i++)
{
if (Array.IndexOf(validChars, notifName[i].ToString()) < 0) {
errorFound = $"Invalid character '{notifName[i]}' found in notification name.";
break;
}
}
You can change the characters added to the array as needed. The Array IndexOf method is the key to the whole thing. Of course if you want commas to be valid, then you would need to choose a different split character.

Not enough reps to comment directly, but I recommend the Regex approach. One small caveat: you probably need to anchor both ends of the input string, and you will want at least one character to match. So (with thanks to ThiefMaster), here's my regex to validate user input for a simple arithmetical calculator (plus, minus, multiply, divide):
Regex r = new Regex(#"^[0-9\.\-\+\*\/ ]+$");

I'd go with a regex, but still need to add my 2 cents here, because all the proposed non-regex solutions are O(MN) in the worst case (string is valid) which I find repulsive for religious reasons.
Even more so when LINQ offers a simpler and more efficient solution than nesting loops:
var isInvalid = "The String To Test".Intersect("ALL_INVALID_CHARS").Any();

C# removing white spaces in an HTML string

is it possible to remove all white spaces in the following HTML string in C#:
"
<html>
<body>
</body>
</html>
"
Thanks

When dealing with HTML or any markup for that matter, it's usually best to run it through a parser that truly understands the rules of that markup.
The first benefit is that it can tell you if your initial input data is garbage to start with.
If the parser is smart enough it might even be able to correct badly formed markup automatically, or accept it with relaxed rules.
You can then modify the parsed content....and get the parser to write out the changes...this way you can be sure the markup rules are followed and you have correct output.
For some simple HTML markup scenarios or for markup that is so badly formed a parser just balks on it straight away, then yes you can revert to hacking the input string...with string replacements, etc....it all depends on your needs as to which approach you take.
Here are a couple of tools that can help you out:
HTML Tidy
You can use HTML Tidy and just specify some options/rules on how you want your HTML to be tidied up (e.g. remove superfluous whitespace).
It's a WIN32 DLL...but there are C# Wrappers for it.
http://tidy.sourceforge.net
http://robertbeal.com/37/sanitising-html
C# version of HTML Tidy?
http://geekswithblogs.net/mnf/archive/2011/06/08/implementations-of-html-tidylib-for-.net.aspx
HtmlAgilityPack
You can use HtmlAgilityPack to parse HTML if you need to understand the structure better and perhaps do your own tidying up/restructuring.
http://html-agility-pack.net

myString = myString.Replace(System.Environment.NewLine, "");

You can use a regular expression to match white space characters for the replace:
s = RegEx.Replace(s, #"\s+", String.Empty);

I used this solution (in my opinion it works well. See also test code):
Add an extension method to trim the HTML string:
public static string RemoveSuperfluousWhitespaces(this string input)
{
if (input.Length < 3) return input;
var resultString = new StringBuilder(); // Using StringBuilder is much faster than using regular expressions here!
var inputChars = input.ToCharArray();
var index1 = 0;
var index2 = 1;
var index3 = 2;
// Remove superfluous white spaces from the html stream by the following replacements:
// '<no whitespace>' '>' '<whitespace>' ==> '<no whitespace>' '>'
// '<whitespace>' '<' '<no whitespace>' ==> '<' '<no whitespace>'
while (index3 < inputChars.Length)
{
var char1 = inputChars[index1];
var char2 = inputChars[index2];
var char3 = inputChars[index3];
if (!Char.IsWhiteSpace(char1) && char2 == '>' && Char.IsWhiteSpace(char3))
{
// drop whitespace character in char3
index3++;
}
else if (Char.IsWhiteSpace(char1) && char2 == '<' && !Char.IsWhiteSpace(char3))
{
// drop whitespace character in char1
index1 = index2;
index2 = index3;
index3++;
}
else
{
resultString.Append(char1);
index1 = index2;
index2 = index3;
index3++;
}
}
// (index3 >= inputChars.Length)
resultString.Append(inputChars[index1]);
resultString.Append(inputChars[index2]);
var str = resultString.ToString();
return str;
}
// 2) add test code:
[Test]
public void TestRemoveSuperfluousWhitespaces()
{
var html1 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html2 = $"<td class=\"keycolumn\">{Environment.NewLine}<p class=\"mandatory\">Some recipe parameter name</p>{Environment.NewLine}</td>";
var html3 = $"<td class=\"keycolumn\">{Environment.NewLine} <p class=\"mandatory\">Some recipe parameter name</p> {Environment.NewLine}</td>";
var html4 = " <td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html5 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td> ";
var compactedHtml1 = html1.RemoveSuperfluousWhitespaces();
compactedHtml1.Should().BeEquivalentTo(html1);
var compactedHtml2 = html2.RemoveSuperfluousWhitespaces();
compactedHtml2.Should().BeEquivalentTo(html1);
var compactedHtml3 = html3.RemoveSuperfluousWhitespaces();
compactedHtml3.Should().BeEquivalentTo(html1);
var compactedHtml4 = html4.RemoveSuperfluousWhitespaces();
compactedHtml4.Should().BeEquivalentTo(html1);
var compactedHtml5 = html5.RemoveSuperfluousWhitespaces();
compactedHtml5.Should().BeEquivalentTo(html1);
}

Convert character entities to their unicode equivalents

I have html encoded strings in a database, but many of the character entities are not just the standard & and <. Entities like “ and —. Unfortunately we need to feed this data into a flash based rss reader and flash doesn't read these entities, but they do read the unicode equivalent (ex “).
Using .Net 4.0, is there any utility method that will convert the html encoded string to use unicode encoded character entities?
Here is a better example of what I need. The db has html strings like: <p>John & Sarah went to see $ldquo;Scream 4$rdquo;.</p> and what I need to output in the rss/xml document with in the <description> tag is: <p>John &#38; Sarah went to see &#8220;Scream 4&#8221;.</p>
I'm using an XmlTextWriter to create the xml document from the database records similar to this example code http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx
So I need to replace all of the character entities within the html string from the db with their unicode equivilant because the flash based rss reader doesn't recognize any entities beyond the most common like &.

My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.
If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.
EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):
string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("&#" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}
I may have an off-by-one error somewhere in there, but it should be close.

would HttpUtility.HtmlDecode work for you?
I realize it doesn't convert to unicode equivalent entities, but instead converts it to unicode. Is there a specific reason you want the unicode equivalent entities?
updated edit
string test = "<p>John & Sarah went to see “Scream 4”.</p>";
string decode = HttpUtility.HtmlDecode(test);
string encode = HttpUtility.HtmlEncode(decode);
StringBuilder builder = new StringBuilder();
foreach (char c in encode)
{
if ((int)c > 127)
{
builder.Append("&#");
builder.Append((int)c);
builder.Append(";");
}
else
{
builder.Append(c);
}
}
string result = builder.ToString();

you can download a local copy of the appropriate HTML and/or XHTML DTDs from the W3C. Then set up an XmlResolver and use it to expand any entities found in the document.
You could use a regular expression to find/expand the entities, but that won't know anything about context (e.g., anything in a CDATA section shouldn't be expanded).

this might help you put input path in textbox
try
{
FileInfo n = new FileInfo(textBox1.Text);
string initContent = File.ReadAllText(textBox1.Text);
int contentLength = initContent.Length;
Match m;
while ((m = Regex.Match(initContent, "[^a-zA-Z0-9<>/\\s(&#\\d+;)-]")).Value != String.Empty)
initContent = initContent.Remove(m.Index, 1).Insert(m.Index, string.Format("&#{0};", (int)m.Value[0]));
File.WriteAllText("outputpath", initContent);
}
catch (System.Exception excep)
{
MessageBox.Show(excep.Message);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Replace character references for invalid XML characters - c#

Related

Remove control characters sequence from string EOT comma ETX

Splitting data with inconsistent delimiters

Check string for invalid characters? Smartest way?

C# removing white spaces in an HTML string

Convert character entities to their unicode equivalents

Categories

Resources