How to read/write text and avoid special character signs (<, , >, etc) - c#

I am currently parsing some C# scripts that are stored in a database, extracting the body of some methods in the code, and then writing an XML file that shows the id, the body of the extracted methods, etc.
The problem I have write now is that when I write the code in the XML I have to write it as a literal string, so I thought I'd need to add " at the beginning and end:
new XElement("MethodName", #"""" + Extractor.GetMethodBody(rule.RuleScript, "MethodName") + #"""")
This works, but I have a problem, things that are written in the DB as
for (int n = 1; n < 10; n++)
are written into the XML file (or printed to console) as:
for (int n = 1; n < 10; n++)
How can I get it to print the actual character and not its code? The code in the database is written with the actual charaters, not the "safe" < like one.

Inside xml (as a text value) it is correct for < to be encoded as <. The internal representation of xml doesn't affect the value, so let it get encoded. You can get around this by forcing a CDATA section, but in all honesty - it isn't worth it. But here is an example using CDATA:
string noEncoding = new XElement("foo", new XCData("a < b")).ToString();

Why do you think that you have to write it as a literal string? That is not so. Besides, you are not writing it as a literal string at all, it's still a dynamic string value only that you have added quotation marks around it.
A literal string is a string that is written litteraly in the code, like "Hello world". If you get the string in any other way, it's not a literal string.
The quotation marks that you have added to the string simply adds quotation marks to the value, they don't do anything else to the string. You can add the string with the quotation marks just fine:
new XElement("MethodName", Extractor.GetMethodBody(rule.RuleScript, "MethodName"))
Now, the characters that are encoded when they are put in the XML, is because they need to be encoded. You can't put a < character inside a value without encoding it.
If you show the XML, you will see the encoded values, and that is just a sign that it works as it should. When you read the XML, the encoded characters will be decoded, and you end up with the original string.

I don't know what software he's going to use to read the XML, but any that I know of will throw an error on parsing any XML that does not escape < and > chars which aren't used as tag starts and ends. It's just part of the XML specification; these chars are reserved as part of the structure.
If I were you, then, I'd part ways with the System.XML utilities and write this file yourself. Any decent XML tool is going to encode those chars for you, so you should probably not use them. Go with a StreamWriter and create the output the way you are being told to. That way you can control the XML output yourself, even if it means breaking the XML specification.
using (StreamWriter sw = new StreamWriter("c:\\xmlText.xml", false, Encoding.UTF8))
{
sw.WriteLine("<?xml version=\"1.0\"?>");
sw.WriteLine("<Class>");
sw.Write("\t<Method Name=\"MethodName\">");
sw.Write(#"""" + Extractor.GetMethodBody(rule.RuleScript, "MethodName") + #"""");
sw.WriteLine("</Method>");
// ... and so on and so forth
sw.WriteLine("</Class>");
}

Related

HtmlAgilityPack treats everything after < (less than sign) as attributes

I have some input I get via a textarea and I convert that input into a html document, that is later parsed into a PDF document.
When my users input the less than sign (<) everything brakes in my HtmlDocument. HtmlAgilityPack suddenly handles everything after the less than sign as an attribute. See the output:
Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">
It gets a little better if I just add the
htmlDocument.OptionOutputOptimizeAttributeValues = true;
which gives me:
Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>
I have tried all of the options on the htmldocument and none of them lets me specify that the parser should not be strict. On the other hand I might be able to live with it stripping away the <, but adding all the equal signs doesn't really work for me.
void Main()
{
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDoc = WrapContentInHtml(input);
htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}
private HtmlDocument WrapContentInHtml(string content)
{
var htmlBuilder = new StringBuilder();
htmlBuilder.AppendLine("<!DOCTYPE html>");
htmlBuilder.AppendLine("<html>");
htmlBuilder.AppendLine("<head>");
htmlBuilder.AppendLine("<title></title>");
htmlBuilder.AppendLine("</head>");
htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
htmlBuilder.AppendLine(content);
htmlBuilder.AppendLine("</div></body></html>");
var htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputOptimizeAttributeValues = true;
var htmlDoc = htmlBuilder.ToString();
htmlDocument.LoadHtml(htmlDoc);
return htmlDocument;
}
Does anybody have an idea to how I can solve this problem.
The closest question I can find is this:
Losing the 'less than' sign in HtmlAgilityPack loadhtml
Where he actually complains about the < disappearing which would be fine for me. Of course fixing the parsing error is the best solution.
EDIT:
I am using HtmlAgilityPack 1.4.9
Your content is blatantly wrong. This is not about "strictness", it's really about the fact that you're pretending a piece of text is valid HTML. In fact, the results you are getting are exactly because the parser is not strict.
When you need to insert plain text into HTML, you need to encode it first, so that all the various HTML control characters are converted to HTML properly - for example, < must be changed to < and & to &.
One way to handle this is to use the DOM - use InnerText on the target div, instead of slapping strings together and pretending they're HTML. Another is to use some explicit encoding method - for example HttpUtility.HtmlEncode.
You can use System.Net.WebUtility.HtmlEncode which works even without a reference to System.Web.dll which also has HttpServerUtility.HtmlEncode
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());
Result:
Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).

Best Method of standard string to XML legal string - C#

Currently my understanding of XML legal strings is that all is required is that you convert any instances of: &, ", ', <, > with & " &apos; < >
So I made the following parser:
private static string ToXmlCompliantStr(string uriStr)
{
string uriXml = uriStr;
uriXml = uriXml.Replace("&", "&");
uriXml = uriXml.Replace("\"", """);
uriXml = uriXml.Replace("'", "&apos;");
uriXml = uriXml.Replace("<", "<");
uriXml = uriXml.Replace(">", ">");
return uriXml;
}
I am aware that there are similar questions out there with good answers (which is how I was able to write this function) I am writing this question to ask if this code will translate ANY string that C# can throw at it and have XDocument parse it as a part of a whole document without any complaints as all the questions out there that I've found state that these are the only escape characters, not that parsing them will cause 100% valid XML string. I've gone as far as reading through the decompiled XNode class trying to see how that parse it.
Thanks
Firstly, you should absolutely not do this yourself. Use an XML API - that way you can trust that to do the right thing, rather than worrying about covering corner cases etc. You generally shouldn't be trying to come up with an "escaped string" at all - you should pass the string to the XElement constructor (or XAttribute, or whatever your situation is).
In other words, I think you should try really hard to design your application so that you don't need a method of the kind you've shown in your question at all. Look at where you'd be using that method, and see whether you can just create an XElement (or whatever) instead. If you try to treat XML as a data structure in itself rather than just as text, you'll have a much better experience in my experience.
Secondly, you need to understand that in XML 1.0 at least, there are Unicode characters that cannot be validly represented in XML, no matter how much escaping you use. In particular, values U+0000 to U+001F are unrepresentable other than U+0009 (tab), U+000A (line feed) and U+000D (carriage return). Also if you have a string which contains invalid UTF-16 (e.g. an unmatched half of a surrogate pair), that can't be correctly represented in XML.

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

Escape Sequence in String

I have an issue where if I pass a hardcoded string \x1B|200uF as a parameter the command accepts it correctly.
But, when I retrieve the same value from an XML element into a new string variable I get the following value : \\x1B|200uF
As you can see there is an extra escape sequence.
So in summary the problem :
using (XmlReader xmlReader = XmlReader.Create(#"PosPrinter.xml"))
{
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
switch (xmlReader.Name)
{
case "PosPrinter":
_printer.LogicalName = xmlReader.GetAttribute("LogicalName");
break;
case "FeedReceiptCommand":
_printer.FeedReceiptCommand = xmlReader.GetAttribute("value");
break;
I retrieve the value into my 'FeedReceiptCommand' string the value as I mentioned above is stored in the xml as \x1B|200uF but is retrieved into the string as \\x1B|200uF with the extra escape sequence at the beginning.
I then call my command using the string variable FeedReceiptCommand :
_posPrinter.PrintNormal(PrinterStation.Receipt, PrinterSettings.FeedReceiptCommand );
But the command doesn't execute because of the extra escape sequence.
But if I call the same command with the value hardcoded:
_posPrinter.PrintNormal(PrinterStation.Receipt, "\x1B|200uF");
... then the command gets executed successfully..
The value \x1B|200uF is the ESC command to send to a Epson TM-T88V printer using Microsoft.PointOfService whic the '\x' is for Hex I think and the 1B is the Hex value..
I have tried to get rid of the extra escape sequence by using 'Trim', 'Substring', and even doing a foreach loop to loop each char in the string to build a new one. I also tried stringbuilder.
But I'm missing the point somewhere here.
So any help would be appreciated in how I can pass a variable in place of the \x1B|200uF
The problem lies (as #OlafDietsche pointed out) indeed in the XML file. In the C# string, \x1B means "the character with code 1B (hex) or 27 (dec)", in XML it's just the four characters.
So you'll have to encode the special character inside your XML document differently. Theoretically, you'd simply replace \x1B with , which is the XML way of saying "the character number 1B (hex)". The problem in this specific case, however, is that  is not allowed in XML. The valid characters in an XML document are defined here: http://www.w3.org/TR/xml/#charsets
Note how #x1B is not part of this range.
You could use a different character to represent Escape in the XML and replace it inside your C# code. Make sure to use a surrogate character that a) is a valid XML character and b) would never be used as actual data.
For example, choose xFEED as your escape char (as it is easy to recognize). So your document looks like:
<FeedReceiptCommand value="ﻭ|200uF"/>
In your C# code, replace it accordingly:
string actualValue = reader.GetAttribute("value").Replace('\xFEED', '\x1B')
In the hardcoded string, you have the character hex 1B for ESC plus the string |200uF. I haven't seen your XML, but I guess in XML, you have the string \x1B literally, which is four characters \, x, 1 and B.
That's the difference between the two, hardcoded and XML.
AFAIK, there is no way to include control characters as ESC literally in an XML 1.0 document. You might try to encode it as  and parse it yourself, if it is not delivered properly by your XML parser.

How to read double quotes (") in a text file in C#?

I have to read a text file and then to parse it, in C# using VS 2010. The sample text is as follows,
[TOOL_TYPE]
; provides the name of the selected tool for programming
“Phoenix Select Advanced”;
[TOOL_SERIAL_NUMBER]
; provides the serial number for the tool
7654321;
[PRESSURE_CORRECTION]
; provides the Pressure correction information requirement
“Yes”;
[SURFACE_MOUNT]
; provides the surface mount information
“Yes”;
[SAPPHIRE_TYPE]
; provides the sapphire type information
“No”;
Now I have to parse only the string data (in double quotes) and headers (in square brackets[]), and then save it into another text file. I can successfully parse the headers but the string data in double quotes is not appearing correctly, as shown below.
[TOOL_TYPE]
�Phoenix Select Advanced�;
[TOOL_SERIAL_NUMBER]
7654321;
[PRESSURE_CORRECTION]
�Yes�;
[SURFACE_MOUNT]
�Yes�;
[SAPPHIRE_TYPE]
�No�;
[EXTENDED_TELEMETRY]
�Yes�;
[OVERRIDE_SENSE_RESISTOR]
�No�;
Please note a special character (�) which is appearing every time whenever a double quotes appear.
How can I write the double quotes(") in the destination file and avoid (�) ?
Update
I am using the following line for my parsing
temporaryconfigFileWriter.WriteLine(configFileLine, false, Encoding.Unicode);
Here is the complete code I am using:
string temporaryConfigurationFileName = System.Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "\\Temporary_Configuration_File.txt";
//Pointers to read from Configuration File 'configFileReader' and to write to Temporary Configuration File 'temporaryconfigFileWriter'
StreamReader configFileReader = new StreamReader(CommandLineVariables.ConfigurationFileName);
StreamWriter temporaryconfigFileWriter = new StreamWriter(temporaryConfigurationFileName);
//Check whether the 'END_OF_FILE' header is specified or not, to avoid searching for end of file indefinitely
if ((File.ReadAllText(CommandLineVariables.ConfigurationFileName)).Contains("[END_OF_FILE]"))
{
//Read the file untill reaches the 'END_OF_FILE'
while (!((configFileLine = configFileReader.ReadLine()).Contains("[END_OF_FILE]")))
{
configFileLine = configFileLine.Trim();
if (!(configFileLine.StartsWith(";")) && !(string.IsNullOrEmpty(configFileLine)))
{
temporaryconfigFileWriter.WriteLine(configFileLine, false, Encoding.UTF8);
}
}
// to write the last header [END_OF_FILE]
temporaryconfigFileWriter.WriteLine(configFileLine);
configFileReader.Close();
temporaryconfigFileWriter.Close();
}
Your input file doesn't contain double quotes, that's a lie. It contains the opening double quote and the closing double quote not the standard version.
First you must ensure that you are reading your input with the correct encoding (Try multiple ones and just display the string in a textbox in C# you'll see if it show the characters correctly pretty fast)
If you want such characters to appear in your output you must write the output file as something else than ASCII and if you write it as UTF-8 for example you should ensure that it start with the Byte Order Mark (Otherwise it will be readable but some software like notepad will display 2 characters as it won't detect that the file isn't ASCII).
Another choice is to simply replace “ and ” with "
It appears that you are using proper typographic quotes (“...”) instead of the straight ASCII ones ("..."). My guess would be that you read the text file with the wrong encoding.
If you can see them properly in Notepad and neither ASCII nor one of the Unicode encodings works, then it's probably codepage 1252. You can get that encoding via
Encoding.GetEncoding(1252)

Categories