Problems with XSLT and Special Characters - c#

On my web app (ASP.net 4,C#) I use FOR XML PATH('') to convert Data from SQL Server to XML,
and use the following lines to feed it to XSLT:
MemoryStream stream = new MemoryStream(UTF8Encoding.UTF8.GetBytes(xml));
XPathDocument document = new XPathDocument(stream);
StringWriter writer = new StringWriter();
XslCompiledTransform transform = new XslCompiledTransform();
transform.Load(xsltPath);
transform.Transform(document, null, writer);
return writer.ToString();
Now when I feed messages from my forum, in sunny day scenarios, there should be no problem at all and there isn't.
When a user decides to use special characters like < > in their messages thought, there we have the rainy day.
I get an error which by the way differs from time to time (From message to message depending on what they write there).
I have already tried disable-output-escaping="yes"
Needless to say, I want the users to be able to use some tags like
<a href... or <font ...>
Below is an example of one of the messages that causes the issue:
setting-->about phone----< software update
Any possible solutions?

You need to encode such special characters. As far as XML is concerned, there are 5 of them:
> - >
< - <
& - &
" - "
' - &apos;
You need to encode these from the use input.
An alternative is to place all user generated content within <!\[CDATA\[\]\]> sections, which effectively achieves the same.

Related

YASR - Yet another search and replace question

Environment: asp.net c# openxml
Ok, so I've been reading a ton of snippets and trying to recreate the wheel, but I'm hoping that somone can help me get to my desination faster. I have multiple documents that I need to merge together... check... I'm able to do that with openxml sdk. Birds are singing, sun is shining so far. Now that I have the document the way I want it, I need to search and replace text and/or content controls.
I've tried using my own text - {replace this} but when I look at the xml (rename docx to zip and view the file), the { is nowhere near the text. So I either need to know how to protect that within the doucment so they don't diverge or I need to find another way to search and replace.
I'm able to search/replace if it is an xml file, but then I'm back to not being able to combine the doucments easily.
Code below... and as I mentioned... document merge works fine... just need to replace stuff.
* Update * changed my replace call to go after the tag instead of regex. I have the right info now, but the .Replace call doesn't seem to want to work. Last four lines are for validation that I was seeing the right tag contents. I simply want to replace those contents now.
protected void exeProcessTheDoc(object sender, EventArgs e)
{
string doc1 = Server.MapPath("~/Templates/doc1.docx");
string doc2 = Server.MapPath("~/Templates/doc2.docx");
string final_doc = Server.MapPath("~/Templates/extFinal.docx");
File.Delete(final_doc);
File.Copy(doc1, final_doc);
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(final_doc, true))
{
string altChunkId = "AltChunkId2";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(doc2, FileMode.Open))
chunk.FeedData(fileStream);
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
mainPart.Document.Save();
}
exeSearchReplace(final_doc);
}
public static void GetPropertyFromDocument(string document, string outdoc)
{
XmlDocument xmlProperties = new XmlDocument();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, false))
{
ExtendedFilePropertiesPart appPart = wordDoc.ExtendedFilePropertiesPart;
xmlProperties.Load(appPart.GetStream());
}
XmlNodeList chars = xmlProperties.GetElementsByTagName("Company");
chars.Item(0).InnerText.Replace("{ClientName}", "Penn Inc.");
StreamWriter sw;
sw = File.CreateText(outdoc);
sw.WriteLine(chars.Item(0).InnerText);
sw.Close();
}
}
}
If I'm reading this right, you have something like "{replace me}" in a .docx and then when you loop through the XML, you're finding things like <t>{replace</t><t> me</><t>}</t> or some such havoc. Now, with XML like that, it's impossible to create a routine that will replace "{replace me}".
If that's the case, then it's very, very likely related to the fact that it's considered a proofing error. i.e. it's misspelled as far as Word is concerned. The cause of it is that you've opened the document in Word and have proofing turned on. As such, the text is marked as "isDirty" and split up into different runs.
The two ways about fixing this are:
Client-side. In Word, just make sure all proofing errors are either corrected or ignored.
Format-side. Use the MarkupSimplifier tool that is part of Open XML Package Editor Power Tool for Visual Studio 2010 to fix this outside of the client. Eric White has a great (and timely for you - just a few days old) write up here on it: Getting Started with Open XML PowerTools Markup Simplifier
If you want to search and replace text in a WordprocessingML document, there is a fairly easy algorithm that you can use:
Break all runs into runs of a single character. This includes runs that have special characters such as a line break, carriage return, or hard tab.
It is then pretty easy to find a set of runs that match the characters in your search string.
Once you have identified a set of runs that match, then you can replace that set of runs with a newly created run (which has the run properties of the run containing the first character that matched the search string).
After replacing the single-character runs with a newly created run, you can then consolidate adjacent runs with identical formatting.
I've written a blog post and recorded a screen-cast that walks through this algorithm.
Blog post: http://openxmldeveloper.org/archive/2011/05/12/148357.aspx
Screen cast: http://www.youtube.com/watch?v=w128hJUu3GM
-Eric

How to handle xml that contains nested xml using c# xmlreader?

I'm using c# to interact with a database that has an exposed REST API. The table that I'm interested in contains forum posts, some of which themselves contain xml.
Whenever my result set contains a post that has xml, my application throws an error as follows:
Exception Details: System.Xml.XmlException: '>' is an unexpected token. The expected token is '"' or '''. Line 1, position 62.
And this is the line that fails:
Line 44: ds.ReadXml(xmlData);
And this is the code I'm using:
var webClient = new WebClient();
string searchString = searchValue.Text;
string requestUrl = "http://myserver/restapi.ashx/search.xml?pagesize=4&pageindex=0&query=";
requestUrl += searchString;
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
XmlReader xmlData = XmlReader.Create(webClient.OpenRead(requestUrl),settings);
DataSet ds = new DataSet();
ds.ReadXml(xmlData);
Repeater1.DataSource = ds.Tables[1];
Repeater1.DataBind();
And this is the type of XML record that it's choking on (the stuff in the node is causing the problem):
<SearchResults PageSize="1" PageIndex="0" TotalCount="342">
<SearchResult>
<ContentId>994</ContentId>
<Title>Help Files: What are they written in?</Title>
<Url>http://myserver/linktest.aspx</Url>
<Date>2008-10-16T16:18:00+01:00</Date><ContentType>post</ContentType>
<Body><div class="ForumPostBodyArea"> <div class="ForumPostContentText"> <p>Can anyone see anything obviously wrong with this xml, when its fired to CRM Its creating 13 null records.</p> <p><?xml version="1.0" encoding="UTF-8"?><soap:Envelope xmlns:typens="http://tempuri.org/type" soap:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:wsdlns="http://tempuri.org/wsdl/" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Header><SessionHeader><sessionId xsi:type="xsd:long">18208442035524</sessionId></SessionHeader></soap:Header><soap:Body><typens:add><entityname xsi:type="xsd:string">lead</entityname><records xsi:nil="true" xsi:type="typens:ewarebase" /><status xsi:type="xsd:string">PreRegistration</status><requester xsi:type="xsd:string">Mimnagh</requester><personfirstname xsi:type="xsd:string">Sean</personfirstname><personlastname xsi:type="xsd:string">Test2</personlastname><personsalutation xsi:type="xsd:string">Mr</personsalutation><details xsi:type="xsd:string">test project details</details><description xsi:type="xsd:string">test description details</description><comments xsi:type="xsd:string">test project comments</comments><personemail xsi:type="xsd:string">smimnagh#mac.com</personemail><personphonenumber xsi:type="xsd:string">12334566777</personphonenumber><type xsi:type="xsd:string">PreReg</type><companyname xsi:type="xsd:string">Site Client</companyname></typens:add></soap:Body></soap:Envelope></p> <p>Many thanks</p> </div> </div>
</Body>
<Tags>
<Tag>xml</Tag>
</Tags>
<IndexedAt>2010-07-08T11:53:46.848+01:00</IndexedAt>
</SearchResult>
</SearchResults>
Is there something that I can do with the xmlreader to make it ignore whatever's causing the problem?
Please note that I can't change the XML prior to consuming it - so if it's malformed then I wonder if there's a way to ignore or modify that particular record without generating an error?
Thanks!
It looks like some of your quotes need escaping in the contents of some of your elements. Try using
"
for quote marks that aren't wrapping attribute values.
UPDATE:
Because the data you want to read isn't strictly XML (it's nearly XML) you're best bet is to
Either you or your boss, if you have one, screams at the third party because they're not sending you well formed XML.
Perform some horrible hack to try and convert whatever you might get to XML.
If you have to go with point 2, the simplest thing that pops into my head is to read the characters of the 'XML' counting in and out of angle brackets. If you find any " characters and you're not within any angle brackets, replace the " with
"
But note that doing that is a complete last resort.
The Content of your <Body> tag is not well formed. XML is very strict with the syntax of data. Either embed a CDATA section into your XML or escape the string properly.

How do I write an xml document to an asp.net response formatted nicely?

I have xml documents in a database field. The xml documents have no whitespace between the elements (no line feeds, no indenting).
I'd like to output them to the browser, formatted nicely. I would simply like linefeeds in there with some indenting. Is there an easy, preferably built-in way to do this?
I am using ASP.NET 3.5 and C#. This is what I have so far, which is outputting the document all in one line:
I'm about 99.9977% sure I am using the XmlWriter incorrectly. What I am accomplishing now can be done by writing directly to the response. But am I on the right track at least? :)
int id = Convert.ToInt32(Request.QueryString["id"]);
var auditLog = webController.DB.Manager.AuditLog.GetByKey(id);
var xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Indent = true;
xmlWriterSettings.OmitXmlDeclaration = true;
var xmlWriter = XmlWriter.Create(Response.OutputStream, xmlWriterSettings);
if (xmlWriter != null)
{
Response.Write("<pre>");
// ObjectChanges is a string property that contains an XML document
xmlWriter.WriteRaw(Server.HtmlEncode(auditLog.ObjectChanges));
xmlWriter.Flush();
Response.Write("</pre>");
}
This is the working code, based on dtb's answer:
int id = Convert.ToInt32(Request.QueryString["id"]);
var auditLog = webController.DB.Manager.AuditLog.GetByKey(id);
var xml = XDocument.Parse(auditLog.ObjectChanges, LoadOptions.None);
Response.Write("<pre>" + Server.HtmlEncode(xml.ToString(SaveOptions.None)) + "</pre>");
Thank you for helping me!
WriteRaw just writes the input unchanged to the underlying stream.
if you want to use built-in formatting, you need first to parse the XML and then convert it back to a string.
The simplest solution is possibly to use XLinq:
var xml = XDocument.Parse(auditLog.ObjectChanges);
Response.Write(Server.HtmlEncode(xml.ToString(SaveOptions.None)));
(This assumes auditLog.ObjectChanges is a string that represents well-formed XML.)
If you need more control over the formatting (indentation, line-breaks) save the XDocument to a MemoryStream-backed XmlWriter, decode the MemoryStream back to a string, and write the string HtmlEncoded.
If auditLog.ObjectChanges is the XML content that needs to be formatted, then you've stored it in an unformatted way. To format it, treat it as XML and write it to an XMLWriter to format it. Then include the formatted XML into the response, with the HTML encoding.

Fastest way to add new node to end of an xml?

I have a large xml file (approx. 10 MB) in following simple structure:
<Errors>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
<Error>.......</Error>
</Errors>
My need is to write add a new node <Error> at the end before the </Errors> tag. Whats is the fastest way to achieve this in .net?
You need to use the XML inclusion technique.
Your error.xml (doesn't change, just a stub. Used by XML parsers to read):
<?xml version="1.0"?>
<!DOCTYPE logfile [
<!ENTITY logrows
SYSTEM "errorrows.txt">
]>
<Errors>
&logrows;
</Errors>
Your errorrows.txt file (changes, the xml parser doesn't understand it):
<Error>....</Error>
<Error>....</Error>
<Error>....</Error>
Then, to add an entry to errorrows.txt:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XmlTextWriter xtw = new XmlTextWriter(sw);
xtw.WriteStartElement("Error");
// ... write error messge here
xtw.Close();
}
Or you can even use .NET 3.5 XElement, and append the text to the StreamWriter:
using (StreamWriter sw = File.AppendText("logerrors.txt"))
{
XElement element = new XElement("Error");
// ... write error messge here
sw.WriteLine(element.ToString());
}
See also Microsoft's article Efficient Techniques for Modifying Large XML Files
First, I would disqualify System.Xml.XmlDocument because it is a DOM which requires parsing and building the entire tree in memory before it can be appended to. This means your 10 MB of text will be more than 10 MB in memory. This means it is "memory intensive" and "time consuming".
Second, I would disqualify System.Xml.XmlReader because it requires parsing the entire file first before you can get to the point of when you can append to it. You would have to copy the XmlReader into an XmlWriter since you can't modify it. This requires duplicating your XML in memory first before you can append to it.
The faster solution to XmlDocument and XmlReader would be string manipulation (which has its own memory issues):
string xml = #"<Errors><error />...<error /></Errors>";
int idx = xml.LastIndexOf("</Errors>");
xml = xml.Substring(0, idx) + "<error>new error</error></Errors>";
Chop off the end tag, add in the new error, and add the end tag back.
I suppose you could go crazy with this and truncate your file by 9 characters and append to it. Wouldn't have to read in the file and would let the OS optimize page loading (only would have to load in the last block or something).
System.IO.FileStream fs = System.IO.File.Open("log.xml", System.IO.FileMode.Open, System.IO.FileAccess.ReadWrite);
fs.Seek(-("</Errors>".Length), System.IO.SeekOrigin.End);
fs.Write("<error>new error</error></Errors>");
fs.Close();
That will hit a problem if your file is empty or contains only "<Errors></Errors>", both of which can easily be handled by checking the length.
The fastest way would probably be a direct file access.
using (StreamWriter file = File.AppendText("my.log"))
{
file.BaseStream.Seek(-"</Errors>".Length, SeekOrigin.End);
file.Write(" <Error>New error message.</Error></Errors>");
}
But you lose all the nice XML features and may easily corrupt the file.
I would use XmlDocument or XDocument to Load your file and then manipulate it accordingly.
I would then look at the possibility of caching this XmlDocument in memory so that you can access the file quickly.
What do you need the speed for? Do you have a performance bottleneck already or are you expecting one?
How is your XML-File represented in code? Do you use the System.XML-classes? In this case you could use XMLDocument.AppendChild.
Try this out:
var doc = new XmlDocument();
doc.LoadXml("<Errors><error>This is my first error</error></Errors>");
XmlNode root = doc.DocumentElement;
//Create a new node.
XmlElement elem = doc.CreateElement("error");
elem.InnerText = "This is my error";
//Add the node to the document.
if (root != null) root.AppendChild(elem);
doc.Save(Console.Out);
Console.ReadLine();
Here's how to do it in C, .NET should be similar.
The game is to simple jump to the end of the file, skip back over the tag, append the new error line, and write a new tag.
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main(int argc, char** argv) {
FILE *f;
// Open the file
f = fopen("log.xml", "r+");
// Small buffer to determine length of \n (1 on Unix, 2 on PC)
// You could always simply hard code this if you don't plan on
// porting to Unix.
char nlbuf[10];
sprintf(nlbuf, "\n");
// How long is our end tag?
long offset = strlen("</Errors>");
// Add in an \n char.
offset += strlen(nlbuf);
// Seek to the END OF FILE, and then GO BACK the end tag and newline
// so we use a NEGATIVE offset.
fseek(f, offset * -1, SEEK_END);
// Print out your new error line
fprintf(f, "<Error>New error line</Error>\n");
// Print out new ending tag.
fprintf(f, "</Errors>\n");
// Close and you're done
fclose(f);
}
The quickest method is likely to be reading in the file using an XmlReader, and simply replicating each read node to a new stream using XmlWriter When you get to the point at which you encounter the closing </Errors> tag, then you just need to output your additional <Error> element before coninuing the 'read and duplicate' cycle. This way is inevitably going to be harder than than reading the entire document into the DOM (XmlDocument class), but for large XML files, much quicker. Admittedly, using StreamReader/StreamWriter would be somewhat faster still, but pretty horrible to work with in code.
Using string-based techniques (like seeking to the end of the file and then moving backwards the length of the closing tag) is vulnerable to unexpected but perfectly legal variations in document structure.
The document could end with any amount of whitespace, to pick the likeliest problem you'll encounter. It could also end with any number of comments or processing instructions. And what happens if the top-level element isn't named Error?
And here's a situation that using string manipulation fails utterly to detect:
<Error xmlns="not_your_namespace">
...
</Error>
If you use an XmlReader to process the XML, while it may not be as fast as seeking to EOF, it will also allow you to handle all of these possible exception conditions.
I attempted to use code other answers had suggested but ran into an issue where sometimes calling .length on my strings was not the same as the number of bytes for the string so I was inconsistently losing characters. I modified it to get the byte count instead.
var endTag = "</Errors>";
var nodeText = GetNodeText();
using (FileStream file = File.Open("my.log", FileMode.Open, FileAccess.ReadWrite))
{
file.BaseStream.Seek(-(Encoding.UTF8.GetByteCount(endTag)), SeekOrigin.End);
fileStream.Write(Encoding.UTF8.GetBytes(nodeText), 0, Encoding.UTF8.GetByteCount(nodeText));
fileStream.Write(Encoding.UTF8.GetBytes(endTag), 0, Encoding.UTF8.GetByteCount(endTag));
}

Is there a quick way to format an XmlDocument for display in C#?

I want to output my InnerXml property for display in a web page. I would like to see indentation of the various tags. Is there an easy way to do this?
Here's a little class that I put together some time ago to do exactly this.
It assumes that you're working with the XML in string format.
public static class FormatXML
{
public static string FormatXMLString(string sUnformattedXML)
{
XmlDocument xd = new XmlDocument();
xd.LoadXml(sUnformattedXML);
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
XmlTextWriter xtw = null;
try
{
xtw = new XmlTextWriter(sw);
xtw.Formatting = Formatting.Indented;
xd.WriteTo(xtw);
}
finally
{
if(xtw!=null)
xtw.Close();
}
return sb.ToString();
}
}
You should be able to do this with code formatters. You would have to html encode the xml into the page first.
Google has a nice prettifyer that is capable of visualizing XML as well as several programming languages.
Basically, put your XML into a pre tag like this:
<pre class="prettyprint">
<link href="prettify.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="prettify.js"></script>
</pre>
Use the XML Web Server Control to display the content of an xml document on a web page.
EDIT: You should pass the entire XmlDocument to the Document property of the XML Web Server Control to display it. You don't need to use the InnerXml property.
If identation is your only cocern and if you can afford to launch xternall process, you can process xml file with HTML Tidy console tool (~100K).
The code is:
tidy --input-xml y --output-xhtml y --indent "1" $(FilePath)
Then you can display idented string on web page once you get rid of special chars.
It would be also easy to create recursive function that makes such output - simply iterate nodes starting from the root and enter next recursion step for child node, passing identation as a parameter to each new recursion call.
Check out the free Actipro CodeHighlighter for ASP.NET - it can neatly display XML and other formats.
Or are you more interested in actually formatting your XML? Then have a look at the XmlTextWriter - you can specify things like Format (indenting or not) and the indent level, and then write out your XML to e.g. a MemoryStream and read it back from there into a string for display.
Marc
Use an XmlTextWriter with the XmlWriterSettings set up so that indentation is enabled. You can use a StringWriter as "temporary storage" if you want to write the resulting string onto screen.

Categories