Convert XML to Plain Text

Convert XML to Plain Text - c#

My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.
The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.
In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.
So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.
My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?
FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

This will leave you with just the text:
class Program
{
static void Main(string[] args)
{
var blah = new System.IO.StringReader(sourceDoc);
var reader = System.Xml.XmlReader.Create(blah);
StringBuilder result = new StringBuilder();
while (reader.Read())
{
result.Append( reader.Value);
}
Console.WriteLine(result);
}
static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

Or you can use a regular expression:
public static string StripHtml(String htmlText)
{
// replace all tags with spaces...
htmlText = Regex.Replace(htmlText, #"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlText.Contains(" "))
{
htmlText = htmlText.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlText = htmlText.Replace(" ", " ");
htmlText = htmlText.Replace("&", "&");
return htmlText;
}

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.
This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.
I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.
Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

Related

c# create word document with openXML : XML Parsing Error (when replacement string contains spaces)

I am trying to create a word document using a word template in my C# application using openXML. Here is my code so far:
DirectoryInfo tempDir = new DirectoryInfo(Server.MapPath("~\\Files\\WordTemplates\\"));
DirectoryInfo docsDir = new DirectoryInfo(Server.MapPath("~\\Files\\FinanceDocuments\\"));
string ype = "test Merge"; //if ype string contains spaces then I get this error
string sourceFile = tempDir + "\\PaymentOrderTemplate.dotx";
string destinationFile = docsDir + "\\" + "PaymentOrder.doc";
// Create a copy of the template file and open the copy
File.Copy(sourceFile, destinationFile, true);
// create key value pair, key represents words to be replace and
//values represent values in document in place of keys.
Dictionary<string, string> keyValues = new Dictionary<string, string>();
keyValues.Add("ype", ype);
SearchAndReplace(destinationFile, keyValues);
Process.Start(destinationFile);
And the SearchAndReplace funtion:
public static void SearchAndReplace(string document, Dictionary<string, string> dict)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
foreach (KeyValuePair<string, string> item in dict)
{
Regex regexText = new Regex(item.Key);
docText = regexText.Replace(docText, item.Value);
}
using (StreamWriter sw = new StreamWriter(
wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}
But when I try to open the exported file I get this error:
XML parsing error
Location: Part: /word/document.xml, line: 2, Column: 2142
Document.xml first lines:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid wp14">
<w:body>
<w:tbl>
<w:tblPr>
<w:tblW w:w="10348" w:ttest Merge="dxa"/>
<w:tblInd w:w="108" w:ttest Merge="dxa"/>
<w:tblBorders>
Edit
I found out that the problem occured because I was using mergefields in the word template. If I use plain text it works. But in this case it will be slow because it has to check every single word in the template and if matches replace it. Is it possible to do it in another way?

Disclaimer: You seem to be using the OpenXML SDK, because your code looks virtually identical to that found here: https://msdn.microsoft.com/en-us/library/bb508261(v=office.12).aspx - I've never in my life used this SDK and I'm basing this answer on an educated guess at what's happening
It seems that the operation you're carrying out on this Word document is affecting parts of the document that you didn't intend.
I believe that calling document.MainDocumentPart.GetStream() just giving you more or less raw direct access to the XML of the document, and you're then treating it as a plain xml file, manipulating it as text, and carrying out a list of straight text replacements? I think it's thus likely the cause of the problem because you're intending to edit document text, but accidentally damaging xml node structure in the process
By way of an example, here is a simple HTML document:
<html>
<head><title>Damage report</title></head>
<body>
<p>The soldier was shot once in the body and twice in the head</p>
</body>
</html>
You decide to run a find/replace to make the places the soldier was shot, a bit more specific:
var html = File.ReadAllText(#"c:\my.html");
html = html.Replace("body", "chest");
html = html.Replace("head", "forehead");
File.WriteAllText(#"c:\my.html");
Only thing, your document is now ruined:
<html>
<forehead><title>Damage report</title></forehead>
<chest>
<p>The soldier was shot once in the chest and twice in the forehead</p>
</chest>
</html>
A browser can't parse it (well, it's still valid I suppose, but it's meaningless) any more because the replacement operation broke some things.
You're replacing "ype" with "test Merge" but this seems to be clobbering an occurrence of the word "type" - something that it seems pretty likely would appear in the XML attribute or element names - and turning it into "ttest Merge".
To correctly change the content of an XML document's node texts, it should be parsed from text to an XML document object model representation, the nodes iterated, the texts altered, and the whole thing re-serialized back to xml text. Office SDK does seem to provide ways to do this, because you can treat a document like a collection of class object instances, and say things like this code snippet (also from MSDN):
// Create a Wordprocessing document.
using (WordprocessingDocument myDoc = WordprocessingDocument.Create(docName, WordprocessingDocumentType.Document))
{
// Add a new main document part.
MainDocumentPart mainPart = myDoc.AddMainDocumentPart();
//Create DOM tree for simple document.
mainPart.Document = new Document();
Body body = new Body();
Paragraph p = new Paragraph();
Run r = new Run();
Text t = new Text("Hello World!");
//Append elements appropriately.
r.Append(t);
p.Append(r);
body.Append(p);
mainPart.Document.Append(body);
// Save changes to the main document part.
mainPart.Document.Save();
}
You should be looking for another way, not using streams/direct low level xml access, to access the document elements. Something like these:
https://blogs.msdn.microsoft.com/brian_jones/2009/01/28/traversing-in-the-open-xml-dom/
https://www.gemboxsoftware.com/document/articles/find-replace-word-csharp
Or possibly starting with a related SO question like this: Search And Replace Text in OPENXML (Added file) (though the answer you need may be in the something linked inside this question)

How to open an existing doc/x file and then write on it?

I want to open an existing MS Word document, then write a specific text on it.
The existing document has this content:
First car costs ... $.
I would like to add a specific text right after the word costs.
The final text should be :
First car costs 15000 $.
I want to do this using a simple c# application.
I'm having difficulties to find a way to add text at the desired position.
I used a NuGet package called DocX.
Here is my code:
using Novacode;
using System.Text.RegularExpressions;
...
string fileName = #"C:\Users\DAFPC\Documents\WordDoc1.docx";
var doc = DocX.Load(fileName);
doc.ReplaceText("...", "15000",false, RegexOptions.None, null, null, MatchFormattingOptions.SubsetMatch, true, false);
doc.Save();
Process.Start("WINWORD.EXE", fileName);
The code does not replace "..." with "15000". But if I try to replace the word "Car" with "15000" the code works.

Check edit
doc.ReplaceText(toReplace, replacement);
You should wrap it in a function like the following:
static void Main(string[] args)
{
string filename = "test.docx";
ReplaceInDocx(filename, "...", "15000");
}
static void ReplaceInDocx(string filename, string toReplace, string replacement)
{
var doc = DocX.Load(filename);
doc.ReplaceText(toReplace, replacement);
doc.Save();
}
You might find this useful.
Edit: Ok, I see what you mean. The problem is with the way MS Word works. When you enter "..." it automatically gets converted into "…" (ellipsis). What you should do, then, is to search for that instead. Or better, change the search parameter.
ReplaceInDocx(filename, "…", "car");

streamReader.ReadToEnd() return just header OpenXML

Please, I want to find a word and replace it with another word in word doccument using openXML
  I use this method
public static void AddTextToWord(string filepath, string txtToFind,string ReplaceTxt)
{
WordprocessingDocument wordDoc = WordprocessingDocument.Open(filepath, true);
string docText = null;
StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream());
docText = sr.ReadToEnd();
System.Diagnostics.Debug.WriteLine(docText);
Regex regexText = new Regex(txt);
docText = regexText.Replace(docText,txt2);
StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create));
System.Diagnostics.Debug.WriteLine(docText);
wordDoc.Close();
}
but
docText
return just the head of the page that the xml shema of the document.
<?xml version="1.0" encoding=.......

Check Your Strings
If you want to replace a specific word or phrase within your existing content, you may just want to use the String.Replace() method as opposed to performing a Regex.Replace() which may not work as expected (as it expects a regular expression as opposed to a traditional string). This may not matter if you expect to use regular expressions, but it's worth noting.
Ensure You Are Pulling The Content
Word Documents are obviously not as easy to parse as plain text, so in order to get the actual "content", you may have to use an approach similar to the one mentioned here that targets the Document.Body properties instead of reading using a StreamReader object :
docText = wordDoc.MainDocumentPart.Document.Body.InnerText;
Performing Your Replacement
With that said, you currently appear to be reading the contents of your file and storing it in a string called docText. Since you have that string and know your values to find and replace, just call the Replace() method as seen below :
docText = docText.Replace(txtToFind,ReplaceTxt);
Writing Out Your Content
After performing the replacement, you'll just need to write your updated text out to a stream :
using (var sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}

How do I hide images that have a certain class when creating a pdf from html?

I am having an issue trying to hide image elements that contain a certain class when converting the html to pdf, using iTextSharp (5.x).
I do not have access over the original Html as it comes from another source, however, I can do basic things like Regex and string.replace in C# after I get it.
A simple example of the Html string would be something like this:
<div>
<div>
<img src="somepath/desktop.jpg" class="img-desktop">Desktop</img>
<img src="somepath/mobile.jpg" class="img-mobile">Mobile</img>
</div>
</div>
This string is then getting created into a PDF using the XMLWorker in iTextSharp.
I need to hide the second image and, more generically, any image element with the "img-mobile" class.
What I've tried:
Add img.img-mobile {display:none} to the CSS that is sent in when creating the pdf
Add img.img-mobile {width:0;height:0} to the CSS
Add #media print { img.img-mobile: display:none} to the CSS
Add #media print { img.img-mobile: width:0;height:0} to the CSS
Use Regex to find an img element with that classes, then loop through the matches, replace the source with empty source and replace the original html of that string with the new string (my Regex isn't grabbing any matches, unfortunately)
var pattern = "<img.*?class=\"img-mobile.*\"\\s?>.*</img>";
var mobileImages = Regex.Matches(innerHtml, pattern);
var srcPattern = "src=\".*\" ";
foreach (var imageElement in mobileImages)
{
var replaceString = Regex.Replace(imageElement.ToString(), srcPattern, " ");
innerHtml.Replace(imageElement.ToString(), replaceString);
}
I am quickly running out of ideas on how to handle this... The only saving grace is that the Html that comes in is consistent since a tool is generating it, somewhere else. So, when a user "adds an image to that html" it will always be structured the same, so Regex and replace methods are acceptable, although a CSS method would be much more preferred...

Even if you're a Regex expert and your input is predictable as mentioned, parsing HTML is hard. A better and easier way is to use a tested/proven parser, which is available in pretty much every programming language. For .NET it's HtmlAgilityPack. If you know a bit of XPath, which is quite similar to CSS selectors, it's pretty simple to setup and select the specific nodes you want to remove:
string RemoveImage(string htmlToParse)
{
var hDocument = new HtmlDocument()
{
OptionWriteEmptyNodes = true,
OptionAutoCloseOnEnd = true
};
hDocument.LoadHtml(htmlToParse);
var root = hDocument.DocumentNode;
var imagesDesktop = root.SelectNodes("//img[#class='img-desktop']");
foreach (var image in imagesDesktop)
{
var imageText = image.NextSibling;
imageText.Remove();
image.Remove();
}
return root.WriteTo();
}
And then pass your parsed HTML to iTextSharp:
var parsedHtml = RemoveImage(HTML);
using (var xmlSnippet = new StringReader(parsedHtml))
{
using (FileStream stream = new FileStream(
outputFile,
FileMode.Create,
FileAccess.Write))
{
using (var document = new Document())
{
PdfWriter writer = PdfWriter.GetInstance(
document, stream
);
document.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(
writer, document, xmlSnippet
);
}
}
}
works for me with the HTML snippet you provided.
UPDATE, after comment about 'approved' code:
Aah, the dreaded CCB. Know how that goes. :( If HtmlAgilityPack doesn't pass, here's an alternate solution, although it's probably not the best Regex ever written. ;)
const string HTML = #"
<div>
<p class='img-desktop'>Paragraph</p>
<div>
<img src='somepath/desktop.jpg' class='img-desktop'>Desktop</img>
<img src='somepath/mobile.jpg' class='img-mobile'>Mobile</img>
</div>
<div>
<img src='somepath/desktop.jpg' alt='img-desktop' title='img-desktop' class=""img-desktop"">Desktop
</IMG>
<img src='somepath/mobile.jpg' class='img-mobile'>Mobile</img>
</div>
</div>";
public void Go()
{
var regex = new Regex(
// initial update
// #"<img[^>]*class='?""?'?img-desktop""?[^>]*>.*?</img>",
// after seeing accepted answer, noticed a bad copy/paste.
// above works, but for readability should have been this:
#"<img[^>]*class='?""?img-desktop""?'?[^>]*>.*?</img>",
// and also noticed above can be shortened to this, which works too
// #"<img[^>]*class=[^>]*img-desktop[^>]*>.*?</img>"
RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline
);
Console.WriteLine(regex.Replace(HTML, ""));
}
The Regex gives you a little extra leeway in case the actual HTML you're dealing with isn't exactly as posted above.

How to determine if XML is well formed?

I've got a large xml document in a string. What's the best way to determine if the xml is well formed?

Something like:
static void Main() {
Test("<abc><def/></abc>");
Test("<abc><def/><abc>");
}
static void Test(string xml) {
using (XmlReader xr = XmlReader.Create(
new StringReader(xml))) {
try {
while (xr.Read()) { }
Console.WriteLine("Pass");
} catch (Exception ex) {
Console.WriteLine("Fail: " + ex.Message);
}
}
}
If you need to check against an xsd, then use XmlReaderSettings.

Simply run it through a parser. That will perform the appropriate checks (whether it parses ok).
If it's a large document (as indicated) then an event-based parser (e.g. SAX) will be appropriate since it won't store the document in memory.
It's often useful to have XML utilities around to check this sort of stuff. I use XMLStarlet, which is a command-line set of tools for XML checking/manipulation.

XmlReader seems a good choice as it should stream the data (not load the whole xml in one go)
http://msdn.microsoft.com/en-us/library/9d83k261.aspx

Try using an XmlReader with an XmlReaderSettings that has ConformanceLevel.Document set.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert XML to Plain Text - c#

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

Related

c# create word document with openXML : XML Parsing Error (when replacement string contains spaces)

How to open an existing doc/x file and then write on it?

streamReader.ReadToEnd() return just header OpenXML

How do I hide images that have a certain class when creating a pdf from html?

How to determine if XML is well formed?

Categories

Resources