Rtf to WordML Convert in C# - c#

I have a windows application to generate report.
It has templates in RTF as "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n", which is written to word doc file. then the word is Saved-As XML and close. Then, tags like (say) are extracted and some new
The problem here is Word, which is used as converter in the process and it consumes valuable time in Loop, where it opens word instance, save, close, delete.
Please correct any mistake if i have made and help me with an alternative to convert to WordML .

Use Aspose .Words
//your rtf string
string rtfStrx = "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n"
//convert string to bytes for memory stream
byte[] rtfBytex = Encoding.UTF8.GetBytes(rtfStrx);
MemoryStream rtfStreamx = new MemoryStream(rtfBytex);
Document rtfDocx = new Document(rtfStreamx);
rtfDocx.Save(#"C:\Temp.xml", SaveFormat.WordML);
This saves your RTF text in new document as WordML. I cannot say about time it will take in loop. But it will surely have much less time then MS Word being physically opened and closed.

Unless I am missing something, I assume that you are trying to create Office XML file from RTF template? I think you can use Open XML SDK for creation of the xml file. Specifically, DocumentReflector that comes with that SDK seems to a good fit for that. See this example. Also, there is a http://www.codeguru.com/cpp/controls/richedit/conversions/article.php/c5377/ which shows how to convert from RTF to HTML that might guide you.

use wpf richtextbox. Rtf => xaml. Since xaml is xml_ use xslt or linq to convert it to your desired xml structure

Related

Creating Word file from ObservableCollection with C#

I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)
OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}
OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML
Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.

Is there a less convoluted, preferably automatable, way to convert PDFs to HTML?

I need to convert PDF files to HTML.
I can do this manually via several steps, using this (Rube) Goldberg variation:
0) Save PDF as text
1) Copy-and-paste text into MS Word
2) Save MS Word doc as HTML
I feel like I'm walking on my hands doing that, though.
Is there a programmatic way to accomplish the same? So that I could do something like:
string htmlFile = ConvertPDFToHTML("FrumiousBandersnatch.PDF");

How do I use Linq-to-XML on the contents of a string (not document)?

Hi and thanks for looking!
Background
I am working on a developer tool for our dev team that parses content from MS Word into a Windows form with text boxes. We do some processing on the text, then submit the form to a database.
Some of the textboxes in the form contain Word XML which we need to clean up and convert to our own XML to later use with XSLT.
When the form populates, I would like to take the Word XML and use Linq to search for certain tags (example: <w:t>SOME TEXT</w:t>) and convert it to our own XML (<Text>SOME TEXT</Text>) before it gets to the textbox.
Question
How do I use Linq-to-Xml on the contents retrieved from a string in the pre-processing stage? I know how to instantiate an XDocument, but this is just a string so I am stumped. Probably missing something simple.
Thanks!
You can use the XDocument.Parse Method to create an XDocument from a string.

Search and Replace of text in a memorystream in C# .NET

I have loaded a memorystream with a word document and I want to be able to alter specific text within the memorystream and save it back to the word document, like search and replace functionality. Please can anyone help me with this as I don't want to use the Word Interop libraries. I have the code to load and save the document already, please see below. The problem is, if I convert the memorystring to a string and use the string replace method, when I save the string all the formatting within the word document is lost and when I open the document all it shows is black boxes all over the place.
private void ReplaceInFile(string filePath, string searchText, string replaceText)
{
byte[] inputFile = File.ReadAllBytes(filePath);
MemoryStream memory = new MemoryStream(inputFile);
byte[] data = memory.ToArray();
string pathStr = Request.PhysicalApplicationPath + "\\Docs\\OutputDocument.doc";
FileInfo wordFile = new FileInfo(pathStr);
FileStream fileStream = wordFile.Open(FileMode.Create, FileAccess.Write, FileShare.None);
fileStream.Write(data, 0, data.Length);
fileStream.Close();
memory.Close();
}
I copied the code from sample code on the internet. So That is why memorystream was used as I had no idea how to do it. My issue is the company I work for doesn't want to use the word interop as sometimes they have found that word can display popup dialog boxes on occassion that prevents the coded functionality from executing. This is why I want to look at ways of achieving a mail merge functionality but in a programmatical way. I did do a very similar thing to what I want to do here many years ago but in Delphi not C# and I have typically lost the code. So if anyone can shed any light on this then I would be grateful.
You will have to use the Word interop libraries - or at least something similar. It's not like Word documents are just plain text documents - they're binary files. Converting the bytes into a string and doing a replace that way is going to break the document completely.
With the new open formats you may be able to write your own code to parse them, but it's going to be significantly harder than using a library.
Your best bet is to convert the file to OOXML - then that's an XML file which you can update programmatically using a string find / replace, System.XML or LINQ.
(See http://blogs.msdn.com/b/ericwhite/archive/2008/09/19/bulk-convert-doc-to-docx.aspx for more info on the server side conversion process.)

iText - how to do search/replace on existing RTF document

Currently I'm working on a simple Mail-Merge module.
I need to load plain *.RTF template, then replace all words enclosed in [[field]] tags and at the end and print them out.
I found the iText library which is free and capable of loading/saving pdfs and rtf.
I managed to load rtf, merge a few copies to one huge doc but I have no idea how to replace [[field]] by custom data like customer name/address.
Is that feature present, and if yes - how to do it?
The solution platform is c#/.NET
I don't think that pdf is the way you want to go.
According to this article it is extremely difficult at best and not possible at worst.
Would something like RTFLib work better for you?
G-Man
Finally I decided to use *.docx and "Open XML SDK 2.0 for Microsoft Office" .NET strongly typed wrapper.
You can use RichTextBox control to find/replace placeholders.
RichTextBox rtb = new RichTextBox();
rtb.LoadFile("template.rtf");
string placeHolder = "[[placeholder_name]]";
int pos = rtb.Find(placeHolder);
rtb.Select(pos, placeHolder.Length);
rtb.SelectedText = "new value";
After this you can get rtf formatted text with:
rtb.Rtf;

Categories