iText - how to do search/replace on existing RTF document

iText - how to do search/replace on existing RTF document - c#

Currently I'm working on a simple Mail-Merge module.
I need to load plain *.RTF template, then replace all words enclosed in [[field]] tags and at the end and print them out.
I found the iText library which is free and capable of loading/saving pdfs and rtf.
I managed to load rtf, merge a few copies to one huge doc but I have no idea how to replace [[field]] by custom data like customer name/address.
Is that feature present, and if yes - how to do it?
The solution platform is c#/.NET

I don't think that pdf is the way you want to go.
According to this article it is extremely difficult at best and not possible at worst.
Would something like RTFLib work better for you?
G-Man

Finally I decided to use *.docx and "Open XML SDK 2.0 for Microsoft Office" .NET strongly typed wrapper.

You can use RichTextBox control to find/replace placeholders.
RichTextBox rtb = new RichTextBox();
rtb.LoadFile("template.rtf");
string placeHolder = "[[placeholder_name]]";
int pos = rtb.Find(placeHolder);
rtb.Select(pos, placeHolder.Length);
rtb.SelectedText = "new value";
After this you can get rtf formatted text with:
rtb.Rtf;

Related

Creating Word file from ObservableCollection with C#

I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)

OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}

OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML

Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

Remove Markdown tags from string

I have a string which has Markdown tags embedded inside it. I do not want to encode the Markdown as anything else, I just want to rip out all of the tags.
How can I do this quickly? I need to do this as part of a batch processing job which processes around 5 million pieces of text, so speed is very important.
I looked at MarkdownSharp, and using Transform, but I'm not sure it's the best way of doing this. I just want plaintext output, with no tags inside. I'm even considering a regex removal, but I'm not sure what the most performant option would be.

You could probably use MarkdownSharp or any other similar library (I recommend Strike, since it is surprisingly fast!) to convert the Markdown to Html and then use HtmlAgilityPack to extract the text.
A faster option, but more work for you, would be to modify an existing Markdown parser to produce plain text instead.

The solution was a bit hard to get from the comments, but this works for .NET 6:
Install Markdown Deep from NuGet. I needed something for .NET 6 so I used the Core version https://www.nuget.org/packages/MarkdownDeep.NET.Core/
Create a Markdown object:
using MarkdownDeep;
var markdownRemover = new Markdown()
{
SummaryLength = -1
};
Strip the markdown from the text:
var plainText = markdownRemover.Transform(mdText);

Rtf to WordML Convert in C#

I have a windows application to generate report.
It has templates in RTF as "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n", which is written to word doc file. then the word is Saved-As XML and close. Then, tags like (say) are extracted and some new
The problem here is Word, which is used as converter in the process and it consumes valuable time in Loop, where it opens word instance, save, close, delete.
Please correct any mistake if i have made and help me with an alternative to convert to WordML .

Use Aspose .Words
//your rtf string
string rtfStrx = "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n"
//convert string to bytes for memory stream
byte[] rtfBytex = Encoding.UTF8.GetBytes(rtfStrx);
MemoryStream rtfStreamx = new MemoryStream(rtfBytex);
Document rtfDocx = new Document(rtfStreamx);
rtfDocx.Save(#"C:\Temp.xml", SaveFormat.WordML);
This saves your RTF text in new document as WordML. I cannot say about time it will take in loop. But it will surely have much less time then MS Word being physically opened and closed.

Unless I am missing something, I assume that you are trying to create Office XML file from RTF template? I think you can use Open XML SDK for creation of the xml file. Specifically, DocumentReflector that comes with that SDK seems to a good fit for that. See this example. Also, there is a http://www.codeguru.com/cpp/controls/richedit/conversions/article.php/c5377/ which shows how to convert from RTF to HTML that might guide you.

use wpf richtextbox. Rtf => xaml. Since xaml is xml_ use xslt or linq to convert it to your desired xml structure

Subscript text in pdf C#

How do I insert a subscript character in a string in C#?
I have problems appending a superscript "2" in the same string using char.ConvertFromUtf32(178);, but I struggle with finding a similar solution for the subscripted text. Actually, I'm struggling with finding any solution at all to this rather embarrassing issue.

Plain text doesn't have formatting, like superscript, subscript, bold, italic and/or colors.
You need to use some "rich text" format.
The type of "rich text" depends on where you want to use it. Examples: HTML, RTF.
For PDF you need to look into the formatting options provided by your PDF creation library.

The PDF creation library I'm using did not offer much.
One work around I could figure out was to pick equalent ascii values from charecter map and append it to the existing string.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.