Add header and footer to an RTF file using c# - c#

We have an MVC app that outputs RTF files based on templates (which themselves are RTF files).
The code that my colleague wrote uses System.Windows.Forms.RichTextBox to convert text to RTF file (to be more excat it uses the Rtf property of RichTextBox). I was thinking of adding headers and footers to the template RTF files, but RichTextBox appears to remove those. Additionally some of the documents that we generate are composed of multiple templates (more often than not, a single template does not equal a single page and one template can be injected in the middle of another), so thats one more reason why including headers and footers in the templates would not work.
Is there any way to add headers and footer in C# to RTF documents created in the way described above?
I tried fishing something on the subject from the internet, but I wasn't able to find anything concrete.

I was searching for a library that could possibly solve my problem and I came across this one:
.NET RTF Writer Library in C#
The library itself doesn't exactly solve my problem on it's own, but the documents generated by it are easy to read and without all the crap Word would put into them. The demo for this library generates a document that has a header and a footer. The code of those two looks more or less like this:
{\header
{\pard\fi0\qd
This is a header
\par}
}
{\footer
{\pard\fi0\qc
{\fs30
This is a footer
}\par}
}
I still need to figure out how to apply correct formating here, but that should be relatively easy to find. So, I can solve my initial problem by injecting the code above to the RTF code generated by RichTextBox. I'm not sure if the position of those two tags matters, but I guess I will find that out soon enough...
Here is the code that I use to inject the header and footer:
public string AddHeaderAndFooter(string rtf)
{
// Open file that stores header and footer
string headerCode = System.IO.File.ReadAllText(Server.MapPath("~/DocTemplates/header.txt"));
// Inject header and footer code before the last "}" character
return rtf.Insert(rtf.LastIndexOf('}') - 1, headerCode);
}
Note I have the header and footer in a static txt file, because it actually contains images in RTF readable format and that would be too big to put in the code. I haven't noticed any problems related to the fact that header and footer are defined at the end of the RTF file.

Related

Generate a table of contents in HTML with IronPDF

We're in the process of generating a PDF file, using IronPDF, from some HTML we've generated.
This document will contain an unknown number of pages. Aside from showing the page number at the bottom of the, which we can probably fix using the {page}`-placeholder, we also need a Table of Contents at the beginning of the document.
While this is probably doable, I fail to see how we should go about implementing something like this. We only have the generated HTML to our disposal, so it's hard to come up with page numbers upfront.
I'm guessing using the 'Advanced Templating With Handlebars.Net' functionality can be (mis)used for this scenario, but I'm struggling to get my head around this.
Any suggestions or pointers on how I can proceed in adding a table of contents at the beginning of a document (created from HTML)?
Have you seen these Objects? (PdfDocument - bookmarks - pdfoutline)
It looks like ironpdf supports this functionality by Inserting bookmarks to the Outline (bookmarks property) with the InsertBookMark Method
From my understanding, it might be possible to first render the document, then add bookmarks to the document based on the resulting pages in the pdf... however This could be difficult depending on the nature of the document being generated...
https://ironpdf.com/c%23-pdf-documentation/html/T_IronPdf_PdfOutline.htm
https://ironpdf.com/c%23-pdf-documentation/html/P_IronPdf_PdfDocument_BookMarks.htm
https://ironpdf.com/c%23-pdf-documentation/html/T_IronPdf_PdfDocument.htm
https://ironpdf.com/c%23-pdf-documentation/html/M_IronPdf_PdfOutline_InsertBookMark.htm

.net Interop.Word insert from docx without screwing up formatting

I have an issue I've stuck with for over a year now. I made a Forms application in VB.net which allows the user to type in some information and select items which represent docx-files with tables with special formatting, pictures and other formatting quirks in them.
At the end the software creates a Word document via Office.Interop, using the information the user provided in text fields in the Forms and the items they selected (e.g. it creates a table in Word, listing the user's selections with some extra info) and then appends the content from multiple docx-files depending on the user's selection to the document created via Interop.
The problem is: To achieve this I had to use a pretty dirty method:
I open the respective docx-files, select all content (Range.Wholestory()) and copy it (Range.Copy()). Then I insert this content from the clipboard into my newly created document with the following option:
Selection.PasteAndFormat (wdFormatOriginalFormatting)
This produces a satisfactory result but it feels super dirty since it uses the user's clipboard (which I save at the beginning of the runtime and restore at the end).
I originally tried to use the Selection.InsertFile-Method and tried this again today but it completely screws the formatting.
When the content of the docx is inserted this way it neither has the formatting of the original docx nor the one of the file I created with the program. E.g. the SpaceBefore and SpaceAfter values are wrong, even if I explicitly define them in my created file. Changing the formatting afterwards is no option since the source files contain a lot of special formatting and can change all the time.
Another factor which makes it hard: I cannot save the file before it is presented to the user, using temp folder is not an option in the environment this application is deployed into, so basically everything happens in RAM.
Summary:
Basically what I want is to create the same outcome as with my "Copy and Paste" method utilizing the OriginalFormatting WITHOUT using the clipboard. The problem is, the InsertFile-Method doesn't provide an option for the formatting.
Any idea or help would be greatly appreciated.
Edit:
The FormattedText option as suggested by Rich Michaels produces the same result as the InsertFile-Method. Here is the relevant part of what I did (word is the Microsoft.Office.Interop.Word.Application):
#Opening the source file
Dim doctemp As Microsoft.Office.Interop.Word.Document
doctemp = word.Documents.Open(doctempfilepath)
#Selecting whole document; this is what I did for the "Copy/Paste"-Method, too
doctemp.Range.WholeStory()
Dim insert_range As wordoptions.Range
doc_destination.Activate()
#Jumping to the end and selecting the range
word.Selection.EndKey(Unit:=Microsoft.Office.Interop.Word.WdUnits.wdStory)
insert_range = word.Selection.Range
#Inserting the text
insert_range.FormattedText = doctemp.Range.FormattedText
doctemp.Close(False)
This is the problem:
Use the Range.FormattedText property. It doesn't touch the clipboard and it maintains the source formatting. The process is ...
Set the range in the Source document you want "copied" and set the insertion point in the Destination document and then,
DestinationRange.FormattedText = SourceRange.FormattedText

Get all lines from MS Word header and footer using Interop PIA in C#

I am trying to get all lines from the header and footer in a word doc. I am using the following code:
HeaderFooter header = this.Doc.Sections[1].Headers[Word.WdHeaderFooterIndex.wdHeaderFooterPrimary]
string text = header.Range.Text;
But the Range.Text property always seems to return the last line. If my header has:
Header line 1
Header line 2
Header line 3
My code always return "Header line 3". I get similar results for the footer. I have tried calling header.Range.WholeStory(). I have tried calling header.Range.Paragraphs and tried iterating over the Paragraphs collection. I am at a complete loss. The documentation on MSDN is cryptic and I can't find my way through it. Any help would be appreciated.
FWIW: I am writing a utility in C# to verify that each document a large library of word documents conforms to a company wide template. The template calls for the header and footer to contain certain information (title, document number, date, etc). Each datum will also be checked against a regex. I know the use of fields, bookmarks, etc. would be elegant but I am afraid the documents I am working with simply have the information embedded as text in the header and the footer.
You can try the rest of the headers just in case
foreach (Section section in this.Doc.Sections)
foreach (HeaderFooter header in section.Headers)
if (header.Range.Text.Contains("1")) Debugger.Break();
I found the answer. Like most things it was hidding in plain sight. I was debugging from the console using Console.WriteLine() and the slash Rs were getting the best of me.
private string GetUsableTextFromRange(Word.HeaderFooter headerFooter)
{
Word.Range range = headerFooter.Range;
string textWithSlashRs = range.Text;
string usableText = textWithSlashRs.Replace("\r", Environment.NewLine);
return usableText;
}
In the end I had the data all along... Hope this helps someone else.

Extracting text from PDF with iTextSharp is not working for some PDF

I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin
Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.
OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

Categories