Word OpenXML replace token text - c#

I'm using OpenXML to amend Word templates, these templates contain simple tokens that are identifiable by certain characters (currently the double chevrons (ascii 171 and 187)).
I would like to replace these tokens with my text, which could be multiline (i.e. from a database).

Firstly you need to open the template:
//read file into memory
byte[] docByteArray = File.ReadAllBytes(templateName);
using (MemoryStream ms = new MemoryStream())
{
//write file to memory stream
ms.Write(docByteArray, 0, docByteArray.Length);
//
ReplaceText(ms);
//reset stream
ms.Seek(0L, SeekOrigin.Begin);
//save output
using (FileStream outputStream = File.Create(docName))
ms.CopyTo(outputStream);
}
The simple approach searching the inner text xml of the body is the quickest way, but doesn't allow for insertion of multiline text and doesn't give you the basis to expand to more complicated changes.
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(ms, true))
{
string docText = null;
//read the entire document into a text
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
docText = sr.ReadToEnd();
//replace the text
docText.Replace(oldString, myNewString);
//write the text back
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
sw.Write(docText);
}
Instead you need to work with the elements and structure:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(ms, true))
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
//filter them to the ones that contain the QuoteLeft char
var tokenTexts = texts.Where(t => t.Text.Contains(oldString));
foreach (var token in tokenTexts)
{
//get the parent element
var parent = token.Parent;
//deep clone this Text element
var newToken = token.CloneNode(true);
//split the text into an array using a regex of all line terminators
var lines = Regex.Split(myNewString, "\r\n|\r|\n");
//change the original text element to the first line
((Text) newToken).Text = lines[0];
//if more than one line
for (int i = 1; i < lines.Length; i++)
{
//append a break to the parent
parent.AppendChild<Break>(new Break());
//then append the next line
parent.AppendChild<Text>(new Text(lines[i]));
}
//insert it after the token element
token.InsertAfterSelf(newToken);
//remove the token element
token.Remove();
}
wordDoc.MainDocumentPart.Document.Save();
}
Basically you find the Text element (Word is built from Paragraphs of Runs of Text), clone it, change it (inserting new Break and Text elements if needed), then add it after the original token Text element and finally remove the original token Text element.

Related

C# iText7 - 'Trailer Not Found' when using PdfReader with PDF string from database

I'm saving the contents of my PDF file (pdfAsString) to the database.
File is of type IFormFile (file uploaded by the user).
string pdfAsString;
using (var reader = new StreamReader(indexModel.UploadModel.File.OpenReadStream()))
{
pdfAsString = await reader.ReadToEndAsync();
// pdfAsString = ; // encoding function or lack thereof
}
Later I'm trying to fetch and use these contents to initialize a new instance of MemoryStream, then using that to create a PdfReader and then using that to create a PdfDocument, but at this point I get the 'Trailer not found' exception. I have verified that the Trailer part of the PDF is present inside the contents of the string that I use to create the MemoryStream. I have also made sure the position is set to the beginning of the file.
The issue seems related to the format of the PDF contents fetched from the database. iText7 doesn't seem able to navigate through it other than the beginning of the file.
I'm expecting to be able to create an instance of PdfDocument with the contents of the PDF saved to my database.
Note 1: Using the Stream created from OpenReadStream() works when trying to create a PdfReader and then PdfDocument, but I don't have access to that IFormFile when reading from the DB, so this doesn't help me in my use case.
Note 2: If I use the PDF from my device by giving a path, it works correctly, same for using a FileStream created from a path. However, this doesn't help my use case.
So far, I've tried saving it raw and then using that right out of the gate (1) or encoding special symbols like \n \t to ASCII hexadecimal notation (2). I've also tried HttpUtility.UrlEncode on save and UrlDecode after getting the database record (3), and also tried ToBase64String on save and FromBase64String on get (4).
// var pdfContent = databaseString; // 1
// var pdfContent = databaseString.EncodeSpecialCharacters(); // encode special symbols // 2
// var pdfContent = HttpUtility.UrlDecode(databaseString); // decode urlencoded string // 3
var pdfContent = Convert.FromBase64String(databaseString); // decode base64 // 4
using (var stream = new MemoryStream(pdfContent))
{
PdfReader pdfReader = new PdfReader(stream).SetUnethicalReading(true);
PdfWriter pdfWriter = new PdfWriter("new-file.pdf");
PdfDocument pdf = new PdfDocument(pdfReader, pdfWriter); // exception here :(
// some business logic...
}
Any help would be appreciated.
EDIT: on a separate project, I'm trying to run this code:
using (var stream = File.OpenRead("C:\\<path>\\<filename>.pdf"))
{
var formFile = new FormFile(stream, 0, stream.Length, null, "<filename>.pdf");
var reader = new StreamReader(formFile.OpenReadStream());
var pdfAsString = reader.ReadToEnd();
var pdfAsBytes = Encoding.UTF8.GetBytes(pdfAsString);
using (var newStream = new MemoryStream(pdfAsBytes))
{
newStream.Seek(0, SeekOrigin.Begin);
var pdfReader = new PdfReader(newStream).SetUnethicalReading(true);
var pdfWriter = new PdfWriter("Test-PDF-1.pdf");
PdfDocument pdf = new PdfDocument(pdfReader, pdfWriter);
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<string, PdfFormField> fields = form.GetFormFields();
foreach (var field in fields)
{
field.Value.SetValue(field.Key);
}
//form.FlattenFields();
pdf.Close();
}
}
and if I replace "newStream" inside of PdfReader with formFile.OpenReadStream() it works fine, otherwise I get the 'Trailer not found' exception.
Answer: use BinaryReader and ReadBytes instead of StreamReader when initially trying to read the data. Example below:
using (var stream = File.OpenRead("C:\\<filepath>\\<filename>.pdf"))
{
// FormFile - my starting point inside of the web application
var formFile = new FormFile(stream, 0, stream.Length, null, "<filename>.pdf");
var reader = new BinaryReader(formFile.OpenReadStream());
var pdfAsBytes = reader.ReadBytes((int)formFile.Length); // store this in the database
using (var newStream = new MemoryStream(pdfAsBytes))
{
newStream.Seek(0, SeekOrigin.Begin);
var pdfReader = new PdfReader(newStream).SetUnethicalReading(true);
var pdfWriter = new PdfWriter("Test-PDF-1.pdf");
PdfDocument pdf = new PdfDocument(pdfReader, pdfWriter);
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<string, PdfFormField> fields = form.GetFormFields();
foreach (var field in fields)
{
field.Value.SetValue(field.Key);
}
//form.FlattenFields();
pdf.Close();
}
}

count word pages from byte array

sorry for my English
I have the contents of a word document in a byte array and I want to know how many pages it has.
I already did this with a pdf file using this code:
public void MssGetNumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
int pageCount;
MemoryStream stream = new MemoryStream(ssFileBinaryData);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
// TODO: Write implementation for action
}
How do I do something similar, with a word document?
In the pdf I simply have to search through the regex the text that matches this:
Regex(#"/Type\s*/Page[^s]")
What do I have to put in the regex to match the pages of the word document?
Well, I solved this myself by converting the word document into pdf with Aspose.dll
public void MssGet_Word_NumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
// Load Word Document from this byte array
Document loadedFromBytes = new Document(new MemoryStream(ssFileBinaryData));
// Save Word to PDF byte array
MemoryStream pdfStream = new MemoryStream();
loadedFromBytes.Save(pdfStream, SaveFormat.Pdf);
byte[] pdfBytes = pdfStream.ToArray();
int pageCount;
MemoryStream stream = new MemoryStream(pdfBytes);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
}
Can you perhaps elaborate on the tool(s) you used to convert the word doc to PDF?

PDF text replace not working

I'm trying to replace text in PDF using iTextSharp dll but its not working in all cases. PDF document doesn't have acro fields.
If the text which I need to replace is bigger than original text its not printing all characters. Finding some special characters is also not working.
I have tried this code
using (PdfReader reader = new PdfReader(sourceFileName))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace("SOMETEXT", "NEWBIGGERTEXT");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(newFileName, FileMode.Create, FileAccess.Write)).Close();
}
Please let me know how this can be achieved.

Get substring from MemoryStream without converting entire stream to string

I would like to be able to efficiently get a substring from a MemoryStream (that originally comes from a xml file in a zip). Currently, I read the entire MemoryStream to a string and then search for the start and end tags of the xml node I desire. This works fine but the text file may be very large so I would like to avoid converting the entire MemoryStream into a string and instead just extract the desired section of xml text directly from the stream.
What is the best way to go about this?
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using(var sr = new StreamReader(ms))
{
xmlText = sr.ReadToEnd();
}
}
}
string startTag = "<someTag>";
string endTag = "</someTag>";
int startIndex = xmlText.IndexOf(startTag, StringComparison.Ordinal);
int endIndex = xmlText.IndexOf(endTag, startIndex, StringComparison.Ordinal) + endTag.Length - 1;
xmlText = xmlText.Substring(startIndex, endIndex - startIndex + 1);
If your file is a valid xml file then you should be able to use a XmlReader to avoid loading the entire file into memory
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var xml = XmlReader.Create(ms))
{
if(xml.ReadToFollowing("someTag"))
{
xmlText = xml.ReadInnerXml();
}
else
{
// <someTag> not found
}
}
}
}
You'll likely want to catch potential exceptions if the file is not valid xml.
Assuming that since it is xml it will have line breaks, it would probably be best to use StreamReader ReadLine and search for your tags in each line. (Also note put your StreamReader in a using as well.)
Something like
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var sr = new StreamReader(ms))
{
bool adding = false;
string startTag = "<someTag>";
string endTag = "</someTag>";
StringBuilder text = new StringBuilder();
while (sr.Peek() >= 0)
{
string tmp = sr.ReadLine();
if (!adding && tmp.Contains(startTag))
{
adding = true;
}
if (adding)
{
text.Append(tmp);
}
if (tmp.Contains(endTag))
break;
}
xmlText = text.ToString();
}
}
This assumes that the start and end tags are on a line by themselves. If not, you could clean up the resulting text string by getting the index of start and end again like you originally did.

Merge multiple word documents into one using OpenXML and XElement

As the title states I am trying to merge multiple word(.docx) files into one word doc. Each of these documents is one page long. I am using some of the code from this post in this implementation. The issue I am running into is that only the first document gets written properly, every other iteration appends a new document but the document contents is the same as the first.
Here is the code I am using:
//list that holds the file paths
List<String> fileNames = new List<string>();
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
//get the first document
MemoryStream mainStream = new MemoryStream();
byte[] buffer = File.ReadAllBytes(fileNames[0]);
mainStream.Write(buffer, 0, buffer.Length);
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(mainStream, true))
{
//xml for the new document
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
//iterate through eacah file
for (int i = 1; i < fileNames.Count; i++)
{
//read in the document
byte[] tempBuffer = File.ReadAllBytes(fileNames[i]);
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(tempBuffer), true);
//new documents XML
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
//add the new xml
newBody.Add(tempBody);
string str = newBody.ToString();
//write to the main document and save
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
tempBuffer = null;
}
//write entire stream to new file
FileStream fileStream = new FileStream("xmltest.docx", FileMode.Create);
mainStream.WriteTo(fileStream);
//ret = mainStream.ToArray();
mainStream.Close();
mainStream.Dispose();
}
Again the problem is that each new document being created has the same content as the first document. So when I run this the output will be a document with five identical pages. I've tried switching the documents order around in the list and get the same result so it is nothing specific to one document.
Could anyone suggest what I am doing wrong here? I'm looking through it and I can't explain the behavior I am seeing. Any suggestions would be appreciated. Thanks much!
Edit: I'm thinking this may have something to do with that fact that the documents I am trying to merge have been generated with custom XML parts. I'm thinking that the Xpath in the documents are somehow pointing to the same content. The thing is I can open each of these documents and see the proper content, it's just when I merge them that I see the issue.
This solution uses DocumentFormat.OpenXml
public static void Join(params string[] filepaths)
{
//filepaths = new[] { "D:\\one.docx", "D:\\two.docx", "D:\\three.docx", "D:\\four.docx", "D:\\five.docx" };
if (filepaths != null && filepaths.Length > 1)
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(#filepaths[0], true))
{
MainDocumentPart mainPart = myDoc.MainDocumentPart;
for (int i = 1; i < filepaths.Length; i++)
{
string altChunkId = "AltChunkId" + i;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(#filepaths[i], FileMode.Open))
{
chunk.FeedData(fileStream);
}
DocumentFormat.OpenXml.Wordprocessing.AltChunk altChunk = new DocumentFormat.OpenXml.Wordprocessing.AltChunk();
altChunk.Id = altChunkId;
//new page, if you like it...
mainPart.Document.Body.AppendChild(new Paragraph(new Run(new Break() { Type = BreakValues.Page })));
//next document
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
}
mainPart.Document.Save();
myDoc.Close();
}
}
The way you seem to merge may not work properly at times. You can try one of the approaches
Using AltChunk as in http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx
Using http://powertools.codeplex.com/ DocumentBuilder.BuildDocument method
If still you face the similar issue you can find the databound controls prior to Merge and
assign data to these controls from the CustomXml part. You can find this approach in method AssignContentFromCustomXmlPartForDataboundControl of OpenXmlHelper class. The code can be downloaded from http://worddocgenerator.codeplex.com/

Categories