PDF text replace not working - c#

I'm trying to replace text in PDF using iTextSharp dll but its not working in all cases. PDF document doesn't have acro fields.
If the text which I need to replace is bigger than original text its not printing all characters. Finding some special characters is also not working.
I have tried this code
using (PdfReader reader = new PdfReader(sourceFileName))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace("SOMETEXT", "NEWBIGGERTEXT");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(newFileName, FileMode.Create, FileAccess.Write)).Close();
}
Please let me know how this can be achieved.

Related

Maori Macrons aren't recognized when converting the letter to PDF using aspose word

I'm using below code to convert html to pdf using aspose word plugin but the Maori macrons aren't recognized and causing issue.
MemoryStream stream = new MemoryStream(data);
// load HTML file
Aspose.Words.LoadOptions loadOptions = new Aspose.Words.LoadOptions();
loadOptions.LoadFormat = LoadFormat.Html;
Aspose.Words.Document doc = new Aspose.Words.Document(stream, loadOptions);
foreach (Aspose.Words.Section section in doc)
{
section.PageSetup.PaperSize = PaperSize.Letter;
section.PageSetup.RightMargin = 20;
section.PageSetup.LeftMargin = 10;
section.PageSetup.TopMargin = 10;
section.PageSetup.BottomMargin = 10;
}
doc.Save(strPath, SaveFormat.Pdf);
Sample text before converting:
What we get in Pdf after conversion.

Adding a HTML page as a last page to PDF document

I am creating a PDF Document consisting 6 images (1 Image on 1 Page) using iTextSharp.
I need to add a HTML Page as a last page after the 6th Image.
I have tried the below, but the HTML does not get added on a new page, instead gets attached immediately below the 5th Image.
Please advice how to make the html add to the last page.
Code for reference:
string ImagePath = HttpContext.Current.Server.MapPath("~/Images/");
string[] fileNames = System.IO.Directory.GetFiles(ImagePath);
string outputFileNames = "Test.pdf";
string outputFilePath = System.Web.Hosting.HostingEnvironment.MapPath("~/Pdf/" + outputFileNames);
Document doc = new Document(PageSize.A4, 20, 20, 20, 20);
System.IO.Stream st = new FileStream(outputFilePath, FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(doc, st);
doc.Open();
writer.PageEvent = new Footer();
for (int i = 0; i < fileNames.Length; i++)
{
string fname = fileNames[i];
if (System.IO.File.Exists(fname) && Path.GetExtension(fname) == ".png")
{
iTextSharp.text.Image img = iTextSharp.text.Image.GetInstance(fname);
img.Border = iTextSharp.text.Rectangle.BOX;
img.BorderColor = iTextSharp.text.BaseColor.BLACK;
doc.Add(img);
}
}
byte[] pdf; // result will be here
var cssText = File.ReadAllText(MapPath("~/Style1.css"));
var html = File.ReadAllText(MapPath("~/HtmlPage1.html"));
using ( var memoryStream = new MemoryStream())
{
using (var cssMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(cssText)))
{
using (var htmlMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(html)))
{
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlMemoryStream, cssMemoryStream);
}
}
pdf = memoryStream.ToArray();
//document.Add(new Paragraph(Encoding.UTF8.GetString(pdf)));
}
doc.NewPage();
doc.Add(new Paragraph(Encoding.UTF8.GetString(pdf)));
doc.Close();
writer.Close();
I need to add a HTML Page as a last page after the 6th Image.
Any help is appreciated
In contrast to what you assume according to your code comments, pdf is not where the result will be. It remains empty:
byte[] pdf; // result will be here
...
using ( var memoryStream = new MemoryStream())
{
... code not accessing memoryStream ...
pdf = memoryStream.ToArray();
//document.Add(new Paragraph(Encoding.UTF8.GetString(pdf)));
}
doc.NewPage();
doc.Add(new Paragraph(Encoding.UTF8.GetString(pdf)));
Thus, you add the new page before adding an empty paragraph, after the converted html already has been added to the document.
Actually it is added during
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlMemoryStream, cssMemoryStream);
So you have to add the new page before that. Thus, the following replacing everything from your byte[] pdf; on should do the job:
var cssText = File.ReadAllText(MapPath("~/Style1.css"));
var html = File.ReadAllText(MapPath("~/HtmlPage1.html"));
using (var cssMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(cssText)))
{
using (var htmlMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(html)))
{
doc.NewPage();
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlMemoryStream, cssMemoryStream);
}
}
doc.Close();
As an aside, don't close the writer! It implicitly is closed when the doc is closed. Closing it again does nothing at best or damage otherwise.
In a comment you claimed
but this also does not resolve the issue... the pdf content still get added after the image and then continued on new page.
So I tested the proposed change. Obviously I don't have your environment and also not your image, html, and css files. Thus, I used own ones, a small screen shot and "<html><body><h1>Test</h1><p>This is a test piece of html</p></body></html>".
With your code I get:
With the code changed as described above I get
My impression here is that the proposed code change does resolve the issue. The html content is added on a new page.
Thus apparently your either incorrectly applied the proposed change, or you executed old code, or you inspected some old result.

count word pages from byte array

sorry for my English
I have the contents of a word document in a byte array and I want to know how many pages it has.
I already did this with a pdf file using this code:
public void MssGetNumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
int pageCount;
MemoryStream stream = new MemoryStream(ssFileBinaryData);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
// TODO: Write implementation for action
}
How do I do something similar, with a word document?
In the pdf I simply have to search through the regex the text that matches this:
Regex(#"/Type\s*/Page[^s]")
What do I have to put in the regex to match the pages of the word document?
Well, I solved this myself by converting the word document into pdf with Aspose.dll
public void MssGet_Word_NumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
// Load Word Document from this byte array
Document loadedFromBytes = new Document(new MemoryStream(ssFileBinaryData));
// Save Word to PDF byte array
MemoryStream pdfStream = new MemoryStream();
loadedFromBytes.Save(pdfStream, SaveFormat.Pdf);
byte[] pdfBytes = pdfStream.ToArray();
int pageCount;
MemoryStream stream = new MemoryStream(pdfBytes);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
}
Can you perhaps elaborate on the tool(s) you used to convert the word doc to PDF?

Word OpenXML replace token text

I'm using OpenXML to amend Word templates, these templates contain simple tokens that are identifiable by certain characters (currently the double chevrons (ascii 171 and 187)).
I would like to replace these tokens with my text, which could be multiline (i.e. from a database).
Firstly you need to open the template:
//read file into memory
byte[] docByteArray = File.ReadAllBytes(templateName);
using (MemoryStream ms = new MemoryStream())
{
//write file to memory stream
ms.Write(docByteArray, 0, docByteArray.Length);
//
ReplaceText(ms);
//reset stream
ms.Seek(0L, SeekOrigin.Begin);
//save output
using (FileStream outputStream = File.Create(docName))
ms.CopyTo(outputStream);
}
The simple approach searching the inner text xml of the body is the quickest way, but doesn't allow for insertion of multiline text and doesn't give you the basis to expand to more complicated changes.
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(ms, true))
{
string docText = null;
//read the entire document into a text
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
docText = sr.ReadToEnd();
//replace the text
docText.Replace(oldString, myNewString);
//write the text back
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
sw.Write(docText);
}
Instead you need to work with the elements and structure:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(ms, true))
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
//filter them to the ones that contain the QuoteLeft char
var tokenTexts = texts.Where(t => t.Text.Contains(oldString));
foreach (var token in tokenTexts)
{
//get the parent element
var parent = token.Parent;
//deep clone this Text element
var newToken = token.CloneNode(true);
//split the text into an array using a regex of all line terminators
var lines = Regex.Split(myNewString, "\r\n|\r|\n");
//change the original text element to the first line
((Text) newToken).Text = lines[0];
//if more than one line
for (int i = 1; i < lines.Length; i++)
{
//append a break to the parent
parent.AppendChild<Break>(new Break());
//then append the next line
parent.AppendChild<Text>(new Text(lines[i]));
}
//insert it after the token element
token.InsertAfterSelf(newToken);
//remove the token element
token.Remove();
}
wordDoc.MainDocumentPart.Document.Save();
}
Basically you find the Text element (Word is built from Paragraphs of Runs of Text), clone it, change it (inserting new Break and Text elements if needed), then add it after the original token Text element and finally remove the original token Text element.

Merge multiple word documents into one using OpenXML and XElement

As the title states I am trying to merge multiple word(.docx) files into one word doc. Each of these documents is one page long. I am using some of the code from this post in this implementation. The issue I am running into is that only the first document gets written properly, every other iteration appends a new document but the document contents is the same as the first.
Here is the code I am using:
//list that holds the file paths
List<String> fileNames = new List<string>();
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
fileNames.Add("filePath");
//get the first document
MemoryStream mainStream = new MemoryStream();
byte[] buffer = File.ReadAllBytes(fileNames[0]);
mainStream.Write(buffer, 0, buffer.Length);
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(mainStream, true))
{
//xml for the new document
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
//iterate through eacah file
for (int i = 1; i < fileNames.Count; i++)
{
//read in the document
byte[] tempBuffer = File.ReadAllBytes(fileNames[i]);
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(tempBuffer), true);
//new documents XML
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
//add the new xml
newBody.Add(tempBody);
string str = newBody.ToString();
//write to the main document and save
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
tempBuffer = null;
}
//write entire stream to new file
FileStream fileStream = new FileStream("xmltest.docx", FileMode.Create);
mainStream.WriteTo(fileStream);
//ret = mainStream.ToArray();
mainStream.Close();
mainStream.Dispose();
}
Again the problem is that each new document being created has the same content as the first document. So when I run this the output will be a document with five identical pages. I've tried switching the documents order around in the list and get the same result so it is nothing specific to one document.
Could anyone suggest what I am doing wrong here? I'm looking through it and I can't explain the behavior I am seeing. Any suggestions would be appreciated. Thanks much!
Edit: I'm thinking this may have something to do with that fact that the documents I am trying to merge have been generated with custom XML parts. I'm thinking that the Xpath in the documents are somehow pointing to the same content. The thing is I can open each of these documents and see the proper content, it's just when I merge them that I see the issue.
This solution uses DocumentFormat.OpenXml
public static void Join(params string[] filepaths)
{
//filepaths = new[] { "D:\\one.docx", "D:\\two.docx", "D:\\three.docx", "D:\\four.docx", "D:\\five.docx" };
if (filepaths != null && filepaths.Length > 1)
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(#filepaths[0], true))
{
MainDocumentPart mainPart = myDoc.MainDocumentPart;
for (int i = 1; i < filepaths.Length; i++)
{
string altChunkId = "AltChunkId" + i;
AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(#filepaths[i], FileMode.Open))
{
chunk.FeedData(fileStream);
}
DocumentFormat.OpenXml.Wordprocessing.AltChunk altChunk = new DocumentFormat.OpenXml.Wordprocessing.AltChunk();
altChunk.Id = altChunkId;
//new page, if you like it...
mainPart.Document.Body.AppendChild(new Paragraph(new Run(new Break() { Type = BreakValues.Page })));
//next document
mainPart.Document.Body.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
}
mainPart.Document.Save();
myDoc.Close();
}
}
The way you seem to merge may not work properly at times. You can try one of the approaches
Using AltChunk as in http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx
Using http://powertools.codeplex.com/ DocumentBuilder.BuildDocument method
If still you face the similar issue you can find the databound controls prior to Merge and
assign data to these controls from the CustomXml part. You can find this approach in method AssignContentFromCustomXmlPartForDataboundControl of OpenXmlHelper class. The code can be downloaded from http://worddocgenerator.codeplex.com/

Categories