Merge PDF based on size using Aspose - c#

As I am new to Aspose, I need help in below case.
I want to merge multiple PDF into 1 PDF using Aspose, I can do it easily but the problem is, I want to limit the PDF size to 200MB.
That means, If my merged PDF size is greater than 200MB, then I need to split the PDF into multiple PDF. For Example, If my merged PDF is of 300MB, then first PDF should be of 200MB and second one PDF should be 100MB.
Main problem is, I am not able to find the size of the document in below code. I am using below code.
Document destinationPdfDocument = new Document();
Document sourcePdfDocument = new Document();
//Merge PDF one by one
for (int i = 0; i < filesFromDirectory.Count(); i++)
{
if (i == 0)
{
destinationPdfDocument = new Document(filesFromDirectory[i].FullName);
}
else
{
// Open second document
sourcePdfDocument = new Document(filesFromDirectory[i].FullName);
// Add pages of second document to the first
destinationPdfDocument.Pages.Add(sourcePdfDocument.Pages);
//** I need to check size of destinationPdfDocument over here to limit the size of resultant PDF**
}
}
// Encrypt PDF
destinationPdfDocument.Encrypt("userP", "ownerP", 0, CryptoAlgorithm.AESx128);
string finalPdfPath = Path.Combine(destinationSourceDirectory, destinatedPdfPath);
// Save concatenated output file
destinationPdfDocument.Save(finalPdfPath);
Other way of merging PDF based on size also be appreciated.
Thanks in Advance

I am afraid that there is no direct way to determine PDF file size before saving it physically. Therefore, we have already logged a feature request as PDFNET-43073 in our issue tracking system and product team has been investigating the feasibility of this feature. As soon as we have some significant updates regarding availability of the feature, we will definitely inform you. Please spare us little time.
However, as a workaround, you may save document into a memory stream and place a check on the size of that memory stream, whether it exceeds from your desired PDF size or not. Please check following code snippet, where we have generated PDFs with desired size of 200MBs with aforementioned approach.
//Instantiate document objects
Document destinationPdfDocument = new Document();
Document sourcePdfDocument = new Document();
//Load source files which are to be merged
var filesFromDirectory = Directory.GetFiles(dataDir, "*.pdf");
for (int i = 0; i < filesFromDirectory.Count(); i++)
{
if (i == 0)
{
destinationPdfDocument = new Document(filesFromDirectory[i]);
}
else
{
// Open second document
sourcePdfDocument = new Document(filesFromDirectory[i]);
// Add pages of second document to the first
destinationPdfDocument.Pages.Add(sourcePdfDocument.Pages);
//** I need to check size of destinationPdfDocument over here to limit the size of resultant PDF**
MemoryStream ms = new MemoryStream();
destinationPdfDocument.Save(ms);
long filesize = ms.Length;
ms.Flush();
// Compare the filesize in MBs
if (i == filesFromDirectory.Count() - 1)
{
destinationPdfDocument.Save(dataDir + "PDFOutput_" + i + ".pdf");
}
else if ((filesize / (1024 * 1024)) < 200)
continue;
else
{
destinationPdfDocument.Save(dataDir + "PDFOutput_" + i.ToString() + ".pdf");
destinationPdfDocument = new Document();
}
}
}
I hope this will be helpful. Please let us know if you need any further assistance.
I work with Aspose as Developer Evangelist.

Related

c# Novacode.Picture to System.Drawing.Image

I'm reading in a .docx file using the Novacode API, and am unable to create or display any images within the file to a WinForm app due to not being able to convert from a Novacode Picture (pic) or Image to a system image. I've noticed that there's very little info inside the pic itself, with no way to get any pixel data that I can see. So I have been unable to utilize any of the usual conversion ideas.
I've also looked up how Word saves images inside the files as well as Novacode source for any hints and I've come up with nothing.
My question then is is there a way to convert a Novacode Picture to a system one, or should I use something different to gather the image data like OpenXML? If so, would Novacode and OpenXML conflict in any way?
There's also this answer that might be another place to start.
Any help is much appreciated.
Okay. This is what I ended up doing. Thanks to gattsbr for the advice. This only works if you can grab all the images in order, and have descending names for all the images.
using System.IO.Compression; // Had to add an assembly for this
using Novacode;
// Have to specify to remove ambiguous error from Novacode
Dictionary<string, System.Drawing.Image> images = new Dictionary<string, System.Drawing.Image>();
void LoadTree()
{
// In case of previous exception
if(File.Exists("Images.zip")) { File.Delete("Images.zip"); }
// Allow the file to be open while parsing
using(FileStream stream = File.Open("Images.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using(DocX doc = DocX.Load(stream))
{
// Work rest of document
// Still parse here to get the names of the images
// Might have to drag and drop images into the file, rather than insert through Word
foreach(Picture pic in doc.Pictures)
{
string name = pic.Description;
if(null == name) { continue; }
name = name.Substring(name.LastIndexOf("\\") + 1);
name = name.Substring(0, name.Length - 4);
images[name] = null;
}
// Save while still open
doc.SaveAs("Images.zip");
}
}
// Use temp zip directory to extract images
using(ZipArchive zip = ZipFile.OpenRead("Images.zip"))
{
// Gather all image names, in order
// They're retrieved from the bottom up, so reverse
string[] keys = images.Keys.OrderByDescending(o => o).Reverse().ToArray();
for(int i = 1; ; i++)
{
// Also had to add an assembly for ZipArchiveEntry
ZipArchiveEntry entry = zip.GetEntry(String.Format("word/media/image{0}.png", i));
if(null == entry) { break; }
Stream stream = entry.Open();
images[keys[i - 1]] = new Bitmap(stream);
}
}
// Remove temp directory
File.Delete("Images.zip");
}

Why does switching my logo give me a corrupt PDF file?

I am creating a PDF file in a ASP.NET C# Windows Console Application, using iTextSharp.
I have this one piece of code. If Site is 'LMH', I get a good PDF i can open with Adobe Reader. If not, I get error: There was an error processing a page. There was a problem reading this document (114).
Here is my code:
string ApplicationPath = System.IO.Directory.GetCurrentDirectory();
PdfPTable table = new PdfPTable(4) { TotalWidth = 800.0F, LockedWidth = true };
float[] widths = new[] { 80.0F, 80.0F, 500.0F, 140.0F };
table.SetWidths(widths);
table.HorizontalAlignment = 0;
table.DefaultCell.Border = 0;
if (ActiveProfile.Site == "LMH")
{
Image hmsImage = Image.GetInstance(ApplicationPath + "\\" + "HMS Logo.png");
hmsImage.ScaleToFit(80.0F, 40.0F);
PdfPCell hmslogo = new PdfPCell(hmsImage);
hmslogo.Border = 0;
hmslogo.FixedHeight = 60;
table.AddCell(hmslogo);
}
else
{
Image blankImage = Image.GetInstance(ApplicationPath + "\\" + "emptyLogo.png");
blankImage.ScaleToFit(80.0F, 40.0F);
PdfPCell emptyCell = new PdfPCell(blankImage);
emptyCell.Border = 0;
emptyCell.FixedHeight = 60;
table.AddCell(emptyCell);
}
And the main trunk:
System.IO.FileStream file = new System.IO.FileStream(("C:/") + keyPropertyId + ".pdf", System.IO.FileMode.OpenOrCreate);
Document document = new Document(PageSize.A4.Rotate () , 20, 20, 6, 4);
PdfWriter writer = PdfWriter.GetInstance(document, file );
document.AddTitle(_title);
document.Open();
addHeader(document);
addGeneralInfo(document, keyPropertyId);
addAppliances(document, _LGSRobj);
addFaults(document, _LGSRobj);
addAlarms(document, _LGSRobj);
addFinalCheck(document, _LGSRobj);
addSignatures(document, _LGSRobj);
addFooter(document, writer);
document.Close();
writer .Close ();
file.Close();
All that has changed is a logo. Both logo files have the same dimensions. What can possibly be wrong? BTW. File opens fine using Foxit, but this is not an acceptable solution.
This is the PDF I can't open with Adobe Reader: https://dl.dropboxusercontent.com/u/20086858/1003443.pdf
This is the emptyLogo.png file: https://dl.dropboxusercontent.com/u/20086858/emptylogo.png
This is the logo that works: https://dl.dropboxusercontent.com/u/20086858/HMS%20Logo.png
This is a 'good' version of the pdf with the logo that works: https://dl.dropboxusercontent.com/u/20086858/1003443-good.pdf
It is very suspicious that both the 'good' and the not-so-good version have the identical size in spite of very different image sizes. Comparing them one sees that both files differ completely in their first 192993 bytes but from there on only very little. Furthermore the broken PDF contains EOF markers in this region denoting file ends at indices 140338 and 192993 but the following bytes do not at all look like a clean incremental update.
Cutting the file at the first denoted file end, 140338, one gets the file the OP wanted to have.
Thus:
The code overwrites existing files with new data; if the former file was longer, the remainder of that longer file remains as trash at the end and, therefore, renders the new file broken.
The OP opens the stream like this:
new System.IO.FileStream(("C:/") + keyPropertyId + ".pdf", System.IO.FileMode.OpenOrCreate);
FileMode.OpenOrCreate causes the observed behavior.
The FileMode values are documented here on MSDN, especially:
OpenOrCreate Specifies that the operating system should open a file if it exists; otherwise, a new file should be created.
Create Specifies that the operating system should create a new file. If the file already exists, it will be overwritten. ... FileMode.Create is equivalent to requesting that if the file does not exist, use CreateNew; otherwise, use Truncate.
Thus, use
new System.IO.FileStream(("C:/") + keyPropertyId + ".pdf", System.IO.FileMode.Create);
instead.

File.WriteAllBytes generates one pdf fine, but multiple pdfs with errors?

I have a loop that goes through some data and generates pdf files. if I generate one pdf by iteself, it works just fine (the pdf opens), but if I create 2 pdf files, the first one will open fine, but the second one will display and error saying the file is corrupt or something similar. Is there something I am doing wrong in the loop with the stream, etc?
foreach (report r reports)
{
byte[] pdf;
ReportName = r.ReportName;
switch (r.ReportId.ToLower())
{
case "pdf":
pdfBuilder = new pdfHelper(candidate,
pdfTemplates[(Guid)case_report.TemplateId], r.XMLFieldData, DCFormats,
r.ProjectReportName, dependants, DepCount, SpoCount);
pdf = pdfBuilder.GenerateCasePDF();
break;
}
//Add Bookmarks for each report in candidate
ChapterCount++;
ChapterReport = new Chapter(new Paragraph(case_report.ReportName), ChapterCount);
tDoc.Add(ChapterReport);
reader = new PdfReader(pdf);
n = reader.NumberOfPages;
for (int page = 1; page <= n; page++)
copy.AddPage(copy.GetImportedPage(reader, page));
copy.FreeReader(reader);
reader.Close();
}
//Save pdf to folder
ReportName = null;
tDoc.Close();
PubResult = outputStream.ToArray();
File.WriteAllBytes(string.Format(#"{0}\{1}.pdf", JobRootPath, CaseFileName), PubResult);
//Reset for next case
outputStream = new MemoryStream();
tDoc = new iTextSharp.text.Document();
copy = new PdfSmartCopy(tDoc, outputStream);
copy.ViewerPreferences = PdfWriter.PageModeUseOutlines;
copy.SetFullCompression();
tDoc.Open();
}
I'm guessing ChapterCount should get reset to its initial value like all the other variables you're resetting at the end of your for loop.
That aside, I'd recommend moving the body of your for loop, and all related variables, into a new method. Reusing variables tends to lead to errors like this.

Remove outer print marks on PDF iTextSharp

I have a pdf file with a cover that looks like the following:
Now, I need to remove the so-called 'galley marks' around the edges of the cover. I am using iTextSharp with C# and I need code using iTextSharp to create a new document with only the intended cover or use PdfStamper to remove that. Or any other solution using iTextSharp that would deliver the results.
I have been unable to find any good code samples in my search to this point.
Do you have to actually remove them or can you just crop them out? If you can just crop them out then the code below will work. If you have to actually remove them from the file then to the best of my knowledge there isn't a simple way to do that. Those objects aren't explicitly marked as meta-objects to the best of my knowledge. The only way I can think of to remove them would be to inspect everything and see if it fits into the document's active area.
Below is sample code that reads each page in the input file and finds the various boxes that might exist, trim, art and bleed. (See this page.)
As long as it finds at least one it sets the page's crop box to the first item in the list. In your case you might actually have to perform some logic to find the "smallest" of all of those items or you might be able to just know that "art" will always work for you. See the code for additional comments. This targets iTextSharp 5.4.0.0.
//Sample input file
var inputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Binder1.pdf");
//Sample output file
var outputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Cropped.pdf");
//Bind a reader to our input file
using (var r = new PdfReader(inputFile)) {
//Get the number of pages
var pageCount = r.NumberOfPages;
//See this for a list: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfReader.html#getBoxSize(int, java.lang.String)
var boxNames = new string[] { "trim", "art", "bleed" };
//We'll create a list of all possible boxes to pick from later
List<iTextSharp.text.Rectangle> boxes;
//Loop through each page
for (var i = 1; i <= pageCount; i++) {
//Initialize our list for this page
boxes = new List<iTextSharp.text.Rectangle>();
//Loop through the list of known boxes
for (var j = 0; j < boxNames.Length; j++) {
//If the box exists
if(r.GetBoxSize(i, boxNames[j]) != null){
//Add it to our collection
boxes.Add(r.GetBoxSize(i, boxNames[j]));
}
}
//If we found at least one box
if (boxes.Count > 0) {
//Get the page's entire dictionary
var dict = r.GetPageN(i);
//At this point we might want to apply some logic to find the "inner most" box if our trim/bleed/art aren't all the same
//I'm just hard-coding the first item in the list for demonstration purposes
//Set the page's crop box to the specified box
dict.Put(PdfName.CROPBOX, new PdfRectangle(boxes[0]));
}
}
//Create our output file
using (var fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to our reader and output file
using(var stamper = new PdfStamper(r,fs)){
//We did all of our PDF manipulation above so we don't actually have to do anything here
}
}
}

Do you need Adobe PDF installed on server to work with iTextSharp?

I've developed a solution on my development machine where it:
Opens PDFs for a file path server side via C#
Merges them together
Does a Response.BinaryWrite to push to a browser the merged PDF
Works great on local DEV. When pushed to server, it gets some 'binary gibberish' in the browser window.
Adobe or Foxit Reader is NOT installed on the server, however it is installed on my local dev machine. My understanding is that iTextSharp allowed you to not need PDF Readers installed at all, but does it? Or maybe this is an IIS thing where .pdf is not listed as a filetype...
Here is some sample code:
// First set up the response and let the browser know a PDF is coming
context.Response.Buffer = true;
context.Response.ContentType = "application/pdf";
context.Response.AddHeader("Content-Disposition", "inline");
List<string> PDFs = new List<string>();
PDFs.Add(#"c:\users\shane\documents\visual studio 2010\Projects\PDFMultiPrintTester\PDFMultiPrintTester\TEST1.pdf");
PDFs.Add(#"c:\users\shane\documents\visual studio 2010\Projects\PDFMultiPrintTester\PDFMultiPrintTester\TEST2.pdf");
PDFs.Add(#"c:\users\shane\documents\visual studio 2010\Projects\PDFMultiPrintTester\PDFMultiPrintTester\TEST3.pdf");
// Second, some setup stuff
System.IO.MemoryStream MemStream = new System.IO.MemoryStream();
iTextSharp.text.Document doc = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader reader = default(iTextSharp.text.pdf.PdfReader);
int numberOfPages = 0;
int currentPageNumber = 0;
iTextSharp.text.pdf.PdfWriter writer = iTextSharp.text.pdf.PdfWriter.GetInstance(doc, MemStream);
doc.Open();
iTextSharp.text.pdf.PdfContentByte cb = writer.DirectContent;
iTextSharp.text.pdf.PdfImportedPage page = default(iTextSharp.text.pdf.PdfImportedPage);
int rotation = 0;
foreach (string f in PDFs)
{
// Third, append all the PDFs--THIS IS THE MAGIC PART
byte[] sqlbytes = null;
sqlbytes = ReadFile(f);
reader = new iTextSharp.text.pdf.PdfReader(sqlbytes);
numberOfPages = reader.NumberOfPages;
currentPageNumber = 0;
while ((currentPageNumber < numberOfPages))
{
currentPageNumber += 1;
doc.SetPageSize(PageSize.LETTER);
doc.NewPage();
page = writer.GetImportedPage(reader, currentPageNumber);
rotation = reader.GetPageRotation(currentPageNumber);
if ((rotation == 90) | (rotation == 270))
{
cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(currentPageNumber).Height);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
}
// Finally Spit the stream out
if (MemStream == null)
{
context.Response.Write("No Data is available for output");
}
else
{
doc.Close();
context.Response.BinaryWrite(MemStream.GetBuffer());
context.Response.End();
MemStream.Close();
}
}
}
public static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}
My understanding is that iTextSharp allowed you to not need PDF
Readers installed at all, but does it?
iTextSharp is used to generate PDF files. It has nothing to do with the way those files are browsed. If you don't have a PDF reader installed on the client machine that is browsing the application streaming this PDF file in the response don't expect to get anything other than gibberish on this client machine.
Unfortunately you haven't shown the code that is used to generate this PDF file on the server so it's difficult to say whether the problem might be somehow related to it. The important thing is to set the ContentType of the response to application/pdf and send a valid PDF file to the response. The way this response is interpreted on the client will greatly depend on the browser being used and the different plugins and PDF readers installed on this client machine.
You may need to set the Response.ContentType to application/pdf. See related SO post.
When you render Content-Disposition: inline it uses Adobe Plugin - "Adobe PDF Link Helper" (or FoxIt Reader) in IE. Since you probably don't have this ActiveX plugin on your server (AcroIEHelperShim.dll), it will just render the byte content inline as text/html since it doesn't have an inline interpreter.
Finally figured it out. You don't need Adobe PDF reader or Foxit Reader installed on the server. You only need iTextReader installed on the server (by installed I mean the assembly exists within your solution). The thing you do need is the MIME Type in IIS. We had to add that MIME type and it worked right away after that. Funny thing is that even with that, Chrome was able to figure it out and render it properly. I'm assuming that IIS puts the proper headers in place that associate with that MIME Type and that was not happening. IE8 couldn't figure it out.

Categories