Split large PDF file in to multiple pdfs in C# - c#

I have a large pdf file which I need to split into multiple pdfs or chunks before I upload to the server(another wcf service).
I have two approaches to send large files(>2 MB) to server by splitting them multiple pdfs or one pdf into chunks .Can any one tell me this how to achieve ?
I found the articles using iTextSharp but it's deprecated one. I don't use licensed one. Do we have any feasible way to achieve this ?
I have followed the following article .But they have used iTextshap which is a deprecated one .
https://www.c-sharpcorner.com/article/splitting-pdf-file-in-c-sharp-using-itextsharp/

using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System.IO;
class Program
{
// Output Folder
static string outputFolder = #"D:\PDFSplit\Example\outputFolder";
static void Main(string[] args)
{
// Input Folder
var inputFolder = #"D:\PDFSplit\Example\inputFolder";
// Input File name
var inputPDFFileName = "sample.pdf";
// Input file path
string inputPDFFilePath = Path.Combine(inputFolder, inputPDFFileName);
// Open the input file in Import Mode
PdfDocument inputPDFFile = PdfReader.Open(inputPDFFilePath, PdfDocumentOpenMode.Import);
//Get the total pages in the PDF
var totalPagesInInputPDFFile = inputPDFFile.PageCount;
while(totalPagesInInputPDFFile !=0)
{
//Create an instance of the PDF document in memory
PdfDocument outputPDFDocument = new PdfDocument();
// Add a specific page to the PdfDocument instance
outputPDFDocument.AddPage(inputPDFFile.Pages[totalPagesInInputPDFFile-1]);
//save the PDF document
SaveOutputPDF(outputPDFDocument, totalPagesInInputPDFFile);
totalPagesInInputPDFFile--;
}
}
private static void SaveOutputPDF(PdfDocument outputPDFDocument,int pageNo)
{
// Output file path
string outputPDFFilePath = Path.Combine(outputFolder, pageNo.ToString() + ".pdf");
//Save the document
outputPDFDocument.Save(outputPDFFilePath);
}
}

The first thing to split the pdf is to reduce the size of the file and split into serval files and then analyze it. First you need to pdf document which you need to split and then you need to call the split function with the output file streams.
PdfLoadedDocument document = new PdfLoadedDocument("sample.pdf");
document.Split("Document-{0}.pdf");
document.Close(true);
There is a another way, first you need to load the PDF document using Document class and then choose the pages to be split into a Page[] array. After that Create a new Document and add pages to it using Document.Pages.Add(Page[]) method.
Save the PDF file using the Document.Save(String) method.

Try using streams, for instance StreamReader.
See meatvest's answer here
And the docs

Related

Writing and Reading CustomValues takes too long in PdfSharp

I have a c# method that writes a custom value for given pdf file. In order to write a custom value for a pdf, I am using PdfSharp 1.50.5147
The problem here is PdfReader.Open waits too long for the pdf belove :
https://www.mouser.com.tr/catalog/English/103/dload/pdf/mouser.pdf
public bool WritePropertyToFile(string filePath, string extension, string key, string value)
{
try
{
document = PdfReader.Open(filePath); //Here it lasts 2.5 minutes !!
var properties = document.CustomValues.Elements;
properties.SetString("/" + key, value);
document.Save(filePath);
document = null;
return true;
}
catch (Exception)
{
if (document != null)
document = null;
throw;
}
}
My requirement is to write and read custom values in miliseconds for a given file. Although lots of pdf files' custom values can be written and read in miliseconds, some of the files such as this one may cause problems for me.
Do I need to open whole document for writing or reading a custom value? Is there a different technique for this? Do you have suggestion for this problem?
Currently, there is no method to open, in this case, large pdf's quickly in PdfSharp due to the fact that PdfSharp first loads the entire pdf in memory. The pdf you're trying to open is a whopping 168MB file.
You may extend PdfSharp and try to load the trailer contents first and then read each block of contents according to trailer entries.

GemBox DocumentModel.Load() cannot read Pdf file

Currently i am unable to load original pdf document using GemBox. it gives me below error in image. and I am using Acrobat 9.
I have tried using 8/16/2018 fixes too. Any suggestion will be highly appreciated.
Basic Code i am using is,
using GemBox.Document;
using System;
namespace Pdf2Text
{
class Program
{
[STAThread]
static void Main(string[] args)
{
ComponentInfo.SetLicense("My-License");
DocumentModel document = null;
document = DocumentModel.Load(#"E:\data\testing\HA021.pdf");
document.Save(#"E:\data\testing\HA021.docx");
}
}
}
The current implementation of PDF reader in GemBox.Document is still in beta and cannot handle this PDF feature, an "iref streams" which are a cross-reference tables stored in streams.
However, GemBox.Pdf can handle cross-reference streams so as a workaround what you could do is something like the following:
// Load PDF with GemBox.Pdf.
var pdfDocument = PdfDocument.Load("Sample.pdf");
pdfDocument.SaveOptions.CrossReferenceType = PdfCrossReferenceType.Table;
// Save PDF with GemBox.Pdf.
var pdfStream = new MemoryStream();
pdfDocument.Save(pdfStream);
// Load PDF with GemBox.Document.
var document = DocumentModel.Load(pdfStream, LoadOptions.PdfDefault);
Last regarding the conversion of PDF to DOCX, GemBox.Document's PDF reader is currently intended for extracting text and tables from PDF files, it's not intended for any high fidelity requirement.

Add text and picture in .docx file

I use a Office Word file (template) and in this file there is repetitive default text and photo that I have to replace it by another photo and text
How can I define specific zone in the template and then find those zones in C# to replace them ?
I think the best way is to find out how to manipulate the word xml structure to include the data you want.
For template filling and altering you can use the XML SDK from Microsoft
You can also follow this manual approach here without using the SDK.
Manual approach. You will add a custom XML Ressource that includes your changes/ressources for the template.
If you don`t need to be that flexible you can use the standard content control / picture content control in Word and replace them afterwards in C# - it depends how flexible you want to be in replacing elements..
You can find a good and complete example of using picture content control here: Picture content control handling
Ok, finally I try this approch ; use a Word file with Content Control and use a XML file to bind data to them
For that I use the following code :
string outFile = #"D:\template_created.docx";
string docPath = #"D:\template.docx";
string xmlPath = #"D:\template.xml";
File.Copy(docPath, outFile);
using (WordprocessingDocument doc = WordprocessingDocument.Open(outFile, true))
{
MainDocumentPart mdp = doc.MainDocumentPart;
if (mdp.CustomXmlParts != null)
{
mdp.DeleteParts<CustomXmlPart>(mdp.CustomXmlParts);
}
CustomXmlPart cxp = mdp.AddCustomXmlPart(CustomXmlPartType.CustomXml);
FileStream fs = null;
try
{
fs = new FileStream(xmlPath, FileMode.Open);
cxp.FeedData(fs);
mdp.Document.Save();
}
finally
{
if (fs != null)
{
fs.Dispose();
}
}
}
When I run the app, it created the custom XML file and append it to my Word file. When I open the Word file, there is no error, but all the Content Control are not filled
My final approach was to use Content Control in my Word document with a unique id. Then I can find those id's with C# and replace the content.

Read a stored PDF from memory stream

I'm working on a database project using C# and SQLServer 2012. In one of my forms I have a PDF file with some other information that is stored in a table. This is working successfully, but when I want to retrieve the stored information I have a problem with displaying the PDF file, because I can't display it and I don't know how to display it.
I read some articles that said it can not be displayed with Adobe PDF viewer from a memory stream, is there any way to that?
This is my code for retrieving the data from the database:
sql_com.CommandText = "select * from incoming_boks_tbl where [incoming_bok_id]=#incoming_id and [incoming_date]=#incoming_date";
sql_com.Parameters.AddWithValue("incoming_id",up_inco_num_txt.Text);
sql_com.Parameters.AddWithValue("incoming_date", up_inco_date_txt.Text);
sql_dr = sql_com.ExecuteReader();
if(sql_dr.HasRows)
{
while(sql_dr.Read())
{
up_incoming_id_txt.Text = sql_dr[0].ToString();
up_inco_num_txt.Text = sql_dr[1].ToString();
up_inco_date_txt.Text = sql_dr[2].ToString();
up_inco_reg_txt.Text = sql_dr[3].ToString();
up_inco_place_txt.Text = sql_dr[4].ToString();
up_in_out_txt.Text = sql_dr[5].ToString();
up_subj_txt.Text = sql_dr[6].ToString();
up_note_txt.Text = sql_dr[7].ToString();
string file_ext = sql_dr[8].ToString();//pdf file extension
byte[] inco_file = (byte[])(sql_dr[9]);//the pdf file
MemoryStream ms = new MemoryStream(inco_file);
//here I don't know what to do with memory stream file data and where to store it. How can i display it?
}
}
This answer should give you some options: How to render pdfs using C#
In the past I have used Googles open source PDF rendering project - PDFium
There is a C# nuget package called PdfiumViewer which gives a C# wrapper around PDFium and allows PDFs to be displayed and printed.
It works directly with Streams so doesn't require any data to be written to disk
This is my example from a WinForms app
public void LoadPdf(byte[] pdfBytes)
{
var stream = new MemoryStream(pdfBytes);
LoadPdf(stream)
}
public void LoadPdf(Stream stream)
{
// Create PDF Document
var pdfDocument = PdfDocument.Load(stream);
// Load PDF Document into WinForms Control
pdfRenderer.Load(_pdfDocument);
}

How to open an existing PDF file with Migradoc PDF library

I am trying to use the Migradoc library from PDFSharp (http://www.pdfsharp.net/) to print pdf files. So far I have found that Migradoc does support printing through its MigraDoc.Rendering.Printing.MigraDocPrintDocument class. However, I have not found a way to actually open an existing PDF file with MigraDoc.
I did find a way to open an existing PDF file using PDFSharp, but I cannot successfully convert a PDFSharp.Pdf.PdfDocument into a MigraDoc.DocumentObjectModel.Document object. So far I have not found the MigraDoc and PDFSharp documentation to be very helpful.
Does anyone have any experience using these libraries to work with existing PDF files?
I wrote the following code with help from this sample, but the result when my input PDF is 2 pages is an output PDF with 2 blank pages.
using MigraDoc.DocumentObjectModel;
using MigraDoc.Rendering;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
...
public void PrintPDF(string filePath, string outFilePath)
{
var document = new Document();
var docRenderer = new DocumentRenderer(document);
docRenderer.PrepareDocument();
var inPdfDoc = PdfReader.Open(filePath, PdfDocumentOpenMode.Modify);
for (var i = 0; i < inPdfDoc.PageCount; i++)
{
document.AddSection();
docRenderer.PrepareDocument();
var page = inPdfDoc.Pages[i];
var gfx = XGraphics.FromPdfPage(page);
docRenderer.RenderPage(gfx, i+1);
}
var renderer = new PdfDocumentRenderer();
renderer.Document = document;
renderer.RenderDocument();
renderer.PdfDocument.Save(outFilePath);
}
Your code modifies the inPdfDoc in memory without saving the changes. Complicated code without any visual effect.
MigraDoc cannot open PDF files, MigraDoc cannot print PDF files, PDFsharp cannot print PDF files.
http://www.pdfsharp.net/wiki/PDFsharpFAQ.ashx

Categories