Extract pages in memory iTextSharp

Extract pages in memory iTextSharp - c#

can someone help me with this problem. How can I extract some pages from pdf and return them as byte array or Stream, without using physical file as output.
Here is the way doing this using filestream:
public static void ExtractPages(string sourcePdfPath, string outputPdfPath, int startPage, int endPage)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
reader = new PdfReader(sourcePdfPath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
sourceDocument.Open();
for(int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
}
catch(Exception ex)
{
throw ex;
}
}
I need something like:
public static byte[] ExtractPages(string sourcePdfPath, int startPage, int endPage)
{
....
return byte[];
}

Replace the new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create) with a MemoryStream and return it.
Some thing like (untested, but should work):
public static byte[] ExtractPages(string sourcePdfPath, int startPage, int endPage)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
MemoryStream target = new MemoryStream();
reader = new PdfReader(sourcePdfPath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
pdfCopyProvider = new PdfCopy(sourceDocument, target);
sourceDocument.Open();
for(int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
return target.ToArray();
}

Related

ITextsharp: Error reading a pdf file in Byte[] content (PdfReader)

I'm trying to merge several PDFs into a single file through a list that contains their content in byte[]. When opening a document from the Byte[] list with PdfReader, the program launches the following exception: "the document has no pages". When I review the contents of the Byte[] list there are complete, but the exception is always launched.
I try to download the content of that single page separately and the generated document launches error when opening it. The division of the pdf does well because it generates each document in physical and makes it perfect for each page of the PDF.
I appreciate your help or opinions in this situation.
This is the code I use to split and merge documents:
public List<byte[]> SplitPDF(byte[] contentPdf)
{
try
{
var listBythe = new List<byte[]>();
PdfImportedPage page = null;
PdfCopy PdfCopy = null;
PdfReader reader = new PdfReader(contentPdf);
for (int numPage = 1; numPage <= reader.NumberOfPages; numPage++)
{
Document doc = new Document(PageSize.LETTER);
var mStream = new MemoryStream();
PdfCopy = new PdfCopy(doc, mStream);
doc.Open();
page = PdfCopy.GetImportedPage(reader, numPage);
PdfCopy.AddPage(page);
listBythe.Add(mStream.ToArray());
doc.Close();
}
MergePdfToPage(listBythe);
return listBythe;
}
catch (Exception ex)
{
throw ex;
}
}
private byte[] MergePdfToPage(List<byte[]>contentPage)
{
byte[] docPdfByte = null;
var ms = new MemoryStream();
using (Document doc = new Document(PageSize.LETTER))
{
PdfCopy copy = new PdfCopy(doc, ms);
doc.Open();
var num = doc.PageNumber;
foreach (var file in contentPage.ToArray())
{
using (var reader = new PdfReader(file))
{
copy.AddDocument(reader);
}
}
doc.Close();
docPdfByte = ms.ToArray();
}
return docPdfByte;

In your loop you do
Document doc = new Document(PageSize.LETTER);
var mStream = new MemoryStream();
PdfCopy = new PdfCopy(doc, mStream);
doc.Open();
page = PdfCopy.GetImportedPage(reader, numPage);
PdfCopy.AddPage(page);
listBythe.Add(mStream.ToArray());
doc.Close();
In particular you retrieve the mStream bytes before closing doc. But before doc is closed, the pdf is incomplete in mStream!
To get a complete pdf from mStream, please change the order of instructions an do
Document doc = new Document(PageSize.LETTER);
var mStream = new MemoryStream();
PdfCopy = new PdfCopy(doc, mStream);
doc.Open();
page = PdfCopy.GetImportedPage(reader, numPage);
PdfCopy.AddPage(page);
doc.Close();
listBythe.Add(mStream.ToArray());
instead.

I created something for you, hopefully it will work as well as it did for me.
Class :
public class PDFFactory
{
public PDFFactory()
{
PdfDocument = new Document(iTextSharp.text.PageSize.A4, 65, 65, 60, 60);
}
private Document _pdfDocument;
public Document PdfDocument
{
get
{
return _pdfDocument;
}
set
{
_pdfDocument = value;
}
}
private MemoryStream _pdfMemoryStream;
public MemoryStream PDFMemoryStream
{
get
{
return _pdfMemoryStream;
}
set
{
_pdfMemoryStream = value;
}
}
private string _pdfBase64;
public string PDFBase64
{
get
{
if (this.DocumentClosed)
return _pdfBase64;
else
return null;
}
set
{
_pdfBase64 = value;
}
}
private byte[] _pdfBytes;
public byte[] PDFBytes
{
get
{
if (this.DocumentClosed)
return _pdfBytes;
else
return null;
}
set
{
_pdfBytes = value;
}
}
public byte[] GetPDFBytes()
{
PDFDocument.Close();
return PDFMemoryStream.GetBuffer();
}
public void closeDocument()
{
PDFDocument.Close();
PDFBase64 = Convert.ToBase64String(this.PDFMemoryStream.GetBuffer());
PDFBytes = this.PDFMemoryStream.GetBuffer();
}
}
Service:
public byte[] ()
{
PDFFactory pdf_1 = new PDFFactory();
PDFFactory pdf_2 = new PDFFactory();
List<byte[]> sourceFiles = new List<byte[]>();
sourceFiles.Add(pdf_1.GetPDFBytes);
sourceFiles.Add(pdf_2.GetPDFBytes);
PDFFactory pdfFinal = new PDFFactory();
for (int fileCounter = 0; fileCounter <= sourceFiles.Count - 1; fileCounter += 1)
{
PdfReader reader2 = new PdfReader(sourceFiles[fileCounter]);
int numberOfPages = reader2.NumberOfPages;
for (int currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
{
// Determine page size for the current page
pdfFinal.PDFDocument.SetPageSize(reader2.GetPageSizeWithRotation(currentPageIndex));
// Create page
pdfFinal.PDFDocument.NewPage();
PdfImportedPage importedPage = pdfFinal.PDFWriter.GetImportedPage(reader2, currentPageIndex);
// Determine page orientation
int pageOrientation = reader2.GetPageRotation(currentPageIndex);
if ((pageOrientation == 90) || (pageOrientation == 270))
pdfFinal.PDFWriter.DirectContent.AddTemplate(importedPage, 0, -1.0F, 1.0F, 0, 0, reader2.GetPageSizeWithRotation(currentPageIndex).Height);
else
pdfFinal.PDFWriter.DirectContent.AddTemplate(importedPage, 1.0F, 0, 0, 1.0F, 0, 0);
}
}
pdfFinal.closeDocument();
return pdfFinal.PDFBytes;
}
Let me know if it helped.

How to use iTextSharp to save certain pages to a MemoryStream and return selected page as base64 string

Using iTextSharp, I was able to read a base64 string and convert certain pages into local files using a FileStream. I want to do the same without saving to the local filesystem, using MemoryStream and only returning the selected pages to the calling function as a base64 string.
// function takes reader and start and end page and destination file path as parameter.
public static void ExtractPages(PdfReader pdfReader, string sourcePdfPath,
string outputPdfPath, int startPage, int endPage)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
reader = pdfReader;
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
// old code to save selected pdf into local file system.
FileStream fileStream = new FileStream(outputPdfPath, FileMode.Create);
// memory stream created but not been used yet!
MemoryStream memoryStream = new MemoryStream();
pdfCopyProvider = new PdfCopy(sourceDocument, fileStream);
sourceDocument.Open();
// save selected page into local file system
for (int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
#region TestMemoryStream
// TODO: New code to be added and return selected page data as bas64 string
#endregion
sourceDocument.Close();
reader.Close();
}
catch (Exception ex)
{
throw ex;
}
}

I was able to sort out the issue, my modified code is below.
reader = pdfReader;
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
//FileStream fileStream = new FileStream(outputPdfPath, FileMode.Create);
MemoryStream memoryStream = new MemoryStream();
pdfCopyProvider = new PdfCopy(sourceDocument, memoryStream);
sourceDocument.Open();
for (int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
byte[] content1 = memoryStream.ToArray();
// var bas64 = Convert.ToBase64String(content1);
string bas64New = Convert.ToBase64String(content1);

Concatenating PDF using PDFsharp returns empty PDF

I have a list of PDF stored as list<byte[]>. I try to concatenate all these PDF files using PDFsharp, but after my operation I get a PDF with proper page count, but all pages are blank. Looks like I lose some header or something but I can't find where.
My code:
PdfDocument output = new PdfDocument();
try
{
foreach (var report in reports)
{
using (MemoryStream stream = new MemoryStream(report))
{
PdfDocument input = PdfReader.Open(stream, PdfDocumentOpenMode.Import);
foreach (PdfPage page in input.Pages)
{
output.AddPage(page);
}
}
}
if (output.Pages.Count <= 0)
{
throw new Exception("Empty Document");
}
MemoryStream final = new MemoryStream();
output.Save(final);
output.Close();
return final.ToArray();
}
catch (Exception e)
{
throw new Exception(e.ToString());
}
I want to return it as byte[] because I use them later:
return File(report, System.Net.Mime.MediaTypeNames.Application.Octet, "test.pdf");
This returns PDF with proper page count, but all blank.

You tell in a comment that the files come from SSRS.
Older versions of PDFsharp require a special SSRS setting:
For the DeviceSettings parameter for the Render method on the ReportExecutionService object, pass this value:
theDeviceSettings = "<DeviceInfo><HumanReadablePDF>True</HumanReadablePDF></DeviceInfo>";
Source:
http://forum.pdfsharp.net/viewtopic.php?p=1613#p1613

I use iTextSharp, look at this saple code (it works)
public static byte[] PdfJoin(List<String> pdfs)
{
byte[] mergedPdf = null;
using (MemoryStream ms = new MemoryStream())
{
using (iTextSharp.text.Document document = new iTextSharp.text.Document())
{
using (iTextSharp.text.pdf.PdfCopy copy = new iTextSharp.text.pdf.PdfCopy(document, ms))
{
document.Open();
for (int i = 0; i < pdfs.Count; ++i)
{
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(pdfs[i]);
// loop over the pages in that document
int n = reader.NumberOfPages;
for (int page = 0; page < n; )
{
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
}
}
}
mergedPdf = ms.ToArray();
}
return mergedPdf;
}
public static byte[] PdfJoin(List<byte[]> pdfs)
{
byte[] mergedPdf = null;
using (MemoryStream ms = new MemoryStream())
{
using (iTextSharp.text.Document document = new iTextSharp.text.Document())
{
using (iTextSharp.text.pdf.PdfCopy copy = new iTextSharp.text.pdf.PdfCopy(document, ms))
{
document.Open();
for (int i = 0; i < pdfs.Count; ++i)
{
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(pdfs[i]);
// loop over the pages in that document
int n = reader.NumberOfPages;
for (int page = 0; page < n; )
{
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
}
}
}
mergedPdf = ms.ToArray();
}
return mergedPdf;
}

Image is not created from stream

i am trying to extract image from pdf using this code
#region ExtractImagesFromPDF
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(#"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
// GetImageEncoder is found below this method
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
break;
}
}
}
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
}
}
#endregion
its all going right but the line
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
giving an error “Parameter not valid”
i cant getting whats the problem is the stream is not a image stream then why iTextSharp reading it as image.
please anyone help me out

Extract image from pdf using Itext

I have been using ITEXT functions to read simple text from the pdf file but is it possible to read image from the PDF file using ITEXT in C#

you can try something like this...
using iTextSharp.text;
using iTextSharp.text.pdf;
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
// NOTE: This will only get the first image it finds per page.
PdfReader pdf = new PdfReader(sourcePdf);
RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
if ((bytes != null))
{
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
{
memStream.Position = 0;
System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
// must save the file while stream is open.
if (!Directory.Exists(outputPath))
Directory.CreateDirectory(outputPath);
string path = Path.Combine(outputPath, String.Format(#"{0}.jpg", pageNumber));
System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
// GetImageEncoder is found below this method
System.Drawing.Imaging.ImageCodecInfo jpegEncoder = GetImageEncoder("JPEG");
img.Save(path, jpegEncoder, parms);
break;
}
}
}
}
}
}
}
}
catch
{
throw;
}
finally
{
pdf.Close();
}
}
#endregion
#region GetImageEncoder
public static System.Drawing.Imaging.ImageCodecInfo GetImageEncoder(string imageType)
{
imageType = imageType.ToUpperInvariant();
foreach (ImageCodecInfo info in ImageCodecInfo.GetImageEncoders())
{
if (info.FormatDescription == imageType)
{
return info;
}
}
return null;
}
#endregion
I hope it will helps you....

Hi this is not C# but my code in Java I hope you can use this to extract images in C#
public ByteArrayOutputStream extractImages(byte[] pdf) throws IOException{
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ZipOutputStream zip = new ZipOutputStream(baos);
MyImageRenderer listener = new MyImageRenderer(zip);
for(int i=1;i<=reader.getNumberOfPages();i++){
parser.processContent(i, listener);
}
zip=listener.getZip();
zip.close();
return baos;
}
MyImageRenderer is a class that implements the RendererListener interface here's the method I wrote for rendering the images.
public void renderImage(ImageRenderInfo renderInfo) {
try {
PdfImageObject image = renderInfo.getImage();
if (image == null)
return;
ZipEntry entry = new ZipEntry(String.format(img, renderInfo
.getRef().getNumber(), image.getFileType()));
System.out.println(image.getFileType());
zip.putNextEntry(entry);
zip.write(image.getImageAsBytes());
zip.closeEntry();
} catch (IOException ioex) {
ioex.printStackTrace();
}
}
I know this code is in Java but it's to give you a general idea

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract pages in memory iTextSharp - c#

Related

ITextsharp: Error reading a pdf file in Byte[] content (PdfReader)

How to use iTextSharp to save certain pages to a MemoryStream and return selected page as base64 string

Concatenating PDF using PDFsharp returns empty PDF

Image is not created from stream

Extract image from pdf using Itext

Categories

Resources