OpenXml Convert from Word document to HTML with Header - c#

I want to read a .docx file and send its content in Email as email body not as an attachment.
So for this I use openXML and OpenXmlPowerTools to convert docx file to html. This is almost working fine until i got a document which has Header and Footer with images.
Here is my code to convert .docx to Html
using (WordprocessingDocument doc = WordprocessingDocument.Open(stream, true))
{
HtmlConverterSettings convSettings = new HtmlConverterSettings()
{
FabricateCssClasses = true,
CssClassPrefix = "cls-",
RestrictToSupportedLanguages = false,
RestrictToSupportedNumberingFormats = false,
ImageHandler = imageInfo =>
{
DirectoryInfo localDirInfo = new DirectoryInfo(imageDirectoryName);
if (!localDirInfo.Exists)
{
localDirInfo.Create();
}
++imageCounter;
string extension = imageInfo.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = null;
if (extension == "png")
{
extension = "jpeg";
imageFormat = ImageFormat.Jpeg;
}
else if (extension == "bmp")
{
imageFormat = ImageFormat.Bmp;
}
else if (extension == "jpeg")
{
imageFormat = ImageFormat.Jpeg;
}
else if (extension == "tiff")
{
imageFormat = ImageFormat.Tiff;
}
// If the image format is not one that you expect, ignore it,
// and do not return markup for the link.
if (imageFormat == null)
{
return null;
}
string imageFileName = imageDirectoryName + "/image" + imageCounter.ToString() + "." + extension;
try
{
imageInfo.Bitmap.Save(imageFileName, imageFormat);
}
catch (System.Runtime.InteropServices.ExternalException)
{
return null;
}
XElement img = new XElement(Xhtml.img, new XAttribute(NoNamespace.src, imageFileName), imageInfo.ImgStyleAttribute, imageInfo.AltText != null ? new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
return img;
}
};
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(doc1, convSettings);
Above code works fine, convert images as well, but if the document has header and footer those are not converted.
So is their any workaround to include header and footer in html file.
Please suggest me. Thanks!

OpenXmlPowerTools ignores headers and footers when converting a docx-document to HTML, so they won't show up in the resulting HTML (you can browse the source code on github).
Perhaps it's because the concept of a 'page' doesn't apply to HTML, so there's no obvious equivalent to a document header.

Related

How To Open File According To There File Extension Format Without Downloading File?

Here i am trying to open the file of any extension format in ASP.NET MVC CORE 2.0 that has been saved in data base according to the extension format of that file without downloading that file. For example lets say i have ms word file so when i click that file it should open in word, if i have pdf it should open in pdf format without downloading the file.
Now, the problem in my code is that it force to download the file instead of opening the file according to respective file extemsion format.
Any help will be a great and will be thank full.Thank You
Below is my code
public IActionResult ViewFileByFileId (int id)
{
DoctorCredentialDocsModel DoctorCredential = new DoctorCredentialDocsModel();
DoctorCredential = _doctorService.GetDoctorCredentialDetails(id);
string AttachPath = ConfigPath.DoctorCredentialsAttachmentPath;
string strFileFullPath = Path.Combine(AttachPath, DoctorCredential.AttachedFile);
string contentType = MimeTypes.GetMimeType(strFileFullPath);
if (!strFileFullPath.Contains("..\\"))
{
byte[] filedata = System.IO.File.ReadAllBytes(strFileFullPath);
var cd = new System.Net.Mime.ContentDisposition
{
FileName = DoctorCredential.FileName,
Inline = false,
};
Request.HttpContext.Response.Headers.Add("Content-Disposition", cd.ToString());
return File(filedata, contentType);
}
else
{
return new NotFoundResult();
}
}
Try yo send like tihs. FileContentResult may can help you :)
Lastly look at this :)
What's the difference between the four File Results in ASP.NET MVC
public FileContentResult ViewFileByFileId (int id)
{
DoctorCredentialDocsModel DoctorCredential = new DoctorCredentialDocsModel();
DoctorCredential = _doctorService.GetDoctorCredentialDetails(id);
string AttachPath = ConfigPath.DoctorCredentialsAttachmentPath;
string strFileFullPath = Path.Combine(AttachPath, DoctorCredential.AttachedFile);
string contentType = MimeTypes.GetMimeType(strFileFullPath);
if (!strFileFullPath.Contains("..\\"))
{
byte[] filedata = System.IO.File.ReadAllBytes(strFileFullPath);
var cd = new System.Net.Mime.ContentDisposition
{
FileName = DoctorCredential.FileName,
Inline = false,
};
Request.HttpContext.Response.Headers.Add("Content-Disposition", cd.ToString());
return File(filedata, contentType);
}
else
{
return new NotFoundResult();
}
}

Render SSRS 2008 Report in web page without Report Viewer

I have a web app written in C# that I need to be able to render an SSRS report on an aspx page without using the Report Viewer control.
As HTML inside a div tag would be perfect. I have the app attached to my SSRS instance via ReportingService2010 reference.
I've found some examples online but are for ReportingServices2005 and couldn't port them over.
How can I do this?
I pulled this out of a project I put together about a year ago.
A few key points:
you need to pass credentials to the report server.
you need to create an images path so that any images in your report are rendered and displayed in the html Report/GraphFiles/ "this should be relative to your app url"
and if your report has any parameters you will need to add them.
you will definitely need to tweek the code to get it going.
it uses the ReportExecutionService reference, you will have to play around with it but the nuts and bolts should all be here.
i'd really love to spend time cleaning it up a bit but i dont have the time sorry, i hope it helps
class RenderReport
{
public struct ReportServerCreds
{
public string UserName { get; set; }
public string Password { get; set; }
public string Domain { get; set; }
}
public ReportServerCreds GetReportCreds()
{
ReportServerCreds rsc = new ReportServerCreds();
rsc.UserName = ConfigurationManager.AppSettings["reportserveruser"].ToString();
rsc.Password = ConfigurationManager.AppSettings["reportserverpassword"].ToString();
rsc.Domain = ConfigurationManager.AppSettings["reportserverdomain"].ToString();
return rsc;
}
public enum SSRSExportType
{
HTML,PDF
}
public string RenderReport(string reportpath,SSRSExportType ExportType)
{
using (ReportExecutionService.ReportExecutionServiceSoapClient res = new ReportExecutionService.ReportExecutionServiceSoapClient("ReportExecutionServiceSoap"))
{
ReportExecutionService.ExecutionHeader ExecutionHeader = new ReportExecutionService.ExecutionHeader();
ReportExecutionService.TrustedUserHeader TrusteduserHeader = new ReportExecutionService.TrustedUserHeader();
res.ClientCredentials.Windows.AllowedImpersonationLevel = System.Security.Principal.TokenImpersonationLevel.Impersonation;
ReportServerCreds rsc = GetReportCreds();
res.ClientCredentials.Windows.ClientCredential.Domain = rsc.Domain;
res.ClientCredentials.Windows.ClientCredential.UserName = rsc.UserName;
res.ClientCredentials.Windows.ClientCredential.Password = rsc.Password;
res.Open();
ReportExecutionService.ExecutionInfo ei = new ReportExecutionService.ExecutionInfo();
string format =null;
string deviceinfo =null;
string mimetype = null;
if (ExportType.ToString().ToLower() == "html")
{
format = "HTML4.0";
deviceinfo = #"<DeviceInfo><StreamRoot>/</StreamRoot><HTMLFragment>True</HTMLFragment></DeviceInfo>";
}
else if (ExportType.ToString().ToLower() == "pdf")
{
format = "PDF";
mimetype = "";
}
byte[] results = null;
string extension = null;
string Encoding = null;
ReportExecutionService.Warning[] warnings;
string[] streamids = null;
string historyid = null;
ReportExecutionService.ExecutionHeader Eheader;
ReportExecutionService.ServerInfoHeader serverinfoheader;
ReportExecutionService.ExecutionInfo executioninfo;
// Get available parameters from specified report.
ParameterValue[] paramvalues = null;
DataSourceCredentials[] dscreds = null;
ReportParameter[] rparams = null;
using (ReportService.ReportingService2005SoapClient lrs = new ReportService.ReportingService2005SoapClient("ReportingService2005Soap"))
{
lrs.ClientCredentials.Windows.AllowedImpersonationLevel = System.Security.Principal.TokenImpersonationLevel.Impersonation;
lrs.ClientCredentials.Windows.ClientCredential.Domain = rsc.Domain;
lrs.ClientCredentials.Windows.ClientCredential.UserName = rsc.UserName;
lrs.ClientCredentials.Windows.ClientCredential.Password = rsc.Password;
lrs.GetReportParameters(reportpath,historyid,false,paramvalues,dscreds,out rparams);
}
// Set report parameters here
//List<ReportExecutionService.ParameterValue> parametervalues = new List<ReportExecutionService.ParameterValue>();
//string enumber = Session["ENumber"] as string;
//parametervalues.Add(new ReportExecutionService.ParameterValue() { Name = "ENumber", Value = enumber });
//if (date != null)
//{
// DateTime dt = DateTime.Today;
//parametervalues.Add(new ReportExecutionService.ParameterValue() { Name = "AttendanceDate", Value = dt.ToString("MM/dd/yyyy")});
//}
//if (ContainsParameter(rparams, "DEEWRID"))
//{
//parametervalues.Add(new ReportExecutionService.ParameterValue() { Name = "DEEWRID", Value = deewrid });
//}
//if (ContainsParameter(rparams, "BaseHostURL"))
//{
// parametervalues.Add(new ReportExecutionService.ParameterValue() { Name = "BaseHostURL", Value = string.Concat("http://", Request.Url.Authority) });
//}
//parametervalues.Add(new ReportExecutionService.ParameterValue() {Name="AttendanceDate",Value=null });
//parametervalues.Add(new ReportExecutionService.ParameterValue() { Name = "ENumber", Value = "E1013" });
try
{
Eheader = res.LoadReport(TrusteduserHeader, reportpath, historyid, out serverinfoheader, out executioninfo);
serverinfoheader = res.SetExecutionParameters(Eheader, TrusteduserHeader, parametervalues.ToArray(), null, out executioninfo);
res.Render(Eheader, TrusteduserHeader, format, deviceinfo, out results, out extension, out mimetype, out Encoding, out warnings, out streamids);
string exportfilename = string.Concat(enumber, reportpath);
if (ExportType.ToString().ToLower() == "html")
{
//write html
string html = string.Empty;
html = System.Text.Encoding.Default.GetString(results);
html = GetReportImages(res, Eheader, TrusteduserHeader, format, streamids, html);
return html;
}
else if (ExportType.ToString().ToLower() == "pdf")
{
//write to pdf
Response.Buffer = true;
Response.Clear();
Response.ContentType = mimetype;
//Response.AddHeader("content-disposition", string.Format("attachment; filename={0}.pdf", exportfilename));
Response.BinaryWrite(results);
Response.Flush();
Response.End();
}
}
catch (Exception e)
{
Response.Write(e.Message);
}
}
}
string GetReportImages(ReportExecutionService.ReportExecutionServiceSoapClient res,
ReportExecutionService.ExecutionHeader EHeader,
ReportExecutionService.TrustedUserHeader tuh,
string reportFormat, string[] streamIDs, string html)
{
if (reportFormat.Equals("HTML4.0") && streamIDs.Length > 0)
{
string devInfo;
string mimeType;
string Encoding;
int startIndex;
int endIndex;
string fileExtension = ".jpg";
string SessionId;
Byte[] image;
foreach (string streamId in streamIDs)
{
SessionId = Guid.NewGuid().ToString().Replace("}", "").Replace("{", "").Replace("-", "");
//startIndex = html.IndexOf(streamId);
//endIndex = startIndex + streamId.Length;
string reportreplacementname = string.Concat(streamId, "_", SessionId, fileExtension);
html = html.Replace(streamId, string.Concat(#"Report\GraphFiles\", reportreplacementname));
//html = html.Insert(endIndex, fileExtension);
//html = html.Insert(startIndex, #"Report/GraphFiles/" + SessionId + "_");
devInfo = "";
//Image = res.RenderStream(reportFormat, streamId, devInfo, out encoding, out mimeType);
res.RenderStream(EHeader,tuh, reportFormat, streamId, devInfo, out image , out Encoding, out mimeType);
System.IO.FileStream stream = System.IO.File.OpenWrite(HttpContext.Current.Request.PhysicalApplicationPath + "Report\\GraphFiles\\" + reportreplacementname);
stream.Write(image, 0, image.Length);
stream.Close();
mimeType = "text/html";
}
}
return html;
}
bool ContainsParameter(ReportParameter[] parameters, string paramname)
{
if(parameters.Where(i=>i.Name.Contains(paramname)).Count() != 0)
{
return true;
}
return false;
}
}
To Execute:
first parameter is the location of the report on the server.
the second is a SSRSExportType enum
RenderReport("ReportPathOnServer",SSRSExportType.HTML);
If you are just trying to show the HTML render of a report and you want it to look like a native object to the application without any parameters or toolbar, then you could call the URL for the report directly and include "&rc:Toolbar=false" in the URL. This will hide the toolbar for the Report Viewer control. This method is described under the URL Access Parameter Reference msdn article. Not exactly what you asked for, but it may achieve the purpose.
Here's a sample call that omits the HTML and Body sections if you are embedding the results in an existing HTML document:
http://ServerName/ReportServer?%2fSome+Folder%2fSome+Report+Name&rs:Command=Render&rc:Toolbar=false&rc:HTMLFragment=true
Definitely an old question but if you're using ASP.NET MVC you could try this open source solution. It uses an HTML helper and renders an .aspx page in an iframe. The repo has a server-side, local render, and anonymous example.

Corrupted .ttf files when extracting embedded fonts from PDF with iTextSharp

I managed to create a piece of C# code that extracts embedded fonts from a PDF file. I am testing it with documents that have TrueType fonts embedded.
The problem is that when I write the font bytes into a file, I cannot open it with any font viewer as it seems to be corrupted.
The funny thing is that the code seems to be quite OK, as executing MuPDF mutool.exe extract I get the same exact files with the same result (corrupted .ttf files).
Here's the code (sorry, it's quite large, but I found no shorter way to achieve this so far):
public void ExtractFonts(PdfReader reader, PdfParserDocument document)
{
foreach (var fontData in BaseFont.GetDocumentFonts(reader))
{
// Get font name and indirect reference
var name = (string)fontData[0];
var reference = (PRIndirectReference)fontData[1];
// Get font bytes from PDF
var XrefIndex = Convert.ToInt32(reference.Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
var fontDictionary = (PdfDictionary)reader.GetPdfObject(XrefIndex);
var fontDescriptor = fontDictionary.GetAsDict(PdfName.FONTDESCRIPTOR);
PRIndirectReference fontBytesReference = null;
if (fontDescriptor != null)
{
if (fontDescriptor.Get(PdfName.FONTFILE) != null)
{
fontBytesReference = (PRIndirectReference)fontDescriptor.Get(PdfName.FONTFILE);
}
else if (fontDescriptor.Get(PdfName.FONTFILE2) != null)
{
fontBytesReference = (PRIndirectReference)fontDescriptor.Get(PdfName.FONTFILE2);
}
else if (fontDescriptor.Get(PdfName.FONTFILE3) != null)
{
fontBytesReference = (PRIndirectReference)fontDescriptor.Get(PdfName.FONTFILE3);
}
}
// Only add embedded fonts
if (fontBytesReference != null)
{
var fontBytesReferenceIndex = Convert.ToInt32(fontBytesReference.Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
var pdfObject = reader.GetPdfObject(fontBytesReferenceIndex);
var pdfStream = (PdfStream)pdfObject;
var bytes = PdfReader.FlateDecode(PdfReader.GetStreamBytesRaw((PRStream)pdfStream));
// Write to file
using (var file = File.OpenWrite(name + ".ttf"))
{
file.Write(bytes, 0, bytes.Length);
}
}
}
}

Changing PDF font size in iTextSharp

Following is the function I am writing to create PDF using iTextSharp.
Let me explain the function ...
I am here creating a PDF file from the another Template PDF file. The template PDF file is sent to this function in bytes[], then I create pdfReader from this...
From pdfReader I create pdfStamper (i.e. new PDF file) and write the response values to its fields. It is working fine... only issue is fint size of values is much large...
public void GeneratePrintPDFTest(ResponseGroup actual, Pages page, byte[] filebyte, out string pdfname, string localstorage)
{
string rootPath = #"D:/FOP-PDF/";
var pdfReader = new PdfReader(filebyte);
var pdfStamper = new PdfStamper(pdfReader,new FileStream(rootPath.ToString(CultureInfo.InvariantCulture) + page.PageId.ToString(CultureInfo.InvariantCulture)
+ ".pdf",FileMode.Create));
pdfname = rootPath.ToString(CultureInfo.InvariantCulture) + page.PageId.ToString(CultureInfo.InvariantCulture) + ".pdf";
AcroFields pdfFormFields = pdfStamper.AcroFields;
foreach (DictionaryEntry de in pdfReader.AcroFields.Fields)
{
var response = actual.Responses.Where(obj => obj.ITPPageFieldKeyId == Convert.ToInt32(de.Key.ToString())).Select(obj => obj).FirstOrDefault();
if (response != null)
{
if (response.ResponseValues != null && !string.IsNullOrEmpty(response.ResponseValues.ToString())
&& response.ResponseValues.ToString() != "0" && !string.IsNullOrEmpty(response.DataItemID)
&& response.DataItemID != "0")
{
if (response.PrintFormulaResult || response.PageFieldFormulaId == 0)
{
pdfFormFields.SetField(de.Key.ToString(), response.ResponseValues.ToString());
}
}
}
}
pdfStamper.FormFlattening = false;
pdfStamper.Close();
}
I tried following solutions but of no use....
float fSize = 10;
pdfFormFields.SetFieldProperty(de.Key.ToString(), de.Key.ToString(), fSize, null);
I also doubt it may be coming from the template PDF file, but if so how could I change it programatically.
Please help me with this... Thanks in advance...

Extract image from PDF using itextsharp

I am trying to extract all the images from a pdf using itextsharp but can't seem to overcome this one hurdle.
The error occures on the line System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); giving an error of "Parameter is not valid".
I think it works when the image is a bitmap but not of any other format.
I have this following code - sorry for the length;
private void Form1_Load(object sender, EventArgs e)
{
FileStream fs = File.OpenRead(#"reader.pdf");
byte[] data = new byte[fs.Length];
fs.Read(data, 0, (int)fs.Length);
List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();
iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
iTextSharp.text.pdf.PdfObject PDFObj = null;
iTextSharp.text.pdf.PdfStream PDFStremObj = null;
try
{
RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data);
PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
{
PDFObj = PDFReaderObj.GetPdfObject(i);
if ((PDFObj != null) && PDFObj.IsStream())
{
PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);
if ((bytes != null))
{
try
{
System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);
MS.Position = 0;
System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
ImgList.Add(ImgPDF);
}
catch (Exception)
{
}
}
}
}
}
PDFReaderObj.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
} //Form1_Load
Resolved...
Even I got the same exception of "Parameter is not valid" and after so much of
work with the help of the link provided by der_chirurg
(http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx ) I resolved it
and following is the code:
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using iTextSharp.text.pdf.parser;
using Dotnet = System.Drawing.Image;
using iTextSharp.text.pdf;
namespace PDF_Parsing
{
partial class PDF_ImgExtraction
{
string imgPath;
private void ExtractImage(string pdfFile)
{
PdfReader pdfReader = new PdfReader(files[fileIndex]);
for (int pageNumber = 1; pageNumber <= pdfReader.NumberOfPages; pageNumber++)
{
PdfReader pdf = new PdfReader(pdfFile);
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
string width = tg.Get(PdfName.WIDTH).ToString();
string height = tg.Get(PdfName.HEIGHT).ToString();
ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new Matrix(float.Parse(width), float.Parse(height)), (PRIndirectReference)obj, tg);
RenderImage(imgRI);
}
}
}
}
private void RenderImage(ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
using (Dotnet dotnetImg = image.GetDrawingImage())
{
if (dotnetImg != null)
{
using (MemoryStream ms = new MemoryStream())
{
dotnetImg.Save(ms, ImageFormat.Tiff);
Bitmap d = new Bitmap(dotnetImg);
d.Save(imgPath);
}
}
}
}
}
}
You need to check the stream's /Filter to see what image format a given image uses. It may be a standard image format:
DCTDecode (jpeg)
JPXDecode (jpeg 2000)
JBIG2Decode (jbig is a B&W only format)
CCITTFaxDecode (fax format, PDF supports group 3 and 4)
Other than that, you'll need to get the raw bytes (as you are), and build an image using the image stream's width, height, bits per component, number of color components (could be CMYK, indexed, RGB, or Something Weird), and a few others, as defined in section 8.9 of the ISO PDF SPECIFICATION (available for free).
So in some cases your code will work, but in others, it'll fail with the exception you mentioned.
PS: When you have an exception, PLEASE include the stack trace every single time. Pretty please with sugar on top?
Works for me like this, using these two methods:
public static List<System.Drawing.Image> ExtractImagesFromPDF(byte[] bytes)
{
var imgs = new List<System.Drawing.Image>();
var pdf = new PdfReader(bytes);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
List<PdfObject> objs = FindImageInPDFDictionary(pg);
foreach (var obj in objs)
{
if (obj != null)
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
var pdfImage = new PdfImageObject((PRStream)pdfStrem);
var img = pdfImage.GetDrawingImage();
imgs.Add(img);
}
}
}
}
finally
{
pdf.Close();
}
return imgs;
}
private static List<PdfObject> FindImageInPDFDictionary(PdfDictionary pg)
{
var res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
var xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
var pdfObgs = new List<PdfObject>();
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
var tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
var type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type)) // image at the root of the pdf
{
pdfObgs.Add(obj);
}
else if (PdfName.FORM.Equals(type)) // image inside a form
{
FindImageInPDFDictionary(tg).ForEach(o => pdfObgs.Add(o));
}
else if (PdfName.GROUP.Equals(type)) // image inside a group
{
FindImageInPDFDictionary(tg).ForEach(o => pdfObgs.Add(o));
}
}
}
}
return pdfObgs;
}
In newer version of iTextSharp, the 1st parameter of ImageRenderInfo.CreateForXObject is not Matrix anymore but GraphicsState. #der_chirurg's approach should work. I tested myself with the information from the following link and it worked beautifully:
http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/
To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.
Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...
I added library on github which, extract images in PDF and compress them.
Could be useful, when you are going to start play with very powerful library ITextSharp.
Here the link: https://github.com/rock-walker/PdfCompression
This works for me and I think it's a simple solution:
Write a custom RenderListener and implement its RenderImage method, something like this
public void RenderImage(ImageRenderInfo info)
{
PdfImageObject image = info.GetImage();
Parser.Matrix matrix = info.GetImageCTM();
var fileType = image.GetFileType();
ImageFormat format;
switch (fileType)
{//you may add more types here
case "jpg":
case "jpeg":
format = ImageFormat.Jpeg;
break;
case "pnt":
format = ImageFormat.Png;
break;
case "bmp":
format = ImageFormat.Bmp;
break;
case "tiff":
format = ImageFormat.Tiff;
break;
case "gif":
format = ImageFormat.Gif;
break;
default:
format = ImageFormat.Jpeg;
break;
}
var pic = image.GetDrawingImage();
var x = matrix[Parser.Matrix.I31];
var y = matrix[Parser.Matrix.I32];
var width = matrix[Parser.Matrix.I11];
var height = matrix[Parser.Matrix.I22];
if (x < <some value> && y < <some value>)
{
return;//ignore these images
}
pic.Save(<path and name>, format);
}
I have used this library in the past without any problems.
http://www.winnovative-software.com/PdfImgExtractor.aspx
private void btnExtractImages_Click(object sender, EventArgs e)
{
if (pdfFileTextBox.Text.Trim().Equals(String.Empty))
{
MessageBox.Show("Please choose a source PDF file", "Choose PDF file", MessageBoxButtons.OK);
return;
}
// the source pdf file
string pdfFileName = pdfFileTextBox.Text.Trim();
// start page number
int startPageNumber = int.Parse(textBoxStartPage.Text.Trim());
// end page number
// when it is 0 the extraction will continue up to the end of document
int endPageNumber = 0;
if (textBoxEndPage.Text.Trim() != String.Empty)
endPageNumber = int.Parse(textBoxEndPage.Text.Trim());
// create the PDF images extractor object
PdfImagesExtractor pdfImagesExtractor = new PdfImagesExtractor();
pdfImagesExtractor.LicenseKey = "31FAUEJHUEBQRl5AUENBXkFCXklJSUlQQA==";
// the demo output directory
string outputDirectory = Path.Combine(Application.StartupPath, #"DemoFiles\Output");
Cursor = Cursors.WaitCursor;
// set the handler to be called when an image was extracted
pdfImagesExtractor.ImageExtractedEvent += pdfImagesExtractor_ImageExtractedEvent;
try
{
// start images counting
imageIndex = 0;
// call the images extractor to raise the ImageExtractedEvent event when an images is extracted from a PDF page
// the pdfImagesExtractor_ImageExtractedEvent handler below will be executed for each extracted image
pdfImagesExtractor.ExtractImagesInEvent(pdfFileName, startPageNumber, endPageNumber);
// Alternatively you can use the ExtractImages() and ExtractImagesToFile() methods
// to extracted the images from a PDF document in memory or to image files in a directory
// uncomment the line below to extract the images to an array of ExtractedImage objects
//ExtractedImage[] pdfPageImages = pdfImagesExtractor.ExtractImages(pdfFileName, startPageNumber, endPageNumber);
// uncomment the lines below to extract the images to image files in a directory
//string outputDirectory = System.IO.Path.Combine(Application.StartupPath, #"DemoFiles\Output");
//pdfImagesExtractor.ExtractImagesToFile(pdfFileName, startPageNumber, endPageNumber, outputDirectory, "pdfimage");
}
catch (Exception ex)
{
// The extraction failed
MessageBox.Show(String.Format("An error occurred. {0}", ex.Message), "Error");
return;
}
finally
{
// uninstall the event handler
pdfImagesExtractor.ImageExtractedEvent -= pdfImagesExtractor_ImageExtractedEvent;
Cursor = Cursors.Arrow;
}
try
{
System.Diagnostics.Process.Start(outputDirectory);
}
catch (Exception ex)
{
MessageBox.Show(string.Format("Cannot open output folder. {0}", ex.Message));
return;
}
}
/// <summary>
/// The ImageExtractedEvent event handler called after an image was extracted from a PDF page.
/// The event is raised when the ExtractImagesInEvent() method is used
/// </summary>
/// <param name="args">The handler argument containing the extracted image and the PDF page number</param>
void pdfImagesExtractor_ImageExtractedEvent(ImageExtractedEventArgs args)
{
// get the image object and page number from even handler argument
Image pdfPageImageObj = args.ExtractedImage.ImageObject;
int pageNumber = args.ExtractedImage.PageNumber;
// save the extracted image to a PNG file
string outputPageImage = Path.Combine(Application.StartupPath, #"DemoFiles\Output",
"pdfimage_" + pageNumber.ToString() + "_" + imageIndex++ + ".png");
pdfPageImageObj.Save(outputPageImage, ImageFormat.Png);
args.ExtractedImage.Dispose();
}

Categories