Read a stored PDF from memory stream - c#

I'm working on a database project using C# and SQLServer 2012. In one of my forms I have a PDF file with some other information that is stored in a table. This is working successfully, but when I want to retrieve the stored information I have a problem with displaying the PDF file, because I can't display it and I don't know how to display it.
I read some articles that said it can not be displayed with Adobe PDF viewer from a memory stream, is there any way to that?
This is my code for retrieving the data from the database:
sql_com.CommandText = "select * from incoming_boks_tbl where [incoming_bok_id]=#incoming_id and [incoming_date]=#incoming_date";
sql_com.Parameters.AddWithValue("incoming_id",up_inco_num_txt.Text);
sql_com.Parameters.AddWithValue("incoming_date", up_inco_date_txt.Text);
sql_dr = sql_com.ExecuteReader();
if(sql_dr.HasRows)
{
while(sql_dr.Read())
{
up_incoming_id_txt.Text = sql_dr[0].ToString();
up_inco_num_txt.Text = sql_dr[1].ToString();
up_inco_date_txt.Text = sql_dr[2].ToString();
up_inco_reg_txt.Text = sql_dr[3].ToString();
up_inco_place_txt.Text = sql_dr[4].ToString();
up_in_out_txt.Text = sql_dr[5].ToString();
up_subj_txt.Text = sql_dr[6].ToString();
up_note_txt.Text = sql_dr[7].ToString();
string file_ext = sql_dr[8].ToString();//pdf file extension
byte[] inco_file = (byte[])(sql_dr[9]);//the pdf file
MemoryStream ms = new MemoryStream(inco_file);
//here I don't know what to do with memory stream file data and where to store it. How can i display it?
}
}

This answer should give you some options: How to render pdfs using C#
In the past I have used Googles open source PDF rendering project - PDFium
There is a C# nuget package called PdfiumViewer which gives a C# wrapper around PDFium and allows PDFs to be displayed and printed.
It works directly with Streams so doesn't require any data to be written to disk
This is my example from a WinForms app
public void LoadPdf(byte[] pdfBytes)
{
var stream = new MemoryStream(pdfBytes);
LoadPdf(stream)
}
public void LoadPdf(Stream stream)
{
// Create PDF Document
var pdfDocument = PdfDocument.Load(stream);
// Load PDF Document into WinForms Control
pdfRenderer.Load(_pdfDocument);
}

Related

Split large PDF file in to multiple pdfs in C#

I have a large pdf file which I need to split into multiple pdfs or chunks before I upload to the server(another wcf service).
I have two approaches to send large files(>2 MB) to server by splitting them multiple pdfs or one pdf into chunks .Can any one tell me this how to achieve ?
I found the articles using iTextSharp but it's deprecated one. I don't use licensed one. Do we have any feasible way to achieve this ?
I have followed the following article .But they have used iTextshap which is a deprecated one .
https://www.c-sharpcorner.com/article/splitting-pdf-file-in-c-sharp-using-itextsharp/
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System.IO;
class Program
{
// Output Folder
static string outputFolder = #"D:\PDFSplit\Example\outputFolder";
static void Main(string[] args)
{
// Input Folder
var inputFolder = #"D:\PDFSplit\Example\inputFolder";
// Input File name
var inputPDFFileName = "sample.pdf";
// Input file path
string inputPDFFilePath = Path.Combine(inputFolder, inputPDFFileName);
// Open the input file in Import Mode
PdfDocument inputPDFFile = PdfReader.Open(inputPDFFilePath, PdfDocumentOpenMode.Import);
//Get the total pages in the PDF
var totalPagesInInputPDFFile = inputPDFFile.PageCount;
while(totalPagesInInputPDFFile !=0)
{
//Create an instance of the PDF document in memory
PdfDocument outputPDFDocument = new PdfDocument();
// Add a specific page to the PdfDocument instance
outputPDFDocument.AddPage(inputPDFFile.Pages[totalPagesInInputPDFFile-1]);
//save the PDF document
SaveOutputPDF(outputPDFDocument, totalPagesInInputPDFFile);
totalPagesInInputPDFFile--;
}
}
private static void SaveOutputPDF(PdfDocument outputPDFDocument,int pageNo)
{
// Output file path
string outputPDFFilePath = Path.Combine(outputFolder, pageNo.ToString() + ".pdf");
//Save the document
outputPDFDocument.Save(outputPDFFilePath);
}
}
The first thing to split the pdf is to reduce the size of the file and split into serval files and then analyze it. First you need to pdf document which you need to split and then you need to call the split function with the output file streams.
PdfLoadedDocument document = new PdfLoadedDocument("sample.pdf");
document.Split("Document-{0}.pdf");
document.Close(true);
There is a another way, first you need to load the PDF document using Document class and then choose the pages to be split into a Page[] array. After that Create a new Document and add pages to it using Document.Pages.Add(Page[]) method.
Save the PDF file using the Document.Save(String) method.
Try using streams, for instance StreamReader.
See meatvest's answer here
And the docs

GemBox DocumentModel.Load() cannot read Pdf file

Currently i am unable to load original pdf document using GemBox. it gives me below error in image. and I am using Acrobat 9.
I have tried using 8/16/2018 fixes too. Any suggestion will be highly appreciated.
Basic Code i am using is,
using GemBox.Document;
using System;
namespace Pdf2Text
{
class Program
{
[STAThread]
static void Main(string[] args)
{
ComponentInfo.SetLicense("My-License");
DocumentModel document = null;
document = DocumentModel.Load(#"E:\data\testing\HA021.pdf");
document.Save(#"E:\data\testing\HA021.docx");
}
}
}
The current implementation of PDF reader in GemBox.Document is still in beta and cannot handle this PDF feature, an "iref streams" which are a cross-reference tables stored in streams.
However, GemBox.Pdf can handle cross-reference streams so as a workaround what you could do is something like the following:
// Load PDF with GemBox.Pdf.
var pdfDocument = PdfDocument.Load("Sample.pdf");
pdfDocument.SaveOptions.CrossReferenceType = PdfCrossReferenceType.Table;
// Save PDF with GemBox.Pdf.
var pdfStream = new MemoryStream();
pdfDocument.Save(pdfStream);
// Load PDF with GemBox.Document.
var document = DocumentModel.Load(pdfStream, LoadOptions.PdfDefault);
Last regarding the conversion of PDF to DOCX, GemBox.Document's PDF reader is currently intended for extracting text and tables from PDF files, it's not intended for any high fidelity requirement.

How to get text from image using C# .NET [duplicate]

Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1

Onenote OCR capabilities in a desktop software

Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1

Strip Adobe Reader and Version requirements from PDF before outputting it to browser

I am planning on using pdf.js to have PDF context via the browser with Javascript. The problem is that some PDFs, the ones I am using, require Adobe's Reader with a specific Version. pdf.js does does not yet(ever?) support spoofing of these. What I need to know is if there's a way in C# to open the PDF and remove these Reader and Version requirements and how to do it. I was planning on using itextsharp to do other PDF manipulation server-side so an example using this would be most helpful. I plan on serving these as an ActionResult from an ajax request via MVC 4, so a MemoryStream would be most helpful at the end of this manipulation.
So in the end pdf.js was unable to do what I needed it too, however, what I was able to do was convert the Xfa/Pdf to a C# object then send the pages as needed via Json to my Javascript for rendering in the HTML5 Canvas. The code below takes an xfa-in-a-pdf file and turns it into a C# object with the help of itextsharp:
PdfReader.unethicalreading = true;
PdfReader reader = new PdfReader(new FileStream(Statics.PdfUploadLocation + PdfFileName, FileMode.Open, FileAccess.Read));
XfaForm xfaForm = new XfaForm(reader);
XDocument xDoc = XDocument.Parse(xfaForm.DomDocument.InnerXml);
string xfaNamespace = #"{http://www.xfa.org/schema/xfa-template/2.6/}";
List<XElement> formPages = xDoc.Descendants(xfaNamespace + "subform").Descendants(xfaNamespace + "subform").ToList();
TotalPages = formPages.Count();
var fieldIndex = 0;
RawPdfFields = new List<XfaField>();
for (int page = 0; page < formPages.Count(); page++)
{
RawPdfFields.AddRange(formPages[page].Descendants(xfaNamespace + "field")
.Select(x => new XfaField
{
Page = page,
Index = fieldIndex++,
Name = (string)x.Attribute("name"),
Height = GetUnitFromPossibleString((string)x.Attribute("h")),
Width = GetUnitFromPossibleString((string)x.Attribute("w")),
XPosition = GetUnitFromPossibleString((string)x.Attribute("x")),
YPosition = GetUnitFromPossibleString((string)x.Attribute("y")),
Reference = GetReference(x.Descendants(xfaNamespace + "traverse")),
AssistSpeak = GetAssistSpeak(x.Descendants(xfaNamespace + "speak"))
}).ToList());
}
Your PDF file n-400.pdf uses the Adobe XML Forms Architecture (XFA). This means you require a viewer that also supports XFA which pdf.js seemingly does not.
Such a PDF normally contains some standard PDF content which indicates that the PDF requires some viewer that supports XFA. In your case the content contains
If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.
This actually indicates what a XFA enabled viewer does, it renders some pages based upon information in the XFA XML data and displays it instead of the PDF style page descriptions.
While being defined proprietarily by Adobe, the PDF specification ISO 32000-1 describes how XFA data is to be embedded in a PDF document, cf. section 12.7.8 XFA Forms.
If you only need those forms in a flattened state, you might want to have a look at iText Demo: Dynamic XFA forms in PDF.

Categories