I am writing a program that uses OCR (tessnet2) to scan an image file and extract certain information. This was easy before I found out that I was going to be scanning attachments of PDFs from an Exchange server.
The first problem I am working on is how to convert my PDFs to BMP files. From what I can tell so far of TessNet2, it can only read in image files - specifically BMP. So I am now tasked with converting a PDF of indeterminate size (2 - 15 pages) to BMP image. After that is done I can easily scan each image using the code I have built already with TessNet2.
I have seen things using Ghostscript to do this task - i'm just wondering if there was another free solution or if one of you fine humans could give me a crash course on how to do this using Ghostscript.
You can use ImageMagick too. And it's totally free! No trial or payment.
Just download the ImageMagick .exe from here. Install it and download the NuGet file in here.
There is the code! Hope I helped! (even though the question was made 6 years ago...)
Procedure:
using ImageMagick;
public void PDFToBMP(string output)
{
MagickReadSettings settings = new MagickReadSettings();
// Settings the density to 500 dpi will create an image with a better quality
settings.Density = new Density(500);
string[] files= GetFiles();
foreach (string file in files)
{
string fichwithout = Path.GetFileNameWithoutExtension(file);
string path = Path.Combine(output, fichwithout);
using (MagickImageCollection images = new MagickImageCollection())
{
images.Read(fich);
foreach (MagickImage image in images)
{
settings.Height = image.Height;
settings.Width = image.Width;
image.Format = MagickFormat.Bmp; //if you want to do other formats of image, just change the extension here!
image.Write(path + ".bmp"); //and here!
}
}
}
}
Function GetFiles():
public string[] GetFiles()
{
if (!Directory.Exists(#"your\path"))
{
Directory.CreateDirectory(#"your\path");
}
DirectoryInfo dirInfo = new DirectoryInfo(#"your\path");
FileInfo[] fileInfos = dirInfo.GetFiles();
ArrayList list = new ArrayList();
foreach (FileInfo info in fileInfos)
{
if(info.Name != file)
{
// HACK: Just skip the protected samples file...
if (info.Name.IndexOf("protected") == -1)
list.Add(info.FullName);
}
}
return (string[])list.ToArray(typeof(string));
}
Found a CodeProject article on converting PDFs to Images:
http://www.codeproject.com/Articles/57100/Simple-and-Free-PDF-to-Image-Conversion
I recognize this is a very old question, but it is an ongoing problem. If you are targeting .NET 6 or later, I hope you would take a look at my library Melville.PDF.
Melville.Pdf is a MIT-Licensed C# implementation of a PDF renderer. I hope this serves a need that I have felt for some time.
If you are trying to get text out of PDF documents, render + OCR may be the hard way arround. Some PDF files are just a thin wrapper around image objects, but many actually have text inside of them. Melville.PDF does not do text extraction (yet) but it might be an easier way to get text out of some files.
Related
Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1
Right now I am using ghostscript in Unity to convert pdfs to jpgs and view them in my project.
Currently it flows like so:
-Pdfs are converted into multiple jpegs (one for each page)
-The converted jpegs are written to disk
-They are then read in by bytes into a 2D texture
-And this 2D texture is assigned to a GameObjects RawImage component
This works perfectly in Unity, but... (now comes the hiccup) my project is intended to run on the Microsoft Hololens.
The Hololens runs on the Windows 10 API, but in a limited capacity.
Where the issue arises is when I try to convert pdfs and view them on the Hololens. Quite simply, the Hololens cannot create or delete files outside of its known folders (Pictures, Documents, etc).
My imagined solution to this problem is to instead of write the converted jpeg files to disk, write them to memory and view them from there.
In talking with GhostScript devs, I was told GhostScript.NET does what I am looking to do - convert pdfs and view them from memory (It does this with the Rasterizer/Viewer classes, I believe, but again I don't understand it quite well).
I've been lead to look at the latest GhostScript.NET docs to route out how this is done, but I simply don't understand them well enough to approach this.
My question is then, based on how I'm using ghostscript now, how do I use GhostScript.NET in my project to write the converted jpegs into memory and view them there?
Here's how I'm doing it now (code-wise):
//instantiate
byte[] fileData;
Texture2D tex = null;
//if a PDF file exists at the current head path
if (File.Exists(CurrentHeadPath))
{
//Transform pdf to jpg
PdfToImage.PDFConvert pp = new PDFConvert();
pp.OutputFormat = "jpeg"; //format
pp.JPEGQuality = 100; //100% quality
pp.ResolutionX = 300; //dpi
pp.ResolutionY = 500;
pp.OutputToMultipleFile = true;
CurrentPDFPath = "Data/myFiles/pdfconvimg.jpg";
//this call is what actually converts the pdf to jpeg files
pp.Convert(CurrentHeadPath, CurrentPDFPath);
//this just loads the first image
if (File.Exists("Data/myFiles/pdfconvimg" + 1 + ".jpg"))
{
//reads in the jpeg file by bytes
fileData = File.ReadAllBytes("Data/myFiles/pdfconvimg" + 1 + ".jpg");
tex = new Texture2D(2, 2);
tex.LoadImage(fileData); //..this will auto-resize the texture dimensions.
//Read Texture into RawImage component
PdfObject.GetComponent<RawImage>().texture = tex;
PdfObject.GetComponent<RawImage>().rectTransform.sizeDelta = new Vector2(288, 400);
PdfObject.GetComponent<RawImage>().enabled = true;
}
else
{
Debug.Log("reached eof");
}
}
The convert function is from a script called PDFConvert which I obtained from code project. Specifically How To Convert PDF to Image Using Ghostscript API.
From the GhostScript.Net documentation, take a look at the example code labeled: "Using GhostscriptRasterizer class". Specifically the following lines:
Image img = _rasterizer.GetPage(desired_x_dpi, desired_y_dpi, pageNumber);
img.Save(pageFilePath, ImageFormat.Png);
The Image class seems to be part of the System.Drawing package, and System.Drawing.Image has another Save method where the first parameter is a System.IO.Stream.
Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1
so I learned about the System.IO.Packaging.ZipPackage in .NET. I am trying to use it to extract a thumbnail that is located in word documents if you 'Save as Thumbnail', the general advice seems to be to use a third-part library instead, but does anyone perhaps know how to do this?
If you know you are only working with .docx files, you can read the thumbnail, if the document has one, using this code:
ZipPackage zip = ZipPackage.Open(#"C:\Test Documents\thumbnail.docx") as ZipPackage;
var part = zip.GetPart(new Uri("/docProps/thumbnail.emf", UriKind.Relative));
if (part != null)
{
Image i = Image.FromStream(part.GetStream());
pictureBox1.Image = i;
}
Anyone knows how to load a .AI file (Adobe Illustrator) and then rasterize/render the vectors into a Bitmap so I could generate eg. a JPG or PNG from it?
I would like to produce thumbnails + render the big version with transparent background in PNG if possible.
Ofcause its "possible" if you know the specs of the .AI, but has anyone any knowledge or code to share for a start? or perhaps just a link to some components?
C# .NET please :o)
Code is most interesting as I know nothing about reading vector points and drawing splines.
Well, if Gregory is right that ai files are pdf-compatible, and you are okay with using GPL code, there is a project called GhostscriptSharp on github that is a .NET interface to the Ghostscript engine that can render PDF.
With the newer AI versions, you should be able to convert from PDF to image. There are plenty of libraries that do this that are cheap, so I would choose buy over build on this one. If you need to convert the older AI files, all bets are off. I am not sure what format they were in.
private void btnGetAIThumb_Click(object sender, EventArgs e)
{
Illustrator.Application app = new Illustrator.Application();
Illustrator.Document doc = app.Open(#"F:/AI_Prog/2009Calendar.ai", Illustrator.AiDocumentColorSpace.aiDocumentRGBColor, null);
doc.Export(#"F:/AI_Prog/2009Calendar.png",Illustrator.AiExportType.aiPNG24, null);
doc.Close(Illustrator.AiSaveOptions.aiDoNotSaveChanges);
doc = null; //
}
Illustrator.AiExportType.aiPNG24 can be set as JPEG,GIF,Flash,SVG and Photoshop format.
I Have Tested that with Pdf2Png and it worked fine with both .PDF and .ai files.
But I don't know how it will work with transparents.
string pdf_filename = #"C:\test.ai";
//string pdf_filename = #"C:\test.pdf";
string png_filename = "converted.png";
List<string> errors = cs_pdf_to_image.Pdf2Image.Convert(pdf_filename, png_filename);