I'm trying to implement this feature in my application.
Just like in windows, I type into the search box and if the File contents is checked in the settings, than no matter its a text file or pdf/word file, the search returns me the file that contains the string in the search box.
So, I already have come up with a application for files and folder search which works pretty good for the file content search for text files and word file. I'm using interop word for word files.
I know, I can use iTextSharp or some other 3rd party stuff to do this for pdf files. But that doesn't satisfy me. I just wanted to find out how windows does it? Or if anyone else has done it in a different way? I just didn't wanted to use any 3rd party tool but doesn't mean I can't. I just wanted to keep my application light and not dump it with many tools.
As far as I know, it is not possible to search for pdf content with out having 3rd party tool, software or utility installed. So there are pdfgrep for example. But if you manage to any way make a c# program, I would include a third party library to do the job.
I made a solution for some thing similar in this answer Read specific value based on label name from PDF in C#, with a bit of tweak you can have what you are looking for. The only thing is with PdfClown, it is for .net framework, but at the other hand it is open source, free and has no limitation. But if you are looking for .net core you might find some free (with limitation) or paid pdf libraries.
As you request in the comment here is a sample solution to find text in side pdf pages. I have left comments inside the code:
//The found content
private List<string> _contentList;
//Search for content in a given pdf file
public bool SearchPdf(FileInfo fileInfo, string word)
{
_contentList = new List<string>();
ExtractPages(fileInfo.FullName);
var content = string.Join(" ", _contentList);
return content.Contains(word);
}
//Extract content for each page of given pdf file
private void ExtractPages(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
//Extract content of pdf page and put the found result inside _contentList
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
Now lets do quick test, so we assume all your invoice are in c:\temp folder:
static void Main(string[] args)
{
var program = new SearchPdfContent();
DirectoryInfo d = new DirectoryInfo(#"c:\temp");
FileInfo[] Files = d.GetFiles("*.pdf");
var word = "Sushi";
foreach (FileInfo file in Files)
{
var found = program.SearchPdf(file, word);
if (found)
{
Console.WriteLine($"{file.FullName} contains word {word}");
}
}
}
In my case I have for example word sushi inside the invoice:
c:\temp\invoice0001.pdf contains word Sushi
All that said, this is an example of solution. You can take it from here bring it to the next level. Enjoy your day.
I leave some links of what I have searched for:
Searching for files with specific file content
How to search contents of multiple pdf files?
Windows search PDF contents
https://superuser.com/questions/402673/how-to-search-inside-pdfs-with-windows-search
If your application is meant to search for file contents from binaries stored into your DB, the SQL Full-Text search feature can achieve this for you.
You just need to make sure that you have the required IFilters installed and create a full-text index on the table where the binary data is stored.
But if your application must access a folder in real time and search for file contents, you will probably need a third party tool just like #maytham-ɯɐɥʇʎɐɯ said.
Related
Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1
I am currently building an Excel file by hand using OpenXml. I'm in the process of adding the sheets, however, I have come across an issue. I have a loop that adds the names of each sheet in but once it runs and I try to open the file, I get the following message:
"We found a problem with some content in 'FileName.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, Click Yes."
I think the issue might be due that I am adding in the name of each sheet using a string variable. When I take it out and add something else, it works. Below is my code where I am looping through and adding my sheets.
//Technology Areas
foreach (DataRow dr in techAreaDS.Rows)
{
var data = dr["TechAreaName"].ToString().Split('-');
var techArea = data[2].TrimStart();
var techAreaSheet = new Sheet { Id = workbookPart.GetIdOfPart(worksheetPart),
SheetId = sheetId, Name = techArea };
sheets.Append(techAreaSheet);
sheetId++;
}
I've seen people mention it is an issue with cells having strings that can be converted into strings, but in this case, the string will always be a string. Any help would be appreciated.
EDIT: I've figured out the problem. The issue is the Name property has a Max Length of 31. One of my items has a 42 length, hence the error. I did find a cool set of code to validate my OpenXml. Link.
UPDATE:
Oddly enough, someone thinks this question was about finding some code to help validate what I was doing. It was not... The question is clear: why was I receiving an error when trying to name sheets. I was not asking for validation code, though I found some.
I do ask that if you wish to help, please read the question versus assume what I was asking, and if you don't know what I wish to have answered, ask...
In order to find out the issue(s) causing this error, you need to validate the generated document.
Besides using the built in validation method as described here, which doesn't show you all issues as I found out, I suggest that you download and install Microsoft's Open XML SDK 2.5 for Microsoft Office.
It contains Microsoft's Open XML SDK 2.5 Productivity Tool, which is very helpful here:
Create a copy of the damaged XLSX file, and apply the fixes as Microsoft Excel is suggesting (suppose you have the files FileName_corrupt.xlsx and FileName_fixed.xlsx
Then, run Microsoft's Open XML SDK 2.5 Productivity Tool, open FileName_corrupt.xlsx, select "Compare Files" and specify the 2nd file FileName_fixed.xlsx. This allows you to compare the XML structure of both files.
Let Microsoft's Open XML SDK 2.5 Productivity Tool generate C# code from both files: Open them first, then right-click on the root level and select "Reflect Code". This will create C# code which allows you to generate the same file. Save both C# code versions (i.e. FileName_corrupt.cs and FileName_fixed.cs)
Now you can compare the differences via Visual Studio: Either use
devenv.exe /diff FileName_corrupt.cs FileName_fixed.cs
to compare them, or use the batch file I've created to launch the VS compare - this is a hidden feature in Visual Studio, it allows to compare 2 local files being not part of TFS.
This way you should be able to work out the differences and allow you to fix your code.
NOTE: For a first validation, I do suggest to use the validation code. Only if it still fails, use the steps above. For validation you can use
public static string ValidateOpenXmlDocument(OpenXmlPackage pXmlDoc, bool throwExceptionOnValidationFail=false)
{
using (var docToValidate = pXmlDoc)
{
var validator = new DocumentFormat.OpenXml.Validation.OpenXmlValidator();
var validationErrors = validator.Validate(docToValidate).ToList();
var errors = new System.Text.StringBuilder();
if (validationErrors.Any())
{
var errorMessage = string.Format("ValidateOpenXmlDocument: {0} validation error(s) with document", validationErrors.Count);
errors.AppendLine(errorMessage);
errors.AppendLine();
}
foreach (var error in validationErrors)
{
errors.AppendLine("Description: " + error.Description);
errors.AppendLine("ErrorType: " + error.ErrorType);
errors.AppendLine("Node: " + error.Node);
errors.AppendLine("Path: " + error.Path.XPath);
errors.AppendLine("Part: " + error.Part.Uri);
if (error.RelatedNode != null)
{
errors.AppendLine("Related Node: " + error.RelatedNode);
errors.AppendLine("Related Node Inner Text: " + error.RelatedNode.InnerText);
}
errors.AppendLine();
errors.AppendLine("==============================");
errors.AppendLine();
}
if (validationErrors.Any() && throwExceptionOnValidationFail)
{
throw new Exception(errors.ToString());
}
if (errors.Length > 0)
{
System.Diagnostics.Debug.WriteLine(errors.ToString());
}
return errors.ToString();
}
along with
public static void ValidateExcelDocument(string fileName)
{
using (var xlsx = SpreadsheetDocument.Open(fileName, true))
{
ValidateOpenXmlDocument(xlsx);
}
}
With a slight modification, you can easily use the code above for Microsoft Word validation too:
public static void ValidateWordDocument(string fileName)
{
using (var docx = WordprocessingDocument.Open(fileName, true))
{
ValidateOpenXmlDocument(docx);
}
}
I've figured out the problem. The issue is the Name property has a Max Length of 31 characters. The text I'm trying to use sometimes exceeds that limit (one has 42 characters). I also found a pretty cool set of code to validate my Open Xml to find out what the specific issue is. Link
Is there an API to use Onenote OCR capabilities to recognise text in images automatically?
If you have OneNote client on the same machine as your program will execute you can create a page in OneNote and insert the image through the COM API. Then you can read the page in XML format which will include the OCR'ed text.
You want to use
Application.CreateNewPage to create a page
Application.UpdatePageContent to insert the image
Application.GetPageContent to read the page content and look for OCRData and OCRText elements in the XML.
OneNote COM API is documented here: http://msdn.microsoft.com/en-us/library/office/jj680120(v=office.15).aspx
When you put an image on a page in OneNote through the API, any images will automatically be OCR'd. The user will then be able to search any text in the images in OneNote. However, you cannot pull the image back and read the OCR'd text at this point.
If this is a feature that interests you, I invite you to go to our UserVoice site and submit this idea: http://onenote.uservoice.com/forums/245490-onenote-developers
update: vote on the idea: https://onenote.uservoice.com/forums/245490-onenote-developer-apis/suggestions/10671321-make-ocr-available-in-the-c-api
-- James
There is a really good sample of how to do this here:
http://www.journeyofcode.com/free-easy-ocr-c-using-onenote/
The main bit of code is:
private string RecognizeIntern(Image image)
{
this._page.Reload();
this._page.Clear();
this._page.AddImage(image);
this._page.Save();
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
return null;
}
As I will be deleting my blog (which was mentioned in another post), I thought I should add the content here for future reference:
Usage
Let's start by taking a look on how to use the component: The class OnenoteOcrEngine implements the core functionality and implements the interface IOcrEngine which provides a single method:
public interface IOcrEngine
{
string Recognize(Image image);
}
Excluding any error handling, it can be used in a way similar to the following one:
using (var ocrEngine = new OnenoteOcrEngine())
using (var image = Image.FromFile(imagePath))
{
var text = ocrEngine.Recognize(image);
if (text == null)
Console.WriteLine("nothing recognized");
else
Console.WriteLine("Recognized: " + text);
}
Implementation
The implementation is far less straight-forward. Prior to Office 2010, Microsoft Office Document Imaging (MODI) was available for OCR. Unfortunately, this no longer is the case. Further research confirmed that OneNote's OCR functionality is not directly exposed in form of an API, but the suggestions were made to manually parse OneNote documents for the text (see Is it possible to do OCR on a Tiff image using the OneNote interop API? or need a document to extract text from image using onenote Interop?. And that's exactly what I did:
Connect to OneNote using COM interop
Create a temporary page containing the image to process
Show the temporary page (important because OneNote won't perform the OCR otherwise)
Poll for an OCRData tag containing an OCRText tag in the XML code of the page.
Delete the temporary page
Challenges included the parsing of the XML code for which I decided to use LINQ to XML. For example, inserting the image was done using the following code:
private XElement CreateImageTag(Image image)
{
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.ToBase64(image);
img.Add(data);
return img;
}
private string ToBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
Note the usage of XName.Get("Image", OneNoteNamespace) (where OneNoteNamespace is the constant "http://schemas.microsoft.com/office/onenote/2013/onenote" ) for creating the element with the correct namespace and the method ToBase64 which serializes an GDI-image from memory into the Base64 format. Unfortunately, polling (See What is wrong with polling? for a discussion of the topic) in combination with a timeout is necessary to determine whether the detection process has completed successfully:
int total = 0;
do
{
Thread.Sleep(PollInterval);
this._page.Reload();
string result = this._page.ReadOcrText();
if (result != null)
return result;
} while (total++ < PollAttempts);
Results
The results are not perfect. Considering the quality of the images, however, they are more than satisfactory in my opinion. I could successfully use the component in my project. One issue remains which is very annoying: Sometimes, OneNote crashes during the process. Most of the times, a simple restart will fix this issue, but trying to recognise text from some images reproducibly crashes OneNote.
Code / Download
Check out the code at GitHub
not sure about OCR, but the documentation site for onenote API is this
http://msdn.microsoft.com/en-us/library/office/dn575425.aspx#sectionSection1
I need to extract data from .PDF files and load it in to SQL 2008.
Can any one tell me how to proceed??
Here is an example of how to use iTextSharp to extract text data from a PDF. You'll have to fiddle with it some to make it do exactly what you want, I think it's a good outline. You can see how the StringBuilder is being used to store the text, but you could easily change that to use SQL.
static void Main(string[] args)
{
PdfReader reader = new PdfReader(#"c:\test.pdf");
StringBuilder builder = new StringBuilder();
for (int x = 1; x <= reader.NumberOfPages; x++)
{
PdfDictionary page = reader.GetPageN(x);
IRenderListener listener = new SBTextRenderer(builder);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(x);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(ContentByteUtils.GetContentBytesForPage(reader, x), resourcesDic);
}
}
public class SBTextRenderer : IRenderListener
{
private StringBuilder _builder;
public SBTextRenderer(StringBuilder builder)
{
_builder = builder;
}
#region IRenderListener Members
public void BeginTextBlock()
{
}
public void EndTextBlock()
{
}
public void RenderImage(ImageRenderInfo renderInfo)
{
}
public void RenderText(TextRenderInfo renderInfo)
{
_builder.Append(renderInfo.GetText());
}
#endregion
}
Imagine if you asked this question. How can I load data from arbitrary text files into a SQL table. The challenge isn't opening the text file and reading it, its getting meaningful data out of the files automatically.
So you can use either iText or pdfSharp to read the PDF files, but its the getting meaningful data out that's going to be the challenge.
what you need to do is to use a tool to extract the text from PDF first and then read the file into a binary reader .. then store it into your database .. for extracting the text there are several tools to use. the first to mention are:
iTextsharp which is a Library that can be downloaded and used to do extensive work and in-depth edits and builds when dealing with PDF documents, and there are a lot of examples available online along with a full book that explains the ins and outs of itThe second tool is Adobe PDF iFilter which is a tool from adobe to deal with PDF modifications and manipulation.
Also Foxit iFilter also is a similar assembly that can do just what u r asking for!
PDF Boxwill also serve you!
these are the most well known and well documented ones!
check the following examples:
try the following examples on code project:
Parsing PDF files in .NET using PDFBox and IKVM.NET.A simple class to extract plain text from PDF documents with ITextSharpUsing the IFilter interface to extract text from various document types A parser for PDF Forms written in C#.NET
These do the job and they ain't hard to understand. Hope they help you :-)
A final note: as for me, i would iTextSharp as it's the most well documented library with most available examples.
If you mean metadata, try this question (first answer)
Read/Modify PDF Metadata using iTextSharp
You'll have to do the database stuff yourself though.
I'm developing a solution that allows people to upload a DOCX file as a template. This template is used for generating Word documents with database info.
What I would like to do is once a template gets uploaded, to check it for errors. (I don't want my parser crashing when a template is used.)
I've seen the question about checking a signature of a Word template, but that isn't enough to validate the integrity of the file. Of course it is possible to try to unzip the file, validate the XML in there, and so on, but this is rather CPU intensive and I'd like a different approach if there is one.
Are there any solutions that are part of the Open XML SDK or other standard approaches to this? Any ideas are apreciated.
in C# off the MSDN site
public static bool IsDocumentValid(WordprocessingDocument mydoc)
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(mydoc);
foreach (ValidationErrorInfo error in errors)
Debug.Write(error.Description);
return (errors.Count() == 0);
}