I am using ITextSharp and listed code to extract text from pdf.
But I have found that some rows give me incorrect result:
in excel - "11 3 11"
in Visual Studio - "11 \u0085\u0014\u0016\u001c 3 11"
in pdf - "11 £139 3 11"
One more example:
in excel - "2 45 1"
in Visual Studio - "2 \u0085\u0019\u0018\u001b 45 1"
in pdf - "2 £658 45 1"
After investigation I have found that pdf file contains
french-script-mt-58fbba579ea99.ttf
using (PdfReader reader = new PdfReader(pfile.path)){
StringBuilder text = new StringBuilder();
if (pagenum == 0)
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i, new
iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy());
string stringOutput = page;
string[] lines = stringOutput.Split('\n');
allData.Add(lines);
output = lines;
}
}
}
Questions:
How I can add font that I have loaded to Extract strategy?
Is it possible to create mapping so I can convert \u0085\u0014\u0016\u001c to £139?
Maybe I have missed something with encoding?
All the entries with the pound currency symbol "£" are drawn using fonts (named C2_0 and C2_2 respectively) without the information required for PDF text extraction as described in the PDF specification ISO 32000-1 section 9.10 "Extraction of Text Content": They use encoding Identity-H (which does not imply any mapping to Unicode) and have no ToUnicode mapping.
The fonts used for the other entries either use a meaningful encoding (T1_0 and T1_1 use WinAnsiEncoding) or have a ToUnicode map (C2_1).
As text extraction in iText essentially follows the description in that section 9.10, iText cannot extract the actual text of these £-entries, instead it returns the raw glyph codes, just like Adobe Reader copy&paste does.
Usually this means that one has to resort to OCR, either to the page as a whole and extract all text using OCR, or to the characters of the fonts in question individually to build ToUnicode tables for those fonts and then extract the text as above.
In this case, though, the C2_0 and C2_2 embedded font programs themselves contain information mapping the contained glyphs to Unicode code points. Thus, one can also build ToUnicode tables making use of the information in those font programs. Such information can be read from the font programs using a font library which can handle true type fonts.
Related
I want to read the some content pdf files. I just started before getting into the stuff I just want to know what the right approach to do so.
ItextSharp reader may be helpful in that case, so I converted the pdf into text using:
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
I'm still wondering if this approach seems OK, or if I should convert this pdf into excel and then read the content which I want instead.
Professionals thoughts will be appreciated.
With iText, you can also choose a specific strategy for extracting the text. But keep in mind that this is always a heuristic process.
Pdf documents essentially contain only the instructions needed to render the document for a viewer. So there is no concept of "text". More something like "draw character A at position 420, 890".
In order for any text-extraction to work, it needs to make some guesses on when two characters are close enough together that they should be concatenated, and when they should be apart.
Coincidentally, iText does this based on the width of a single space character in the font that is being used.
Keep in mind there could also be ActualText (this is a sort of text that gets hidden in the document, and is only used in extraction. It makes it possible to have the document render a character like "œ" (ligature version), which gets extracted as "oe" (non ligature version).
Depending on your input documents, you might want to look into the different implementations of ITextExtractionStrategy.
My PDF contains a list of persons and I'm currently looking for an ideal solution to get these persons but in some cases, I ended up reading this sentence:
It is not possible because PDFs don't have a structure.
Now the thing is there are tagged PDFs that shows you the "structure" of your PDF. In my case I have a tagged PDF where each value of the person has it's own row and each person is in a column. This means that there is/should be a simple way to parse through this "table" in my PDF to get each person's value, right?
So my question is: When tagged PDFs have a structure then how can I benefit from it so I can read all the values I need?
(Side small question: Are PDF to Excel applications using the tags from the PDF to create the Excel file?)
EDIT #1:
This is an example of the PDF file:
I already tried your suggested way #Lara with SyncFusion but the problem is the string I receive:
John Peter Smith Smithstrasse 1 0101 Smithikon am See 010 010 01 01 020 020 02 02
It is impossible to use Regex with such an output. The problem is you never know if Peter in this case belongs to first or last name and Smith could be a part of the street. That's why I can't use it and that's the reason why I'm searching for a solution where I can use the tags in the PDF. Everything is nice separated so I only need a way to get the values from the tags.
EDIT #2:
As #Balasubramanian wished for here's a tagged PDF example:
http://www.sh.ch/fileadmin/Redaktoren/Dokumente/Aufsichtsbehoerde_ueber_das_Anwaltswesen/Verzeichnis_SH_Anwaelte.pdf
This PDF gives with syncfusion exactly the output I add in Edit #1.
I don't have any special requirements for the output so it doesn't matter if I receive the data in a JSON file, an array or something similar. What, on the other hand, is important is that each value of each person is separated so I can get this values. But the big question is how I can do that. The tags must be somewhere saved in the PDF file (metadata?).
ITextSharp is an open source .net library you can use in order to read the contents of PDF file. Below code does the same.
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}
You can get the library from https://sourceforge.net/projects/itextsharp/ or Nuget too. Just download and refer in your application and use the above code snippet. You will be able to extract PDF into text.
Update:
Below is what i would suggest you to try ..
string pDFExtract = "John Peter Smith Smithstrasse 1 0101 Smithikon am See 010 010 01 01 020 020 02 02";
string[] arrpDFExtract = pDFExtract.Split(' ');
string Name = arrpDFExtract[0] +" "+ arrpDFExtract[1];
Here, you have to find out what is the size of the string array and based on that you have to build conditions to which you will get the exact values you want... Analysis of the condition of array length and its corresponding values you have to do..Post which just use above way and get the things out of the PDF.
I have done lots of Document processing using these kind of algorithm building and every thing works just like charm.
At present Syncfusion do not have support for extracting the text from the Tagged PDF document. However can you please provide the expected output structure from the tagged PDF document and also the PDF document with which you are trying to extract the texts from it.
I have a pdf file which I have a problem extracting text from it - using an itextsharp api.
some of the numbers are replaced by other numbers or backslashes : "//"
The pdf file was originally came from MS Word and exported to pdf using "Save as pdf", and i have to work with the pdf file and not the Doc.
You can see the problem very clearly when you try to copy and paste some numbers from the file
For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.
You can also see the problem with the date string : 11/4/2016 turns into // // 11110
When I print the pdf file using adobe Pdf converter printer on my computer, it get fixed, but i need to fix it automaticlly, using C# for example
Thanks
The file is shared here :
https://www.dropbox.com/s/j6w9350oyit0od8/OnePageGili.pdf?dl=0
In a nutshell
iTextSharp text extraction results exactly reflect what the PDF claims the characters in question mean. Thus, text extraction as recommended by the PDF specification (which relies on these information) always will return this.
The embedded fonts contain different information. Thus, text extraction methods disbelieving this information may return more satisfying results.
In more detail
First of all, you say
I have a pdf file which I have a problem extracting text from it - using an itextsharp api.
and so make it sound like an iTextSharp-specific issue. Later, though, you state
You can see the problem very clearly when you try to copy and paste some numbers from the file
If you can also see the issue with copy&paste, it is not an iTextSharp-specific issue but either an issue of multiple PDF processors including the viewer you copied&pasted with or it simply is an issue of the PDF you have.
As it turns out, it is the latter, you have a PDF that lies about its contents.
For example, let's look at the text you pointed out:
For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.
Inspecting the PDF page content stream, you'll find those six digits generated by these instructions:
/F3 11.04 Tf
...
[<00150013>-4<0014>8<00160016>-4<0016>] TJ
I.e. the font F3 is selected (which uses Identity-H encoding, so each glyph is represented by two bytes) and the glyphs drawn are from left to right:
0015
0013
0014
0016
0016
0016
The ToUnicode mapping of the font F3 in your PDF now claims:
1 beginbfrange
<0013> <0016> [<0033> <0033> <0033> <0032>]
endbfrange
I.e. it says
glyph 0013 represents Unicode codepoint 0033, the digit 3
glyph 0014 represents Unicode codepoint 0033, the digit 3
glyph 0015 represents Unicode codepoint 0033, the digit 3
glyph 0016 represents Unicode codepoint 0032, the digit 2
So the string of glyphs drawn using the instructions above represent 333222 according to the ToUnicode map.
The PDF specification presents the ToUnicode mapping as the highest priority method to map a character code to a Unicode value. Thus, a text extractor working according to the specification will return 333222 here.
I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper
How to read the texts from a pdf file created by Adobe Distiller tool?
I'm currently using ABCPdf tool and I have a code sample to read pdf contents but it can only read the texts from pdfs which have been created by Adobe PDF Library:
public string ExtractTextsFromAllPages(string pdfFileName)
{
var sb = new StringBuilder();
using (var doc = new Doc())
{
doc.Read(pdfFileName);
for (var currentPageNumber = 1; currentPageNumber <= doc.PageCount; currentPageNumber++)
{
doc.PageNumber = currentPageNumber;
sb.Append(doc.GetText("Text"));
}
}
return sb.ToString();
}
I have other pdf files which have been created by Adobe Distiller and the above code doesn't work; I mean it returns the below strange data which seems encoded:
\0\a\b\0\t\n\0\r\n\0\a\b\t\n\n\b\v\f\0\t\r\f\b\0\r\0\r\n\v\b\v\f\f\n\r\0\r\0\0\0\b\r\n\0\a\r\0\0\b\r\b\b\t\n\r\0\b\r\n\t\b\v\n\b\v\v\0\a\b\r\n\r\n\v\r\0\b\b\b\v\r\0\r\n\v\f\r\f\f\r\n !\"\"\v#\t $ %&$% $'\v\"% \0( )% ! !\"\"'*$'\r\n\t $ %&$% $'\v\"% \0( \r\n\f\f\f\f\b\f\f\f\f\a \b\b\f\f\f!\"\r\n\f\a#$\f\f\f\b\f\f\a%\a \b\b\f\a\a&\a\a' \b\a\b\r\n(\f)\f)
How to read the texts from a pdf file created by Adobe Distiller tool?
To be said that I can open such pdf files using my browser easily like other pdfs.
Thanks,
I've had similar problems with working with PDF's. I've not used ABCPdf, but you may want to check out iTextSharp, I've created a tool to extract strings from PDF files using that before, however you're still going to have a problem if the font is embedded. If you are able to switch up to iTextSharp, here is a question on SO that goes over the topic:
Reading PDF content with itextsharp dll in VB.NET or C#
First thing to try is to copy and paste text from the PDF using Adobe Reader or any other PDF viewer.
If you can not copy and paste text at all then text extraction feature might be disabled via permissions in the file. Usually permissions are ignored by PDF libraries and do not affect text extraction.
If you can copy and paste text from the file but it looks garbled/incorrect then the PDF does not contain some information required for text extraction to be performed properly. Such files will be displayed properly.
Adobe Distiller produces files without information required for proper text extraction if it's configured to produce smallest files possible.
EDIT:
If you need to discriminate garbage chars from meaningful text then you should implement an algorithm that measures the readability of text.
Some links for that:
Calculating entropy of a string
Is there any way to detect strings like putjbtghguhjjjanika?
this answer about text scoring systems
So, the fact, that you just do not see some readable text might be caused by a strange encoding used.
We normally assume that an ASCII caracter set is used for encoding. Imaging the sentence "Hello world" (ASCII to HEX would be: 48 65 6C 6C 6F 20 77 6F 72 6C 64)
In a straightforward way we would assume that the meaning would be 48 for a "H", 65 for "e" and so on.
But fancy an engineer doing his own subsetting of fonts: For encoding "H" as the first emerging letter he uses 00, for e then 01. The sentence would then be encoded like 00 01 02 02 03 04 05 03 06 02 07
This will result in a couple of unreadable characters. Just like ancient secret scripts which encode and decode via a secret encoding table.
The answer to your question is simply: You can read text generated from distiller only when you know the right encoding vector for reassembling.
ABCpdf can extract text from all PDFs that contain valid text. It infers spaces, de-hyphenates, clips to an area of interest and many other things that are required to ensure that the text you get is the same as the text you see.
However all this assumes that the PDF is valid - that it conforms to the PDF spec - that it is not corrupt.
The most common cause of text extraction problems are corrupt Identity encoded fonts. Identity encoded fonts are referenced by glyph rather than by character code. The fonts include a ToUnicode map to allow the glyph IDs to be converted to characters.
However we sometimes see documents from which this entry has been removed. This means that the only way only way to identify the characters would be to OCR the document.
You can see this yourself if you open the documents in Acrobat and copy the text. When you paste the copied text into an application such as notepad you will be able to see that it is wrong. ABCpdf just sees the same as Acrobat.
The fact that these documents have been so thoroughly and effectively mangled may be intentional. It is certainly a good way to ensure no-one can copy your text.
I wrote the ABCpdf .NET text extraction so I should know. :-)