Need some help
I have a pdf, and I just need to read it and store it content in DB.
From some reason, I couldn't find a simple example of doing that using Itext 7
another thing, the content is in Hebrew, at first I used iTextSharp, but the content I got is in reverse order, so I have two options:
1. fix the reverse code
2. maybe find a more normal code maybe in Itext7 which don't have this problem.
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
var res = ConvertToHebrew(currentText);
text.Append(res);
}
pdfReader.Close();
}
The convertToHebrew function is not perfect for me, so I hope to find something which work without me trying to fix things.
If the PDF document that contains right to left scripts like Hebrew or Arabic is properly formed, then the content stream of the page will contain /ReversedChars instructions that wrap right-to-left text snippets. iText 7 is able to deal with such instructions and extract right to left text correctly from properly formed documents.
This functionality is implemented as a part of LocationTextExtractionStrategy. To use it you basically have to replace SimpleTextExtractionStrategy with LocationTextExtractionStrategy in your code. You should also call SetRightToLeftRunDirection(true) for the new LocationTextExtractionStrategy instance but you should notice the difference in the result even without this flag.
That being said, if the document was formed improperly (or not completely properly depending on how you consider it) and does not contain ReversedChars instructions then iText 7 cannot help you at the moment. At some point extraction of right to left scripts even for not completely proper PDFs will likely be possible with iText 7 but this is something for the future.
Related
I want to read the some content pdf files. I just started before getting into the stuff I just want to know what the right approach to do so.
ItextSharp reader may be helpful in that case, so I converted the pdf into text using:
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
I'm still wondering if this approach seems OK, or if I should convert this pdf into excel and then read the content which I want instead.
Professionals thoughts will be appreciated.
With iText, you can also choose a specific strategy for extracting the text. But keep in mind that this is always a heuristic process.
Pdf documents essentially contain only the instructions needed to render the document for a viewer. So there is no concept of "text". More something like "draw character A at position 420, 890".
In order for any text-extraction to work, it needs to make some guesses on when two characters are close enough together that they should be concatenated, and when they should be apart.
Coincidentally, iText does this based on the width of a single space character in the font that is being used.
Keep in mind there could also be ActualText (this is a sort of text that gets hidden in the document, and is only used in extraction. It makes it possible to have the document render a character like "œ" (ligature version), which gets extracted as "oe" (non ligature version).
Depending on your input documents, you might want to look into the different implementations of ITextExtractionStrategy.
I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper
I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)
OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}
OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML
Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.
I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color".
Here is my code in c#..
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
{
int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
int endPosition = currentText.IndexOf("Total");
string result = currentText.Substring(startPosition, endPosition - startPosition);
// result will contain everything from and up to the Total line
using (StringReader reader = new StringReader(result))
{
// Loop over the lines in the string.
string[] split = line.Split(new Char[] { ' ' });
}
}
If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file
Please read the documentation before posting semi-duplicate questions, such as:
Edit an existing PDF file using iTextSharp
How to Read and Mark(Highlight) a pdf file using C#
You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).
In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.
The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:
Hello World
In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):
ld
Wor
llo
He
The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!
When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).
<>
<<ld><Wor><llo><He>>
<<Hello People>>
This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.
The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.
When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.
<<ld><Wor><llo><He>>
results in:
<<He><llo><Wor><ld>>
Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.
You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.
I am looking at the feasibility of creating something using C# and iTextSharp that can take a PDF template and replace various place holder values with actual values retrieved from a database. Essentially a PDF mail merge. I have the iText in action book but it covers rather a lot of stuff i don't need and I am struggling to find anything related to what i want to do. I am happy to use PDF fields as the place holders so long as the merged/flattened form does not look like it has fields in it, the output document should look like a mail merged letter and not a form that has been filled in. In an ideal world i just want search the text content of the PDF and then replace text place holders with their correct field values a la word mail merge.
Can anyone advise me of the best approach to this and point me in the direction of the most helpful iTextSharp classes to use, or if you know the iText in Action book a pointer to the most helpful section for me to read.
Build your template sans fields in your page-layout/text-editor of choice.
Save to PDF.
Open that PDF and add fields to it. This is easy to do in Acrobat Pro (you could download a trial if need be). It's also possible in iText, just much harder.
In either case, you want to set your form fields to have no border, and no background... that way only their contents will be visible, no boxes to make your fields look like fields.
Merging field data into a form is Quite Trivial with iText (forgive my Java, I don't know much about C#):
void fillPDF( String filePath, Map<String, String> fieldVals ) {
PdfReader reader = new PdfReader(myFilePath);
PdfStamper stamper = new PdfStamper( reader, outputFileStream );
stamper.setFormFlattening(true);
AcroFields fields = stamper.getAcroFields();
for (String fldName : fieldVals.keySet()) {
fields.setField( fldName, fieldVals.get(fldName) );
}
stamper.close();
}
This ignores list boxes with multiple selections (and exceptions), but other than that should be ready to go. Given that you're doing a mail-merge type thing, I don't think multiple selections will be much of an issue.