iText GetTextFromPage exception with inline image

iText GetTextFromPage exception with inline image - c#

I have the same problem as was discussed here, which was not solved. My objective is to extract the text from an existing pdf file. I get the error message Could not find image data or EI for a certain pdf, which I cannot share as a sample. It works for other pdfs, with the following code
string fileURI = "C:\\Test\\Sample.pdf";
PdfReader reader = new PdfReader(fileURI);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
Debug.WriteLine(s);
I am using iTextSharp 5.5.0 and tried changing found == 1 to found <= 1 as suggested in other posts. It does not help.
Would it help to remove all images in the pdf? I really just need the text. Which commands from iText could help me with this?

I downloaded the trial version of Acrobat to create a version of the pdf file, that I could share. After opening the file and saving it again as "Optimized PDF" over the Acrobat, the code was working and I could extract the text.
So the solution to the problem is probably opening each file in Acrobat and saving it again with the right settings using the Acrobat reference and then extracting the text.

Related

C# - Save XML's plain text as PDF instead of converting [duplicate]

I am trying to save xml file as PDF as it is. In other words, I am trying to create PDF file that shows content of XML like a screenshot (like raw screenshot). My client somehow needs it like this. I couldn't really find the same question on stackoverflow. Is there anyway I can do this using iText or some other library?
Thank you!

First extract your text from XML file:
XmlDocument doc = new XmlDocument();
doc.Load("c:\\temp.xml");
string myText;
foreach(XmlNode node in doc.DocumentElement.ChildNodes){
myText= node.InnerText; //or loop through its children as well
}
Then create a PDF file and pass this text into it
in this case here I use PDFFlow library to create pdf documents
var DocumentBuilder.New()
.AddSection().AddParagraphToSection(myText).ToDocument()
.Build("Result.PDF");

If you load the xml in a browser you can easily save to searchable PDF
In a shell call (replace msedge with chrome if necessary)
"path to\msedge.exe" --headless --disable-gpu --print-to-pdf="out path\xml.pdf" --enable-logging "file://path to A\file.xml"
Enable logging helps as it can take a long time to process without any visual progress.
[0818/231038.640:INFO:headless_shell.cc(648)] Written to file ...\xml.pdf.
You can also add --print-to-pdf-no-header. Also if adding some style consider --run-all-compositor-stages-before-draw but I have no idea if that works for xml.
ForGet Image --screenshot as a 40 Page high XML as JPEG does NOT translate well to PDF. I tried :-)
If you want that as an image PDF then Re-Print the PDF to PDF using a command line viewer such as this since it is ONLY Print As Image output :-) also note it can in addition read the XML in Black and White (NO linting).
But have not tested how well it does XML2PDF via command line print
SumatraPDF -print-to "My Print to PDF" "path to\filename.pdf" (or xml in mono)
Note "My Print to PDF" is a promptless port you need to configure as required.

Corrupted file after `WordprocessingDocument.Open`

I have problem with this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true)) { }
Using just above and trying open document in Word showing error message that file is corrupted. It is interesting that for LibreOffice file is OK. I compared xml files (in docx) in WinMarge file before and after using this code and both are identical. Difference is only in size of docx file - why?

OK.. I resolved problem.. it's not nice solution but it's works..
var document = "template.docx";
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
// some editing stuff
wordDoc.Clone("ready.docx");
}
Now template.docx is corrupted, but ready.docx is fine.

I had the same problem, although your solution worked for me
wordDoc.Clone("ready.docx");
I found that in my case it was problem with letter encoding. I had the file generated from abby, image text to docx generation.
In order to check if encoding makes you trouble:
change youreditedword.docx to youreditedword.zip
open .zip, go to word folder, open document.xml
check document.xml' text that appears in word. If it is garbled, then you have encoding problem.
I fixed it this way - removed one phrase in my original unedited word file and writed it down again and saved it. Probably this way the encoding for the file is changed. Then after using openxml library and opening the file did not produce file is corrupted error

How to convert docx to html file using open xml with formatting

I know there are lot of question having same title but I am currently having some issue for them I didn't get the correct way to go.
I am using Open xml sdk 2.5 along with Power tool to convert .docx file to .html file which uses HtmlConverter class for conversion.
I am successfully able to convert the docx file into the Html file but the problem is, html file doesn't retain the original formatting of the document file. eg. Font-size,color,underline,bold etc doesn't reflect into the html file.
Here is my existing code:
public void ConvertDocxToHtml(string fileName)
{
byte[] byteArray = File.ReadAllBytes(fileName);
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
{
HtmlConverterSettings settings = new HtmlConverterSettings()
{
PageTitle = "My Page Title"
};
XElement html = HtmlConverter.ConvertToHtml(doc, settings);
File.WriteAllText(#"E:\Test.html", html.ToStringNewLineOnAttributes());
}
}
}
So I just want to know if is there any way by which I can retain the formatting in converted HTML file.
I know about some third party APIs which does the same thing. But I would prefer if there any way using open xml or any other open source to do this.

PowerTools for Open XML just released a new HtmlConverter module. It now contains an open source, free implementation of a conversion from DOCX to HTML formatted with CSS. The module HtmlConverter.cs supports all paragraph, character, and table styles, fonts and text formatting, numbered and bulleted lists, images, and more. See https://openxmldeveloper.org/

Your end result will not look exactly the way your Word Document turns out, but this link might help.

You might want to find an external tool to help you do this, like Aspose Words

You can use OpenXML Viewer extension for Firefox for Converting with formatting.
http://openxmlviewer.codeplex.com
This works for me. Hope this helps.

PDF generated with itext becomes 'corrupted' when using SetSimpleColumn()

First I would like to point out that stackowerflow helped me with many problems in the past, so thank you all. But now I have come to problem that I haven't fount a solution for yet and it's driving me crazy. I'm not native english speaker, so sorry for any language mistakes.
So here it is:
I'm generating pdf with itextsharp library(great library by the way). I'm starting with some kind of pdf form/template, to which i'm adding 'fill-out' data. I'm using PdfReader to read template pdf and by caling PdfStamper method GetOverContent(pageNum) for individual pages I get PdfContentByte. With that PdfContentByte I'm adding my text/data (BeginText and EndText is used on every page). Most of text I add with method ShowTextAligned. That all ok, generated pdf contains my text. The problem begins where i have to add 'columned' text. I do that with following code:
ColumnText ct = new ColumnText(cb);//cb is PdfContentByte
Phrase p = new Phrase(txt, FontFactory.GetFont(DEFAULT_FONT, BaseFont.CP1250, true, font_size));
ct.SetSimpleColumn(p, x, y, x+width, y+height, 10, alignment);
ct.Go();
setDefaultFont();//sets font to PdfContentByte again with setFontAndSize and SetColorFill
Columned text is added with this code OK, but the text(on that same page/same PdfContentByte) added AFTER this with ShowTextAligned is not visible in Acrobat Reader.
Here is the 'fun' part - that text in same pdf file opened with foxit reader is fine/visible/ok.
So text added with ShowTextAligned after adding ColumnText is not visible in acrobat reader but visible in foxit reader just fine. This problem exists inside one page, new page resets this problem (PdfContentByte for next page is new).
My workaround for that was to add all ColumnText AFTER all calls of ShowTextAligned. That worked till today, when customer printed out generated pdf with acrobat reader, which after printing the document, displayed message that pdf contains error and that author of pdf should be contacted. Version of Adobe Reader is 10.1.1. Problem is not in customer computer, same thing hapens on my computer.
After researching the web I installed Adobe Acrodat Pro Trial which contains tool Preflight, which is purposed for analyzing pdfs (as far I understand). This tool outputs warning "Invalid content state stream for operator". And here I'm stucked. I belive the problem exists inside added ColumnText, because document generated without them causes no problem displaying/printing and Preflight states "No problem found".
It is possible that i'm missing some fact and that the problem is in my code...
Please help me, because i'm runnig out of ideas.
I hope this post will help someday someone else with the same problem.
I cannot attach sample pdf because it contains sensitive data, but if there is no other way, i'll recreate the scenario/code.

So to answer my question/problem:
When writing to pdf using PdfContentByte and using method ShowTextAligned you have to call BeginText before writing and after you are finished you have to call EndText. So i did. BUT if you want to add some other element(like ColumnText, Image and probably anything else) you can't do that before you call EndText. If you do, generated pdf will be 'problematical'/corrupted.
So in pseudocode following is wrong:
BeginText();
ShowtextAligned();
AddImage();
ShowtextAligned();
EndText();
Correct usage is:
BeginText();
ShowtextAligned();
EndText();
AddImage();
BeginText();
ShowtextAligned();
EndText();
I hope this will help someone someday somewhere.

iTextSharp for PDF - how add file attachments?

I am using iTextSharp to create a PDF document in C#. I would like to attach another file to the PDF. I'm having just loads of trouble trying to do so. The examples here show some annotations, which apparently attachments are.
This is what I've tried:
writer.AddAnnotation(its.pdf.PdfAnnotation.CreateFileAttachment(writer, new iTextSharp.text.Rectangle(100,100,100,100), "File Attachment", its.pdf.PdfFileSpecification.FileExtern(writer, "C:\\test.xml")));
Well, what happens is it does add an annotation on the PDF (appears as a little comment voice balloon), which i don't want. test.xml is shown in the attachments pane in Adobe Reader, but it can't be read or saved, and its file size is unknown so it's likely that it's never being properly attached.
Any suggestions?

Well, I got some code working to attach it:
its.Document PDFD = new its.Document(its.PageSize.LETTER);
its.pdf.PdfWriter writer;
writer = its.pdf.PdfWriter.GetInstance(PDFD, new FileStream(targetpath, FileMode.Create));
its.pdf.PdfFileSpecification pfs = its.pdf.PdfFileSpecification.FileEmbedded(writer, "C:\\test.xml", "New.xml", null);
writer.AddFileAttachment(pfs);
where "its"="iTextSharp.text"
Now to read the attachment!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.