Using below code i am able to read only one Comment per page, How to read all the comments from all the pages from PDF. Or any way to get all the comments List from PDF in one shot.
for (int page = 1; page <= pdfRead.NumberOfPages; ++page)
{
PdfDictionary pagedic = pdfRead.GetPageN(page);
PdfArray annotarray = (PdfArray)PdfReader.GetPdfObject(pagedic.Get(PdfName.ANNOTS));
if (annotarray == null || annotarray.Size == 0)
continue;
string all_string = "";
foreach (PdfObject A in annotarray.ArrayList)
{
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);
if (AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.TEXT))
{
all_string += AnnotationDictionary.GetAsString(PdfName.T).ToString() +"\n";
all_string += AnnotationDictionary.GetAsString(PdfName.CONTENTS).ToString()+ "\n";
}
}
}
Pages are organized in a page tree. Each leaf of the page tree refers to a page dictionary. The /Annots entry is one of the optional keys of a page dictionary. It contains an array of annotations that belong to a specific page.
You are looping over every page dictionary of every page in the page tree, retrieving the /Annots array of every page dictionary. This is the correct procedure.
Your question "How to read all the comments from all the pages from PDF?" is wrong. You are already reading all the comments from all the pages from a PDF the correct way. It is inherent to PDF that the annotations are organized in this way. Even if there was an iTextSharp method to give you all the annotations in one single method, it would use the exact same code you are using now. What would that gain you? It would take the same amount of processing time.
Related
Using
var shapes = currentDocument.Shapes;
foreach (Shape shape in shapes)
if (shape.Type == MsoShapeType.msoPicture)
{
InlineShapeHelper.ReplaceInlineShape(...);
break;
}
I can replace the first image in "currentDocument".
How can I detect on which page the image is located (or in this case also sufficient: if it's on the first page)?
I want to replace an specific image on the first page, so is it maybe even possible to extract the image or check otherwise if the image is the one I'm looking for?
To answer your specific question: How can I detect on which page the image is located?
The get_Information method can return the page number of a given Range using the enumeration Word.WdInformation.wdActiveEndPageNumber.
A Shape is always anchored to a specific character in the document - this is the Range property of the Shape (Shape.Anchor).
The following code sample demonstrates how to loop the Shapes in a document, get their name and page number. Note that if the Shape.Name is known it's possible to pick up a Shape object directly (Shapes["Name As String"]). But you need to be careful with names generated by the Word application when Shape is inserted as Word can change the name it assigns itself at any time. If a name is assigned using code that name remains static - Word won't change it.
Word.ShapeRange shpRange = doc.Content.ShapeRange;
foreach (Word.Shape shp in shpRange)
{
System.Diagnostics.Debug.Print(shp.Name + ", " + shp.Anchor.get_Information(Word.WdInformation.wdActiveEndPageNumber).ToString());
}
One way I found just after posting the question is to generate the hash-code of the image:
var shapes = currentDocument.Shapes;
foreach (Shape shape in shapes)
if (shape.Type == MsoShapeType.msoPicture)
{
int hash = shape.GetHashCode();
InlineShapeHelper.ReplaceInlineShape(...);
break;
}
But I would still be interested in other, better, more elegant solutions and the possibility to get to know the page number.
I'm trying to write a extension method for Aspose's DocumentBuilder class that allows you to check if inserting a number of paragraphs into a document will cause a page break or not, I hoped this would be rather simple, but it turns out otherwise:
public static bool WillPageBreakAfter(this DocumentBuilder builder, int numParagraphs)
{
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;
// Delete the paragraphs, we don't need them anymore
...
if (pageCountBefore != pageCountAfter)
{
return true;
}
else
{
return false;
}
}
MY problem is, that inserting paragraphs does not seem to update the builder.Document.PageCount property. Even plugging in something crazy like 5000 paragraphs does seem to modify that property. I've also tried InsertBreak() (including using BreakType.PageBreak) and Writeln() but those don't work either.
What's going on here? Is there anyway I can achieve the desired result?
UPDATE
It seems that absolutely nothing done on the DocumentBuilder parameter actually happens on the DocumentBuilder that is calling the method. In other words:
If I modify the for loop to do something like builder.InsertParagraph(i.ToString()); and then remove the code that deletes the paragraphs afterwords. I can call:
myBuilder.WillPageBreakAfter(10);
And expect to see 0-9 written to the document when it is saved, however it is not. None of the Writeln()s in the extension methods seem to do anything at all.
UPDATE 2
It appears for what ever reasons, I cannot write anything with the DocumentBuilder after accessing the page count. So calling something like Writeln() before the int pageCountBefore = builder.Document.PageCount; line works, but trying to write after that line simply does nothing.
The Document.PageCount invokes page layout. You are modifying the document after using this property. Note that when you modify the document after using this property, Aspose.Words will not update the page layout automatically. In this case you should call Document.UpdatePageLayout method.
I work with Aspose as Developer Evangelist.
And it seems I've figured it out.
From the Aspose docs:
// This invokes page layout which builds the document in memory so note that with large documents this
// property can take time. After invoking this property, any rendering operation e.g rendering to PDF or image
// will be instantaneous.
int pageCount = doc.PageCount;
The most important line here:
This invokes page layout
By "invokes page layout", they mean it calls UpdatePageLayout(), for which the docs contain this note:
However, if you modify the document after rendering and then attempt to render it again - Aspose.Words will not update the page layout automatically. In this case you should call UpdatePageLayout() before rendering again.
So basically, given my original code, I have to call UpdatePageLayout() after my Writeln()s in order to get the updated page count.
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Update the page layout.
builder.Document.UpdatePageLatout();
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;
I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/
I am trying to cast AcroFields to the specific types so I can set properties on them.
When I call AcroFields.GetField(string name); all I get is a string.
When I call AcroFields.GetFieldItem(string name); I get an object but cannot cast it to the specific type.
I have also tried:
AcroFields.SetFieldProperty("myfield", "CheckType", RadioCheckField.TYPE_STAR, null);
This returned false every time.
To better explain my scenario:
I have an existing PDF ( I am NOT generating this file ).
There is a checkbox in it. I want to change the "CheckType" like this:
myRadioCheckField.CheckType = RadioCheckField.TYPE_STAR
But since I cannot cast the AcroField to the specific type I cannot access that property "CheckType".
Is there a way to achieve this?
Please provide a working sample if possible.
Thank you.
Your post contains two different questions. One question is easy to answer. The other question is impossible to answer.
Let's start with the easy question:
How to get specific types from AcroFields? Like PushButtonField, RadioCheckField, etc
This is explained in the FormInformation example, which is part of chapter 6 of my book:
PdfReader reader = new PdfReader(datasheet);
// Get the fields from the reader (read-only!!!)
AcroFields form = reader.AcroFields;
// Loop over the fields and get info about them
StringBuilder sb = new StringBuilder();
foreach (string key in form.Fields.Keys) {
sb.Append(key);
sb.Append(": ");
switch (form.GetFieldType(key)) {
case AcroFields.FIELD_TYPE_CHECKBOX:
sb.Append("Checkbox");
break;
case AcroFields.FIELD_TYPE_COMBO:
sb.Append("Combobox");
break;
case AcroFields.FIELD_TYPE_LIST:
sb.Append("List");
break;
case AcroFields.FIELD_TYPE_NONE:
sb.Append("None");
break;
case AcroFields.FIELD_TYPE_PUSHBUTTON:
sb.Append("Pushbutton");
break;
case AcroFields.FIELD_TYPE_RADIOBUTTON:
sb.Append("Radiobutton");
break;
case AcroFields.FIELD_TYPE_SIGNATURE:
sb.Append("Signature");
break;
case AcroFields.FIELD_TYPE_TEXT:
sb.Append("Text");
break;
default:
sb.Append("?");
break;
}
sb.Append(Environment.NewLine);
}
// Get possible values for field "CP_1"
sb.Append("Possible values for CP_1:");
sb.Append(Environment.NewLine);
string[] states = form.GetAppearanceStates("CP_1");
for (int i = 0; i < states.Length; i++) {
sb.Append(" - ");
sb.Append(states[i]);
sb.Append(Environment.NewLine);
}
// Get possible values for field "category"
sb.Append("Possible values for category:");
sb.Append(Environment.NewLine);
states = form.GetAppearanceStates("category");
for (int i = 0; i < states.Length - 1; i++) {
sb.Append(states[i]);
sb.Append(", ");
}
sb.Append(states[states.Length - 1]);
This code snippet stores the types of the fields in a StringBuilder, as well as the possible values of a radio field and a check box.
If you execute it on datasheet.pdf, you get form_info.txt as a result.
So far so good, but then comes the difficult question:
How do I find the check type? How do I change it?
That question reveals a lack of understanding of PDF. When we looked for the possible values of a check box (or radio button) in the previous answer, we asked for the different appearance states. These appearance states are small pieces of content that are expressed in PDF syntax.
For instance: take a look at the buttons.pdf form.
When we look at it on the outside, we see:
The check box next to "English" can be an empty square or a square with a pinkish background and a cross. Now let's take a look at the inside:
We see that this is the table check box, and we see that there are two appearance states: /Yes and /Off. What these states look like when selected, is described in a stream.
The stream of the /Off state is rather simple:
You immediately see that we are constructing a rectangle (re) and drawing it without filling it (S).
The /Yes state is slightly more complex:
We see that the fill color is being changed (rg), and that we stroke the rectangle in black and fill it using the fill color that was defined (B). Then we define two lines with moveTo (m) and lineTo (l) operations and we stroke them (S).
If you are proficient in PDF syntax, it is easy to see that we're drawing a cross inside a colored rectangle. So that answers your question on condition that you're proficient in PDF...
If you want to replace the appearance, then you have to replace the stream that draws the rectangle and the cross. That's not impossible, but it's a different question than the one you've posted.
Summarized: There is no such thing as a TYPE_STAR in the PDF reference (ISO-32000-1), nor in any PDF. If you have an existing PDF, you can not cast a check box or radio button to a RadioCheckField. You could try to reconstruct the RadioCheckField object, but if you'd want to know if a check box is visualized using a check mark, a star,... then you have to interpret the PDF syntax. If you don't want to do that, you can't create the "original" RadioCheckField object that was used to create the PDF due to the lack of ready-to-use information in the PDF.
After some research I have come to the conclusion that at this point in time, with the current version of iTextSharp v5.5.7.0:
It is NOT possible to grab an AcroField and cast it back to the original class ( RadioCheckField ) that was used to generate the field in the first place.
So the specific classes like PushButtonField and RadioCheckField are only useful for generating a new PDF but not for editing an existing PDF.
This cannot be done.
As far as I know iTextSharp does not support casting from AcroField back to the orginal class ( RadioCheckField ) that was used to generate the field.
You would have to write your own code to parse and inspect the PDF to achieve this.
I have been challenged with producing a method that will read in very large text files into a program these files can range from 2gb to 100gb.
The idea so far has been to read say a couple of 1000 lines of text into the method.
At the moment the program is setup using a stream reader reading a file line by line and processing the necessary areas of data found on that line.
using (StreamReader reader = new StreamReader("FileName"))
{
string nextline = reader.ReadLine();
string textline = null;
while (nextline != null)
{
textline = nextline;
Row rw = new Row();
var property = from matchID in xmldata
from matching in matchID.MyProperty
where matchID.ID == textline.Substring(0, 3).TrimEnd()
select matching;
string IDD = textline.Substring(0, 3).TrimEnd();
foreach (var field in property)
{
Field fl = new Field();
fl.Name = field.name;
fl.Data = textline.Substring(field.startByte - 1, field.length).TrimEnd();
fl.Order = order;
fl.Show = true;
order++;
rw.ID = IDD;
rw.AddField(fl);
}
rec.Rows.Add(rw);
nextline = reader.ReadLine();
if ((nextline == null) || (NewPack == nextline.Substring(0, 3).TrimEnd()))
{
d.ID = IDs.ToString();
d.Records.Add(rec);
IDs++;
DataList.Add(d.ID, d);
rec = new Record();
d = new Data();
}
}
}
The program goes on further and populates a class. ( just decided not to post the rest)
I know that once the program is shown an extremely large file, memory exception errors will occur.
so that is my current problem and so far i have been googling several approaches with many people just answering use a stream reader and reader.readtoend, i know readtoend wont work for me as i will get those memory errors.
Finally i have been looking into async as a way of creating a method that will read a certain amount of lines and wait for a call before processing the next amount of lines.
This brings me to my problem i am struggling to understand async and i can't seem to find any material that will help me learn and was hoping someone here can help me out with a way to understand async.
Of course if anyone knows of a better way to solve this problem I am all ears.
EDIT Added the remainder of the code to put a end to any confusion.
Your problem isn't synchronous v's asynchronous, it's that you're reading the entire file and storing parts of the file in memory before you do something with that data.
If you were reading each line, processing it and writing the result to another file/database, then StreamReader will let you process multi GB (or TB) files.
Theres only a problem if you're storing a portions of the file until you finish reading it, then you can run into memory issues (but you'd be surprised how large you can let Lists & Dictionaries get before you run out of memory)
What you need to do is save your processed data as soon as you can, and not keep it in memory (or keep as little in memory as possible).
With files that large you may need to keep your working set (your processing data) in a database - possibly something like SqlExpress or SqlLite would do (but again, it depends on how large your working set gets).
Hope this helps, don't hesitate to ask further questions in the comments, or edit your original question, I'll update this answer if I can help in any way.
Update - Paging/Chunking
You need to read the text file in chunks of one page, and allow the user to scroll through the "pages" in the file. As the user scrolls you read and present them with the next page.
Now, there are a couple of things you can do to help yourself, always keep about 10 pages in memory, this allows your app to be responsive if the user pages up / down a couple of pages very quickly. In the applications idle time (Application Idle event) you can read in the next few pages, again you throw away pages that are more than five pages before or after the current page.
Paging backwards is a problem, because you don't know where each line begins or ends in the file, therefore you don't know where each page begins or ends. So for paging backwards, as you read down through the file, keep a list of offsets to the start of each page (Stream.Pos), then you can quickly Seek to a given position and read the page in from there.
If you need to allow the user to search through the file, then you pretty much read through the file line by line (remembering the page offsets as you go) looking for the text, then when you find something, read in and present them with that page.
You can speed everything up by pre-processing the file into a database, there are grid controls that will work off a dynamic dataset (they will do the paging for you) and you get the benefit of built in searches / filters.
So, from a certain point of view, this is reading the file asynchronously, but that's from the users point of view. But from a technical point of view, we tend to mean something else when we talk about doing something asynchronous when programming.