Aspose.Words adding paragraphs via DocumentBuilder does not update Document info - c#

I'm trying to write a extension method for Aspose's DocumentBuilder class that allows you to check if inserting a number of paragraphs into a document will cause a page break or not, I hoped this would be rather simple, but it turns out otherwise:
public static bool WillPageBreakAfter(this DocumentBuilder builder, int numParagraphs)
{
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;
// Delete the paragraphs, we don't need them anymore
...
if (pageCountBefore != pageCountAfter)
{
return true;
}
else
{
return false;
}
}
MY problem is, that inserting paragraphs does not seem to update the builder.Document.PageCount property. Even plugging in something crazy like 5000 paragraphs does seem to modify that property. I've also tried InsertBreak() (including using BreakType.PageBreak) and Writeln() but those don't work either.
What's going on here? Is there anyway I can achieve the desired result?
UPDATE
It seems that absolutely nothing done on the DocumentBuilder parameter actually happens on the DocumentBuilder that is calling the method. In other words:
If I modify the for loop to do something like builder.InsertParagraph(i.ToString()); and then remove the code that deletes the paragraphs afterwords. I can call:
myBuilder.WillPageBreakAfter(10);
And expect to see 0-9 written to the document when it is saved, however it is not. None of the Writeln()s in the extension methods seem to do anything at all.
UPDATE 2
It appears for what ever reasons, I cannot write anything with the DocumentBuilder after accessing the page count. So calling something like Writeln() before the int pageCountBefore = builder.Document.PageCount; line works, but trying to write after that line simply does nothing.

The Document.PageCount invokes page layout. You are modifying the document after using this property. Note that when you modify the document after using this property, Aspose.Words will not update the page layout automatically. In this case you should call Document.UpdatePageLayout method.
I work with Aspose as Developer Evangelist.

And it seems I've figured it out.
From the Aspose docs:
// This invokes page layout which builds the document in memory so note that with large documents this
// property can take time. After invoking this property, any rendering operation e.g rendering to PDF or image
// will be instantaneous.
int pageCount = doc.PageCount;
The most important line here:
This invokes page layout
By "invokes page layout", they mean it calls UpdatePageLayout(), for which the docs contain this note:
However, if you modify the document after rendering and then attempt to render it again - Aspose.Words will not update the page layout automatically. In this case you should call UpdatePageLayout() before rendering again.
So basically, given my original code, I have to call UpdatePageLayout() after my Writeln()s in order to get the updated page count.
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Update the page layout.
builder.Document.UpdatePageLatout();
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;

Related

Is there any way to optimize pdfreader itextsharp?

There is a method that many times reads text from different pages of a pdf document using a rectangle. Accordingly, the larger the file, the slower everything is processed, I tried to use Parallel.Foreach, but I didn't get a substantial increase in processing speed, everything seems to be hampered by PdfReader.
The method is something like this:
var lst = new ConcurrentBag<Test3>();
using(var reader = new PdfReader(byteArr))
{
Parallel.Foreach(areas, t =>
{
var pageSize = reader.GetPageSize(t.PageNumber);
var rectangle = GetRectagle(t.AreaData, pageSize);
var text = GetTextFromRectangle(reader, rectagle, t.PageNumber);
lst.Add(text);
}
}
public string GetTextFromRectagle(PdfReader reader, Rectangle rect, int pageNum)
{
RenderFilter[] filter = {
new RegionTextRenderText()
};
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new
LocationTextExtractionStrategy(), filter);
return PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
}
After you mentioned in a comment that there are
Approximately 900 rectangle areas per page
and added your GetTextFromRectangle code, the cause of the problem became clear: For each of your pre-defined rectangles you make iText parse the whole content of the page the rectangle is on into a filtered text extraction strategy you expect to be focused on the respective rectangle area.
By the way, even worse, I don't see you using the Rectangle rect parameter in your GetTextFromRectangle method, thus after all you actually do not even focus on the respective rectangle!
So you parse each page approximately 900 times, each time throwing away most of the parsed information, instead of only once and retrieving the text from the pre-parsed data from each of those 900 rectangles per page.
This is waste of resources in its purest form!
What you should do instead, is
sort and separate your areas by their respective page and
for each of the pages
once (and only once) parse the content of that page into an unfiltered LocationTextExtractionStrategy and
for each rectangle on that page use the GetResultantText(TextChunkFilter) method of the strategy instance with a TextChunkFilter that filters by position (whether the chunk in question is inside the rectangle at hand) to retrieve the area text.
As an aside, in case of iText 7 instead of iText 5 (for .Net, formerly called iTextSharp) that GetResultantText overload with a TextChunkFilter is missing but you can emulate it, cf. this answer.

How to determine whether a MS-Word paragraph is more than one line?

What I want is not to find where enter was pressed in a paragraph (the end of a paragraph). I need to determine whether a paragraph contains a single line or multiple lines so that it can be formatted accordingly (centered or left-justified).
Like this in the center if it's in one line
or left justify if in Multiline
How to determine whether a paragraph is more than one line in VSTO?
Since "lines" are not objects in the Word object model, due to its dynamic layout algorithms, this needs to be approached via the old WordBasic technology still built into the APIs. (WordBasic worked based on selections, rather than objects, which is why this capability is present in these old methods.)
In this case, the Word.WdInformation enumeration offers parameters that work with "lines", more specifically for this problem wdFirstCharacterLineNumber.
The following sample code contains a code snippet that calls IsParaOneLine on a specific paragraph of a document.
IsParaOneLIne duplicates the paragraph Range passed two it twice: once for the starting point and once for the end point. These Ranges are then collapsed to their starting and end points, respectively and the line number determined. If the two are the same, true is returned to the calling code, otherwise false.
Notes:
rngEnd.MoveEnd(Word.WdUnits.wdCharacter, -1); moves the end point back by one character because after collapsing to the end of a paragraph Range, the Range is at the start of the following paragraph. This moves it back to the original paragraph.
The example applies a style rather than "direct formatting". Rather than formatting with centered and left alignment throughout a document I strongly recommend using Styles. If there's not a built-in style with the formatting required, create the custom styles you need. If you're familiar with CSS you know the advantages of using styles. With Word there's an additional reason: it massively reduces the temp files Word generates so that you're less likely to run out of memory.
Word.Range rng = doc.Paragraphs[2].Range;
if (IsParaOneLine(rng))
{
rng.set_Style(Word.WdBuiltinStyle.wdStyleHeading1);
}
else
{
Debug.Print("Not one line");
}
public bool IsParaOneLine(Word.Range rng)
{
Word.Range rngStart = rng.Duplicate;
rngStart.Collapse(Word.WdCollapseDirection.wdCollapseStart);
Word.Range rngEnd = rng.Duplicate;
rngEnd.Collapse(Word.WdCollapseDirection.wdCollapseEnd);
rngEnd.MoveEnd(Word.WdUnits.wdCharacter, -1);
int posLineStart = (int) rngStart.get_Information(Word.WdInformation.wdFirstCharacterLineNumber);
int posLineEnd = (int) rngEnd.get_Information(Word.WdInformation.wdFirstCharacterLineNumber);
bool isSameLine = false;
if (posLineStart == posLineEnd)
isSameLine = true;
return isSameLine;
}

iTextSharp - How to read all comments and reply from PDF in C#

Using below code i am able to read only one Comment per page, How to read all the comments from all the pages from PDF. Or any way to get all the comments List from PDF in one shot.
for (int page = 1; page <= pdfRead.NumberOfPages; ++page)
{
PdfDictionary pagedic = pdfRead.GetPageN(page);
PdfArray annotarray = (PdfArray)PdfReader.GetPdfObject(pagedic.Get(PdfName.ANNOTS));
if (annotarray == null || annotarray.Size == 0)
continue;
string all_string = "";
foreach (PdfObject A in annotarray.ArrayList)
{
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);
if (AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.TEXT))
{
all_string += AnnotationDictionary.GetAsString(PdfName.T).ToString() +"\n";
all_string += AnnotationDictionary.GetAsString(PdfName.CONTENTS).ToString()+ "\n";
}
}
}
Pages are organized in a page tree. Each leaf of the page tree refers to a page dictionary. The /Annots entry is one of the optional keys of a page dictionary. It contains an array of annotations that belong to a specific page.
You are looping over every page dictionary of every page in the page tree, retrieving the /Annots array of every page dictionary. This is the correct procedure.
Your question "How to read all the comments from all the pages from PDF?" is wrong. You are already reading all the comments from all the pages from a PDF the correct way. It is inherent to PDF that the annotations are organized in this way. Even if there was an iTextSharp method to give you all the annotations in one single method, it would use the exact same code you are using now. What would that gain you? It would take the same amount of processing time.

Get next file in the folder

When you open a picture in Windows Photo Viewer, you can navigate backward and forward between supported files using arrow keys (next photo / previous photo).
The question is: how to get path of the next file given path of the current file in the folder?
You can do this easily by getting all paths into a collection and keep a counter.If you don't want to load all file paths into memory you can use Directory.EnumerateFiles and Skip method to get next or prev file.For example:
int counter = 0;
string NextFile(string path, ref int counter)
{
var filePath = Directory.EnumerateFiles(path).Skip(counter).First();
counter++;
return filePath;
}
string PreviousFile(string path, ref int counter)
{
var filePath = Directory.EnumerateFiles(path).Skip(counter - 1).First();
counter--;
return filePath;
}
Ofcourse you need some additional checks, for example in NextFile you need to check if you get to the last file, you need to reset the counter, likewise in the PreviousFile you need to make sure counter is not 0, if so return the first file etc.
Given your concern with large number of files in a given folder, and desire to load them on demand, I'd recommend the following approach -
(Note - The suggestion of calling Directory.Enumerate().Skip... in the other answer works, but is not efficient, specially so for directories with large number of files, and few other reasons)
// Local field to store the files enumerator;
IEnumerator<string> filesEnumerator;
// You would want to make this call, at appropriate time in your code.
filesEnumerator = Directory.EnumerateFiles(folderPath).GetEnumerator();
// You can wrap the calls to MoveNext, and Current property in a simple wrapper method..
// Can also add your error handling here.
public static string GetNextFile()
{
if (filesEnumerator != null && filesEnumerator.MoveNext())
{
return filesEnumerator.Current;
}
// You can choose to throw exception if you like..
// How you handle things like this, is up to you.
return null;
}
// Call GetNextFile() whenever you user clicks the next button on your UI.
Edit: Previous files can be tracked in a linked list, as the user moves to next file.
The logic will essentially look like this -
Use the linked list for your previous and next navigation.
On initial load or click of Next, if the linked list, or its next node is null, then use the GetNextFile method above, to find the next path, display on UI, and add it to the linked list.
For Previous use the linked list to identify the previous path.

Locate only non-hidden elements using Selenium WebDriver in C#

I have a collection of records on a web page, and when a record is clicked, a 'Delete' link is displayed (actually 'unhidden' as its actually always there).
When trying to access this 'Delete' link, I am using its value.
When I use Driver.FindElement, it returns the first Delete link, even though it's hidden, and therefore can't click it (and shouldn't as it is not the right link).
So, what I basically want to do is find only non-hidden links. The code below works, but as it iterates through every Delete link I am afraid it may be inefficient.
Is there a better way?
public class DataPageModel : BasePageModel
{
private static readonly By DeleteSelector = By.CssSelector("input[value=\"Delete\"]");
private IWebElement DeleteElement
{
get
{
var elements = Driver.FindElements(DeleteSelector);
foreach (var element in elements.Where(e => e.Displayed))
{
return element;
}
Assert.Fail("Could not locate a visible Delete Element");
return null;
}
}
}
While I agree with #Torbjorn that you should be weary about where you spend your time optimizing, I do think this code is a bit inefficient.
Basically what is slowing the code down is the back and forth checking of each element to see if its displayed. To speed up the code, you need to get the element you want in one go.
Two options (both involve javascript):
jQuery
Take a look at the different ways to bring jQuery selectors to Selenium (I wrote about it here). Once you have that, you can make use of jQuery's :visible selector.
Alternatively if you know for sure the page already has jQuery loaded and you don't want to do all the extra code, you can simply use ExecuteScript:
IWebElement element = (IWebElement)driver.ExecuteScript("return $('input[value=\"Delete\"]:visible').first().get(0)");
Javascript
If you want to avoid jQuery you can just write a javascript function to do the same thing you are doing now in C#: Get all the possible elements and return the first visible one.
Then you would do something similar:
string script = //your javascript
IWebElement element = (IWebElement)driver.ExecuteScript(script);
You trade of readability with different degrees depending on which option you pick but they should all be more efficient. Of course these all require that javascript be enabled in the browser.

Categories