PDFClown Find and replace page

PDFClown Find and replace page - c#

Using pdfclown,
I was wondering the best practice to find a page in a Existing PDF doc, and replace with a page from another PDF doc.
I have the bookmark and pagelabel of both pages.

A simple example for replacing pages can be derived from the PageManager cli examples:
string inputA = #"A.pdf";
string inputB = #"B.pdf";
string output = #"A-withPage1FromB-simple.pdf";
org.pdfclown.files.File fileA = new org.pdfclown.files.File(inputA);
org.pdfclown.files.File fileB = new org.pdfclown.files.File(inputB);
// replace page 0 in fileA by page 0 from fileB
Document mainDocument = fileA.Document;
Bookmarks bookmarks = mainDocument.Bookmarks;
PageManager manager = new PageManager(mainDocument);
manager.Remove(0, 1);
manager.Add(0, fileB.Document.Pages.GetSlice(0, 1));
fileA.Save(output, SerializationModeEnum.Standard);
This indeed replaces the first page in A.pdf by the first page in B.pdf and saves the result as A-withPage1FromB-simple.pdf.
Unfortunately, though, the PageManager does not update bookmarks. In the result of the code above, therefore, there still is a bookmarks which used to point to the original first page; as this page is not there, anymore, it now points nowhere anymore. And the bookmark pointing to the first page in fileB, is ignored completely.
Other document level, page related properties also are not transferred, e.g. the page label. In case of the page labels, though, the original label for the first page remains associated to the first page after replacement. This is due to a different kind of reference (by page number, not by object).

Related

Itext7 Reading a PDF in c# (with Hebrew)

Need some help
I have a pdf, and I just need to read it and store it content in DB.
From some reason, I couldn't find a simple example of doing that using Itext 7
another thing, the content is in Hebrew, at first I used iTextSharp, but the content I got is in reverse order, so I have two options:
1. fix the reverse code
2. maybe find a more normal code maybe in Itext7 which don't have this problem.
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
var res = ConvertToHebrew(currentText);
text.Append(res);
}
pdfReader.Close();
}
The convertToHebrew function is not perfect for me, so I hope to find something which work without me trying to fix things.

If the PDF document that contains right to left scripts like Hebrew or Arabic is properly formed, then the content stream of the page will contain /ReversedChars instructions that wrap right-to-left text snippets. iText 7 is able to deal with such instructions and extract right to left text correctly from properly formed documents.
This functionality is implemented as a part of LocationTextExtractionStrategy. To use it you basically have to replace SimpleTextExtractionStrategy with LocationTextExtractionStrategy in your code. You should also call SetRightToLeftRunDirection(true) for the new LocationTextExtractionStrategy instance but you should notice the difference in the result even without this flag.
That being said, if the document was formed improperly (or not completely properly depending on how you consider it) and does not contain ReversedChars instructions then iText 7 cannot help you at the moment. At some point extraction of right to left scripts even for not completely proper PDFs will likely be possible with iText 7 but this is something for the future.

How can I export a piece of a DOCX file and keep the same paragraph numbering?

TL;DR:
How can I capture the paragraph numbering as a 'part' of the text and export it to a DOCX?
Problem
I have a document that's split into sections and sub-sections that reads similarly to a set of state statutes (Statute 208, with subsections Statute 208.1, Statute 208.2, etc.). We created this by modifying the numbering.xml file within the .docx zip.
I want to export a 'sub-section' (208.5) and its text to a separate .docx file. My VSTO add-in exports the text well enough, but the numbering resets to 208.1. This does make some sense as it's now the first paragraph with that <ilvl> in the document.
PDF works okay
Funnily enough, I'm able to call Word.Range's ExportAsFixedFormat function and export this selection to PDF just fine - even retaining the numbering. This led me down a path of trying to 'render' the selection, possibly as it would be printed, in order to throw it into a new .docx file, but I haven't figured that out, either.
What I've tried:
Range.ExportFragment() using both wdFormatStrictOpenXMLDocument and wdFormatDocumentDefaultas the wdSaveType values.
These export but also reset the numbering.
Document.PrintOut() using PrintToFile = true and a valid filename. I realize now that this, quite literally, generates 'printout instructions' and won't inject a new file at path filename with any valid file structure.
Plainly doesn't work. :)
Application.Selection.XML to a variable content and calling Document.Content.InsertXML(content) on a newly added Document object.
Still resets the numbering.
Code Section for Context
using Word = Microsoft.Office.Interop.Word;
Word.Range range = Application.ActiveDocument.Range(startPosition, endPosition);
range.Select();
//export to DOCX?
Application.Selection.Range.ExportFragment(
filename, Word.WdSaveFormat.wdFormatDocumentDefault);

You could use ConvertNumbersToText(wdNumberAllNumbers) before exporting, then _Document.Undo() or close without saving after the export.

There is some good information at this (dated) link that still should work with current Word APIs:
https://forums.windowssecrets.com/showthread.php/27711-Determining-which-ListTemplates-item-is-in-use-(VBA-Word-2000)
Information at that link suggests that you can create a name/handle for your ListTemplate so that you can reference it in code--as long as your statute-style bullets are associated with a named style for the document. The idea is to first name the ListTemplate that's associated with the statute bullet style for the active document and then reference that name when accessing the ListLevels collection.
For instance, you could have code that looks something like this:
ActiveDocument.Styles("StatutesBulletStyle").ListTemplate.Name = "StatuteBulletListTemplate";
After the above assignment, you can refer to the template by name:
ActiveDocument.ListTemplates("StatuteBulletListTemplate").ListLevels(1).StartAt = 5;
Using the above technique no longer requires that you try to figure out what the active template is...
Does that help?

iTextSharp - PDF Bookmark not pointing to a page

I have built a tree view to show the bookmarks of a given PDF document.
Using iTextSharp I get the bookmarks in a List object and use the Title value to show on the tree view, no problem.
The problem comes when I want the tree view node to reference a page number in the PDF document.
Some PDF documents have a value for Title, Page and Action for example:
Title: "Title Page",
Page: "1 XYZ -3 845 1.0",
Action: "GoTo"
However, others are in this format:
Title: "Title Page",
Named: "G1.1009819",
Action: "GoTo"
I have no idea what to do this the "Named" value. I have tried going through all the links in the document and comparing the value to the destination value of the link, but with no luck.
Does anyone know what this "Named" property represents?

It's a named destination, see the keyword list for some examples. It is a very common way to mark destinations in a document.
What do you want to do with the named destinations?
Do you want to consolidateNamedDestinations() so that they are no longer named destinations, but links to the specific place in the document.
Or do you want to create a link to a named destination? (That's probably more work. I don't think there are examples at hand.)
If you browse the examples, you'll discover the LinkActions where we use the SimpleNamedDestination object to retrieve the named destinations almost the same way you retrieve bookmarks using the SimpleBookmark class.
This code snippet gives us the bookmarks in the form of an XML file:
public void createXml(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
HashMap<String,String> map = SimpleNamedDestination.getNamedDestination(reader, false);
SimpleNamedDestination.exportToXML(map, new FileOutputStream(dest),
"ISO8859-1", true);
reader.close();
}
See destinations.xml for the result.
The code is much easier because the structure isn't nested: each name corresponds with a single destination.

Footer still active after reading it (rtf formular file)

An rtf document is generated by a data base application, with information from this data base. I have created a software (C#, net framework 4.5), to pick up data, then to record it into Excel file.
I have to read the footer of the rtf file, thing I can do.
But, when software access to footer, the document view is the same when footer/header are active (the same effect when you double click on header/footer to access it when you are under Word. This action action adds a carriage return on header (Word add this to enter something), and this \r causes to have additional page.
Here the code :
Sections oSection = cGlobalVar.varWordApp.ActiveDocument.Sections;
HeaderFooter oFooter = oSection[1].Footers[WdHeaderFooterIndex.wdHeaderFooterFirstPage];
Range oRange = oFooter.Range.Tables[1].Range;//<= at this point, footer is accessible, the empty header of original document has a\r character, causing 2nd page to document that I don't want
strBuffer = oRange.Text;//<= information I need
oRange = oSection[1].Range.Tables[1].Range;//<= try to affect something else to oRange
oFooter = null;//<= try to null the object
oSection = null;//<= same as above
//cGlobalVar.varWordDoc.ActiveWindow.View.Type = WdViewType.wdPrintView;//<= try to use this to return to a normal state
I have tried to manipulate Word to find something to get back to my original document (one page), but without any success.

Nulling the object won't clear its content. If you want to clear it, change the text of the range object
oFooter.Range.Text = "";
oSection.Range.Text = "";
Note: These objects have a reference type. This means that the variable points to the actual object, which is somewhere else. If you set the variable to null, you are just loosing the link to the object, but you are not changing the object. See my answer to the SO question Setting a type reference type to null doesn't affect copied type?
UPDATE
I made an experiment in Word, using a VBA macro that reads the table range of the footer as you did above. It does not change the view type of word.
Sub Macro1()
Dim oSection As Sections
Dim oFooter As HeaderFooter
Dim oRange As Range
Dim strBuffer As String
Set oSection = Application.ActiveDocument.Sections
Set oFooter = oSection(1).Footers(WdHeaderFooterIndex.wdHeaderFooterPrimary)
Set oRange = oFooter.Range.Tables(1).Range
strBuffer = oRange.Text
Debug.Print strBuffer
End Sub

Interop Word - Delete Page from Document

What is the easiest and most efficient way to delete a specific page from a Document object using the Word Interop Libraries?
I have noticed there is a Pages property that extends/implements IEnumerable. Can one simply remove the elements in the array and the pages will be removed from the Document?
I have also seen the Ranges and Section examples, but the don't look very elegant to use.
Thanks.

The short answer to your question is that there is no elegant way to do what you are trying to achieve.
Word heavily separates the content of a document from its layout. As far as Word is concerned, a document doesn't have pages; rather, pages are something derived from a document by viewing it in a certain way (e.g. print view). The Pages collection belongs to the Pane interface (accessed, for example, by Application.ActiveWindow.ActivePane), which controls layout. Consequently, there are no methods on Page that allow you to change (or delete) the content that leads to the existence of the page.
If you have control over the document(s) that you are processing in your code, I suggest that you define sections within the document that represent the parts you want to programmatically delete. Sections are a better construct because they represent content, not layout (a section may, in turn, contain page breaks). If you were to do this, you could use the following code to remove a specific section:
object missing = Type.Missing;
foreach (Microsoft.Office.Interop.Word.Section section in doc.Sections) {
if (/* some criteria */) {
section.Range.Delete(ref missing, ref missing);
break;
}
}

One possible option is to bookmark the whole pages (Select the whole page, go to Tools | Insert Bookmark then type in a name). You can then use the Bookmarks collection of the Document object to refer to the text and delete it.
Alternatively, try the C# equivalent of this code:
Doc.ActiveWindow.Selection.GoTo wdPage, PageNumber
Doc.Bookmarks("\Page").Range.Text = ""
The first line moves the cursor to page "PageNumber". The second one uses a Predefined Bookmark which always refers to the page the cursor is currently on, including the the page break at the end of the page if it exists.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.