I've parsed html into a PDF and created a table of contents from the Header tags. The bookmarks in the document work fine, but clicking on the line in the table of contents doesn't do anything. The cursor doesn't change icons like it does if I put a URL in the link.
I used Itext RUPS to inspect the final PDF and the named destinations are in the final file.
I tried hard coding a couple of the names in just to see what happens, but they also didn't work. Putting in .CreateURL and google.com works fine.
The one thing I'm doing that may or may not be an issue is I'm creating the body document, then creating the table of contents and merging the two documents.
Maybe Bruno can make a cameo on this one.
private static List ProcessOutlineChildren(PdfDocument pdfDocument, List tableOfContents, IEnumerable<PdfOutline> pdfOutlines, IDictionary<String, PdfObject> names = null)
{
List<TabStop> tabStops = new List<TabStop>();
tabStops.Add(new TabStop(580, TabAlignment.RIGHT));
foreach (var o in pdfOutlines)
{
ListItem currentOutlineItem = new ListItem();
Paragraph paragraph = new Paragraph();
paragraph.AddTabStops(tabStops);
paragraph.Add(o.GetTitle());
paragraph.Add(new Tab());
paragraph.Add((pdfDocument.GetPageNumber((PdfDictionary) o.GetDestination().GetDestinationPage(names))).ToString());
paragraph.SetAction(PdfAction.CreateGoTo(o.GetDestination()));
currentOutlineItem.Add(paragraph);
if (o.GetAllChildren().Any())
{
currentOutlineItem.Add(ProcessOutlineChildren(pdfDocument, new List(), o.GetAllChildren(), names));
}
tableOfContents.Add(currentOutlineItem);
}
return tableOfContents;
}
public class CustomOutlineHandler : OutlineHandler
{
//PDF's require a unique name for destinations, this is how the actions/bookmarks jump to a location.
protected override string GenerateUniqueDestinationName(IElementNode element)
{
string destinationName = base.GenerateUniqueDestinationName(element);
if ("p".Equals(element.Name()))
{
destinationName = destinationName.Replace(GetDestinationNamePrefix(), "paragraph-prefix-");
}
return destinationName;
}
}
//From my main method converting things into PDF.
OutlineHandler customOutlineHandler = new CustomOutlineHandler().PutAllTagPriorityMappings(priorityMappings);
customOutlineHandler.SetDestinationNamePrefix("destination-name-");
properties.SetOutlineHandler(customOutlineHandler);
Related
Since in earlier versions of the API existed Anchor object, which should work as a link inside the text, I am trying, looking in many tutorials, for a way of adding a TOC that redirects you from index to the page where the title is.
The thing is I am generating PDFs based on XML Documents, so the TOC will vary depending on XML content.
I have begun iText7 for C# from 0 and thanks to tutorials at itextpdf.com and some stackOverflow tutorials, between many others, I am already able to make almost everything I want, except this. Anyways I am very confused because I only find old iText5 references or older, iText7 java references which, sometimes, I am not that able to "translate" to c#, etc.
I have tried setting a property to the Cell which contains the title and the page I want to go to.
I also looked some places that talked about having the content added from files. However, that is already done in my code, so I should directly deal with strings. Not even an array (or any similar), since I would be dealing string-by-string.
p.e: My program receives XML Value "Chapter 1" from a Node and it is automatically added to a String. So, I would add this string at the needed paragraph and in the TOC.
This is done because it is a REALLY complex XML file (848 lines in file, for example).
Paragraph indice = new Paragraph(new Text("Here the TOC line"));
Paragraph title = new Paragraph(new Text("Here the title"));
/*This goes at page 20. I am making further investigation so as to locate how should I look for the page (Looking for the string at the doc and getting the page, for example)*/
indice.SetProperty(Property.ACTION, PdfAction.CreateGoToR(#"C:\aplic\pdfPruebasIText\pdfPruebasIText\docPrueba.pdf", 20));
(Document) docPDF.add(indice);//TOC goes first
//More content
docPDF.add(title);//What I want indexed
I am adding more texts, paragraphs, tables etc at my docPrueba.pdf document. Just added what I thought are the main lines for my problem. If necessary, I would be adding more lines
In theory, it should go to page 20 of my Document docPrueba.pdf
Actually, it does nothing, except having the mouse pointer change its shape.
No error messages are shown, nor there is any error. Everything I added is generated, except this fail.
Seems like actually the problem was that I was opening the document with internet browser instead of Adobe PDF software.
I am feeling ashamed about this...
I will answering again with the whole solution, since I think it would be useful for iText7 beginners like me.
Based on this thread: iText7 Table of Content
I made code so that it is posible to add lines to the TOC while working at the document.
The main problem now is that the TOC is created AT THE END of the document (Whops!). So I am working on it for now to solve that issue.
For now, I will be adding my modified code here:
string tit0 = "Titulo 0";//Title strings to add to a Dictionary
string tit1 = "Titulo 1";
string tit2 = "Titulo 2";
Dictionary<string, KeyValuePair<string, int>> indiceRelaciones = new Dictionary<string, KeyValuePair<string, int>>();
indiceRelaciones=EscribirIndicePDF(innerDoc, DocPDF, tit0, indiceRelaciones);//HERE the title is added to the document and the dictionary flushed
DocPDF.Add(parrafo1);//Paragraphs, tables and stuff
DocPDF.Add(tabla1);
DocPDF.Add(tabla2);
indiceRelaciones = EscribirIndicePDF(innerDoc, DocPDF, tit1, indiceRelaciones);
DocPDF.Add(parrafo2);
indiceRelaciones = EscribirIndicePDF(innerDoc, DocPDF, tit2, indiceRelaciones);
DocPDF.Add(parrafo3);
crearIndice(innerDoc, DocPDF, indiceRelaciones);//This method generates the TOC
DocPDF.Close();
private static Dictionary<string, KeyValuePair<string, int>> EscribirIndicePDF(PdfDocument innerDoc, Document docPDF, string tit, Dictionary<string, KeyValuePair<string, int>> indice)
{
Paragraph titulo = new Paragraph().Add(new Text(tit));
titulo.SetKeepTogether(true);
outline = CreateOutline(outline, innerDoc, tit);
KeyValuePair<string, int> paginaTitulo = new KeyValuePair<string, int>(tit, innerDoc.GetNumberOfPages());
titulo.SetFontColor(ColorConstants.BLUE).SetDestination(tit).SetKeepWithNext(true).SetNextRenderer(new UpdatePageRenderer(titulo, paginaTitulo));
docPDF.Add(titulo);
indice.Add(tit, paginaTitulo);
return indice;
}
private static void crearIndice(PdfDocument innerDoc, Document docPDF, Dictionary<string, KeyValuePair<string, int>> indiceRelaciones)
{//¡Este método crea en la primera página el índice! Este no se toca. Se modifica el otro para que se llame y se agreguen entradas al diccionario.
List<TabStop> tabStops = new List<TabStop>();
tabStops.Add(new TabStop(580, TabAlignment.RIGHT, new DottedLine()));
foreach (KeyValuePair<string, KeyValuePair<string, int>> entrada in indiceRelaciones) {
KeyValuePair<string, int> texto = entrada.Value;
string claveEntrada = entrada.Key;
string claveTexto = texto.Key;
string valorTexto = texto.Value.ToString();
Paragraph p = new Paragraph().AddTabStops(tabStops).Add(claveTexto).Add(new Tab()).Add(valorTexto).SetAction(PdfAction.CreateGoTo(claveEntrada));
docPDF.Add(p);
}
}
private static PdfOutline CreateOutline(PdfOutline outline, PdfDocument innerDoc, string line)
{
if (outline == null)
{
outline = innerDoc.GetOutlines(false);
outline = outline.AddOutline(line);
outline.AddDestination(PdfDestination.MakeDestination(new PdfString(line)));
return outline;
}
PdfOutline hijo = outline.AddOutline(line);
hijo.AddDestination(PdfDestination.MakeDestination(new PdfString(line)));
return outline;
}
internal class UpdatePageRenderer : ParagraphRenderer
{
private KeyValuePair<string, int> Entrada;
public UpdatePageRenderer(Paragraph modelElement, KeyValuePair<string, int> entrada) : base(modelElement)
{
Entrada = entrada;
}
public override LayoutResult Layout(LayoutContext layoutContext)
{
LayoutResult result = base.Layout(layoutContext);
int pagina = layoutContext.GetArea().GetPageNumber();
Entrada = new KeyValuePair<string, int>(Entrada.Key, pagina);
return result;
}
}
I apologize for using spanish names. I will be editing this article for using english names, so everybody is able to understand the code.
And now...
I will try to get the TOC at the first page (as it should)
I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}
Using C#, I need to pull data from a word document. I have NetOffice for word installed in the project. The data is in two parts.
First, I need to pull data from the document settings.
Second, I need to pull the content of controls in the document. The content of the fields includes checkboxes, a date, and a few paragraphs. The input method is via controls, so there must be some way to interact with the controls via the api, but I don't know how to do that.
right now, I've got the following code to pull the flat text from the document:
private static string wordDocument2String(string file)
{
NetOffice.WordApi.Application wordApplication = new NetOffice.WordApi.Application();
NetOffice.WordApi.Document newDocument = wordApplication.Documents.Open(file);
string txt = newDocument.Content.Text;
wordApplication.Quit();
wordApplication.Dispose();
return txt;
}
So the question is: how do I pull the data from the controls from the document, and how do I pull the document settings (such as the title, author, etc. as seen from word), using either NetOffice, or some other package?
I did not bother to implement NetOffice, but the commands should mostly be the same (except probably for implementation and disposal methods).
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
string file = "C:\\Hello World.docx";
Microsoft.Office.Interop.Word.Document doc = word.Documents.Open(file);
// look for a specific type of Field (there are about 200 to choose from).
foreach (Field f in doc.Fields)
{
if (f.Type == WdFieldType.wdFieldDate)
{
//do something
}
}
// example of the myriad properties that could be associated with "document settings"
WdProtectionType protType = doc.ProtectionType;
if (protType.Equals(WdProtectionType.wdAllowOnlyComments))
{
//do something else
}
The MSDN reference on Word Interop is where you will find information on just about anything you need access to in a Word document.
UPDATE:
After reading your comment, here are a few document settings you can access:
string author = doc.BuiltInDocumentProperties("Author").Value;
string name = doc.Name; // this gives you the file name.
// not clear what you mean by "title"
As far as trying to understand what text you are getting from a "legacy control", I need more information as to exactly what kind of control you are extracting from. Try getting a name of the control/textbox/form/etc from within the document itself and then look up that property on the Google.
As a stab in the dark, here is an (incomplete) example of getting text from textboxes in the document:
List<string> textBoxText = new List<string>();
foreach (Microsoft.Office.Interop.Word.Shape s in doc.Shapes)
{
textBoxText.Add(s.TextFrame.TextRange.Text); //this could result in an error if there are shapes that don't contain text.
}
Another possibility is Content Controls, of which there are several types. They are often used to gather user input.
Here is some code to catch a rich text Content Control:
List<string> contentControlText = new List<string>();
foreach(ContentControl CC in doc.ContentControls)
{
if (CC.Type == WdContentControlType.wdContentControlRichText)
{
contentControlText.Add(CC.Range.Text);
}
}
I have a requirement where I would like users to type some string tokens into a Word document so that they can be replaced via a C# application with some values. So say I have a document as per the image
Now using the SDK I can read the document as follows:
private void InternalParseTags(WordprocessingDocument aDocumentToManipulate)
{
StringBuilder sbDocumentText = new StringBuilder();
using (StreamReader sr = new StreamReader(aDocumentToManipulate.MainDocumentPart.GetStream()))
{
sbDocumentText.Append(sr.ReadToEnd());
}
however as this comes back as the raw XML I cannot search for the tags easily as the underlying XML looks like:
<w:t><:</w:t></w:r><w:r w:rsidR="002E53FF" w:rsidRPr="000A794A"><w:t>Person.Meta.Age
(and obviously is not something I would have control over) instead of what I was hoping for namely:
<w:t><: Person.Meta.Age
OR
<w:t><: Person.Meta.Age
So my question is how do I actually work on the string itself namely
<: Person.Meta.Age :>
and still preserve formatting etc. so that when I have replaced the tokens with values I have:
Note: Bolding of the value of the second token value
Do I need to iterate document elements or use some other approach? All pointers greatly appreciated.
This is a bit of a thorny problem with OpenXML. The best solution I've come across is explained here:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
Basically Eric expands the content such that each character is in a run by itself, then looks for the run that starts a '<:' sequence and then the end sequence. Then he does the substitution and recombines all runs that have the same attributes.
The example is for PowerPoint, which is generally much less content-intensive, so performance might be a factor in Word; I expect there are ways to narrow down the scope of paragraphs or whatever you have to blow up.
For example, you can extract the text of the paragraph to see if it includes any placeholders and only do the expand/replace/condense operation on those paragraphs.
Instead of doing find/replace of tokens directly, using OpenXML, you could use some 3rd party OpenXML-based template which is trivial to use and can pays itself off soon.
As Scanny pointed out, OpenXML is full of nasty details that one has to master on on-by-one basis. The learning curve is long and steep. If you want to become OpenXML guru then go for it and start climbing. If you want to have time for some decent social life there are other alternatives: just pick one third party toolkit that is based on OpenXML. I've evaluated Docentric Toolkit. It offers template based approach, where you prepare a template, which is a file in Word format, which contains placeholders for data that gets merged from the application at runtime. They all support any formatting that MS Word supports, you can use conditional content, tables, etc.
You can also create or change a document using DOM approach. Final document can be .docx or .pdf.
Docentric is licensed product, but you will soon compensate the cost by the time you will save using one of these tools.
If you will be running your application on a server, don't use interop - see this link for more details: (http://support2.microsoft.com/kb/257757).
Here is some code I slapped together pretty quickly to account for tokens spread across runs in the xml. I don't know the library much, but was able to get this to work. This could use some performance enhancements too because of all the looping.
/// <summary>
/// Iterates through texts, concatenates them and looks for tokens to replace
/// </summary>
/// <param name="texts"></param>
/// <param name="tokenNameValuePairs"></param>
/// <returns>T/F whether a token was replaced. Should loop this call until it returns false.</returns>
private bool IterateTextsAndTokenReplace(IEnumerable<Text> texts, IDictionary<string, object> tokenNameValuePairs)
{
List<Text> tokenRuns = new List<Text>();
string runAggregate = String.Empty;
bool replacedAToken = false;
foreach (var run in texts)
{
if (run.Text.Contains(prefixTokenString) || runAggregate.Contains(prefixTokenString))
{
runAggregate += run.Text;
tokenRuns.Add(run);
if (run.Text.Contains(suffixTokenString))
{
if (possibleTokenRegex.IsMatch(runAggregate))
{
string possibleToken = possibleTokenRegex.Match(runAggregate).Value;
string innerToken = possibleToken.Replace(prefixTokenString, String.Empty).Replace(suffixTokenString, String.Empty);
if (tokenNameValuePairs.ContainsKey(innerToken))
{
//found token!!!
string replacementText = runAggregate.Replace(prefixTokenString + innerToken + suffixTokenString, Convert.ToString(tokenNameValuePairs[innerToken]));
Text newRun = new Text(replacementText);
run.InsertAfterSelf(newRun);
foreach (Text runToDelete in tokenRuns)
{
runToDelete.Remove();
}
replacedAToken = true;
}
}
runAggregate = String.Empty;
tokenRuns.Clear();
}
}
}
return replacedAToken;
}
string prefixTokenString = "{";
string suffixTokenString = "}";
Regex possibleTokenRegex = new Regex(prefixTokenString + "[a-zA-Z0-9-_]+" + suffixTokenString);
And some samples of calling the function:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(memoryStream, true))
{
bool replacedAToken = true;
//continue to loop document until token's have not bee replaced. This is because some tokens are spread across 'runs' and may need a second iteration of processing to catch them.
while (replacedAToken)
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(texts, tokenNameValuePairs);
}
wordDoc.MainDocumentPart.Document.Save();
foreach (FooterPart footerPart in wordDoc.MainDocumentPart.FooterParts)
{
if (footerPart != null)
{
Footer footer = footerPart.Footer;
if (footer != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> footerTexts = footer.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(footerTexts, tokenNameValuePairs);
}
footer.Save();
}
}
}
foreach (HeaderPart headerPart in wordDoc.MainDocumentPart.HeaderParts)
{
if (headerPart != null)
{
Header header = headerPart.Header;
if (header != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> headerTexts = header.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(headerTexts, tokenNameValuePairs);
}
header.Save();
}
}
}
}
I am currently trying to parse an HTML document to retrieve all of the footnotes inside of it; the document contains dozens and dozens of them. I can't really figure out the expressions to use to extract all of content I want. The thing is, the classes (ex. "calibre34") are all randomized in every document. The only way to see where the footnotes are located is to search for "hide" and it's always text afterwards and is closed with a < /td> tag. Below is an example of one of the footnotes in the HTML document, all I want is the text. Any ideas? Thanks guys!
<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>
Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:
//td[text()='[hide]']/following-sibling::td
Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).
How about use MSHTML to parse HTML source?
Here is the demo code.enjoy.
public class CHtmlPraseDemo
{
private string strHtmlSource;
public mshtml.IHTMLDocument2 oHtmlDoc;
public CHtmlPraseDemo(string url)
{
GetWebContent(url);
oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
oHtmlDoc.write(strHtmlSource);
}
public List<String> GetTdNodes(string TdClassName)
{
List<String> listOut = new List<string>();
IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
foreach (IHTMLElement item in iec)
{
if (item.className == TdClassName)
{
listOut.Add(item.innerHTML);
}
}
return listOut;
}
void GetWebContent(string strUrl)
{
WebClient wc = new WebClient();
strHtmlSource = wc.DownloadString(strUrl);
}
}
class Program
{
static void Main(string[] args)
{
CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");
Console.Write(oH.oHtmlDoc.title);
List<string> l = oH.GetTdNodes("x");
foreach (string n in l)
{
Console.WriteLine("new td");
Console.WriteLine(n.ToString());
}
Console.Read();
}
}