I am trying to build a pdf tool with xfinium library and I would like to know if it is possible to retrieve the pdf links of a pdf to make them live when displayed in my app. For now I can only see them as text and it is not possible to click on them so they are not useful. I have looked in the samples of xfinium but I couldn't find any hint to what I should change to make them work.
Any help would be great.
Thanks a lot.
Links in the PDF file are stored as link annotations. You can retrieve these links in this way: load your file in a PdfFixedDocument. The document's Pages collection is populated automatically with all the pages in the document.
Each page has an Annotations collection which is populated automatically with all the annotations on the page. Loop through this collection and test which annotation is a link annotation. The position of the link on the page is given by the VisualRectangle property.
If you need the link's URL you have to inspect the Action property of the link annotation. If it is an URI action then URI property of the action will give you the link's URL.
Disclaimer: I work for the company that develops XFINIUM.PDF library.
Related
I want to be able to upload a file to the cms, to a field on a page. Then from the actually site page, have a link to download said file
I think that a nice way to do this is with kentico page attachments. On the page in the content tree you can go to attachments. There you can add for example a pdf file. This can then be retrieved on the backend and then you can create a link on the view with it. Also see this link about displaying the page attachments. You can also take a look at attaching file groups to pages.
Example:
Controller:
viewModel.FileUrl = treeNode.AllAttachments?.FirstOrDefault()?.GetPath() ?? string.Empty;
View:
Download file
I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).
I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
If site doesn't cointain feed or it's malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn't cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?
I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.
I am trying to obtain html from the WebBrowser control, but it must include the value attributes of input elements on the page as well.
If I use webBrowser.DocumentText, I get the full HTML of the page as it was initially loaded. The input field values are not included.
If I use webBrowser.Document.Body.OuterHtml, I get the values, but not the other contents of (), which I need so I can get the stylesheet links, etc.
Is there a clean dependable way to obtain the full HTML of the DOM in its current state from the WebBrowser? I am passing the HTML to a library for it to be rendered to PDF, so suggestions for programmatically saving from the WebBrowser control to PDF will also be appreciated.
Thanks
There are some undocumented ways (changing registry, undocumented dll export) to print the document to XPS or PDF printers without parsing the page, that is, if your can afford to roll out required printer drivers to your customer's network.
If you want to parse the web page, documentElement.outerHTML should give you the full canonicalized document, but not the linked image, script or stylesheet files. You need to parse the page, enumerate elements and check element types and get resource urls before digging the WinInet cache or downloading for additional resources. To get the documentElement property, you need to cast HtmlDocument.DomDocument to mshtml.IHTMLDocument2 if you use Windows Forms, or cast WebBrowser.Document to mshtml.IHTMLDocument2 if you use WPF. If you need to wait before the Ajax code finishes execution, starting a timer when the DocumentComplete event is raised.
At this stage, I would parse the HTML DOM and get the necessary data in order to generate a report via a template, so you always have the option to generate other formats supported by the report engine, such as Microsoft Word. Very rarely I need to render the HTML as parsed, for example, printing a long table without adding customized header and footer on each page. That said, you can check Convert HTML to PDF in .NET and test which one of the suggested software/components works best with your target web site, if you do not have long tables.
I am trying to implement a feature where i open (suppose in iframe) a PDF file (multiple pages), Highlight a section of the document a get the page number (the one that is displayed in the PDF tool bar).
Eg: if the toolbar display 2/7 which means i am right now in page 2, i need to capture the page number information. Sounds simple but i am not able to get a .dll/function that exposes this property.
Any help would be grateful.Thanks.
I wouldn't think this would be possible, there's no way to control PDFs with JavaScript in the browser, which is what you'd need to do.
This article suggests the same: http://codingforums.com/showthread.php?t=43436.
Content of link:
in short, no, you can't do that.
really don't think JS can read properties of PDFs, since PDFs are viewed in the browser thru a plugin, ie a viewport for another application (for want of a better explanation).
You may be better trying a different route, such as generating the pages as images and implementing your own paging. Depends on your content and requirements, of course. ABCPDF from http://www.websupergoo.com/ is free (with a link-back), not sure if that's any help for you.