PDFsharp cannot read text from Crystal Reports CR23-encoded documents - c#

We are using Crystal Reports, C# and PDFsharp to generate PDF documents by individual users. Crystal Reports is first used to create a single monolithic PDF document with all the users' entries, with each user's respective portion delineated by text "tags." Afterwards, a C# program generates individual PDFs from the monolith, by extracting its text with PDFsharp, searching for the tags, and then generating a PDF from each between-tag portion.
This process worked fine for many years, but starting with Crystal Reports Service Pack 23, the encoding of the generated PDFs is no longer readable by PDFsharp, and hence the tags cannot be found. (No such problem occurs when copying from these documents if they are rendered in Chrome or Firefox.)
Is there a setting that can be changed in Crystal Reports to restore the old encoding, or must we either modify PDFsharp or use a different PDF processing library?

I posted this answer but it was deleted. I can't figure out why, given that it addresses an explicit question: "or must we either modify PDFsharp or use a different PDF processing library?"
I have no financial interest in the suggested library! I'm not the developer of it. I only use it.
Perhaps whoever decided to delete, didn't bother to read the whole question.
Consider using a different library. I use the Quick PDF library (Foxit, formerly Debenu) to do PDF splitting by tag in Crystal exports. It works fine for pdfs exported from any version of Crystal, including the latest runtime.

The SP16-generated PDFs used WinAnsi encoding, but the SP23 ones use Unicode. SAP said there is no setting in Crystal Reports to force the encoding to WinAnsi.
Solving this problem required adding ToUnicode CMap-retrieval to PDFsharp and using the CMaps at runtime to map each CString text index to its corresponding Unicode character.

Related

How to get RDLC report render PDF with ToUnicode entry for copypasting non-ansi text from the resulting PDF

Preface: we have a reports generated in c# application using Microsoft.Reporting.WebForms. LocalReport class from a RDLC file. They are rendered in PDF format. The text in the report is mostly in Cyrillics. The problem is: it's impossible to copy it from the resulting PDF file, you get garbage.
The reason you get garbage is the text is written as the "Identity-H" encoding for the font. It's not a real encoding, it's just an assignment of CIDs (basically, numbers) for glyphs used in the PDF file. Adobe's PDF format has the "ToUnicode" entry for this reason – that's what should store the correspondence of CIDs to the Unicode characters. If this information was present, it would be possible to copy/past text from the file correctly.
Obviously, this class doesn't write it. While researching the problem, I came across this page that recognizes the lack of copy/paste support and praises it finally being implemented... in SQL Server 2016 Reporting Services.
Well, we don't use ServerReport class and SQL Server RS. Or SQL Server 2016. It'll be kinda a weird and way too giant architectural changes to move to it just because managers complain they cannot copy text from PDFs.
So, is there a workaround? I doubt noone faced this problem before. Maybe the writing of this ToUnicode entry was implemented in LocalReport in the newer version of dotNet? Did someone write some sort of wrapper classes that take a bytearray of the PDF and enhance it? Or maybe people render the report to DOCX and then use some other library to make a PDF out of that correctly?

Simple Template Based PDF Reports in .Net

My customer gave me some Word and Powerpoint documents which specify how certain 'reports' generated by our product are supposed to look like.
That means, I need to modify those documents (replace placeholders etc.) and then I need to export them as PDF.
How would you solve this problem in C# ?
TL;DR: Editing the office document is no problem at all, but exporting that document to PDF (using Interop) allegedly causes issues when running it as a web server application. That's the whole problem here.
I agree that Interop is not suitable for document manipulation in server environment. I would approach this problem by preparing MS Word template documents with placeholders for data. Then I would use c# to load the data for the reports and merge the data with templates to get final documents (docx, pdf, xps or various image formats). There are 3rd party toolkits which make it quite easy. Here is the code used by one such toolkit needed for merging xml data with the template to get a pdf document:
XElement customers = XElement.Load("Customers.xml");
DocumentGenerator dg = new DocumentGenerator(customers);
DocumentGenerationResult result = dg.GenerateDocument("MyTemplate.docx", "MyReport.pdf");
You can of course also use free libraries and SDKs based on OpenXML but you should expect a steep learning curve, lots of debugging and lots of time invested.
Wkthmltopdf might be an option.
A completely different "report approach" could be, to save those office documents with the placeholders as mht (That's MHTML a web archive format). This could be done directly in MS Office or even programatically.
The placeholders could be easily exchanged by string search and replace. The mht files could directly be used to show the report instead of the PDF. A clear disadvantage of the mht format, is the HTML formatting. With PDF you have a clear and fix positioning.
We are using this kind of report creation. There are some flaws, but it works and the customer could edit the mht templates directly by right-click Open-With the prefered MS Office flavor.
You can use report generators, like FastReport.Net for solving your problems. It can assign different data for placeholders and also allow export to PDF.

Convert PDF document to Word document by programmatically without any third party tool (SSRS 2005)

I am using SQL Server Reporting Service 2005(SSRS 2005) to export report to Excel and PDF and VS2008. But now i want an option to Export to Word also, but it is not possible in SSRS 2005 report that i came to know after googling. Here problem is that I CAN'T USE SSRS 2008 REPORT. So i thought that i will follow the steps as....
-- Export to Word
1. Export to PDF
2. Convert that PDF to Word document
Even after so much of googling i didn't got the proper answer. I told once and even telling that i can't use any third party tools so don't give me wrong path.
There are many fundamental differences between PDF and Word making the approach you want highly undesirable as a general workflow. I'll give just one example: PDF typically does not store information about document structure - sentences, paragraphs, columns, tables... All it stores is the actual text at certain locations at a page. Word of course does have those concepts.
Is it possible to do what you want? Yes, to some extent. In the general case with guesswork and approximation. If you know which information you want to convert it might be possible to search for it in the PDF file generated by SSRS and then generate a Word file out of it. However, if SSRS allows export to text, XML, RTF or any other structure based file format (however slightly structure based), you'd have a much easier time.
If you insist on doing what you suggest here, you would have to:
1) Write code to take the PDF exported from SSRS and interpret it (find the textual content you want)
2) Recreate the necessary structural information from that information (what are paragraphs, where and what are the tables, what's the formatting etc...)
3) Write that into a file Word can read (or create a new Word document directly using automation).
This would be a considerable amount of work, but you have all of the necessary information as the PDF specification is freely downloadable from the Adobe web site and it contains all of the information you need.

How can I convert PDF to doc without microsoft.office.interop?

I need to convert PDF files into .doc files using C#. The computer has no file system though it doesn't have Office installed. Any good ideas how I can approach this? I did some research and most of people use the interop services.
You need to understand that PDF is not really implemented as a single document format.
If your PDF docs are created by rendering text to a PDF file, then direct PDF conversion is not only possible, but can be very good (reliable).
If the source of your PDF is either a scanner or fax (essentially a scanner...) then what you have is a document with an "picture" of text. This scenario is more difficult to deal with. If you open up the markup for this there is no 'text' to be converted. In this situation you have to deal with some manner of OCR (optical character recognition) which is less reliable due to a variety of issues.
If you have the option of intercepting the data before it is rendered to PDF (say like in SSRS or Crystal) then it would be better for you to bypass the PDF stage and move your data to a Word document.
If you are constrained to receiving faxes and then needing to interpret their content, prepare for OCR hell. It has been a while since I was there, so I hope that it has gotten better.
Even with out office installed on your machine, you have access (with Visual Studios) to the Office developer toolkit which will allow you build documents to be distributed in the Word formats.(.doc/.docx).
An option/idea may be to convert the PDF to Html, which can be opened in Word?
use aspose pdf kit to conver pdf to text and then text to doc using filestream or aspose doc

concatenating word documents and converting them to pdf

what is the best possible way to merge multiple documents and convert them to pdf. also we need to insert blank pages for every odd pages.
A fully supported, server side automated version of this (mostly baked into the the MS camp though) involves using the OpenXMLSDK to do any field inserts, then using Sharepoint's Word Automation Services (SP 2010) to convert the documents to PDF, and then pick your favorite PDF toolkit (iTextSharp for me) for any post processing (merging documents, inserting blank pages, or images that must be positioned relative to specific pages).
The reason for doing the document merge in PDF rather than OpenXML is simplicity - you don't have to deal with merging styles, headers etc.
The reason for doing the blank pages and image insertion is that OpenXML has no idea how to render the content, and so it has no idea where page breaks would occur naturally (you can still insert breaks like you would in Word though).
If you are using C# and you are OK with a server based solution then have a look at this post. It uses a .net friendly web services interface.
There is an optional SharePoint version available as well, but as you did not include a SharePoint tag I assume that won't be of interest to you.
Full disclosure, I wrote that post.

Categories