I created an application, where we want to create .rtf & .pdf Documents.
The documents also contain characters like ä,ü,ö,ß and we have the big issue, that those special characters are not shown correctly in the RTF Document.
For creating the rtf document, we are using "Migradoc" and the "RtfDocumentRenderer".
The PDF will be created correctly... And for the rtf document, we already tried a few things:
Setting the UTF encoding before calling the renderer
changing the culture info
creating the document as byte array, converting it to an array, encoded the byte array, but without success
with Unicode instead of the character.
The current version 1.51 of PDFsharp/MigraDoc targets .NET 2.0/.NET 3.5.
The next version (coming soon) targets .NET 6 and properly deals with the change of the default encoding.
Related
I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.
If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.
I am using Microsoft.Reporting.WinForms.dll to render my RDLC as a pdf file, but when i open pdf file with adobe reader i have a problem about some special Turkish characters.In pdf, they seem normal but when i try to use CTRL + F to search some words in the pdf file. I couldn't find these words. Even if my pdf file included these Turkish characters. Also, when i copy-paste these words into the file, i get characters like . It is interesting as i also use the same dll to render my RDLC as an excel file. I use same class same code and same method. I don't have this problem in an excel file.
I use byte[] Render(string format); method in WinForms.dll for rendering. Maybe some special character's ASCII code is out of range for byte array maybe because of this it couldn't render every characters for pdf format but i am not sure about this.
Thanks...
according to microsoft article there is an issue with special characters that is fixed in sql server 2014, the corresponding reportviewer dll would be the 2015 runtime.
maybe you should upgrade
I had a similar problem. My application generates PDFs using LocalReport.
My solution was:
1- Modify the RDLC XML schema to use the 2016 version. Change what you have for this.
<Report xmlns = "http://schemas.microsoft.com/sqlserver/reporting/2016/01/reportdefinition" xmlns: rd = "http://schemas.microsoft.com/SQLServer/reporting/reportdesigner">.
Then, modify the schema, it is no longer the same as previous versions (DataSources go up ...)
2- Remove <EmbedFonts>None</EmbedFonts> from DeviceInfo.
With these changes, I got the special characters painted and printed well.
Arabic Special Characters with Unicode
\u064b
\u064d
\u0647
Are rendered incorrectly when trying to save as PDF.
Attached is a sample VS2017 project, with a sample word file, and the code used for conversion, only omitting the license part.
Also check the support ticket on the Aspose free forums:
https://forum.aspose.com/t/arabic-special-characters-unicode-u064b-u064d-u0647-rendered-incorrectly-save-as-pdf/193389?u=mohamed.atia88
Thanks in advance,
I have tried reading Text File and XML File with File Class, it works fine.I was wondering if we can read excel or word or other types.
var str = File.ReadAllLines("Test.xlsx");
While debugging ,str shows special characters.
Hope I had made question clear.Kindly Advise
Down votes are welcomed,if accompanied by proper comment to improve :).
Thanks in advance.
XML and Text Files are plain files, where text on screen appear like they are in file. That's why File.ReadAllLines work.
With Excel, it is different. It has encoded logic in file, which when read by a special programs (read MSExcel) decodes it and displays it correctly on screen.
Think of it as a encoded or obfuscated file read by programs specially defined to decrypt them.
To read Excel file in DotNet, you can use them to be transferred into DataSet/DataTable like this Read Excel File in C# (Example)
With File.ReadAllLines you can read text files (and XML is -as we know- as well a text file).
Of course then function reads other kind of data files as well - but you will not get meaningful results. The binary data is interpreted as characters. This will not work for Office files.
The MSDN documentation for File.ReadAllLines() states that:
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
Therefore you can read text files with one of the UTF encodings it supports. To read files that use other encodings (e.g. Windows ANSI, non-Latin text) you should use the overload that takes an Encoding parameter.
I have binary stored in database. Now I want to convert it to a word doc. I have tried with ASCII encoding but it adds some special characters or symbols in between and doesn't look good.
For example I have resumes in doc and I have saved them in an sql database in binary[] format. Now what I want is to convert that binary to word compatible format and display it in an editor/textarea.
A Word .doc document is not a text file. It contains lots of binary data, the stuff that keeps track of styles, fonts, paragraph formatting, etcetera, etcetera. Which is the junk you see. You cannot realistically read such a file yourself, or for that matter display the document accurately, you have to use Word. You can automate it with the classes in the Microsoft.Office.Interop.Word namespace.
An intermediary solution is to store Word documents in the RTF file format. As long as the formatting doesn't get too fancy, a RichTextBox can display it accurately. Storing it in a dbase column isn't hard either, it is text.
Word document is pretty much Proprietary and closed format, operating with meaning there is no such an interface to pass an array of bites that word understands and get a string out of it.