How to read a special character from rtf file using C#? - c#

I am using C# to convert RTF to XML.
When reading the rich text there are special character's that are represented by special RTF tags. Eg: the double quote symbol ” the rtf tag is \'d3
When ever I read this symbol I need to write "&rdquo" in xml.
Is there any documentation that lists the RTF tags for all the special characters or is it possible to do some encoding to convert?

Check out the RTF Specification and read up on the Document Area.
The article has all the code values for the special characters (RTF Tags) list, eg:
\rdblquote 0xD3

Related

Special characters for Docx with ooxml

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.
Most of the characters are read properly but some special characters such as •,“ ” are being displayed as •. What should I be doing to correct this?
The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters.
In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.
StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);
• indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).
You should check whether the HTML file is read with the correct encoding.
While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.
Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

Accented characters are not showing properly after copying from a text box

I am using below code to copy text from some control.Please note text could be in Spanish or English.Later i am showing it up inside a rich text box.
Clipboard.Clear();
MyDocBodyControl.Range.Copy();
html = Convert.ToString(Clipboard.GetData(DataFormats.Html));
But when i am displaying them in rich text box,the accented characters are not showing properly.If i am using any other formats like Text,then i am getting proper accented characters.But i have to use HTML formats because i have some styles to be added with the copied text.
Any way to show the accented characters properly with HTML data format ?
Set a correct encoding? UTF-8/Unicode/... ?
Also have a look on these topics: How to convert a Unicode character to its ASCII equivalent
DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, which leading to funny/bad characters such as
'Å','‹','Å’','Ž','Å¡','Å“','ž','Ÿ','Â','¡','¢','£','¤','Â¥','¦','§','¨','©'
For example '€' is wrongly encoded as '€' in Windows-1252.
Full explanation here at this dedicated website
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
But by using the conversions tables you will not loose any UTF-8 characters. You can get the original pristine UTF-8 characters from DataFormats.Html. (Note: Ppm solutions defaults to ASCII on a fail and you loose encoding information!)
Also, Chrome adds Apple-converted-* characters that appear as for example 'Â ' from a clip, but claim to be removed.
Soln: Create a translation dictionary and search and replace.

Escape an xml string while creating xml file

I need to create a xml file which is to be converted to an excel file(.xls), and this means that the xml has a lot of meta info in it. Its easy to write all the contents into the xml file as a text file.
var sw = new FileInfo(tempReportFilePath).CreateText();
sw.WriteLine("meta info and other tags")
However, this method does not escape characters, and when the data contains '<' or '>' or '&' etc. the xml is rendered invalid and the .xls file does not open. I can easily do a replace ( '<' with '<' and so on), but for performance reasons, this method is not suitable.
The other alternative is to use xml text writer, but with a ton of meta info, it will mean writing a lot of tags in code. With sw.WriteLine('stuff'), I could simply put parts of meta info in one tag (as a string) and write them to file. Using xslt, the problem I faced was that tags required spaces. For example, for tabular data, the top row fields could have spaces.
How to go about creating a well formed xml file with a lot of meta info, and where the chareacters ('<', '>' etc) are excaped?
Uri.EscapeDataString(string stringToEscape);
XDocument tutorials.
Why not create xls in the first place, there is a nice library to do so :
http://npoi.codeplex.com/
I used the WriteRaw method for writing the meta info tags. For the other data, which was required to be escaped, I used WriteString method.

How to convert UTF-8 to text in HTML entity?

I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .
That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)
Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.
You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.

Subscript text in pdf C#

How do I insert a subscript character in a string in C#?
I have problems appending a superscript "2" in the same string using char.ConvertFromUtf32(178);, but I struggle with finding a similar solution for the subscripted text. Actually, I'm struggling with finding any solution at all to this rather embarrassing issue.
Plain text doesn't have formatting, like superscript, subscript, bold, italic and/or colors.
You need to use some "rich text" format.
The type of "rich text" depends on where you want to use it. Examples: HTML, RTF.
For PDF you need to look into the formatting options provided by your PDF creation library.
The PDF creation library I'm using did not offer much.
One work around I could figure out was to pick equalent ascii values from charecter map and append it to the existing string.

Categories