How to convert a string to RTF in C#? - c#

Question
How do I convert the string "Européen" to the RTF-formatted string "Europ\'e9en"?
[TestMethod]
public void Convert_A_Word_To_Rtf()
{
// Arrange
string word = "Européen";
string expected = "Europ\'e9en";
string actual = string.Empty;
// Act
// actual = ... // How?
// Assert
Assert.AreEqual(expected, actual);
}
What I have found so far
RichTextBox
RichTextBox can be used for certain things. Example:
RichTextBox richTextBox = new RichTextBox();
richTextBox.Text = "Européen";
string rtfFormattedString = richTextBox.Rtf;
But then rtfFormattedString turns out to be the entire RTF-formatted document, not just the string "Europ\'e9en".
Stackoverflow
Insert string with special characters into RTF
How to output unicode string to RTF (using C#)
Output RTF special characters to Unicode
Convert Special Characters for RTF (iPhone)
Google
I've also found a bunch of other resources on the web, but nothing quite solved my problem.
Answer
Brad Christie's answer
Had to add Trim() to remove the preceeding space in result. Other than that, Brad Christie's solution seems to work.
I'll run with this solution for now even though I have a bad gut feeling since we have to SubString and Trim the heck out of RichTextBox to get a RTF-formatted string.
Test case:
[TestMethod]
public void Test_To_Verify_Brad_Christies_Stackoverflow_Answer()
{
Assert.AreEqual(#"Europ\'e9en", "Européen".ConvertToRtf());
Assert.AreEqual(#"d\'e9finitif", "définitif".ConvertToRtf());
Assert.AreEqual(#"\'e0", "à".ConvertToRtf());
Assert.AreEqual(#"H\'e4user", "Häuser".ConvertToRtf());
Assert.AreEqual(#"T\'fcren", "Türen".ConvertToRtf());
Assert.AreEqual(#"B\'f6den", "Böden".ConvertToRtf());
}
Logic as an extension method:
public static class StringExtensions
{
public static string ConvertToRtf(this string value)
{
RichTextBox richTextBox = new RichTextBox();
richTextBox.Text = value;
int offset = richTextBox.Rtf.IndexOf(#"\f0\fs17") + 8; // offset = 118;
int len = richTextBox.Rtf.LastIndexOf(#"\par") - offset;
string result = richTextBox.Rtf.Substring(offset, len).Trim();
return result;
}
}

Doesn't RichTextBox always have the same header/footer? You could just read the content based on off-set location, and continue using it to parse. (I think? please correct me if I'm wrong)
There are libraries available, but I've never had good luck with them personally (though always just found another method before fully exhausting the possibilities). In addition, most of the better ones are usually include a nominal fee.
EDIT
Kind of a hack, but this should get you through what you need to get through (I hope):
RichTextBox rich = new RichTextBox();
Console.Write(rich.Rtf);
String[] words = { "Européen", "Apple", "Carrot", "Touché", "Résumé", "A Européen eating an apple while writing his Résumé, Touché!" };
foreach (String word in words)
{
rich.Text = word;
Int32 offset = rich.Rtf.IndexOf(#"\f0\fs17") + 8;
Int32 len = rich.Rtf.LastIndexOf(#"\par") - offset;
Console.WriteLine("{0,-15} : {1}", word, rich.Rtf.Substring(offset, len).Trim());
}
EDIT 2
The breakdown of the codes RTF control code are as follows:
Header
\f0 - Use the 0-index font (first font in the list, which is typically Microsoft Sans Serif (noted in the font table in the header: {\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}}))
\fs17 - Font formatting, specify the size is 17 (17 being in half-points)
Footer
\par is specifying that it's the end of a paragraph.
Hopefully that clears some things up. ;-)

I found a nice solution that actually uses the RichTextBox itself to do the conversion:
private static string FormatAsRTF(string DirtyText)
{
System.Windows.Forms.RichTextBox rtf = new System.Windows.Forms.RichTextBox();
rtf.Text = DirtyText;
return rtf.Rtf;
}
http://www.baltimoreconsulting.com/blog/development/easily-convert-a-string-to-rtf-in-net/

This is how I went:
private string ConvertString2RTF(string input)
{
//first take care of special RTF chars
StringBuilder backslashed = new StringBuilder(input);
backslashed.Replace(#"\", #"\\");
backslashed.Replace(#"{", #"\{");
backslashed.Replace(#"}", #"\}");
//then convert the string char by char
StringBuilder sb = new StringBuilder();
foreach (char character in backslashed.ToString())
{
if (character <= 0x7f)
sb.Append(character);
else
sb.Append("\\u" + Convert.ToUInt32(character) + "?");
}
return sb.ToString();
}
I think using a RichTextBox is:
1) overkill
2) I don't like RichTextBox after spending days of trying to make it work with an RTF document created in Word.

Below is an ugly example of converting a string to an RTF string:
class Program
{
static RichTextBox generalRTF = new RichTextBox();
static void Main()
{
string foo = #"Européen";
string output = ToRtf(foo);
Trace.WriteLine(output);
}
private static string ToRtf(string foo)
{
string bar = string.Format("!!##!!{0}!!##!!", foo);
generalRTF.Text = bar;
int pos1 = generalRTF.Rtf.IndexOf("!!##!!");
int pos2 = generalRTF.Rtf.LastIndexOf("!!##!!");
if (pos1 != -1 && pos2 != -1 && pos2 > pos1 + "!!##!!".Length)
{
pos1 += "!!##!!".Length;
return generalRTF.Rtf.Substring(pos1, pos2 - pos1);
}
throw new Exception("Not sure how this happened...");
}
}

I know it has been a while, hope this helps..
This code is working for me after trying every conversion code I could put my hands on:
titleText and contentText are simple text filled in a regular TextBox
var rtb = new RichTextBox();
rtb.AppendText(titleText)
rtb.AppendText(Environment.NewLine);
rtb.AppendText(contentText)
rtb.Refresh();
rtb.rtf now holds the rtf text.
The following code will save the rtf text and allow you to open the file, edit it and than load it back into a RichTextBox back again:
rtb.SaveFile(path, RichTextBoxStreamType.RichText);

Here's improved #Vladislav Zalesak's answer:
public static string ConvertToRtf(string text)
{
// using default template from wiki
StringBuilder sb = new StringBuilder(#"{\rtf1\ansi\ansicpg1250\deff0{\fonttbl\f0\fswiss Helvetica;}\f0\pard ");
foreach (char character in text)
{
if (character <= 0x7f)
{
// escaping rtf characters
switch (character)
{
case '\\':
case '{':
case '}':
sb.Append('\\');
break;
case '\r':
sb.Append("\\par");
break;
}
sb.Append(character);
}
// converting special characters
else
{
sb.Append("\\u" + Convert.ToUInt32(character) + "?");
}
}
sb.Append("}");
return sb.ToString();
}

Not the most elegant, but quite optimal and fast method:
public static string PlainTextToRtf(string plainText)
{
if (string.IsNullOrEmpty(plainText))
return "";
string escapedPlainText = plainText.Replace(#"\", #"\\").Replace("{", #"\{").Replace("}", #"\}");
escapedPlainText = EncodeCharacters(escapedPlainText);
string rtf = #"{\rtf1\ansi\ansicpg1250\deff0{\fonttbl\f0\fswiss Helvetica;}\f0\pard ";
rtf += escapedPlainText.Replace(Environment.NewLine, "\\par\r\n ") + ;
rtf += " }";
return rtf;
}
.
Encode characters (Polish ones) method:
private static string EncodeCharacters(string text)
{
if (string.IsNullOrEmpty(text))
return "";
return text
.Replace("ą", #"\'b9")
.Replace("ć", #"\'e6")
.Replace("ę", #"\'ea")
.Replace("ł", #"\'b3")
.Replace("ń", #"\'f1")
.Replace("ó", #"\'f3")
.Replace("ś", #"\'9c")
.Replace("ź", #"\'9f")
.Replace("ż", #"\'bf")
.Replace("Ą", #"\'a5")
.Replace("Ć", #"\'c6")
.Replace("Ę", #"\'ca")
.Replace("Ł", #"\'a3")
.Replace("Ń", #"\'d1")
.Replace("Ó", #"\'d3")
.Replace("Ś", #"\'8c")
.Replace("Ź", #"\'8f")
.Replace("Ż", #"\'af");
}

Related

Display formatted text on console using padding

I´m planning to write a description of the parameters of my console-app in a formated way similar to the follwoing:
The following options are possible:
myOption: Text do describe the option, but that should be splitted
to several lines if too big. Text should automatically
align by a fixed offset.
I already got a method to split the text at the right positions (assuming we do not care if we split in the midlle of any word, we would cimplicate things only if we´d care if we actually split at word-boundaries). However I am stuck on aligning the options explanation.
This is the code so far:
public void DisplayHelpEx()
{
var offset = this._options.Max(x => x.ParameterName.Length) + 6;
Console.WriteLine("The following options are possible:");
foreach (var option in this._corrections)
{
Console.Write((option.ParameterName + ": ").PadLeft(offset));
WriteOffset(offset, option.Explanation);
}
}
public void WriteOffset(int offset, string text)
{
var numChars = TOTAL_NUMBER_CHARS_PER_LINE - offset;
string line;
while ((line = new String(text.Take(numChars).ToArray())).Any())
{
var s = line.PadLeft(numChars);
Console.Write(s);
Console.WriteLine();
text= new String(text.Skip(numChars).ToArray());
}
}
I have tried many combinations of .PadLeft and .PadRight but can´t get it to work.
With the approach above I get the following output:
The following options are possible:
myOption: Text do describe the option, but that should be splitted
to several lines if too big. Text should automatically
align by a fixed offset.
PadLeft takes the text and adds some spaces left or right so that the full text will have a defined width, see https://msdn.microsoft.com/en-us/library/system.string.padleft(v=vs.110).aspx .
However, in your case, you don't want to have the whole text to have a fixed width (especially not if you in future want to split nicely at word boundaries), but rather the offset at the beginning. So why don't you just add the offset spaces at the beginning of each line, like so?
private const string optionParameterName = "myOption";
private const string optionText =
"Text do describe the option, but that should be splitted to several lines if too big.Text should automatically align by a fixed offset.";
private const int TOTAL_NUMBER_CHARS_PER_LINE = 60;
public void DisplayHelpEx()
{
var offset = optionParameterName.Length + 6;
Console.WriteLine("The following options are possible:");
WriteOffset(offset, optionParameterName + ": ", optionText);
}
public void WriteOffset(int offset, string label, string text)
{
var numChars = TOTAL_NUMBER_CHARS_PER_LINE - offset;
string offsetString = new string(' ', offset);
string line;
bool firstLine = true;
while ((line = new String(text.Take(numChars).ToArray())).Any())
{
if (firstLine)
{
Console.Write(label.PadRight(offset));
}
else
{
Console.Write(offsetString);
}
firstLine = false;
Console.Write(line);
Console.WriteLine();
text = new String(text.Skip(numChars).ToArray());
}
}
// output:
// The following options are possible:
// myOption: Text do describe the option, but that should b
// e splitted to several lines if too big.Text sh
// ould automatically align by a fixed offset.
Note that I used label.PadRight(offset) in the first line to make sure that the string with the label is padded to the correct length -- here the padding is useful because it allows us to make the label string have exactly the same width as the other offsets.
in WriteOffset method you do weird things with text and it is hard to follow its modifications
test with a fiddle modified program
public class Program
{
static int TOTAL_NUMBER_CHARS_PER_LINE = 64;
public static void Main()
{
DisplayHelpEx();
}
// i only set test params here
public static void DisplayHelpEx()
{
string ParameterName = "myOption";
string Explanation = "Text do describe the option, but that should be splitted to several lines if too big. Text should automatically align by a fixed offset";
int offset = ParameterName.Length + 6;
Console.WriteLine("The following options are possible:");
// print caption
Console.Write((ParameterName + ": ").PadLeft(offset));
// print help
WriteOffset(offset, TOTAL_NUMBER_CHARS_PER_LINE - offset, Explanation);
}
private static void WriteOffset(int offset, int width, string text)
{
string pad = new String(' ', offset);
int i = 0;
while (i < text.Length)
{
// print text by 1 symbol
Console.Write(text[i]);
i++;
if (i%width == 0)
{
// when line end reached, go to next
Console.WriteLine();
// make offset for new line
Console.Write(pad);
}
}
}
}

XElement.Parse breaks due to an invalid xml text [duplicate]

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Use SecurityElement.Escape
using System;
using System.Security;
class Sample {
static void Main() {
string text = "Escape characters : < > & \" \'";
string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : < > & " &apos;
Console.WriteLine(xmlText);
}
}
If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.
Console.Write(new XElement("Data", "< > &"));
Will output
<Data>< > &</Data>
If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.
The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:
static void Main()
{
const string content = "\v\U00010330";
string newContent = RemoveInvalidXmlChars(content);
Console.WriteLine(newContent);
}
This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.
To support surrogate characters, I suggest using the following method:
public static string RemoveInvalidXmlChars(string text)
{
if (string.IsNullOrEmpty(text))
return text;
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}
Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}
// Replace invalid characters with empty strings.
Regex.Replace(inputString, #"[^\w\.#-]", "");
The regular expression pattern [^\w.#-] matches any character that is not a word character, a period, an # symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.#-\%] also allows a percentage symbol and a backslash in an input string.
Regex.Replace(inputString, #"[!##$%_]", "");
Refer this too :
Removing Invalid Characters from XML Name Tag - RegEx C#
Here is a function to remove the characters from a specified XML string:
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace XMLUtils
{
class Standards
{
/// <summary>
/// Strips non-printable ascii characters
/// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
/// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
/// </summary>
/// <param name="content">contents</param>
/// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
{
string pattern = String.Empty;
switch (XMLVersion)
{
case "1.0":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
break;
case "1.1":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
break;
default:
throw new Exception("Error: Invalid XML Version!");
}
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(tmpContents))
{
tmpContents = regex.Replace(tmpContents, String.Empty);
}
tmpContents = string.Empty;
}
}
}
If you are only escaping invalid XML characters for a string that is used inside of an XML tag you could do something simple like this.
This works when you aren't using an XML library.
public string EscapeXMLCharacters (string target)
{
return
target
.Replace("&", "&")
.Replace("<", "<")
.Replace(">", ">")
.Replace("\"", """)
.Replace("'", "&apos;");
}
you could then call it like so:
public string GetXMLBody(string content)
{
return #"<input>" + EscapeXMLCharacters(content) + "</input>";
}
string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.EncodeName(UnfilteredString);
}
string XMLReadStringWithoutIllegalCharacters(string FilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.DecodeName(UnfilteredString);
}
This simple method replace the invalid characters with the same value but accepted in the XML context.
To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).

RichTextBox and special chars c#

I need to put text with RTF format in a richtextbox, I try to put it with the richtextbox.rtf = TextString parameter, but the problem is that the string has special chars and the richtextbox does not show all the string correctly. The String and code that I am using:
String (TextString):
╔═══This is only an example, the special characters may change═══╗
C# Code:
String TextString = System.Text.Encoding.UTF8.GetString(TextBytes);
String TextRTF = #"{\rtf1\ansi " + TextString + "}";
richtextbox1.Rtf = TextRTF;
With this code, richtextbox show "+---This is only an example, the special characters may change---+" and in some cases, show "??????".
How can i solve this problem? if i change \rtf1\ansi to \rtf1\utf-8, i not see changes.
You can simply use the Text property:
richTextBox1.Text = "╔═══This is only an example, the special characters may change═══╗";
If you want to use the RTF property:
Take a look at this question: How to output unicode string to RTF (using C#)
You need to use something like this to convert the special characters to rtf format:
static string GetRtfUnicodeEscapedString(string s)
{
var sb = new StringBuilder();
foreach (var c in s)
{
if(c == '\\' || c == '{' || c == '}')
sb.Append(#"\" + c);
else if (c <= 0x7f)
sb.Append(c);
else
sb.Append("\\u" + Convert.ToUInt32(c) + "?");
}
return sb.ToString();
}
Then use:
richtextbox1.Rtf = GetRtfUnicodeEscapedString(TextString);

Escape invalid XML characters in C#

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Use SecurityElement.Escape
using System;
using System.Security;
class Sample {
static void Main() {
string text = "Escape characters : < > & \" \'";
string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : < > & " &apos;
Console.WriteLine(xmlText);
}
}
If you are writing xml, just use the classes provided by the framework to create the xml. You won't have to bother with escaping or anything.
Console.Write(new XElement("Data", "< > &"));
Will output
<Data>< > &</Data>
If you need to read an XML file that is malformed, do not use regular expression. Instead, use the Html Agility Pack.
The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:
static void Main()
{
const string content = "\v\U00010330";
string newContent = RemoveInvalidXmlChars(content);
Console.WriteLine(newContent);
}
This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.
To support surrogate characters, I suggest using the following method:
public static string RemoveInvalidXmlChars(string text)
{
if (string.IsNullOrEmpty(text))
return text;
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}
Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnecessarily:
public static string RemoveInvalidXmlChars(string text)
{
if (text == null)
return text;
if (text.Length == 0)
return text;
// a bit complicated, but avoids memory usage if not necessary
StringBuilder result = null;
for (int i = 0; i < text.Length; i++)
{
var ch = text[i];
if (XmlConvert.IsXmlChar(ch))
{
result?.Append(ch);
}
else if (result == null)
{
result = new StringBuilder();
result.Append(text.Substring(0, i));
}
}
if (result == null)
return text; // no invalid xml chars detected - return original text
else
return result.ToString();
}
// Replace invalid characters with empty strings.
Regex.Replace(inputString, #"[^\w\.#-]", "");
The regular expression pattern [^\w.#-] matches any character that is not a word character, a period, an # symbol, or a hyphen. A word character is any letter, decimal digit, or punctuation connector such as an underscore. Any character that matches this pattern is replaced by String.Empty, which is the string defined by the replacement pattern. To allow additional characters in user input, add those characters to the character class in the regular expression pattern. For example, the regular expression pattern [^\w.#-\%] also allows a percentage symbol and a backslash in an input string.
Regex.Replace(inputString, #"[!##$%_]", "");
Refer this too :
Removing Invalid Characters from XML Name Tag - RegEx C#
Here is a function to remove the characters from a specified XML string:
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace XMLUtils
{
class Standards
{
/// <summary>
/// Strips non-printable ascii characters
/// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
/// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
/// </summary>
/// <param name="content">contents</param>
/// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
private void StripIllegalXMLChars(string tmpContents, string XMLVersion)
{
string pattern = String.Empty;
switch (XMLVersion)
{
case "1.0":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
break;
case "1.1":
pattern = #"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
break;
default:
throw new Exception("Error: Invalid XML Version!");
}
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(tmpContents))
{
tmpContents = regex.Replace(tmpContents, String.Empty);
}
tmpContents = string.Empty;
}
}
}
If you are only escaping invalid XML characters for a string that is used inside of an XML tag you could do something simple like this.
This works when you aren't using an XML library.
public string EscapeXMLCharacters (string target)
{
return
target
.Replace("&", "&")
.Replace("<", "<")
.Replace(">", ">")
.Replace("\"", """)
.Replace("'", "&apos;");
}
you could then call it like so:
public string GetXMLBody(string content)
{
return #"<input>" + EscapeXMLCharacters(content) + "</input>";
}
string XMLWriteStringWithoutIllegalCharacters(string UnfilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.EncodeName(UnfilteredString);
}
string XMLReadStringWithoutIllegalCharacters(string FilteredString)
{
if (UnfilteredString == null)
return string.Empty;
return XmlConvert.DecodeName(UnfilteredString);
}
This simple method replace the invalid characters with the same value but accepted in the XML context.
To write string use XMLWriteStringWithoutIllegalCharacters(string UnfilteredString).
To read string use XMLReadStringWithoutIllegalCharacters(string FilteredString).

Remove all non-ASCII characters from string

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.
I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?
Here a simple solution:
public static bool IsASCII(this string value)
{
// ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
return Encoding.UTF8.GetByteCount(value) == value.Length;
}
source: http://snipplr.com/view/35806/
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Do it all at once
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach(char c in s)
{
if((int)c > 127) // you probably don't want 127 either
continue;
if((int)c < 32) // I bet you don't want control characters
continue;
if(c == ',')
continue;
if(c == '"')
continue;
sb.Append(c);
}
return sb.ToString();
}
If you wanted to test a specific character, you could use
if ((int)myChar <= 127)
Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.
Here's an improvement upon the accepted answer:
string fallbackStr = "";
Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
new EncoderReplacementFallback(fallbackStr),
new DecoderReplacementFallback(fallbackStr));
string cleanStr = enc.GetString(enc.GetBytes(inputStr));
This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)
It sounds kind of strange that it's accepted to drop the non-ASCII.
Also I always recommend the excellent FileHelpers library for parsing CSV-files.
strText = Regex.Replace(strText, #"[^\u0020-\u007E]", string.Empty);
public string RunCharacterCheckASCII(string s)
{
string str = s;
bool is_find = false;
char ch;
int ich = 0;
try
{
char[] schar = str.ToCharArray();
for (int i = 0; i < schar.Length; i++)
{
ch = schar[i];
ich = (int)ch;
if (ich > 127) // not ascii or extended ascii
{
is_find = true;
schar[i] = '?';
}
}
if (is_find)
str = new string(schar);
}
catch (Exception ex)
{
}
return str;
}

Categories