How do I replace all special characters with their respective hex codes? - c#

I have a XML file and it contains multiple special characters.
I want to replace all the special characters with their respective hex codes.
So & becomes &#x0026 and so on. But only special characters.
Please help.

You can use HttpUtility.HtmlDecode to decode special characters. More in the official documentation: https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode
But you cannot use this method on the whole XML string, because < and > will be replaced. So you need to apply it only on the text nodes and attributes values

Related

Detect Special Characters in a text in C#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?
Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo
Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.
I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Regex: How to specify a range in hex covering a collection of Unicode symbols

I have an XML document that contains an invalid character (Hex: 0x2642). More from here. I want to remove it before deserializing the document. XML is represented as a string when we strip it off of invalid characters. So far, we've used:
var xmlString = Regex.Replace(xmlString, #"[^\u0000-\uF000]", string.Empty);
It worked for control characters but instead of specifying the 0x2642, I wanted to create a range in regex that covers a range of symbols to avoid this issue in the future (specifically these symbols here)
To specify the linked symbols of the MiscellaneousSymbols block
you can use the regex:\p{IsMiscellaneousSymbols}to match them in c#.
C# uses \p{IsBlock} for Unicode blocks.

Regex match a hash that has been split over multiple lines

I want to match a hash that has been word wrapped by an author, and received over multiple lines.
Example:
SHA256: AB76235776BC87DBAB76235776BC87DBAB76235776BC87
DBAB76235776BC87DB
Has been received. My usual regex to match a sha256 hash like this is of course: [0-9A-Fa-f]{64}
But this does not work. I would like to leave the file unmodified while searching for this match, any ideas on how to match the split hash without removing newlines?
I'd like to have a regex that basically says 'look for 64 sequential hexadecimal values, but allow for one or more newlines in the mix, kthx'
Thanks in advance. C# is the language.
Try this:
\b(?:[a-fA-F0-9]\s*){64}\b
It allows any kind of whitespace, not just line separators. If it really has to allow only line separators, you can use this:
\b(?:[a-fA-F0-9][\r\n]*){64}\b
This will also include the line separator following the number, if there is one, and if it's followed by a word character. You can prevent that like this:
\b(?:[a-fA-F0-9][\r\n]*){63}[a-fA-F0-9]\b
Change your regex to include newline characters:
[A-Z0-9a-z\\r\\n ]{64, }
You could modify the upper bound to include a restriction on the number of linebreaks.
In this case you need to keep in mind linebreaks can be 2 symbols long, depending on machine culture and OS.
1 linebreak --> 66 chars
2 linebreaks --> 68 chars
Continue as much as you like.
On a sidenote. While parsing the file, you generally leave it rest. All your modifications are made with the variables you read the file in to. This is why I do not see the point of keeping the linebreaks.

Best way to separate two base64 strings

I am using standard input and output to pass 2 base64 strings from one application to another. What would be the best way separating them so I could get them as a two separate strings in other application? I was thinking using a simple comma, to separate them and then just use
string[] s = output.Split(',');
Where output is the data I read in from standard output.
Example with the comma:
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCv5E5Y0Wrad5FrLjeUsA71Gipl3mhjIuCw1xhj
jDwXN87lIhpE32UvItf+mvp8flQ+fhi5H0PditDCzUFg8lXuiuOXxelLWEXA8hs7jc+4zzR5ps3R
fOv3M6H8K5XGkwWLhVNQX47sAGyY/43JdbfX7+FsYUFeHW/wa2yKSMZS3wIDAQAB
,HNJpFQyeyJoVbeEgkw/WNtzR0JTPIa1hlK1C8LbFcVcJfL33ssq3gbzi0zxn0n2WxBYKJZj2Kqbs
lVrmFbQJRgvq4ZNF4F8z+xjL9RVVE/rk5x243c3Szh05Phzx+IUyXJe6GkITDmsxcwovvzSaGhzU
3qQkNbhIN0fVyynpg0Kfm0WytuW71ku1eq45ibcczgwQLRJX1GKzC9wH7x/V36i6SpyrxZ/+uCIL
4QgnKt6x4QG7Gfk3Msam6h6JTFdzkeHJjq6JzeapdQn5LxeMY0jLGc4cadMCvy/Jdrcg02pG2wOO
/gJT77xvX+d1igi+BQ/YpFlwXI0BIuRwMAeLojmZdRYjJ+LY69auxgpnQvSF4A+Wc6Jo8m1pzzHB
yQvA8KyiRwbyijoBOsg+oK18UPFWeJ5hE3e+8l/WSEcii+oPgXyXTnK+seesGdOPeem3HukNyIps
L/StHZEkzeJFTr8LIB9HLqDikYU2mQjTiK5cIExoyy2Go+0ndL84rCzMZAlfFlffocL9x+SGyeer
M1mxmyDtmiQfDphEZixHOylciKUhWR00dhxkVRQ4Q9LYCeyGfDiewL+rm5se/ePCklWtTGycV9HM
H5vYLhgIkf5W6+XcqcJlE6vp4WWxmKHQYqRAdfW5MYWskx7jBDTMV2MLy7N6gQRQa/OpK8ruAbVf
MwWP1sGyhAxgrw/UxTH1tW498WI5JtQR3oub3+Uj5AqydhwzQtWM58WfVQXdv2bFZmGH7d9A+C95
DQ8QXKrV7Ot/wVq5KKLgpJy8iMe/G/iyXOmQhkLnZ3qvBaIJd+E2ZIVPty6XGMwgC4JebArr+a6V
Cb/SO+vR+eZmXLln/w==
All you have to do is to use a separator which is not a valid Base64 character. Comma is not a base64 character so you can use.
Base64 characters are [0-9a-zA-Z/=+] (all numbers, uppercase, lowercase, forward slash plus and equal sign).
This seems like a good solution. The comma cannot be part of a base64 index table so it is a safe separator.
You can wrap it i some XML. the CDATA element is perfect for that.

Removing String Escape Codes

My program outputs strings like "Wzyryrff}av{v5~fvzu: Bb``igbuz~+\177Ql\027}C5]{H5LqL{" and the problem is the escape codes (\\\ instead of \, \177 instead of the character, etc.)
I need a way to unescape the string of all escape codes (mainly just the \\\ and octal \027 types). Is there something that already does this?
Thanks
Reference: http://www.tailrecursive.org/postscript/escapes.html
The strings are an encrypted value and I need to decrypt them, but I'm getting the wrong values since the strings are escaped
It sounds more like it's encoded rather than simply escaped (if \177 is really a character). So, try decoding it.
There is nothing built in to do exactly this kind of escaping.
You will need to parse and replace these sequences yourself.
The \xxx octal escapes can be found with a RegEx (\\\d{3}), iterating over the matches will allow you to parse out the octal part and get the replacement character for it (then a simple replace will do).
The others appear to be simple to replace with string.Replace.
If the string is encrypted then you probably need to treat it as binary and not text. You need to know how it is encoded and decode it accordingly. The fact that you can view it as text is incidental.
If you want to replace specific contents you can just use the .Replace() method.
i.e. myInput.Replace("\\", #"\")
I am not sure why the "\" is a problem for you. If it its actually an escape code then it just should be fine since the \ represents the \ in a string.
What is the reason you need to "remove" the escape codes?

Categories