Using a Regex to clean string versus Base64 Encoded string - c#

I have a extension method that is using a Regex.Replace to clean up invalid characters in an user-entered string before it is added to a XML document.
The intent of the regex is to strip out some random hi-ASCII characters that are occasionally in the input when the user pastes text from Microsoft Word and replace them with a space:
public static string CleanInput(this string inputString) {
if (string.IsNullOrEmpty(inputString))
return string.Empty;
// Replace invalid characters with a space.
return Regex.Replace(inputString, #"[^\w\.#-]", " ");
}
Now as fate would have it, someone is now using this extension method on a string that contains base64-encoded data.
What I believe is that the regex will leave MOST of the base64 data unmodified, however I think it is might be changing some of it.
So - knowing that \w in the regex is matching [A-Za-z0-9_] and that Base64 effectively the same range, should this regex be changing the string or not?
If it is changing the string, why and how would you change it so that hi-ASCII garbage is still cleaned up in regular non-encoded text without mucking up the encoded string.

Base64 also uses +,/, and =.
You can add these to your character class:
[^\w\.#+/=-]
Note that - has to be last in order for it to be a literal hyphen-minus instead of specifying a range.
It may also be worth considering that \w isn't necessarily the same as [A-Za-z0-9_] according to Microsoft.

Related

C# regex does not allow special characters correctly?

For example I have the following string:
thats a\n\ntest\nwith multiline \n\nthings...
I tried to use the following code which does not work correctly and still hasn't all chars included:
string text = "thats a\n\ntest\nwith multiline \n\nthings and so on";
var res = Regex.IsMatch(text, #"^([a-zA-Z0-9äöüÄÖÜß\-|()[\]/%'<>_?!=,*. ':;#+\\])+$");
Console.WriteLine(res);
I want the regex returning true when only the following chars are included (do not have to contain all of them but at least one of the following and no others):
a-z, A-Z, 0-9, äüöÄÖÜß and !#'-.:,; ^"§$%&/()=?\}][{³²°*+~'_<>|.
This is a list of known keyboard characters I thought of would be nice the use inside of a message.
If you specified all the chars you want to allow, the regex declaration in C# will look like
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|]+$"
However, the test string you supplied contains line feed (LF, \n, \x0A) chars, so you need to either test on a string with no newlines, or add \n to the character class:
#"^[a-zA-Z0-9äüöÄÖÜß!#'\-.:,; ^""§$%&/()=?\\}\][{³²°*+~'_<>|\n]+$"
Note that the " char is doubled since this is the only way to put a double quote into a verbatim string literal.
Also, the capturing parentheses in your pattern create redundant overhead, you should remove them.

Convert backslash-escaped characters to literals, within a string

Are there any .NET provided functions to convert a string with backslash-escaped characters into the literal string?
For example, the string #"this\x20is a\ntest" should become "this is a\ntest", where \n is a literal newline character and \x20 is a literal space. These would (preferably) be Microsoft escape characters.
Try using Regex.Unescape
using System.Text.RegularExpressions;
...
string result=Regex.Unescape(#"this\x20is a\ntest");
This results in:
this is a
test
https://dotnetfiddle.net/y2f5GE
It might not work all the time as expected, please read the docs for details

Detect Special Characters in a text in C#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?
Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo
Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.
I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Is there a way to differentiate a string argument between non-hexadecimal and a hexadecimal?

Let's say we have the following signature
void doSomething(string s)
When the user calls the function, they can call
doSomething("hello") or doSomething("\x15\x3C\xFF")
Is there a way to tell when the argument is the second form, a hexadecimal value?
I want to do something like
if(isHex(s))
// do this
else
// do that
No. This is not possible. To the runtime environment, a string is essentially just an array of characters (which is essentially just a collection of bytes). It has no idea how those characters were originally represented either in plain text or escaped sequences of hexadecimal.
You can use regex in order to check for valid hex strings. But in order to do this you must provide the string in hex notation as is, i.e. without C#'s interpretation and transformation into a normal string. Use a verbatim string (introduced by a "#") for this:
string s = #"\x15\x3C\xFF";
In verbatim strings, the backslashes are not interpreted as escape characters by c#. But the downside of this is that you are not getting the intended resulting string any more, of course.
public static bool IsHexString(string s)
{
return Regex.IsMatch(s, #"^(\\x[0-9A-F]{2})+$");
}
Explanation of the regular expression:
^ beginning of string.
\\ escaped backslash ("\"). Not a C# escape here, but a regex escape.
x the letter "x".
[0-9A-F]{2} two consecutive hex digits.
(...)+ at least one occurence of a hex number.
$ end of line.

C# .NET Regex remove all quotes of quotes excluding one instance in a sentance

I have description field which is:
16" Alloy Upgrade
In CSV format it appears like this:
"16"" Alloy Upgrade "
What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.
I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.
Thanks in advance.
If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:
Regex.Replace(text, #"(?!="")""(?!"")",string.Empty)
That regex will match every " character not preceded or followed by another ".
I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:
public string CleanField(string input)
{
if (input.StartsWith("\"") && input.EndsWith("\""))
{
string output = input.Substring(1,input.Length-2);
output = output.Replace("\"\"","\"");
return output;
}
else
{
//If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
return input;
}
}
It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.
The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.
processedFieldArray = inputfieldArray.Select(CleanField).ToArray();
I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

Categories