Remove all "invisible" chars from a string? - c#

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.

I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.

If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.

One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.

The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.

var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.

You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();

If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Related

Finding the beginning and end of a substring using regex

I have an awful time with regular expressions, so I usually resort to lousy kludges and workarounds when parsing strings. I need to get better at using regex. This one seems simple to me, but I don't even know where to start.
Here's the string output from my device:
testString = IP:192.168.5.210\rPlaylist:1\rEnable:On\rMode:HDMI\rLineIn:unbal\r
Example:
I want to find if the device is off or on. I need to search for the string "Enable:" then locate the carriage return and determine if the word between Enable: and \r is off or on. It seems like that's what regex is for or do I totally misunderstand it.
Can someone point me in the right direction?
Additional information - Maybe I need to expand on the question.
Based on the answers, finding whether or not the device is Enabled appears to be fairly simple. Since I get a return string is similar to a key/value pair what's more vexing determining the substring between the : and the carriage return. A number of these pairs have a response with lengths that vary significantly, such as DeviceLocation, DeviceName, IPAddress. In fact, the device responds to every command sent to it by returning the entire status list, 48 key/value pairs, which I then must parse even if I only need to know one property.
Also based on your answers .... regular expressions is not the way to go.
Thanks for any help.
Norm
I would suggest for a simple line as shown, ask for one or the other, but verify as well. Based partially off Ken White's suggestions.
if(input.Contains(":On")){
//DoWork()
}else{
if(input.Contains(":Off"))
//DoOtherWork
}
This presumes that ":On" and ":Off" will not appear anywhere else in the string, even with a different string.
Consider the following code:
// This regular expression matches text 'Enabled: ' followed by one or more non '\r' followed by '\r'
// RegexOptions.Multiline is optional but MAY be necessary on other platforms.
// Also, '\r' is not a line break. '\n' is.
Regex regex = new Regex("Enable: ([^\r]+)\r", RegexOptions.Multiline);
string input = "IP:192.168.5.210\rPlaylist: 1\rEnable: On\rMode: HDMI\rLineIn: unbal\r";
var matches = regex.Match(input);
Debug.Assert(matches != Match.Empty);
// The match variable will contain 2 Groups:
// First will be 'Enabled: On\r'
// The other is 'On' since we enclosed ([^\r]+) in ().
Console.WriteLine(matches.Groups[1]);

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

How to check the repeated characters in a string

I am creating a program that filters and check if the word is existing in a dictionary. The problem is how to know if the word has repeated characters.
For example:
string string1 = "sorrrrrrry";
that string does not exist in the dictionary but if you remove repeated r it will be "sorry".
I am using hunspell to check if the word exist in the dictionary. Any solution please? Thanks in advance
For your case what you can do is:
replace the repeated characters but 2 => "sorry"
look if the word exists on the dictionary
if not, replace the 2 repeated characters by 1 character => "sory" (if you have for example "caat")
look if the word exists on the dictionary
Using the regex (\w)\1+ (matches repeated characters) and replacing the first time by $1$1 (2 repeated matched characters) and the by $1
string input = "sorrrrrrry";
Regex regex = new Regex(#"(\w)\1+");
string replacement = "$1$1";
string res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sorry
replacement = "$1";
res = regex.Replace(input, replacement);
Console.WriteLine(res);
//will output => sory
Warning
This can give some results BUT it has some limitations and can produce unexpected results:
you need to handle all the combinations if more than two characters are repeated: if you have "soooorrrry" it will give you 1. "soorry" and then 2. "sory", so the algorithm will not work.
what to do with the case "gooood", is it "good" or "god" ?
You only can try to guess by several fuzzy logic methods which word is the one, wich could match SOME in the dictionary and, if more than one is found, show a list.
Perhaps You know, how a smartphone keyboard tries to help You.
This way is more or less the proper one ( during typing ) not after.
But after is also possible, but needs more effort.
You may want to look into storing the dictionary in Lucene.Net and using its loose matching capability to match the words.

regex approach for extracting strings surrounded with double quotes

I have a search string that is getting passed
Eg: "a+b",a, b, "C","d+e",a-b,d
I want to filter out all sub strings surrounded by double quotes("").
In above sample Output should contain:
"a+b","C","d+e"
Is there a way to do this without looping?
Also I then need to extract a string without above values to do further processing
Eg: a,b,a-b,d
Any suggestions on how to do this with minimal performance impact?
Thank you in advance for all your comments and suggestions
Since you didn't say anything about how exactly you want your output (do you need to keep the commas and extra whitespace? Is it comma delimited to begin with? Let's assume that it is NOT comma delimited and you are just trying to remove the occurences of the "xyz":
string strRegex = #"""([^""])+""";
string strTargetString = #" ""a+b"",a, b, ""C"",""d+e"",a-b,d";
string strOutput = Regex.Replace(strTargetString, strRegex, x => "");
Will remove all of the items (leaving the extra commas and whitespace).
If you are trying to do something where you need each individual match then you might want to try:
var y = (from Match m in Regex.Matches(strTargetString, strRegex) select m.Value).ToList<string>();
y.ForEach(s => Console.WriteLine(s));
To get the list of items without the surrounding quotes, you could either reverse the regex pattern OR use the replace method in the first code sample and then split on the commas, trimming white space (again, assuming you are splitting on commas which it sounds like you are)
First, add a comma to the end of your output:
"a+b",a, b, "C","d+e",a-b,d,
Then, use this regular expression:
((?<quoted>\".+?\")|(?<unquoted>.+?)),\s*
Now you have 2 problems. Kidding!
You'll have to find a way of extracting the matches without using a loop, but at least they are separated into quoted and unquoted strings by using the group. You could use a lamdba expression to pull the data out and join it, one each for quoted and unquoted, but it's just doing a loop behind the scenes, and may add more overhead than a simple for loop. It sounds like you're trying to eek out performance here, so time and test each method to see what gives the best results.

C# - Fastest way to find one of a set of strings in another string

I need to check whether a string contains any swear words.
Following some advice from another question here, I made a HashSet containing the words:
HashSet<string> swearWords = new HashSet<string>() { "word_one", "word_two", "etc" };
Now I need to see if any of the values contained in swearWords are in my string.
I've seen it done the other way round, eg:
swearWords.Contains(myString)
But this will return false.
What's the fastest way to check if any of the words in the HashSet are in myString?
NB: I figure I can use a foreach loop to check each word in turn, and break if a match is found, I'm just wondering if there's a faster way.
If you place your swears in an IEnumerable<> implementing container:
var containsSwears = swarWords.Any(w => myString.Contains(w));
Note: HashSet<> implements IEnumerable<>
You could try a regex, but I'm not sure it's faster.
Regex rx = new Regex("(" + string.Join("|", swearWords) + ")");
rx.IsMatch(myString)
If you have really large set of swear words you could use Aho–Corasick algorithm: http://tomasp.net/blog/ahocorasick.aspx
The main problem with such schemes is defining what a word is in the context of the string you want to check.
Naive implementations such as those using input.Contains simply do not have the concept of a word; they will "detect" swear words even when that was not the intent.
Breaking words on whitespace is not going to cut it (consider also punctuation marks, etc).
Breaking on characters other than whitespace is going to raise culture issues: what characters are considered word-characters exactly?
Assuming that your stopword list only uses the latin alphabet, a practical choice would be to assume that words are sequences consisting of only latin characters. So a reasonable starting solution would be
var words = Regex.Split(#"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Pc}\p{Lm}]", myString);
The regex above is the standard class \W modified to not include digits; for more info, see http://msdn.microsoft.com/en-us/library/20bw873z.aspx. For other approaches, see this question and possibly the CodeProject link supplied in the accepted answer.
Having split the input string, you can iterate over words and replace those that match anything in your list (use swearWords.Contains(word) to check) or simply detect if there are any matches at all with
var anySwearWords = words.Intersect(swearWords).Any();
You could split "myString" into an IEnumerable type, and then use "Overlaps" on them?
http://msdn.microsoft.com/en-us/library/bb355623(v=vs.90).aspx
(P.S. Long time no see...)
EDIT: Just noticed error in my previous answer.

Categories