String Comparison, .NET and non breaking space - c#

I have an app written in C# that does a lot of string comparison. The strings are pulled in from a variety of sources (including user input) and are then compared. However I'm running into problems when comparing space '32' to non-breaking space '160'. To the user they look the same and so they expect a match. But when the app does the compare, there is no match.
What is the best way to go about this? Am I going to have to go to all parts of the code that do a string compare and manually normalize non-breaking spaces to spaces? Does .NET offer anything to help with that? (I've tried all the compare options but none seem to help.)
It has been suggested that I normalize the strings upon receipt and then let the string compare method simply compare the normalized strings. I'm not sure it would be straight-forward to do that because what is a normalized string in the first place. What do I normalize it too? Sure, for now I can convert non-breaking spaces to breaking spaces. But what else can show up? Can there potentially be very many of these rules? Might they even be conflicting. (In one case I want to use a rule and in another I don't.)

I went through lots of pain to find this simple answer. The code below uses a regular expression to replace non breaking spaces with normal spaces.
string cellText = "String with non breaking spaces.";
cellText = Regex.Replace(cellText, #"\u00A0", " ");
Hope this helps, Dan

It needs to be
text.Replace('\u00A0',' ')
where \u00A0 is non breaking space
This will replace the non breaking space with normal space.

If it were me, I would 'normalize' the strings as I 'pulled them in'; probably with a string.Replace(). Then you won't need to change your comparisons anywhere else.
Edit: Mark, that's a tough one. Its really up to you, or you clients, as to what is a 'normalized' string. I've been in a similar situation where the customer demanded that strings like:
I have 4 apples.
I have four apples.
were actually equal. You may need separate normalizers for different situations. Either way, I would still do the normalization upon retrieval of the original strings.

I'd suggest creating your own string comparer that extends one of the original ones -- do the "normalization" there (replace non-breaking space with regular space). In addition to the instance Equals method, there's a static String.Equals that takes a comparer.

The same without regex, mostly for myself when I need it later:
text.Replace('\u00A0', ' ')

Related

Password complexity regex with number or special character

I've got some regex that'll check incoming passwords for complexity requirements. However it's not robust enough for my needs.
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,20})
It ensures that a password meets minimum length, contains characters of both cases and includes a number.
However I need to modify this so that it can contain a number and/or an allowed special character. I've even been given a list of allowed special characters.
I have two problems, delimiting the special characters and making the first condition do an and/or match for number or special.
I'd really appreciate advice from one of the regex gods round these parts.
The allowed special characters are: #%+\/'!#$^?:.(){}[]~-_
If I understand your question correctly, you're looking for a possibility to require another special character. This could be done as follows (see the last lookahead):
((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!§$%&/(/)]).{8,20})
See a demo for this approach here on regex101.com.
However, you can make your expression even better with further approvements: the dot-star (.*) brings you down the line and backtracks afterwards. If you have a password of say 10 characters and you want to make sure, four lookaheads need to be fulfilled, you'll need at least 40 steps (even more as the engine needs to backtrack).
To optimize your expression, you could use the exact opposite of your required characters, thus making the engine come to an end faster. Additionally, as already pointed out in the comments, do not limit your maximum password length.
In the language of regular expressions, this would come down to:
((?=\D*\d)(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[!§$%&/(/)]).{8,})
With the first approach, 63 steps are needed, while the optimized version only needs 29 steps (the half!). Regarding your second question, allowing a digit or a special character, you could simply use an alternation (|) like so:
((?:(?=\D*\d)|(?=.*[!§$%&/(/)]))(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z]).{8,})
Or put the \d in the brackets as well, like so:
((?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[\d!§$%&/(/)]).{8,})
This one would consider to be ConsideredAgoodPassw!rd and C0nsideredAgoodPassword a good password.

How does html decoding work?

In my app I compare strings. I have strings that look the same but some of them contain white space, and other contain nbsp, so when I compare them I get that they are different. However, they represent the same entity so I have issues when I compare them. That's why I want to decode the strings I compare. That way nbsp will be converted to space in both of the strings and they will be treated as equal when I do the comparison. So here's what I do:
HttpUtility.HtmlDecode(string1)[0]
HttpUtility.HtmlDecode(string2)[0]
But I still get that string1[0] has ascii code of 160, and string2[0] has ascii code of 32.
Obviously I am not understanding the concept. What am I doing wrong?
You are trying to compare two different characters, no matter how resembling they might seem to you.
The fact that they have different character codes is enough to make the comparison fail. The easiest thing to do is replace the non-breaking space by a regular space and then compare them.
bool c = html.Replace('\u00A0', ' ').Equals(regular);

Parse directories from a string

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.
How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.
Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

C# - Fastest way to find one of a set of strings in another string

I need to check whether a string contains any swear words.
Following some advice from another question here, I made a HashSet containing the words:
HashSet<string> swearWords = new HashSet<string>() { "word_one", "word_two", "etc" };
Now I need to see if any of the values contained in swearWords are in my string.
I've seen it done the other way round, eg:
swearWords.Contains(myString)
But this will return false.
What's the fastest way to check if any of the words in the HashSet are in myString?
NB: I figure I can use a foreach loop to check each word in turn, and break if a match is found, I'm just wondering if there's a faster way.
If you place your swears in an IEnumerable<> implementing container:
var containsSwears = swarWords.Any(w => myString.Contains(w));
Note: HashSet<> implements IEnumerable<>
You could try a regex, but I'm not sure it's faster.
Regex rx = new Regex("(" + string.Join("|", swearWords) + ")");
rx.IsMatch(myString)
If you have really large set of swear words you could use Aho–Corasick algorithm: http://tomasp.net/blog/ahocorasick.aspx
The main problem with such schemes is defining what a word is in the context of the string you want to check.
Naive implementations such as those using input.Contains simply do not have the concept of a word; they will "detect" swear words even when that was not the intent.
Breaking words on whitespace is not going to cut it (consider also punctuation marks, etc).
Breaking on characters other than whitespace is going to raise culture issues: what characters are considered word-characters exactly?
Assuming that your stopword list only uses the latin alphabet, a practical choice would be to assume that words are sequences consisting of only latin characters. So a reasonable starting solution would be
var words = Regex.Split(#"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Pc}\p{Lm}]", myString);
The regex above is the standard class \W modified to not include digits; for more info, see http://msdn.microsoft.com/en-us/library/20bw873z.aspx. For other approaches, see this question and possibly the CodeProject link supplied in the accepted answer.
Having split the input string, you can iterate over words and replace those that match anything in your list (use swearWords.Contains(word) to check) or simply detect if there are any matches at all with
var anySwearWords = words.Intersect(swearWords).Any();
You could split "myString" into an IEnumerable type, and then use "Overlaps" on them?
http://msdn.microsoft.com/en-us/library/bb355623(v=vs.90).aspx
(P.S. Long time no see...)
EDIT: Just noticed error in my previous answer.

Categories