I have just faced this problem today and wonder if someone has any idea about why does this test may fail (depending on culture). The aim is to check if the test text contain two spaces next to each other, which does according to string.IndexOf (even if i tell the string to replace all occurrences of two spaces next to each other). After some testing it seems \xAD is somehow causing this issue.
public class ReplaceIndexOfSymmetryTest
{
[Test]
public void IndexOfShouldNotFindReplacedString()
{
string testText = "\x61\x20\xAD\x20\x62";
const string TWO_SPACES = " ";
const string ONE_SPACE = " ";
string result = testText.Replace(TWO_SPACES, ONE_SPACE);
Assert.IsTrue(result.IndexOf(TWO_SPACES) < 0);
}
}
Yes, I've come across the same thing before (although with different characters). Basically IndexOf will take various aspects of "special" Unicode characters into account when finding matches, whereas Replace just treats the strings as a sequence of code points.
From the IndexOf docs:
This method performs a word (case-sensitive and culture-sensitive) search using the current culture. The search begins at the first character position of this instance and continues until the last character position.
... and from Replace:
This method performs an ordinal (case-sensitive and culture-insensitive) search to find oldValue.
You could use the overload of IndexOf which takes a StringComparison, and force it to perform an ordinal comparison though.
Like Jon said, use StringComparison.Ordinal to get it right.
Assert.IsTrue(result.IndexOf(TWO_SPACES, StringComparison.Ordinal) < 0);
Related
Why RegularExpressionAttribute validation doesn't compare the input string with the value of all matches concatenated?
I asked a question here about the scenario below, but I found a solution the next day and found it better to raise the issue here.
[Required(
AllowEmptyStrings = false,
ErrorMessage = "Required")]
[RegularExpression(
"^[^0]{1}|..+",
ErrorMessage = "Expressao Regular")]
public string EncryptedValue { get; set; }
Except by empty string or "0", the property should be valid in ModelState, but:
You can test HERE the expression and value.
Expression
^[^0]{1}|..+
Value
+iCMEBYZQtWbnU2RPX/MmqrDPuVJzSGGWhkFd+9/zpMbHVoOlZFuF9ND1xAxsQy3YFCPIsUBEgg2RJNkPefrmQ==
You will notice that the expression match, but with two matches. The first match itself are not equals to the input string, you need to concatenate both match value to reach that.
But apparently this is not done in the validation of ModelState, even with jquery.validate.unobtrusive this happens (with jquery, I need to click in submit button two times to see this, but it's happens).
Solution
You need to build an expression that match input string completelly in the first match.
When you build a expression to validate a field, every OR in your expression must match all the input string.
So whenever you mount an expression with OR operators, always mount from largest input to smallest input.
In this case:
From ^[^0]{1}|..+ to ..+|^[^0]{1}
let's take a look at class System.ComponentModel.DataAnnotations.RegularExpressionAttribute . we are interested in the following method :
[__DynamicallyInvokable]
public override bool IsValid(object value)
{
this.SetupRegex();
string input = Convert.ToString(value, (IFormatProvider) CultureInfo.CurrentCulture);
if (string.IsNullOrEmpty(input))
return true;
Match match = this.Regex.Match(input);
if (match.Success && match.Index == 0)
return match.Length == input.Length;
return false;
}
In our case we have regex expression "^[^0]{1}|..+" and input string +iCMEBYZQtWbnU2RPX/MmqrDPuVJzSGGWhkFd+9/zpMbHVoOlZFuF9ND1xAxsQy3YFCPIsUBEgg2RJNkPefrmQ== . Regex validation return two matches, the first + (first symbol) and the second is the rest part. the first match length less then input string, that is why IsValid return false.
Why RegularExpressionAttribute validation doesn't compare the input string with the value of all matches concatenated? because it works with the first match
So, if your regex is ^[^0]{1}|..+
And you value is : +iCMEBYZQt...
One match it will be +, and the other: iCMEBY0ZQt....
The thing is, in the first part ^[^0]{1}, your regex try to match any character that is not 0, once (at the beginning of the line/string).
So, the regex looks the string and say something like:
Voilah!! I have something that is not zero, at the beginning and is one character, the character is : +, Thanks, thanks, where is my cookie?!
Also, the or instruction (|) tells the regex to be more flexible about the patterns and is something like, If you don't find the first expression, don't feel bad, you also can look for this another thing. But once the regex has the first cookie doesn't make sense to go back and try to look again for the second expression (..+) because already found a solution for the first expression, and the regex knows that there will not be extra cookies for the same work. So, life continues and we have to move on.
Here you can see that the first and the second expressions are independent ways to satisfy the regex.
https://regex101.com/r/UWtcF8/3
Your second regex is : ..+|^[^0]{1}:
https://regex101.com/r/la3wDa/1
Where the order of the expressions are swapped. This is the same but the first expression that the regex will try to satisfy is ..+ which basically is give anything that involves 2 or more characters. So for example if you have 00, the regex will say:
Yumm, cookies !
And that can be fine or not, it will depend on what you want to solve.
Now another option is:
^(?!0$).+
https://regex101.com/r/gxJdch/1
Where we are going to look for anything but a single zero.
Useful and fun links:
https://regex101.com/ < Even when you are not working on php you can select the php version and check regex debugger to learn more things about regex!)
https://regexper.com/
I have a string which I would like to remove the first three characters from. How do I go about this using substrings, or is there any other way?
string temp = "01_Barnsley"
string textIWant = "Barnsley"
Thank you
You can use String.Substring(Int32) method.
Retrieves a substring from this instance. The substring starts at a
specified character position and continues to the end of the string.
string textIWant = temp.Substring(3);
Here is a demonstration.
As an alternative, you can use String.Remove method(Int32, Int32) method also.
Returns a new string in which a specified number of characters in the
current instance beginning at a specified position have been deleted.
string textIWant = temp.Remove(0, 3);
Here is a demonstration.
you can use String.Substring Method (Int32)
string textIWant = temp.Substring(3);
or
String.Remove Method (Int32, Int32)
string textIWant = temp.Remove(0,3);
If there is a pattern to the data, one can use that to extract out what is needed using Regular Expressions.
So if one knows there are numbers (\d regex for digits and with 1 or more with a +) followed by an under bar; that is the pattern to exclude. Now we tell the parser what we want to capture by saying we want a group match using ( ) notation. Within that sub group we say capture everything by using .+. The period (.) means any character and the + as seen before means 1 or more.
The full match for the whole pattern (not what we want) is grouped as at index zero. We want the first subgroup match at index 1 which is our data.
Console.WriteLine (Regex.Match("01_Barnsley", #"\d+_(.+)").Groups[1].Value); // Barnsley
Console.WriteLine (Regex.Match("9999_Omegaman", #"\d+_(.+)").Groups[1].Value); // Omegaman
Notice how we don't have to worry if its more than two digits? Whereas substring can fail because the number grew, it is not a problem for the regex parser due to the flexibility found in our pattern.
Summary
If there is a distinct pattern to the data, and the data may change, use regex. The minimal learning curve can pay off handsomely. If you truly just need something at a specific point that is unchanging, use substring.
A solution using LINQ:
string temp = "01_Barnsley";
string textIWant = new string(temp.Skip(3).ToArray());
Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());
Does anyone have an idea which would be better for potential string replacement?
If I have a collection of varying length of strings of varying lengths in which some strings might need special replacement of encoded hex values (e.g. =0A, %20... etc)
The "replacements" (there could be multiple) for each string would be handled by a Regular Expression to detect the appropriate escaped hex values
Which would be more efficient?
To simply run the replacement on every string in the collection ensuring by brute force that all needed replacements are done
To perform a test if a replacement is needed and only run the replacement on the strings that need it.
I'm working in C#.
Update
A little additional info from the answers and comments.
This is primarily for VCARD processing that is loaded from a QR Code
I currently have a regex that uses capture groups to get the KEY, PARAMETERS and VALUE from each KEY;PARAMETERS:VALUE in the VCARD.
Since i'm supporting v 2.1 and 3.0 the encoding and line folding are VERY different so I need to know the version before I decode.
Now it doesn't make sense to me to run the entire regular expression JUST to get the version and apply the approptiate replace to the whole vcard block of text and THEN rerun the regular expression.
To me it makes more sense to just get my capture groups loaded up then snag the version and do the appropriate decoding replacement on each match
When you just Replace it will perform slightly slower when there's No Match because of the additional checks that Replace does (e.g.)
if (replacement == null)
{
throw new ArgumentNullException("replacement");
}
Regex.Replace does return the input if no matches are found so there's no memory issue here.
Match match = regex.Match(input, startat);
if (!match.Success)
{
return input;
}
When there is a match the regex.Match fires twice once when you do it and again when replace does it. Which means Check and Replace will perform slower then.
So your results will be based on
Do you expect a lot of matches or a lot of misses?
When there are matches how does the fact that the Regex.Match will run twice overwelm the extra parameter checks? My guess is it probably will.
You could use something along the lines of a very specialized lexer with look-forward checking, e.g.,
outputBuffer := new StringBuilder
index := 0
max := input.Length
while index < max
if input[ index ] == '%'
&& IsHexDigit( input[ index + 1 ] )
&& IsHexDigit( input[ index + 2 ] )
outputBuffer.Append( ( char )int.Parse( input.Substring( index + 1, 2 )
index += 3
continue
else
outputBuffer.Append( input[ index ] )
index ++;
continue
If you go with string replacement, it may be better to use StringBuilder.Replace than string.Replace. (Will not create many temporary strings while replacing....)
(Posted on behalf of the question author).
Taking some inspiration from some of the fine folks who chimed in, I managed to isolate and test the code in question.
In both cases I have a Parser Regex that handles breaking up each "line" of the vcard and a Decode Regex that handles capturing any encoded Hex numbers.
It occurred to me that regardless of my use of string.Replace or not I still had to depend on the Decode Regex to pick up the potential replacement hex codes.
I ran through several different scenarios to see if the numbers would change; including: Casting the Regex MatchCollection to a Dictionary to remove the complexity of the Match object and projecting the Decoding regex into a collection of distinct simple anonymous object with an Old and New string value for simple string.Replace calls
In the end no matter how I massaged the test using the String.Replace it came close but was always slower that letting the Decoded Regex do it's Replace thing.
The closest was about a 12% difference in speed.
In the end for those curious this is what ended up as the winning block of code
var ParsedCollection = Parser.Matches(UnfoldedEncodeString).Cast<Match>().Select(m => new
{
Field = m.Groups["FIELD"].Value,
Params = m.Groups["PARAM"].Captures.Cast<Capture>().Select(c => c.Value),
Encoding = m.Groups["ENCODING"].Value,
Content = m.Groups["ENCODING"].Value.FirstOrDefault == 'Q' ? QuotePrintableDecodingParser.Replace(m.Groups["CONTENT"].Value, me => Convert.ToChar(Convert.ToInt32(me.Groups["HEX"].Value, 16)).ToString()) : m.Groups["CONTENT"].Value,
Base64Content = ((m.Groups["ENCODING"].Value.FirstOrDefault() == 'B') ? Convert.FromBase64String(m.Groups["CONTENT"].Value.Trim()) : null)
});
Gives me everything I need in one shot. All the Fields, their values, any parameters and the two most common encodings decoded all projected into a nicely packaged anonymous object.
and on the plus side only a little over 1000 nano seconds from string to parsed and decoded Anonymous Object (thank goodness for LINQ and extension methods) (based on 100,000 tests with around 4,000 length VCARD).
I wrote the following snippet to get rid of excessive spaces in slabs of text
int index = text.IndexOf(" ");
while (index > 0)
{
text = text.Replace(" ", " ");
index = text.IndexOf(" ");
}
Generally this works fine, albeit rather primative and possibly inefficient.
Problem
When the text contains " - " for some bizzare reason the indexOf returns a match!
The Replace function doesn't remove anything and then it is stuck in a endless loop.
Any ideas what is going on with the string.IndexOf?
Ah, the joys of text.
What you most likely have there, but got lost when posting on SO, is a "soft hyphen".
To reproduce the problem, I tried this code in LINQPad:
void Main()
{
var text = "Test1 \u00ad Test2";
int index = text.IndexOf(" ");
while (index > 0)
{
text = text.Replace(" ", " ");
index = text.IndexOf(" ");
}
}
And sure enough, the above code just gets stuck in a loop.
Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)
You can read a small section in the documentation for the String Class which has this to say on the subject:
String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.
I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.
The problem here is that IndexOf and Replace use different methods for locating the text.
Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.
There's at least two ways of handling this:
You can use Regex.Replace, which seems to not have this problem:
text = Regex.Replace(text, " +", " ");
Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.
You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:
index = text.IndexOf(" ", StringComparison.Ordinal);