Why is one character missing in the query result? - c#

Take a look at the code:
string expression = "x & ~y -> (s + t) & z";
var exprCharsNoWhitespace = expression.Except( new[]{' ', '\t'} ).ToList();
var exprCharsNoWhitespace_2 = expression.Replace( " ", "" ).Replace( "\t", "" ).ToList();
// output for examination
Console.WriteLine( exprCharsNoWhitespace.Aggregate( "", (a,x) => a+x ) );
Console.WriteLine( exprCharsNoWhitespace_2.Aggregate( "", (a,x) => a+x ) );
// Output:
// x&~y->(s+t)z
// x&~y->(s+t)&z
I want to remove all whitespace from the original string and then get the individual characters.
The result surprized me.
The variable exprCharsNoWhitespace contains, as expected, no whitespace, but unexpectedly, only almost all of the other characters. The last occurence of '&' is missing, the Count of the list is 12.
Whereas exprCharsNoWhitespace_2 is completely as expected: Count is 13, all characters other than whitespace are contained.
The framework used was .NET 4.0.
I also just pasted this to csharppad (web-based IDE/compiler) and got the same results.
Why does this happen?
EDIT:
Allright, I was unaware that Except is, as pointed out by Ryan O'Hara, a set operation. I hadn't used it before.
// So I'll continue just using something like this:
expression.Where( c => c!=' ' && c!='\t' )
// or for more characters this can be shorter:
expression.Where( c => ! new[]{'a', 'b', 'c', 'd'}.Contains(c) ).

Except produces a set difference. Your expression isn’t a set, so it’s not the right method to use. As to why the & specifically is missing: it’s because it’s repeated. None of the other characters is.

Ryan already answered your question as asked, but I'd like to provide you an alternative solution to the problem you are facing. If you need to do a lot of string manipulation, you may find regular expression pattern matching to be helpful. The examples you've given would work something like this:
string expression = "x & ~y -> (s + t) & z";
string pattern = #"\s";
string replacement = "";
string noWhitespace = new Regex(pattern).Replace(expression, replacement);
Or for the second example, keep everything the same except the pattern:
string pattern = "[abcd]";
Keep the Regex object stored somewhere rather than creating it each time if you're going to use the same pattern a lot.

As already mentioned .Except(...) is a set operation so it drops duplicates.
Try just using .Where(...) instead:
string expression = "x & ~y -> (s + t) & z";
var exprCharsNoWhitespace =
String.Join(
"",
expression.Where(c => !new[] { ' ', '\t' }.Contains(c)));
This gives:
x&~y->(s+t)&z

Related

Replace a part of string containing Password

Slightly similar to this question, I want to replace argv contents:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
to this:
"-help=none\n-URL=(default)\n-password=********\n-uname=Khanna\n-p=100"
I have tried very basic string find and search operations (using IndexOf, SubString etc.). I am looking for more elegant solution so as to replace this part of string:
-password=AnyPassword
to:
-password=*******
And keep other part of string intact. I am looking if String.Replace or Regex replace may help.
What I've tried (not much of error-checks):
var pwd_index = argv.IndexOf("--password=");
string converted;
if (pwd_index >= 0)
{
var leftPart = argv.Substring(0, pwd_index);
var pwdStr = argv.Substring(pwd_index);
var rightPart = pwdStr.Substring(pwdStr.IndexOf("\n") + 1);
converted = leftPart + "--password=********\n" + rightPart;
}
else
converted = argv;
Console.WriteLine(converted);
Solution
Similar to Rubens Farias' solution but a little bit more elegant:
string argv = "-help=none\n-URL=(default)\n-password=\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)[^\n]*", "$1********");
It matches password= literally, stores it in capture group $1 and the keeps matching until a \n is reached.
This yields a constant number of *'s, though. But telling how much characters a password has, might already convey too much information to hackers, anyway.
Working example: https://dotnetfiddle.net/xOFCyG
Regular expression breakdown
( // Store the following match in capture group $1.
password= // Match "password=" literally.
)
[ // Match one from a set of characters.
^ // Negate a set of characters (i.e., match anything not
// contained in the following set).
\n // The character set: consists only of the new line character.
]
* // Match the previously matched character 0 to n times.
This code replaces the password value by several "*" characters:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)([\s\S]*?\n)",
match => match.Groups[1].Value + new String('*', match.Groups[2].Value.Length - 1) + "\n");
You can also remove the new String() part and replace it by a string constant

RegEx matching for a filename

I have no experience using regular expressions, and although I should spend some time training in them, I have a need for a simple one.
I want to find a match of P*.txt in a given string (meaning anything that starts with a P, followed by anything, and ending in ".txt".
eg:
string myString = "P671221.txt";
Regex reg = new Regex("P*.txt"); //<--- what goes here?
if (reg.IsMatch(myString)
{
Console.WriteLine("Match!"));
}
This example doesn't work because it will return a match for ".txt" or "x.txt" etc. How do I do this?
myString.StartsWith("P") && myString.EndsWith(".txt")
EDIT: Removed my regex
Updated:
string start + (p) + any characters + .txt + string end
^(?i:p).*\.txt$
A more precise alternative would be:
string start + (p) + [specific characters] + .txt + string end
( currently specified are: "a-z", "0-9", space, & underscore )
^(?i:p)(?i:[a-z0-9 _])*\.txt$
Live Demo
Original Solution
( quotes were included, as I overlooked that quotes are part of the code but not
the string )
preceding quotes + (p) + any characters + .txt + following quotes
(?<=")(?i:p).*\.txt(?=")
Image
Live Demo
P[\d]+\.txt this will work. If you have fix number of digits then you can do it like P[\d]{6}\.txt. Just replace the 6 with your desired fix number.
If the value in between the starting letter P and extension .txt can be alphanumeric use P[\w]+\.txt
string myString = "P671221.txt";
Regex reg = new Regex("P(.*?)\\.txt"); //--> if anything goes after P
if (reg.IsMatch(myString))
Console.WriteLine("Match!");
This should meet the requirements that you have presented.
c#
[Pp].*.(?:txt)+$
The best option to get files that start with P & end with .txt with regex is:
^P\w+\.txt$

Regex to split "&" in URL parameters only if they are followed by content ending with "="

I have a dilemma that I have been attempting to resolve with malformed URL's, where specific parameters can have values that contain specific characters that might conflict with parsing the url.
if( remaining.Contains( "?" ) || remaining.Contains( "#" ) )
{
if( remaining.Contains( "?" ) )
{
Path = remaining.Substring( 0, temp = remaining.IndexOf( "?" ) );
remaining = remaining.Substring( temp + 1 );
// Re-encode for URLs
if( remaining.Contains( "?" ) )
{
remaining = URL.Substring( URL.IndexOf( "?" ) + 1 );
}
if( remaining.IndexOf("=") >= 0 )
{
string[] qsps = Regex.Split( remaining, #"[&]\b" );// Original Method: remaining.Split( '&' );
qsps.ToList().ForEach( qsp =>
{
string[] vals = qsp.Split( '=' );
if( vals.Length == 2 )
{
Parameters.Add( vals[0], vals[1] );
}
else
{
string key = (string) vals[0].Clone();
vals[0] = "";
Parameters.Add( key, String.Join( "=", vals ).Substring( 1 ) );
}
} );
}
}
I added the line "Regex.Split( remaining, #"[&]\b" );" to grab "&" that were followed by a character, which seems useful.
I am just trying to see if there is a better approach to only splitting the "&'s" that are actually for parameters?
Example to test against (which caused this needed update):
www.myURL.com/shop/product?utm_src=bm23&utm_med=email&utm_term=apparel&utm_content=02/15/2016&utm_campaign=Last
Chance! Presidents' Day Sales Event: Free Shipping & More!
A working regex should only grab the &'s for the following:
utm_src=btm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
It should NOT count the "& More" as a match, since the section does not end with "=" afterwards
I would suggest a regex using a look-ahead:
/&(?=[^&=]+=)/
You can see this in effect here: version1. It looks first for the & character, and then "peeks" forward to ensure that a = follows, but only if it does not contain another & or a = in between.
You can also ensure that there are no whitespace characters (like newlines, etc.) which aren't valid in URLs anyway (version 2):
&(?=[^\s&=]+=)
I'd like to use this regex:
Regex.Split(url, #"(?<=(?:=\S+?))&",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
if you pass your test string via url which is.
www.myURL.com/shop/product?utm_src=bm23&utm_med=email&utm_term=apparel&utm_content=02/15/2016&utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
The output should be.
www.myURL.com/shop/product?utm_src=bm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping & More!
Please note first line of output.
www.myURL.com/shop/product?utm_src=bm23
which contains first path of url, but can be easily splitted by ?
Not sure what you're trying to do, but if you want to find errant
ampersands, this is a good regex for that.
&(?=[^&=]*(?:&|$))
You could either replace with a %26 or split with it.
If you split with it, just recombine and the errant ampersand will be gone.
(?<=[?&])([^&]*)(?=.*[&=])
Explanation:
(?<=[?&]) positive lookbehind for either '&' or '?'
([^&]*) capture as many characters as possible that aren't '&'
(?=.*[&=]) positive lookahead for either an '&' or '='
Output:
utm_src=bm23
utm_med=email
utm_term=apparel
utm_content=02/15/2016
utm_campaign=Last Chance! Presidents' Day Sales Event: Free Shipping
Demo
So to get the matches:
string str = "www.myURL.com/...";
Regex reg = "(?<=[?&])([^&]*)(?=.*[&=])";
List<string> result = reg.Matches(str).Cast<Match>().Select(m => m.Value).ToList();
Edit for the question edit:
(?<=[?&])\S.*?(?=&\S)|(?<=[?&])\S.*(?=\s)

C# Regex wildcard multiple replace

Doing a search for different strings using wildcards, such as doing a search for test0? (there is a space after the ?). The strings the search produces are:
test01
test02
test03
(and so on)
The replacement text should be for example:
test0? -
The wildcard above in test0? - represents the 1, 2, or 3...
So, the replacement strings should be:
test01 -
test02 -
test03 -
string pattern = WildcardToRegex(originalText);
fileName = Regex.Replace(originalText, pattern, replacementText);
public string WildcardToRegex(string pattern)
{
return "^" + System.Text.RegularExpressions.Regex.Escape(pattern).
Replace("\\*", ".*").Replace("\\?", ".") + "$";
}
The problem is saving the new string with the original character(s) plus the added characters. I could search the string and save the original with some string manipulation, but that seems like too much overhead. There has to be an easier way.
Thanks for any input.
EDIT:
Search for strings using the wildcard ?
Possible string are:
test01 someText
test02 someotherText
test03 moreText
Using Regex, the search string patter will be:
test0? -
So, each string should then read:
test01 - someText
test02 - someotherText
test03 - moreText
How to keep the character that was replaced by the regex wildcard '?'
As my code stands, it will come out as test? - someText
That is wrong.
Thanks.
EDIT Num 2
First, thanks everyone for their answers and direction.
It did help and lead me to the right track and now I can better ask the exact question:
It has to do with substitution.
Inserting text after the Regex.
The sample string I gave, they may not always be in that format. I have been looking into substitution but just can't seem to get the syntax right. And I am using VS 2008.
Any more suggestions?
Thanks
If you want to replace "test0? " with "test0? -", you would write:
string bar = Regex.Replace(foo, "^test0. ", "$0- ");
The key here is the $0 substitution, which will include the matched text.
So if I understand your question correctly, you just want your replacementText to be "$0- ".
If I understand the question correctly, couldn't you just use a match?
//Convert pattern to regex (I'm assuming this can be done with your "originalText")
Regex regex = pattern;
//For each match, replace the found pattern with the original value + " -"
foreach (Match m in regex.Matches)
{
RegEx.Replace(pattern, m.Groups[0].Value + " -");
}
So I'm not 100% clear on what you're doing, but I'll give it a try.
I'm going with the assumption that you want to use "file wildcards" (?/*) and search for a set of values that match (while retaining the values stored using the placeholder itself), then replace it with the new value (re-inserting those placeholders). given that, and probably a lot of overkill (since your requirement is kind of weird) here's what I came up with:
// Helper function to turn the file search pattern in to a
// regex pattern.
private Regex BuildRegexFromPattern(String input)
{
String pattern = String.Concat(input.ToCharArray().Select(i => {
String c = i.ToString();
return c == "?" ? "(.)"
: c == "*" ? "(.*)"
: c == " " ? "\\s"
: Regex.Escape(c);
}));
return new Regex(pattern);
}
// perform the actual replacement
private IEnumerable<String> ReplaceUsingPattern(IEnumerable<String> items, String searchPattern, String replacementPattern)
{
Regex searchRe = BuildRegexFromPattern(searchPattern);
return items.Where(s => searchRe.IsMatch(s)).Select (s => {
Match match = searchRe.Match(s);
Int32 m = 1;
return String.Concat(replacementPattern.ToCharArray().Select(i => {
String c = i.ToString();
if (m > match.Groups.Count)
{
throw new InvalidOperationException("Replacement placeholders exceeds locator placeholders.");
}
return c == "?" ? match.Groups[m++].Value
: c == "*" ? match.Groups[m++].Value
: c;
}));
});
}
Then, in practice:
String[] samples = new String[]{
"foo01", "foo02 ", "foo 03",
"bar0?", "bar0? ", "bar03 -",
"test01 ", "test02 ", "test03 "
};
String searchTemplate = "test0? ";
String replaceTemplate = "test0? -";
var results = ReplaceUsingPattern(samples, searchTemplate, replaceTemplate);
Which, from the samples list above, gives me:
matched: & modified to:
test01 test01 -
test02 test02 -
test03 test03 -
However, if you really want to save headaches you should be using replacement references. there's no need to re-invent the wheel. The above, with replacements, could have been changed to:
Regex searchRe = new Regex("test0(.*)\s");
samples.Select(x => searchRe.Replace(s, "test0$1-"));
You can catch any piece of your matched string and place anywhere in the replace statement, using symbol $ followed by the index of catched element (it starts at index 1).
You can catch element with parenthesis "()"
Example:
If I have several strings with testXYZ, being XYZ a 3-digit number, and I need to replace it, say, with testZYX, inverting the 3 digits, I would do:
string result = Regex.Replace(source, "test([0-9])([0-9])([0-9])", "test$3$2$1");
So, in your case, it can be done:
string result = Regex.Replace(source, "test0([0-9]) ", "test0$1 - ");

Tokenizing with RegEx when delimiter can be in token

I'm parsing some input in C#, and I'm hitting a wall with RegEx processing.
A disclaimer: I'm not a regular expression expert, but I'm learning more.
I have an input string that looks like this:
ObjectType [property1=value1, property2=value2, property3=AnotherObjectType [property4=some value4]]
(a contrived value, but the important thing is that these can be nested).
I'm doing the following to tokenize the string:
Regex Tokenizer = new Regex(#"([=\[\]])|(,\s)");
string[] tokens = Tokenizer.Split(s);
This gets me about 98% of the way. This splits the string on known separators, and commas followed by a whitespace.
The tokens in the above example are:
ObjectType
[
property1
=
value1
,
property2
=
value2
,
property3
=
AnotherObjectType
[
property4
=
some value4
]
]
But I have two issues:
1) The property values can contain commas. This is a valid input:
ObjectType [property1=This is a valid value, and should be combined,, property2=value2, property3=AnotherObjectType [property4=value4]]
I would like the token after property1= to be:
This is a valid value, and should be combined,
And I'd like the whitespace inside the token to be preserved. Currently, it's split when a comma is found.
2) When split, the comma tokens contain whitespace. I'd like to get rid of this if possible, but this is a much less important priority.
I've tried various options, and they have all gotten me partially there. The closest that I've had is this:
Regex Tokenizer = new Regex(#"([=\[\]])|(,\s)|([\w]*\s*(?=[=\[\]]))|(.[^=]*(?=,\s))");
To match the separators, a comma followed by a whitepace, word charaters followed by a whitespace before a literal, and text before a comma and whitespace (that doesn't include the = sign).
When I get the matches instead of calling split, I get this:
ObjectType
[
property1
=
value1
,
property2
=
value2
,
property3
=
AnotherObjectType
[
property4
=
value4
]
]
Notice the missing information from property4. More complex inputs sometimes have the close brackets included in the token, like this: value4]
I'm not sure why that's happening. Any ideas on how to improve upon this?
Thanks,
Phil
This is easiest to answer with a lexer and parser tool. Many argue that they are too complex for these "simple" use cases, though I have always found them clearer and easier to reason about. You don't get bogged down in stupid if logic.
For C#, GPLEX and GPPG seem to be some good ones. See here for why you might want to use them.
In your case you have a grammar, that is how you define the interaction between different tokens based upon context. And also, you have the details of implementing this grammar in your language and toolchain of choice. The grammar is relatively easy to define, you have informally done so already. The details are the tricky part. Wouldn't it be nice if you had a framework that could read some defined way of writing out the grammar bit and just generate the code to actually do it?
That is how these tools work in a nutshell. The docs are pretty short, so read through all of them, taking the time up front will help immensely.
In essence, you would declare a scanner and parser. The scanner takes in a text stream/file and compares it to various regular expressions until it has a match. That match is passed up to the parser as a token. Then the next token is matched and passed up, round and round until the text stream empties out.
Each matched token can have arbitrary C# code attached to it, and the same with each of the rules in the parser.
I don't normally use C#, but I've written quite a few lexers and parsers. The principles are the same across languages. This is the best solution for your problem, and will help you again and again throughout your career.
You can do this with two regular expressions and a recursive function with one caveat: special characters must be escaped. From what I can see, "=", "[" and "]" have special meaning, so you must insert a "\" before those characters if you want them to appear as part of your property value. Note that commas are not considered "special". A comma before a "property=" string is ignored, but otherwise they are treated in no special way (and, in fact, are optional between properties).
Input
ObjectType
[
property1=value1,val\=value2
property2=value2 \[property2\=this is not an object\], property3=
AnotherObjectType [property4=some
value4]]
Regular Expressions
The regex for discovering "complex" types (beginning with a type name followed by square brackets). The regex includes a mechanism for balancing square brackets to make sure that each open bracket is paired with a close bracket (so that the match does not end too soon or too late):
^\s*(?<TypeName>\w+)\s*\[(?<Properties>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*(?(Depth)(?!)))\]\s*$
The regex for discovering properties within a complex type. Note that this also includes balanced square brackets to ensure that the properties of a sub-complex type are not accidentally consumed by the parent.
(?<PropertyName>\w+)\s*=\s*(?<PropertyValue>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*?(?(Depth)(?!))(?=$|(?<!\\)\]|,?\s*\w+\s*=))
Code
private static Regex ComplexTypeRegex = new Regex( #"^\s*(?<TypeName>\w+)\s*\[(?<Properties>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*(?(Depth)(?!)))\]\s*$" );
private static Regex PropertyRegex = new Regex( #"(?<PropertyName>\w+)\s*=\s*(?<PropertyValue>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*?(?(Depth)(?!))(?=$|(?<!\\)\]|,?\s*\w+\s*=))" );
private static string Input =
#"ObjectType" + "\n" +
#"[" + "\n" +
#" property1=value1,val\=value2 " + "\n" +
#" property2=value2 \[property2\=this is not an object\], property3=" + "\n" +
#" AnotherObjectType [property4=some " + "\n" +
#"value4]]";
static void Main( string[] args )
{
Console.Write( Process( 0, Input ) );
Console.WriteLine( "\n\nPress any key..." );
Console.ReadKey( true );
}
private static string Process( int level, string input )
{
var l_complexMatch = ComplexTypeRegex.Match( input );
var l_indent = string.Join( "", Enumerable.Range( 0, level * 3 ).Select( i => " " ).ToArray() );
var l_output = new StringBuilder();
l_output.AppendLine( l_indent + l_complexMatch.Groups["TypeName"].Value );
foreach ( var l_match in PropertyRegex.Matches( l_complexMatch.Groups["Properties"].Value ).Cast<Match>() )
{
l_output.Append( l_indent + "#" + l_match.Groups["PropertyName"].Value + " = " );
var l_value = l_match.Groups["PropertyValue"].Value;
if ( Regex.IsMatch( l_value, #"(?<!\\)\[" ) )
{
l_output.AppendLine();
l_output.Append( Process( level + 1, l_value ) );
}
else
{
l_output.AppendLine( "\"" + l_value + "\"" );
}
}
return l_output.ToString();
}
Output
ObjectType
#property1 = "value1,val\=value2 "
#property2 = "value2 \[property2\=this is not an object\]"
#property3 =
AnotherObjectType
#property4 = "some value4"
If you cannot escape the delimiters, then I doubt even a human could parse such a string. For example, how would a human reliably know whether the value of property 3 should be considered a literal string or a complex type?

Categories