Replacing a Regex Match Collection in C#

Replacing a Regex Match Collection in C# - c#

I need to find in a text everything that starts with [ and ends with ] and replace it with the value that a function returns. So, here is an example of what I'm doing:
public string ProcessTemplate(string input)
{
return Regex.Replace(input, #"\[(.*?)\]", new MatchEvaluator(delegate(Match match)
{
return ProcessBlock(match.Result("$1"));
}));
}
public string ProcessBlock(string value)
{
return Block.FromString(value).Process();
}
Now, my problem is when I need to edit blocks. So I thought to find blocks, edit them, and then replacing them in the text.
So, I created a List of blocks, and separated the ProcessTemplate method to two methods: FindBlocks and ReplaceBlocks:
public void FindBlocks(string input)
{
Input = input;
foreach (Match match in Regex.Matches(input, #"\[(.*?)\]"))
Blocks.Add(Block.FromString(match.Result("$1")));
}
public string ReplaceBlocks()
{
string input = Input;
foreach (Block block in Blocks)
input = input.Replace("[" + block.OrginalText + "]", block.Process);
return input;
}
public IList<Block> Blocks
{
get;
set;
}
public string Input
{
get;
set;
}
It's working, but the problem is that it is pretty slow. I measured with System.Diagnostics.Stopwatch every part, and I found out that the String.Replace in the ReplaceBlocks method is pretty slow.
How can I improve it?
Thanks.

replacing the string in ReplaceBlock with a StringBuilder may provide a performance increase as every time you perform a string.replace it will have to deallocate the string and reallocate the string. string builder doesnt need to do this.
Replace the contents of the ReplaceBlock with the following.
// This will require a reference to System.Text
StringBuilder input =new StringBuilder(Input);
foreach (Block block in Blocks)
{
input = input.Replace("[" + block.OrginalText + "]", block.Process);
}
return input.ToString();
I have also found replacing the foreach loops with a for loop which are quicker.

I think it is slow
Don't optimize until you've profiled. Find out why your code is slow, then optimize those parts.
http://c2.com/cgi/wiki?ProfileBeforeOptimizing

Related

Replacing first 16 digits in a string with Regex.Replace

I'm trying to replace only the first 16 digits of a string with Regex. I want it replaced with "*". I need to take this string:
"Request=Credit Card.Auth
Only&Version=4022&HD.Network_Status_Byte=*&HD.Application_ID=TZAHSK!&HD.Terminal_ID=12991kakajsjas&HD.Device_Tag=000123&07.POS_Entry_Capability=1&07.PIN_Entry_Capability=0&07.CAT_Indicator=0&07.Terminal_Type=4&07.Account_Entry_Mode=1&07.Partial_Auth_Indicator=0&07.Account_Card_Number=4242424242424242&07.Account_Expiry=1024&07.Transaction_Amount=142931&07.Association_Token_Indicator=0&17.CVV=200&17.Street_Address=123
Road SW&17.Postal_Zip_Code=90210&17.Invoice_Number=INV19291"
And replace the credit card number with an asterisk, which is why I say the first 16 digits, as that is how many digits are in a credit card. I am first splitting the string where there is a "." and then checking if it contains "card" and "number". Then if it finds it I want to replace the first 16 numbers with "*"
This is what I've done:
public void MaskData(string input)
{
if (input.Contains("."))
{
string[] userInput = input.Split('.');
foreach (string uInput in userInput)
{
string lowerCaseInput = uInput.ToLower();
string containsCard = "card";
string containsNumber = "number";
if (lowerCaseInput.Contains(containsCard) && lowerCaseInput.Contains(containsNumber))
{
tbStoreInput.Text += Regex.Replace(lowerCaseInput, #"[0-9]", "*") + Environment.NewLine;
}
else
{
tbStoreInput.Text += lowerCaseInput + Environment.NewLine;
}
}
}
}
I am aware that the Regex is wrong, but not sure how to only get the first 16, as right now its putting an asterisks in the entire line like seen here:
"account_card_number=****************&**"
I don't want it to show the asterisks after the "&".

Same answer as in the comments but explained.
your regex pattern "[0-9]" is a single digit match, so each individual digit
including the digits after & will be a match and so would be replaced.
What you want to do is add a quantifier which restricts the matching to a number of characters ie 16, so your regex changes to "[0-9]{16}" to ensure those are the only characters affected by your replace operation

Disclaimer
My answer is purposely broader than what is asked by OP but I saw it as an opportunity to raise awareness of other tools that are available in C# (which are objects).
String replacement
Regex is not the only tool available to replace a simple string by another. Instead of
Regex.Replace(lowerCaseInput, #"[0-9]{16}", "****************")
it can also be
new StringBuilder()
.Append(lowerCaseInput.Take(20))
.Append(new string('*', 16))
.Append(lowerCaseInput.Skip(36))
.ToString();
Shifting from procedural to object
Now the real meat comes in the possibility to encapsulate the logic into an object which holds a kind of string representation of a dictionary (entries being separated by '.' while keys and values are separated by '=').
The only behavior this object has is to give back a string representation of the initial input but with some value (1 in your case) masked to user (I assume for some security reason).
public sealed class CreditCardRequest
{
private readonly string _input;
public CreditCardRequest(string input) => _input = input;
public static implicit operator string(CreditCardRequest request) => request.ToString();
public override string ToString()
{
var entries = _input.Split(".", StringSplitOptions.RemoveEmptyEntries)
.Select(entry => entry.Split("="))
.ToDictionary(kv => kv[0].ToLower(), kv =>
{
if (kv[0] == "Account_Card_Number")
{
return new StringBuilder()
.Append(new string('*', 16))
.Append(kv[1].Skip(16))
.ToString();
}
else
{
return kv[1];
}
});
var output = new StringBuilder();
foreach (var kv in entries)
{
output.AppendFormat("{0}={1}{2}", kv.Key, kv.Value, Environment.NewLine);
}
return output.ToString();
}
}
Usage becomes as follow:
tbStoreInput.Text = new CreditCardRequest(input);
The concerns of your code are now independant of each other (the rule to parse the input is no more tied to UI component) and the implementation details are hidden.
You can even decide to use Regex in CreditCardRequest.ToString() if you wish to, the UI won't ever notice the change.
The class would then becomes:
public override string ToString()
{
var output = new StringBuilder();
if (_input.Contains("."))
{
foreach (string uInput in _input.Split('.'))
{
if (uInput.StartsWith("Account_Card_Number"))
{
output.AppendLine(Regex.Replace(uInput.ToLower(), #"[0-9]{16}", "****************");
}
else
{
output.AppendLine(uInput.ToLower());
}
}
}
return output.ToString();
}

You can match 16 digits after the account number, and replace with 16 times an asterix:
(?<=\baccount_card_number=)[0-9]{16}\b
Regex demo
Or you can use a capture group and use that group in the replacement like $1****************
\b(account_card_number=)[0-9]{16}\b
Regex demo

C# Method to Check if a String Contains Certain Letters

I'm trying to create a method which takes two parameters, "word" and "input". The aim of the method is to print any word where all of its characters can be found in "input" no more than once (this is why the character is removed if a letter is found).
Not all the letters from "input" must be in "word" - eg, for input = "cacten" and word = "ace", word would be printed, but if word = "aced" then it would not.
However, when I run the program it produces unexpected results (words being longer than "input", containing letters not found in "input"), and have coded the solution several ways all with the same outcome. This has stumped me for hours and I cannot work out what's going wrong. Any and all help will be greatly appreciated, thanks. My full code for the method is written below.
static void Program(string input, string word)
{
int letters = 0;
List<string> remaining = new List<string>();
foreach (char item in input)
{
remaining.Add(item.ToString());
}
input = remaining.ToString();
foreach (char letter in word)
{
string c = letter.ToString();
if (input.Contains(c))
{
letters++;
remaining.Remove(c);
input = remaining.ToString();
}
}
if (letters == word.Length)
{
Console.WriteLine(word);
}
}

Ok so just to go through where you are going wrong.
Firstly when you assign remaining.ToString() to your input variable. What you actually assign is this System.Collections.Generic.List1[System.String]. Doing to ToString on a List just gives you the the type of list it is. It doesnt join all your characters back up. Thats probably the main thing that is casuing you issues.
Also you are forcing everything into string types and really you don't need to a lot of the time, because string already implements IEnumerable you can get your string as a list of chars by just doing myString.ToList()
So there is no need for this:
foreach (char item in input)
{
remaining.Add(item.ToString());
}
things like string.Contains have overloads that take chars so again no need for making things string here:
foreach (char letter in word)
{
string c = letter.ToString();
if (input.Contains(c))
{
letters++;
remaining.Remove(c);
input = remaining.ToString();
}
}
you can just user the letter variable of type char and pass that into contains and beacuse remaining is now a List<char> you can remove a char from it.
again Don't reassign remaining.ToString() back into input. use string.Join like this
string.Join(string.empty,remaining);
As someone else has posted there is a probably better ways of doing this, but I hope that what I've put here helps you understand what was going wrong and will help you learn

You can also use Regular Expression which was created for such scenarios.
bool IsMatch(string input, string word)
{
var pattern = string.Format("\\b[{0}]+\\b", input);
var r = new Regex(pattern);
return r.IsMatch(word);
}
I created a sample code for you on DotNetFiddle.
You can check what the pattern does at Regex101. It has a pretty "Explanation" and "Quick Reference" panel.

There are a lot of ways to achieve that, here is a suggestion:
static void Main(string[] args)
{
Func("cacten","ace");
Func("cacten", "aced");
Console.ReadLine();
}
static void Func(string input, string word)
{
bool isMatch = true;
foreach (Char s in word)
{
if (!input.Contains(s.ToString()))
{
isMatch = false;
break;
}
}
// success
if (isMatch)
{
Console.WriteLine(word);
}
// no match
else
{
Console.WriteLine("No Match");
}
}

Not really an answer to your question but its always fun to do this sort of thing with Linq:
static void Print(string input, string word)
{
if (word.All(ch => input.Contains(ch) &&
word.GroupBy(c => c)
.All(g => g.Count() <= input.Count(c => c == g.Key))))
Console.WriteLine(word);
}
Functional programming is all about what you want without all the pesky loops, ifs and what nots... Notice that this code does what you'd do in your head without needing to painfully specify step by step how you'd actually do it:
Make sure all characters in word are in input.
Make sure all characters in word are used at most as many times as they are present in input.
Still, getting the basics right is a must, posted this answer as additional info.

Iterating over IEnumerable, special casing last element

I'm building a string based on an IEnumerable, and doing something like this:
public string BuildString()
{
var enumerable = GetEnumerableFromSomewhere(); // actually an in parameter,
// but this way you don't have to care
// about the type :)
var interestingParts = enumerable.Select(v => v.TheInterestingStuff).ToArray();
stringBuilder.Append("This is it: ");
foreach(var part in interestingParts)
{
stringBuilder.AppendPart(part);
if (part != interestingParts.Last())
{
stringBuilder.Append(", ");
}
}
}
private static void AppendPart(this StringBuilder stringBuilder, InterestingPart part)
{
stringBuilder.Append("[");
stringBuilder.Append(part.Something");
stringBuilder.Append("]");
if (someCondition(part))
{
// this is in reality done in another extension method,
// similar to the else clause
stringBuilder.Append(" = #");
stringBuilder.Append(part.SomethingElse");
}
else
{
// this is also an extension method, similar to this one
// it casts the part to an IEnumerable, and iterates over
// it in much the same way as the outer method.
stringBuilder.AppendInFilter(part);
}
}
I'm not entirely happy with this idiom, but I'm struggling to formulate something more succinct.
This is, of course, part of a larger string building operation (where there are several blocks similar to this one, as well as other stuff in between) - otherwise I'd probably drop the StringBuilder and use string.Join(", ", ...) directly.
My closest attempt at simplifying the above, though, is constructs like this for each iterator:
stringBuilder.Append(string.Join(", ", propertyNames.Select(prop => "[" + prop + "]")));
but here I'm still concatenating strings with +, which makes it feel like the StringBuilder doesn't really contribute much.
How could I simplify this code, while keeping it efficient?

You can replace this:
string.Join(", ", propertyNames.Select(prop => "[" + prop + "]"))
With c# 6 string interpolation:
string.Join(", ", propertyNames.Select(prop => $"[{prop}]"))
In both cases the difference is semantic only and it doesn't really matter. String concatenation like in your case in the select isn't a problem. The compiler still creates only 1 new string for it (and not 4, one for each segment and a 4th for the joint string).
Putting it all together:
var result = string.Join(", ", enumerable.Select(v => $"[{v.TheInterestingStuff}]"));
Because body of foreach is more complex that to fit in a String Interpolation scope you can just remove the last N characters of the string once calculated, as KooKiz suggested.
string separator = ", ";
foreach(var part in interestingParts)
{
stringBuilder.Append("[");
stringBuilder.Append(part);
stringBuilder.Append("]");
if (someCondition(part))
{
// Append more stuff
}
else
{
// Append other thingd
}
stringBuilder.Append(separator);
}
stringBuilder.Length = stringBuilder.Lenth - separator;
In any case I think that for better encapsulation the content of the loop's scope should sit in a separate function that will receive a part and the separator and will return the output string. It can also be an extension method for StringBuilder as suggested by user734028

Use Aggregate extension method with StringBuilder.
Will be more efficiently then concatenate strings if your collection are big
var builder = new StringBuilder();
list.Aggregate(builder, (sb, person) =>
{
sb.Append(",");
sb.Append("[");
sb.Append(person.Name);
sb.Append("]");
return sb;
});
builder.Remove(0, 1); // Remove first comma
As pure foreach is always more efficient then LINQ then just change logic for delimeter comma
var builder = new StringBuilder();
foreach(var part in enumerable.Select(v => v.TheInterestingStuff))
{
builder.Append(", ");
builder.Append("[");
builder.Append(part);
builder.Append("]");
}
builder.Remove(0, 2); //Remove first comma and space

Aggregate solution:
var answer = interestingParts.Select(v => "[" + v + "]").Aggregate((a, b) => a + ", " + b);
Serialization solution:
var temp = JsonConvert.SerializeObject(interestingParts.Select(x => new[] { x }));
var answer = temp.Substring(1, temp.Length - 2).Replace(",", ", ");

the code:
public string BuildString()
{
var enumerable = GetEnumerableFromSomewhere();
var interestingParts = enumerable.Select(v => v.TheInterestingStuff).ToArray();
stringBuilder.Append("This is it: ");
foreach(var part in interestingParts)
{
stringBuilder.AppendPart(part)
}
if (stringBuilder.Length>0)
stringBuilder.Length--;
}
private static void AppendPart(this StringBuilder stringBuilder, InterestingPart part)
{
if (someCondition(part))
{
stringBuilder.Append(string.Format("[{0}] = #{0}", part.Something));
}
else
{
stringBuilder.Append(string.Format("[{0}]", part.Something));
stringBuilder.AppendInFilter(part); //
}
}
much better now IMO.
Now a little discussion on making it very fast. We can use Parallel.For. But you would think (if you would think) the Appends are all happening to a single shareable resource, aka the StringBuilder, and then you would have to lock it to Append to it, not so efficient! Well, if we can say that each iteration of the for loop in the outer function creates one single string artifact, then we can have a single array of string, allocated to the count of interestingParts before the Parallel for starts, and each index of the Parallel for would store its string to its respective index.
Something like:
string[] iteration_buckets = new string[interestingParts.Length];
System.Threading.Tasks.Parallel.For(0, interestingParts.Length,
(index) =>
{
iteration_buckets[index] = AppendPart(interestingParts[index]);
});
your function AppendPart will have to be adjusted to make it a non-extension to take just a string and return a string.
After the loop ends you can do a string.Join to get a string, which is what you may be doing with the stringBuilder.ToString() too.

count occurrences in a string similar to run length encoding c#

Say I have a string like
MyString1 = "ABABABABAB";
MyString2 = "ABCDABCDABCD";
MyString3 = "ABCAABCAABCAABCA";
MyString4 = "ABABACAC";
MyString5 = "AAAAABBBBB";
and I need to get the following output
Output1 = "5(AB)";
Output2 = "3(ABCD)";
Output3 = "4(ABCA)";
Output4 = "2(AB)2(AC)";
Output5 = "5(A)5(B)";
I have been looking at RLE but I can't figure out how to do the above.
The code I have been using is
public static string Encode(string input)
{
return Regex.Replace(input, #"(.)\1*", delegate(Match m)
{
return string.Concat(m.Value.Length, "(", m.Groups[1].Value, ")");
});
}
This works for Output5 but can I do the other Outputs with Regex or should I be using something like Linq?
The purpose of the code is to display MyString in a simple manner as I can get MyString being up to a 1000 characters generally with a pattern to it.
I am not too worried about speed.

Using RLE with single characters is easy, there never is an overlap between matches. If the number of characters to repeat is variable, you'd have a problem:
AAABAB
Could be:
3(A)BAB
Or
AA(2)AB
You'll have to define what rules you want to apply. Do you want the absolute best compression? Does speed matter?
I doubt Regex can look forward and select "the best" combination of matches - So to answer your question I would say "no".

RLE is of no help here - it's just an extremely simple compression where you repeat a single code-point a given number of times. This was quite useful for e.g. game graphics and transparent images ("next, there's 50 transparent pixels"), but is not going to help you with variable-length code-points.
Instead, have a look at Huffman encoding. Expanding it to work with variable-length codewords is not exactly cheap, but it's a start - and it saves a lot of space, if you can afford having the table there.
But the first thing you have to ask yourself is, what are you optimizing for? Are you trying to get the shortest possible string on output? Are you going for speed? Do you want as few code-words as possible, or do you need to balance the repetitions and code-word counts in some way? In other words, what are you actually trying to do? :))
To illustrate this on your "expected" return values, Output4 results in a longer string than MyString4. So it's not the shortest possible representation. You're not trying for the least amounts of code-words either, because then Output5 would be 1(AAAAABBBBB). Least amount of repetitions is of course silly (it would always be 1(...)). You're not optimizing for low overhead either, because that's again broken in Output4.
And whichever of those are you trying to do, I'm thinking it's not going to be possible with regular expressions - those only work for regular languages, and encoding like this doesn't seem all that regular to me. The decoding does, of course; but I'm not so sure about the encoding.

Here is a Non-Regex way given the data that you provided. I'm not sure of any edge cases, right now, that would trip this code up. If so, I'll update accordingly.
string myString1 = "ABABABABAB";
string myString2 = "ABCDABCDABCD";
string myString3 = "ABCAABCAABCAABCA";
string myString4 = "ABABACAC";
string myString5 = "AAAAABBBBB";
CountGroupOccurrences(myString1, "AB");
CountGroupOccurrences(myString2, "ABCD");
CountGroupOccurrences(myString3, "ABCA");
CountGroupOccurrences(myString4, "AB", "AC");
CountGroupOccurrences(myString5, "A", "B");
CountGroupOccurrences() looks like the following:
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
while (str.Length > 0)
{
foreach (string pattern in patterns)
{
int count = 0;
int index = str.IndexOf(pattern);
while (index > -1)
{
count++;
str = str.Remove(index, pattern.Length);
index = str.IndexOf(pattern);
}
result += string.Format("{0}({1})", count, pattern);
}
}
Console.WriteLine(result);
}
Results:
5(AB)
3(ABCD)
4(ABCA)
2(AB)2(AC)
5(A)5(B)
UPDATE
This worked with Regex
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
foreach (string pattern in patterns)
{
result += string.Format("{0}({1})", Regex.Matches(str, pattern).Count, pattern);
}
Console.WriteLine(result);
}

How to split at every second quotation mark

I have a string that looks like this
2,"E2002084700801601390870F"
3,"E2002084700801601390870F"
1,"E2002084700801601390870F"
4,"E2002084700801601390870F"
3,"E2002084700801601390870F"
This is one whole string, you can imagine it being on one row.
And I want to split this in the way they stand right now like this
2,"E2002084700801601390870F"
I cannot change the way it is formatted. So my best bet is to split at every second quotation mark. But I haven't found any good ways to do this. I've tried this https://stackoverflow.com/a/17892392/2914876 But I only get an error about invalid arguements.
Another issue is that this project is running .NET 2.0 so most LINQ functions aren't available.
Thank you.

Try this
var regEx = new Regex(#"\d+\,"".*?""");
var lines = regex.Matches(txt).OfType<Match>().Select(m => m.Value).ToArray();
Use foreach instead of LINQ Select on .Net 2
Regex regEx = new Regex(#"\d+\,"".*?""");
foreach(Match m in regex.Matches(txt))
{
var curLine = m.Value;
}

I see three possibilities, none of them are particularly exciting.
As #dvnrrs suggests, if there's no comma where you have line-breaks, you should be in great shape. Replace ," with something novel. Replace the remaining "s with what you need. Replace the "something novel" with ," to restore them. This is probably the most solid--it solves the problem without much room for bugs.
Iterate through the string looking for the index of the next " from the previous index, and maintain a state machine to decide whether to manipulate it or not.
Split the string on "s and rejoin them in whatever way works the best for your application.

I realize regular expressions will handle this but here's a pure 2.0 way to handle as well. It's much more readable and maintainable in my humble opinion.
using System;
using System.Collections.Generic;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
const string data = #"2,""E2002084700801601390870F""3,""E2002084700801601390870F""1,""E2002084700801601390870F""4,""E2002084700801601390870F""3,""E2002084700801601390870F""";
var parsedData = ParseData(data);
foreach (var parsedDatum in parsedData)
{
Console.WriteLine(parsedDatum);
}
Console.ReadLine();
}
private static IEnumerable<string> ParseData(string data)
{
var results = new List<string>();
var split = data.Split(new [] {'"'}, StringSplitOptions.RemoveEmptyEntries);
if (split.Length % 2 != 0)
{
throw new Exception("Data Formatting Error");
}
for (var index = 0; index < split.Length / 2; index += 2)
{
results.Add(string.Format(#"""{0}""{1}""", split[index], split[index + 1]));
}
return results;
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.