string "search and replace" using a .NET regex - c#

I need to do a 2 rule "replace" -- my rules are, replace all open parens, "(" with a hyphen "-" and strip out all closing parens ")".
So for example this:
"foobar(baz2)" would become
"foobar-baz2"
I currently do it like this -- but, my hunch regex would be cleaner.
myString.Replace("(", "-").Replace(")", "");

I wouldn't go to RegEx for this - what you're doing is just right. It's clear and straightforward ... regular expressions are unlikely to make this any simpler or clearer. You would still need to make two calls to Replace because your substitutions are different for each case.

You CAN use one regex to replace both those occurrences in one line, but it would be less 'forgiving' than two single rule string replacements.
Example:
The code that would be used to do what you want with regex would be:
Regex.Replace(myString, #"([^\(]*?)\(([^\)]*?)\)", "$1-$2");
This would work fine for EXACTLY the example that you provided. If there was the slightest change in where, and how many '(' and ')' characters there are, the regex would break. You could then fix that with more regex, but it would just get uglier and uglier from there.
Regex is an awesome choice, however, for applications that are more rigid.

Jamie Zawinski suddenly comes to my mind:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
So I also think LBushkin is right in this case. Your solution works and is readable.

Nope. This is perfectly clean.
Point is, you'd have to have two regexes anyway, because your substitution strins are different.

I'd say use what you have - it's more-easily readable/maintainable. Regexes are super powerful but also sometimes super confusing. For something this simple, I'd say don't even use Regexes.

I'd think a regex is going to be kind of brittle for this kind of thing. If your version of .NET has extension methods and you'd like a cleaner syntax that scales you might introduce an extension method like this:
public static class StringExtensions
{
public static string ReplaceMany(this string s, Dictionary<string, string> replacements)
{
var sb = new StringBuilder(s);
foreach (var replacement in replacements)
{
sb = sb.Replace(replacement.Key, replacement.Value);
}
return sb.ToString();
}
}
So now you build up your dictionary of replacements...
var replacements = new Dictionary<string, string> { {"(", "-"}, {")", ""} };
And call ReplaceMany:
var result = "foobar(baz2)".ReplaceMany(replacements); // result = foobar-baz2
If you really want to show your intent you can alias Dictionary<string,string> to StringReplacements:
//At the top
using StringReplacements = System.Collections.Generic.Dictionary<string,string>;
//In your function
var replacements = new StringReplacements() { {"(", "-"}, {")", ""} };
var result = "foobar(baz2)".ReplaceMany(replacements);
Might be overkill for only two replacements, but if you have many to make it'll be cleaner than .Replace().Replace().Replace().Replace()....

Regex is overkill for such a simple scenario. What you have is perfect. Although your question has already been answered, I wanted to post to demonstrate that one regex pattern is sufficient:
string input = "foobar(baz2)";
string pattern = "([()])";
string result = Regex.Replace(input, pattern, m => m.Value == "(" ? "-" : "");
Console.WriteLine(result);
The idea is to capture the parentheses in a group. I used [()] which is a character class that'll match what we're after. Notice that inside a character class they don't need to be escaped. Alternately the pattern could've been #"(\(|\))" in which case escaping is necessary.
Next, the Replace method uses a MatchEvaluator and we check whether the captured value is an opening ( or not. If it is, a - is returned. If not we know, based on our limited pattern, that it must be a closing ) and we return an empty string.

Here's a fun LINQ-based solution to the problem. It may not be the best option, but it's an interesting one anyways:
public string SearchAndReplace(string input)
{
var openParen = '(';
var closeParen = ')';
var hyphen = '-';
var newChars = input
.Where(c => c != closeParen)
.Select(c => c == openParen ? hyphen : c);
return new string(newChars.ToArray());
}
2 interesting notes about this implementation:
It requires no complicated regex, so you get better performance and easier
maintenance.
Unlike string.Replace implementations, this method
allocates exactly 1 string.
Not bad!

Related

Regex that returns a list

I have a string that I am looking up that can have two possible values:
stuff 1
grouped stuff 1-3
I am not very familiar with using regex, but I know it can be very powerful when used correctly. So forgive me if this question sounds ridiculous in anyway. I was wondering if it would be possible to have some sort of regex code that would only leave the numbers of my string (for example in this case 1 and 1-3) but perhaps if it were the example of 1-3 I could just return the 1 and 3 separately to pass into a function to get the in between.
I hope I am making sense. It is hard to put what I am looking for into words. If anyone needs any further clarification I would be more than happy to answer questions/edit my own question.
To create a list of numbers in string y, use the following:
var listOfNumbers = Regex.Matches(y, #"\d+")
.OfType<Match>()
.Select(m => m.Value)
.ToList();
This is fully possible, but best done with two separate Regexes, say SingleRegex and RangedRegex - then check for one or the other, and pass into a function when the result is RangeRegex.
As long as you're checking for "numbers in a specific place" then extra numbers won't confuse your algorythm. There are also several Regex Testers out there, a simple google Search weill give you an interface to check for various syntax and matches.
Are you just wanting to loop through all of the numbers in the string?
Here's one way you can loop throw each match in a regular expression.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Regex r = new Regex(#"\d+");
string s = "grouped stuff 1-3";
Match m = r.Match(s);
while(m.Success)
{
string matchText = m.Groups[0].Value;
Console.WriteLine(matchText);
m = m.NextMatch();
}
}
}
This outputs
1
3

Regex Find Replace distinct parts of a string for a porting application

If I have a line of code:
String s = "\t\tgets(buf);";
and need to convert it to
s = "\t\tbuf";
What Regex pattern would I use in the Regex.Replace() method to get rid of
gets() and leave behind buf assuming that buf is a random string. I would also like to preserve the other formatting characters etc that may exist in that string.
Apologies for editing the question.
I used:
s = Regex.Replace(s, "gets", "");
s = s.Replace("(", "").Replace(")", "").Replace(";", "").Replace(" ", "");
s += " = Console.ReadLine();";
But this wouldn't work if the line get(buf) was surrounded by some other parenthesis or
formatting like
s= "/rgets (buf)"; OR s= "(gets (buf))/n";
So I would Ideally just want to just get rid of the gets() leaving behind 'buf' and the other content in the line as is and concat to it later.
Thanks
You read the documentation. Then you write the appropriate regular expression.
string s = "\t\tgets(buf);" ;
string s1 = Regex.Replace( s , #"^(\t\t)gets\(.+)\)(;)" , #"$1$2 = Console.Readline()$3" ) ;
One might note the gets(3) takes an arbitrary expression that evaluates to a char*. That expression is not necessarily going to be an identifier or something that can be used on the left side of an assignment (as it were, an lval in the C language).
You could just use a couple methods from the string class to achieve this:
s = string.Concat(s.Replace("gets(", "").Replace(");", ""), " = Console.ReadLine();");
Regex is powerful, but can sometimes be overkill.
If you really feel you need it, then please add more details into your question.

What is the best way(lambda, extension or Regular expression) of removing the white space

At present I am using the below
private static string RemoveWhiteSpace(string text)
{
string trim = text.Replace(" ", "");
trim = trim.Replace("\r", "");
trim = trim.Replace("\n", "");
trim = trim.Replace("\t", "");
return trim;
}
Looking for a better way
Thanks
Try with:
Regex r = new Regex(#"\s+");
s = r.Replace(s,string.Empty);
that is shorter and probably still readable.
What sort of "better" are you looking for? You certainly could use a regular expression, either to replace your existing behaviour or to make it match more whitespace, but you need to be clear about what the objective is.
It's possible that a change will improve performance - but you'd have to measure it to be sure. Is this currently a performance bottleneck?
What I like about the current method is that it's incredibly obvious what's going on. If you're going to lose that simplicity, you'd better be sure that the benefit is worth it. In particular, if you use a regular expression whitespace matcher, you need to be sure that you really do want all the various kinds of whitespace removed - it won't be doing exactly the same job as your current code. Work out what you want the behaviour to be, and then find the simplest way of implementing that exact behaviour.
One slight simplificiation is to use method chaining:
private static string RemoveWhiteSpace(string text)
{
return text.Replace(" ", "")
.Replace("\r", "")
.Replace("\n", "")
.Replace("\t", "");
}
private static string RemoveWhiteSpace(string text) {
StringBuilder ret = new StringBuilder(text.Length);
foreach(char c in text) {
if ( false == char.IsWhiteSpace(c) ) { ret.Append(c); }
}
return ret.ToString();
}
You can try use for loop instead of foreach. Maybe it will be faster.
I would recommend the regex approach personally if you are sure you want to eliminate all white space.
return Regex.Replace(text, #"\s+", "");
This is better:
return trim.Trim(' ', '\r', '\n', '\t')
but it will not replace from inside the string... I am not sure if it's what you need though.
You could change the last 3 lines with Trim function from string class
trim.Trim();

Capitalizing words in a string using C#

I need to take a string, and capitalize words in it. Certain words ("in", "at", etc.), are not capitalized and are changed to lower case if encountered. The first word should always be capitalized. Last names like "McFly" are not in the current scope, so the same rule will apply to them - only first letter capitalized.
For example: "of mice and men By CNN" should be changed to "Of Mice and Men by CNN". (Therefore ToTitleString won't work here.)
What would be the best way to do that?
I thought of splitting the string by spaces, and go over each word, changing it if necessary, and concatenating it to the previous word, and so on.
It seems pretty naive and I was wondering if there's a better way to do it. I am using .NET 3.5.
Use
Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase("of mice and men By CNN");
to convert to proper case and then you can loop through the keywords as you have mentioned.
Depending on how often you plan on doing the capitalization I'd go with the naive approach. You could possibly do it with a regular expression, but the fact that you don't want certain words capitalized makes that a little trickier.
You can do it with two passes using regular expressions:
var result = Regex.Replace("of mice and men isn't By CNN", #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"(\s(of|in|by|and)|\'[st])\b", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
This outputs Of Mice and Men Isn't by CNN.
The first expression capitalizes every letter on a word boundary and the second one downcases any words matching the list that are surrounded by white space.
The downsides to this approach is that you're using regexs (now you have two problems) and you'll need to keep that list of excluded words up to date. My regex-fu isn't good enough to be able to do it in one expression, but it might be possible.
An answer from another question, How to Capitalize names -
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Console.WriteLine(textInfo.ToTitleCase(title));
Console.WriteLine(textInfo.ToLower(title));
Console.WriteLine(textInfo.ToUpper(title));
Use ToTitleCase() first and then keep a list of applicable words and Replace back to the all-lower-case version of those applicable words (provided that list is small).
The list of applicable words could be kept in a dictionary and looped through pretty efficiently, replacing with the .ToLower() equivalent.
Try something like this:
public static string TitleCase(string input, params string[] dontCapitalize) {
var split = input.Split(' ');
for(int i = 0; i < split.Length; i++)
split[i] = i == 0
? CapitalizeWord(split[i])
: dontCapitalize.Contains(split[i])
? split[i]
: CapitalizeWord(split[i]);
return string.Join(" ", split);
}
public static string CapitalizeWord(string word)
{
return char.ToUpper(word[0]) + word.Substring(1);
}
You can then later update the CapitalizeWord method if you need to handle complex surnames.
Add those methods to a class and use it like this:
SomeClass.TitleCase("a test is a sentence", "is", "a"); // returns "A Test is a Sentence"
A slight improvement on jonnii's answer:
var result = Regex.Replace(s.Trim(), #"\b(\w)", m => m.Value.ToUpper());
result = Regex.Replace(result, #"\s(of|in|by|and)\s", m => m.Value.ToLower(), RegexOptions.IgnoreCase);
result = result.Replace("'S", "'s");
You can have a Dictionary having the words you would like to ignore, split the sentence in phrases (.split(' ')) and for each phrase, check if the phrase exists in the dictionary, if it does not, capitalize the first character and then, add the string to a string buffer. If the phrase you are currently processing is in the dictionary, simply add it to the string buffer.
A non-clever approach that handles the simple case:
var s = "of mice and men By CNN";
var sa = s.Split(' ');
for (var i = 0; i < sa.Length; i++)
sa[i] = sa[i].Substring(0, 1).ToUpper() + sa[i].Substring(1);
var sout = string.Join(" ", sa);
Console.WriteLine(sout);
The easiest obvious solution (for English sentences) would be to:
"sentence".Split(" ") the sentence on space characters
Loop through each item
Capitalize the first letter of each item - item[i][0].ToUpper(),
Remerge back into a string joined on a space.
Repeat this process with "." and "," using that new string.
You should create your own function like you're describing.

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories