Regex performance issue on a really big string - c#

Right now I am new to using regexes so I would really appreciate your help.
I have a really large string (I am parsing an as3 file to json) and I need to locate for those trailing commas out there in the objects..
This is the regex I am using
public static string TrimTraillingCommas(string jsonCode)
{
var regex = new Regex(#"(.*?),\s*(\}|\])", (RegexOptions.Multiline));
return regex.Replace(jsonCode, m => String.Format("{0} {1}", m.Groups[1].Value, m.Groups[2].Value));
}
The problem with it is that it's really slow. Without using it in the string the time to complete the program is : 00:00:00.0289668 and with it : 00:00:00.4096293
Could someone suggest a improved regex or algorithm for faster replacing those trailing commas.
Here is where i start from ( the string with the trailing commas )
Here is the end string I need

You can simplify your regular expression by eliminating your capture groups, replacing the purpose of the latter one by a lookahead:
var regex = new Regex(#",\s*(?=\}|\])");
return regex.Replace(jsonCode, " ");

You don't need the first expression .*? and you can convert the alternation
into a character class. That's about the best you could do.
var regex = new Regex(#",[^\S\r\n]*([}\]])");
return regex.Replace(jsonCode, " $1");

Related

Regex to disallow all symbols that windows doesnt allow in its file/folder names

The symbols not allowed in filename or folder name in windows are \ / : * ? " < > |. I wrote a regex for it but not able to write regex to exclude " (double quotes).
Regex regex = new Regex(#"^[\\\/\:\*\?\'\<\>\|]+$");
Tried as below but it didnt work:
Regex regex = new Regex(#"^[\\\/\:\*\?\'\<\>\|"]+$");
EDIT:
I am particularly looking for code with regex itself and hence its not a duplicate.
Help is greatly appreciated. Thanks.
Instead of using Regex you could do something like this:
var invalidChars = new HashSet<char>(Path.GetInvalidFileNameChars());
var invalid = input.Any(chr => invalidChars.Contains(chr));
#Allrameest answer using Path.GetInvalidFileNameChars() is probably the one you should use.
What I wanted to address is the fact that your regex is actually wrong or very ineffective (I don't know what exectly you wanted to do).
So, using:
var regex = new Regex(#"^[\\\/\:\*\?\'\<\>\|]+$");
mean you match a string which consists ONLY of "forbidden" characters (BTW, I think single quote ' is not invalid). Is it what you want? Don't think so. What you did is:
^ start of the string
[...]+ invalid characters only (at least once)
$ end of the string
Maybe you wanted #"^[^...]+$" (hat used twice)?
Anyway, solution for your problem (with regex) is:
don't use ^ or $ just try to find any of those and bomb out quickly
in raw string (the one starting with #") you escape double quotes by doubling it.
So, the right regex is:
var regex = new Regex(#"[\\\/\:\*\?\""\<\>\|]");
if (regex.Match(filename).Success) {
throw new ArgumentException("Bad filename");
}
Just find any and bomb out.
UPDATE by #JohnLBevan
var regex = new Regex(
"[" + Regex.Escape(new string(Path.GetInvalidFileNameChars())) + "]");
if (regex.Match(filename).Success) {
throw new ArgumentException("Bad filename");
}
(Not using string.Format(...) as this Regex should be static and precompiled anyway)

Using regex to remove everything that is not in between '<#'something'#>' and replace it with commas

I have a string, for example
<#String1#> + <#String2#> , <#String3#> --<#String4#>
And I want to use regex/string manipulation to get the following result:
<#String1#>,<#String2#>,<#String3#>,<#String4#>
I don't really have any experience doing this, any tips?
There are multiple ways to do something like this, and it depends on exactly what you need. However, if you want to use a single regex operation to do it, and you only want to fix stuff that comes between the bracketed strings, then you could do this:
string input = "<#String1#> + <#String2#> , <#String3#> --<#String4#>";
string pattern = "(?<=>)[^<>]+(?=<)";
string replacement = ",";
string result = Regex.Replace(input, pattern, replacement);
The pattern uses [^<>]+ to match any non-pointy-bracket characters, but it combines it with a look-behind statement ((?<=>)) and a look-ahead statement (?=<) to make sure that it only matches text that occurs between a closing and another opening set of brackets.
If you need to remove text that comes before the first < or after the last >, or if you find the look-around statements confusing, you may want to consider simply matching the text that comes between the brackets and then loop through all the matches and build a new string yourself, rather than using the RegEx.Replace method. For instance:
string input = "sdfg<#String1#> + <#String2#> , <#String3#> --<#String4#>ag";
string pattern = #"<[^<>]+>";
List<String> values = new List<string>();
foreach (Match m in Regex.Matches(input, pattern))
values.Add(m.Value);
string result = String.Join(",", values);
Or, the same thing using LINQ:
string input = "sdfg<#String1#> + <#String2#> , <#String3#> --<#String4#>ag";
string pattern = #"<[^<>]+>";
string result = String.Join(",", Regex.Matches(input, pattern).Cast<Match>().Select(x => x.Value));
If you're just after string manipulation and don't necessarily need a regex, you could simply use the string.Replace method.
yourString = yourString.Replace("#> + <#", "#>,<#");

How can I split a regex into exact words?

I need a little help regarding Regular Expressions in C#
I have the following string
"[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r"
The string could also contain spaces and theoretically any other characters. So I really need do match the [[words]].
What I need is a text array like this
"[[Sender.Name]]",
"[[Sender.AdditionalInfo]]",
"[[Sender.Street]]",
// ... And so on.
I'm pretty sure that this is perfectly doable with:
var stringArray = Regex.Split(line, #"\[\[+\]\]")
I'm just too stupid to find the correct Regex for the Regex.Split() call.
Anyone here that can tell me the correct Regular Expression to use in my case?
As you can tell I'm not that experienced with RegEx :)
Why dont you split according to "\r"?
and you dont need regex for that just use the standard string function
string[] delimiters = {#"\r"};
string[] split = line.Split(delimiters,StringSplitOptions.None);
Do matching if you want to get the [[..]] block.
Regex rgx = new Regex(#"\[\[.*?\]\]");
foreach (Match m in rgx.Matches(input))
Console.WriteLine(m.Groups[0].Value);
IDEONE
The regex you are using (\[\[+\]\]) will capture: literal [s 2 or more, then 2 literal ]s.
A regex solution is capturing all the non-[s inside doubled [ and ]s (and the string inside the brackets should not be empty, I guess?), and cast MatchCollection to a list or array (here is an example with a list):
var str = "[[Sender.Name]]\r[[Sender.AdditionalInfo]]\r[[Sender.Street]]\r[[Sender.ZipCode]] [[Sender.Location]]\r[[Sender.Country]]\r";
var rgx22 = new Regex(#"\[\[[^]]+?\]\]");
var res345 = rgx22.Matches(str).Cast<Match>().ToList();
Output:

How to find a string with missing fragments?

I'm building a chatbot in C# using AIML files, at the moment I've this code to process:
<aiml>
<category>
<pattern>a * is a *</pattern>
<template>when a <star index="1"/> is not a <star index="2"/>?</template>
</category>
</aiml>
I would like to do something like:
if (user_string == pattern_string) return template_string;
but I don't know how to tell the computer that the star character can be anything, and expecially that can be more than one word!
I was thinking to do it with regular expressions, but I don't have enough experience with it. Can somebody help me? :)
Using Regex
static bool TryParse(string pattern, string text, out string[] wildcardValues)
{
// ^ and $ means that whole string must be matched
// Regex.Escape (http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape(v=vs.110).aspx)
// (.+) means capture at least one character and place it in match.Groups
var regexPattern = string.Format("^{0}$", Regex.Escape(pattern).Replace(#"\*", "(.+)"));
var match = Regex.Match(text, regexPattern, RegexOptions.Singleline);
if (!match.Success)
{
wildcardValues = null;
return false;
}
//skip the first one since it is the whole text
wildcardValues = match.Groups.Cast<Group>().Skip(1).Select(i => i.Value).ToArray();
return true;
}
Sample usage
string[] wildcardValues;
if(TryParse("Hello *. * * to *", "Hello World. Happy holidays to all", out wildcardValues))
{
//it's a match
//wildcardValues contains the values of the wildcard which is
//['World','Happy','holidays','all'] in this sample
}
By the way, you don't really need Regex for this, it's overkill. Just implement your own algorithm by splitting the pattern into tokens using string.Split then finding each token using string.IndexOf. Although using Regex does result in shorter code
Do you think this should work for you?
Match match = Regex.Match(pattern_string, #"<pattern>a [^<]+ is a [^<]+</pattern>");
if (match.Success)
{
// do something...
}
Here [^<]+ represents for one or more characters which is/are not <
If you think you may have < character in your *, then you can simply use .+ instead of [^<]+
But this will be risky as .+ means any characters having one or multiple times.

Simple Regex Befuddlement

I have some strings of the form
string strA = "Cmd:param1:'C:\\SomePath\SomeFileName.ext':param2";
string strB = "Cmd:'C:\\SomePath\SomeFileName.ext':param2:param3";
I want to split this string on ':' so I can extract the N parameters. Some parameters can contain file paths [as explicitly] shown and I don't want to split on the ':'s that are within the parentheses. I can do this with a regex but I am confused as to how to get the regex to split only if there is no "'" on both sides of the colon.
I have attempted
string regex = #"(?<!'):(?!')";
string regex = #"(?<!'(?=')):";
that is continue matching only if no "'" on the left and no "'" on the right (negative look behind/ahead), but this is still splitting on the colon contained in 'C:\SomePath\SomeFileName.ext'.
How can I amend this regex to do as I require?
Thanks for your time.
Note: I have found that the following regex works. However, I would like to know if there is a better way of doing this?
string regex = #"(?<!.*'.*):|:(?!.*'.*)";
Consider this approach:
var guid = Guid.NewGuid().ToString();
var r = Regex.Replace(strA, #"'.*'", m =>
{
return m.Value.Replace(":", guid);
})
.Split(':')
.Select(s => s.Replace(guid, ":"))
.ToList();
Rather than try to construct a lookbehind regex to split on, you could construct a regex to match the fields themselves and take the set of matches of that regex. EG a field is either a quoted sequence of non-quotes (ie can include :), or it can't include the separator:
string regex = "'[^']*'|[^':]*";
var result = Regex.Matches(strA, regex);
You want to split on (?<!\b[a-z]):(?!\\) (use RegexOptions.IgnoreCase).
Not as pretty but you could replace :\ with safe characters and then return them back to :\ after the split.
string[] param = strA.Replace(#":\", "|||").Split(':').Select(x => x.Replace("|||", #":\")).ToArray();

Categories