I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?
For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).
I've tried to use the following regular expression;
"(?<text>[^""]*(""(.|""|[^"])*)*)"
However the performance was awful.
I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php
Has anybody any idea to cope with this using a regular expression?
You've got some nested repetition quantifiers there. That can be catastrophic for the performance.
Try something like this:
(?<=")(?:[^"]|"")*(?=")
That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.
This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).
I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.
This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:
"Some ""escaped"" quotes"
"Some " + "escaped" + " quotes"
So this is all you need to match it:
(?:"[^"]*")+
You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").
Don't if this is better or worse than m.buettner's (guessing not - he seems to know his stuff) but I thought I'd throw it out there for critique.
"(([^"]+(""[^"]+"")*)*)"
Try this (?<=^")(.*?"{2}.*?"{2})(?="$)
it will be maybe more faster, than two previous
and without any bugs.
Match a " beginning the string
Multiple times match a non-" or two "
Match a " ending the string
"([^"]|(""))*?"
what can I use to replace the string if its in the following format
I' have a "car" that runs very well
So basically i have a search function
and if they just type ' it wasnt finding it so i did
mySearchWord.Replace("'", "''")
and then it found it but now what if there is an ' and " in the same sentence or word, how can i check for both in mySearchWord ?
because for both cases i would do something like
mySearchWord.Replace("'", "''")
mySearchWord.Replace("\"", "\"") //have no idea about this one
or something like that, is there a way to do it at once?
I think someone below is pointing me in the right direction, i just need to be able to pass apostrophes or quotation marks to my search but it was throwing an error maube because when passed, just as in sql, you would need to escape a quotation or apostrophe
This actually replaces both at once:
string text = "I' have a \"car\" that runs very well";
string pattern = "['\"]";
var result = Regex.Replace(text, pattern, m => (m.Value == "'") ? "''" : "\"\"");
I should explain.
This is using a method called Regular Expressions. The pattern variable is the regular expression pattern which is used to match things in the string text. In this case the pattern states that it should match all ' and " characters in the text. The pattern [abc] would match all a, b and c characters.
Regular Expressions seem complex at first, but is very powerful.
You find the Regex class in the System.Text.RegularExpressions namespace.
Here is the documentation for it: http://msdn.microsoft.com/en-us/library/c75he57e(v=VS.100).aspx
The code m => (m.Value == "'") ? "''" : "\"\"" is a Lambda expression, which is a short hand to the Delegate MatchEvaluator (docs).
mySearchWord.Replace("''", "[{just_to_replace}]").Replace("'", "''").Replace("[{just_to_replace}]", "''");
cool, ain't it.
How to remove ,(comma) which is between "(double inverted comma) and "(double inverted comma). Like there is "a","b","c","d,d","e","f" and then from this, between " and " there is one comma which should be removed and after removing that comma it should be "a","b","c","dd","e","f" with the help of the regex in C# ?
EDIT: I forgot to specify that there may be double comma between quotes like "a","b","c","d,d,d","e","f" for it that regex does not work. and there can be any number of comma between quotes.
And there can be string like a,b,c,"d,d",e,f then there should be result like a,b,c,dd,e,f and if string like a,b,c,"d,d,d",e,f then result should be like a,b,c,ddd,e,f.
Assuming the input is as simple as your examples (i.e., not full-fledged CSV data), this should do it:
string input = #"a,b,c,""d,d,d"",e,f,""g,g"",h";
Console.WriteLine(input);
string result = Regex.Replace(input,
#",(?=[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)",
String.Empty);
Console.WriteLine(result);
output: a,b,c,"d,d,d",e,f,"g,g",h
a,b,c,"ddd",e,f,"gg",h
The regex matches any comma that is followed by an odd number of quotation marks.
EDIT: If fields are quoted with apostrophes (') instead of quotation marks ("), the technique is exactly the same--except you don't have to escape the quotes:
string input = #"a,b,c,'d,d,d',e,f,'g,g',h";
Console.WriteLine(input);
string result = Regex.Replace(input,
#",(?=[^']*'(?:[^']*'[^']*')*[^']*$)",
String.Empty);
Console.WriteLine(result);
If some fields were quoted with apostrophes while others were quoted with quotation marks, a different approach would be needed.
EDIT: Probably should have mentioned this in the previous edit, but you can combine those two regexes into one regex that will handle either apostrophes or quotation marks (but not both):
#",(?=[^']*'(?:[^']*'[^']*')*[^']*$|[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)"
Actually, it will handle simple strings like 'a,a',"b,b". The problem is that there would be nothing to stop you from using one of the quote characters in a quoted field of the other type, like '9" Nails' (sic) or "Kelly's Heroes". That's taking us into full-fledged CSV territory (if not beyond), and we've already established that we're not going there. :D
They're called regular expressions for a reason — they are used to process strings that meet a very specific and academic definition for what is "regular". It looks like you have some fairly typical csv data here, and it happens that csv strings are outside of that specific definition: csv data is not formally "regular".
In spite of this, it can be possible to use regular expressions to handle csv data. However, to do so you must either use certain extensions to normal regular expressions to make them Turing complete, know certain constraints about your specific csv data that is not promised in the general case, or both. Either way, the expressions required to do this are unwieldly and difficult to manage. It's often just not a good idea, even when it's possible.
A much better (and usually faster) solution is to use a dedicated CSV parser. There are two good ones hosted at code project (FastCSV and Linq-to-CSV), there is one (actually several) built into the .Net Framework (Microsoft.VisualBasic.TextFieldParser), and I have one here on Stack Overflow. Any of these will perform better and just plain work better than a solution based on regular expressions.
Note here that I'm not arguing it can't be done. Most regular expression engines today have the necessary extensions to make this possible, and most people parsing csv data know enough about the data they're handling to constrain it appropriately. I am arguing that it's slower to execute, harder to implement, harder to maintain, and more error-prone compared to a dedicated parser alternative, which is likely built into whichever platform you're using, and is therefore not in your best interests.
var input = "\"a\",\"b\",\"c\",\"d,d\",\"e\",\"f\"";
var regex = new Regex("(\"\\w+),(\\w+\")");
var output = regex.Replace(input,"$1$2");
Console.WriteLine(output);
You'd need to evaluate whether or not \w is what you want to use.
You can use this:
var result = Regex.Replace(yourString, "([a-z]),", "$1");
Sorry, after seeing your edits, regular expressions are not appropriate for this.
This should be very simple using Regex.Replace and a callback:
string pattern = #"
"" # open quotes
[^""]* # some not quotes
"" # closing quotes
";
data = Regex.Replace(data, pattern, m => m.Value.Replace(",", ""),
RegexOptions.IgnorePatternWhitespace);
You can even make a slight modification to allow escaped quotes (here I have \", and the comments explain how to use "":
string pattern = #"
\\. # escaped character (alternative is be """")
|
(?<Quotes>
"" # open quotes
(?:\\.|[^""])* # some not quotes or escaped characters
# the alternative is (?:""""|[^""])*
"" # closing quotes
)
";
data = Regex.Replace(data, pattern,
m => m.Groups["Quotes"].Success ? m.Value.Replace(",", "") : m.Value,
RegexOptions.IgnorePatternWhitespace);
If you need a single quote replace all "" in the pattern with a single '.
Something like the following, perhaps?
"(,)"
I'm in the process of updating a program that fixes subtitles.
Till now I got away without using regular expressions, but the last problem that has come up might benefit by their use. (I've already solved it without regular expressions, but it's a very unoptimized method that slows my program significantly).
TL;DR;
I'm trying to make the following work:
I want all instances of:
"! ." , "!." and "! . " to become: "!"
unless the dot is followed by another dot, in which case I want all instances of:
"!.." , "! .." , "! . . " and "!. ." to become: "!..."
I've tried this code:
the_str = Regex.Replace(the_str, "\\! \\. [^.]", "\\! [^.]");
that comes close to the first part of what I want to do, but I can't make the [^.] character of the replacement string to be the same character as the one in the original string... Please help!
I'm interested in both C# and PHP implementations...
$str = preg_replace('/!(?:\s*\.){2,3}/', '!...', $str);
$str = preg_replace('/!\s*\.(?!\s*\.)/', '!', $str);
This does the work in to PCREs. You probably could do some magic to merge it to one, but it wouldn't be readable anymore. The first PCRE is for !..., the second one for !. They are quite straightforward.
C#
s = Regex.Replace(s, #"!\s?\.\s?(\.?)\s?", "!$1$1$1");
PHP
$s = preg_replace('/!\s?\.\s?(\.?)\s?/', '!$1$1$1', $s);
The first dot is consumed but not captured; you're effectively throwing that one away. Group #1 captures the second dot if there is one, or an empty string if not. In either case, plugging it into the replacement string three times yields the desired result.
I used \s instead of literal spaces to make it more obvious what I was doing, and added the ? quantifier to make the spaces optional. If you really need to restrict it to actual space characters (not tabs, newlines, etc.) you can change them back to spaces. If you want to allow more than one space at a time, you can change ? to * where appropriate--e.g.:
#"!\s*\.\s*(\.?)\s*"
Also, notice the use of C#'s verbatim string literals--the antidote for backslashitis. ;)
Alright, so i need to be able to match strings in ways that give me more flexibility.
so, here is an example. for instance, if i had the string "This is my random string", i would want some way to make
" *random str* ",
" *is __ ran* ",
" *is* ",
" *this is * string ",
all match up with it, i think at this point a simple true or false would be okay to weather it match's or not, but id like basically * to be any length of any characters, also that _ would match any one character. i can't think of a way, although im sure there is, so if possible, could answers please contain code examples, and thanks in advance!
I can't quite figure out what you're trying to do, but in response to:
but id like basically * to be any length of any characters, also that _ would match any one character
In regex, you can use . to match any single character and .+ to match any number of characters (at least one), or .* to match any number of characters (or none if necessary).
So your *is __ ran* example might turn into the regex .+is .. ran.+, whilst this is * string could be this is .+ string.
If this doesn't answer your question then you might want to try re-wording it to make it clearer.
For learning more regex syntax, a popular site is regular-expressions.info, which provides pretty much everything you need to get started.
Use Regular Expressions.
In C#, you would use the Regex class.
For example:
var str = "This is my random string";
Console.WriteLine(Regex.IsMatch(str, ".*is .. ran.*")); //Prints "True"