Coverting a basic search string to a regex - c#

Say I have a search string like:
"Hello [NAME], how are you today? I am fine."
If I were to use a regex pattern to search text I would have to convert it to something like (assuming that '\ ' is a valid regex search for a single space):
"\Hello\ \[NAME\],\ how\ are\ you\ today\?\ I\ am\ fine."
Now before I go off and try to write a function to do this myself is anyone aware of something that already does this sort of conversion? (Eclipse does something a bit like this; it converts all its searches into regular expressions before searching, even if you're not setting the search pattern to be a regex).
I'm targetting C# in this instance but feel free to add for other languages as other people might be interested in a similar thing for Java, Python et al.

Regex.Escape(string) will return a Regex pattern that matches the supplied literal string.
Specifically, it escapes \*+?|{[()^$.# and white space.

Related

Converting wildcard pattern to regular expression

I am new to regular expressions. Recently I was presented with a task to convert a wildcard pattern to regular expression. This will be used to check if a file path matches the regex.
For example if my pattern is *.jpg;*.png;*.bmp
I was able to generate the regex by spliting on semicolons, escaping the string and replaceing the escaped * with .*
String regex = "((?i)" + Regex.Escape(extension).Replace("\\*", ".*") + "$)";
So my resulting regex will be for jpg ((?i).*\.jpg)$)
Thien I combine all my extensions using the OR operator.
Thus my final expression for this example will be:
((?i).*\.jpg)$)|((?i).*\.png)$)|((?i).*\.bmp)$)
I have tested it and it worked yet I am not sure if I should add or remove any expression to cover other cases or is there a better format the whole thing
Also bear in mind that I can encounter a wildcard like *myfile.jpg where it should match all files whose names end with myfile.jpg
I can encounter patterns like *myfile.jpg;*.png;*.bmp
There's a lot of grouping going on there which isn't really needed... well unless there's something you haven't mentioned this regex would do the same for less:
/.*\.(jpg|png|bmp)$/i
That's in regex notation, in C# that would be:
String regex=new RegEx(#".*\.(jpg|png|bmp)$",RegexOptions.IgnoreCase);
If you have to programatically translate between the two, you've started on the right track - split by semicolon, group your extensions into the set (without the preceding dot). If your wildcard patterns can be more complicated (extensions with wildcards, multi-wildcard starting matches) it might need a bit more work ;)
Edit: (For your update)
If the wild cards can be more complicated, then you're almost there. There's an optimization in my above code that pulls the dot out (for extension) which has to be put back in so you'd end up with:
/.*(myfile\.jpg|\.png|\.bmp)$/i
Basically '*' -> '.*', '.' -> '\.'(gets escaped), rest goes into the set. Basically it says match anything ending (the dollar sign anchors to the end) in myfile.jpg, .png or .bmp.

Google Search Position Regex

I am trying to get the search position of keywords in google using below regex:
string lookup = "(<h3 class=\"r\"><a href=\")(\\w+[a-zA-Z0-9.-?=/]*)";
But this is not working for urls having hypens(-) like:
www.example-xyz.com
Can anyone help me to fix this?
Escape your hyphen with backslash and escape that escaping backslash with another backslash:
string lookup = "(<h3 class=\"r\"><a href=\")(\\w+[a-zA-Z0-9.\\-?=/]*)";
Since - means a range within a [], you need to escape it with a backslash.
string lookup = "(<h3 class=\"r\"><a href=\")(\\w+[a-zA-Z0-9.\-?=/]*)";
By the way, there are many questions on stackoverflow about matching urls with regex, search tags [regex] and [url] to have a look if you want a more refined regex.
Read a decent book on Regular Expressions, like Jeffrey E.F. Friedl's "Mastering Regular Expressions".
Not only it will show you that the - makes a character range in a character class -
[a-z]
and so must be escaped -
[a\-z]
or put at the beginning -
[-az]
or at the end -
[az-]
when meant verbatim, but also that it is usually a mistake to parse such markup (a context-free language, in Chomsky terms) with one Regular Expression alone.
You are looking for a markup parser (like BeautifulSoup or lxml, but in C#), and RFC 3986, Appendix B for a proper URI-matching expression instead.

Extracting information from a SPARQL query using regular expressions

I am having a hard time creating a regular expression that extracts the namespaces from this SPARQL query:
SELECT *
WHERE {
?Vehicle rdf:type umbel-sc:CompactCar ;
skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission>;
dbp-prop:assembly ?Place.
?Place geo-ont:parentFeature dbpedia:United_States .
}
I need to get:
"rdf", "umbel-sc", "skos", "dbp-prop", "geo-ont", "dbpedia"
I need a expression like this:
\\s+([^\\:]*):[^\\s]+
But the above one does not work, because it also eats spaces before reaching :. What am I doing wrong?
So I need a expression like this:
\\s+([^\\:]*):[^\\s]+
But the above one does not work, because it also eats spaces before reaching ":".
The regular expression will eat those spaces, yes, but the group captured by your parenthesis won’t contain it. Is that a problem? You can access this group by reading from Groups[1].Value in the Match object returned from Regex.Match.
If you really need the regex to not match these spaces, you can use a so-called look-behind assertion:
(?<=\s)([^:]*):[^\s]+
As an aside, you don’t need to double all your backslashes. Use a verbatim string instead, like this:
Regex.Match(input, #"(?<=\s)([^:]*):[^\s]+")
I don't know the details of SPARQL syntax, but I would imagine that it is not a regular language so regular expressions won't be able to do this perfectly. However you can get pretty close if you search for something that looks like a word and is surrounded by space on the left and a colon on the right.
This method might be good enough for a quick solution or if your input format is known and sufficiently restricted. For a more general solution suggest you look for or create a proper parser for the SPARQL language.
With that said, try this:
string s = #"SELECT *
WHERE {
?Vehicle rdf:type umbel-sc:CompactCar ;
skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission>;
dbp-prop:assembly ?Place.
?Place geo-ont:parentFeature dbpedia:United_States .
}";
foreach (Match match in Regex.Matches(s, #"\s([\w-]+):"))
{
Console.WriteLine(match.Groups[1].Value);
}
Result:
rdf
umbel-sc
skos
dbp-prop
geo-ont
dbpedia

Convert an Extended Regular expression to .NET compatible RegEx

I've got a RegEx that works great on *NIX systems and languages that support Extended Regular Expressions (ERE). I haven't found a freely available library for .NET that supports ERE's, nor have I had any lucky trying to translate this into a non-ERE and get the same result. Here is the RegEx:
^\+(<{7} \.|={7}$|>{7} \.)
Background: the point of the RegEx is to identify if a given string appears to have the markers from an unresolved subversion merge.
It looks to me like ERE syntax is mostly upward-compatible with .NET's regex flavor, as it is with most other "Perl-compatible" flavors (Perl, PHP, Python, JavaScript, Ruby, Java...). In other words, anything you can do in an ERE regex, you should be able to do in an identical .NET regex. Certainly your example:
^\+(<{7} \.|={7}$|>{7} \.)
means the same thing in .NET as it does in ERE. The only major exception I can see is in the area of POSIX bracket expressions; .NET follows the Unicode standard instead.
It's when you go to apply the regex that things really get different. In C# you might apply that regex like this:
string result = Regex.Match(targetString, #"^\+(<{7} \.|={7}$|>{7} \.)").Value;
C#'s verbatim strings save you having to escape backslashes like in some other languages' string literals; you only have to escape quotation marks, which you do by doubling them:
#"He said, ""Look out!""";
Does that answer your question?
Are you sure you don't have a typo in that? RegexBuddy (when set to either POSIX ERE or GNU ERE) says that the "+" quantifier must be preceded by a token that can be repeated. Other than that, this appears to be a valid .NET Regex. You might want to check out one of the great O'Reilly books on regular expressions as well. If this doesn't help, please post some examples of text you're trying to match/not match.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories