Extracting information from a SPARQL query using regular expressions - c#

I am having a hard time creating a regular expression that extracts the namespaces from this SPARQL query:
SELECT *
WHERE {
?Vehicle rdf:type umbel-sc:CompactCar ;
skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission>;
dbp-prop:assembly ?Place.
?Place geo-ont:parentFeature dbpedia:United_States .
}
I need to get:
"rdf", "umbel-sc", "skos", "dbp-prop", "geo-ont", "dbpedia"
I need a expression like this:
\\s+([^\\:]*):[^\\s]+
But the above one does not work, because it also eats spaces before reaching :. What am I doing wrong?

So I need a expression like this:
\\s+([^\\:]*):[^\\s]+
But the above one does not work, because it also eats spaces before reaching ":".
The regular expression will eat those spaces, yes, but the group captured by your parenthesis won’t contain it. Is that a problem? You can access this group by reading from Groups[1].Value in the Match object returned from Regex.Match.
If you really need the regex to not match these spaces, you can use a so-called look-behind assertion:
(?<=\s)([^:]*):[^\s]+
As an aside, you don’t need to double all your backslashes. Use a verbatim string instead, like this:
Regex.Match(input, #"(?<=\s)([^:]*):[^\s]+")

I don't know the details of SPARQL syntax, but I would imagine that it is not a regular language so regular expressions won't be able to do this perfectly. However you can get pretty close if you search for something that looks like a word and is surrounded by space on the left and a colon on the right.
This method might be good enough for a quick solution or if your input format is known and sufficiently restricted. For a more general solution suggest you look for or create a proper parser for the SPARQL language.
With that said, try this:
string s = #"SELECT *
WHERE {
?Vehicle rdf:type umbel-sc:CompactCar ;
skos:subject <http://dbpedia.org/resource/Category:Vehicles_with_CVT_transmission>;
dbp-prop:assembly ?Place.
?Place geo-ont:parentFeature dbpedia:United_States .
}";
foreach (Match match in Regex.Matches(s, #"\s([\w-]+):"))
{
Console.WriteLine(match.Groups[1].Value);
}
Result:
rdf
umbel-sc
skos
dbp-prop
geo-ont
dbpedia

Related

Regex find and replace pattern matches with other pattern

It is common way to replace a pattern match with string, but I need to replace all sub-strings which match pattern from text to be matched as another pattern match, is it possible?
For example, is it possible to replace all matches to
[0-9]{2}'[0-9]{2} which represent all strings like 99'99 or 85'55
To this [0-9]{2}.[0-9]{2} which represent all strings like 99.99 or 85.55
Is it possible? How to do this kind of replacements?
or I have to handle it manually through matches in for each loop?
Use the Regex.Replace() instance function along with Regex capture groups like this:
var regex = new Regex("([0-9]{2})'([0-9]{2})");
string result = regex.Replace(input, "$1.$2");
More details about capture groups can be found here.
Also, check out this answer. It shows how to use 'named' groups which might help in future.
So as I understand you, you have something like 99'99 or 85'55 and want this to be replaced by 99.99 or 85.55?
What you might look for are capture groups, i.e. find for a matching, catch this matching and put this to the result.
the RegEx here would be s/([0-9]{2})'([0-9]{2})/$1.$2/g
Explanation:
([0-9]{2}) the brackets declare the caption group. It means, whatever is captured in it, will be stored to a variable.
These Variables are $1 and $2, because there were two capture groups.
When building the replacement string, just insert those variables and put the dot between them.

Converting wildcard pattern to regular expression

I am new to regular expressions. Recently I was presented with a task to convert a wildcard pattern to regular expression. This will be used to check if a file path matches the regex.
For example if my pattern is *.jpg;*.png;*.bmp
I was able to generate the regex by spliting on semicolons, escaping the string and replaceing the escaped * with .*
String regex = "((?i)" + Regex.Escape(extension).Replace("\\*", ".*") + "$)";
So my resulting regex will be for jpg ((?i).*\.jpg)$)
Thien I combine all my extensions using the OR operator.
Thus my final expression for this example will be:
((?i).*\.jpg)$)|((?i).*\.png)$)|((?i).*\.bmp)$)
I have tested it and it worked yet I am not sure if I should add or remove any expression to cover other cases or is there a better format the whole thing
Also bear in mind that I can encounter a wildcard like *myfile.jpg where it should match all files whose names end with myfile.jpg
I can encounter patterns like *myfile.jpg;*.png;*.bmp
There's a lot of grouping going on there which isn't really needed... well unless there's something you haven't mentioned this regex would do the same for less:
/.*\.(jpg|png|bmp)$/i
That's in regex notation, in C# that would be:
String regex=new RegEx(#".*\.(jpg|png|bmp)$",RegexOptions.IgnoreCase);
If you have to programatically translate between the two, you've started on the right track - split by semicolon, group your extensions into the set (without the preceding dot). If your wildcard patterns can be more complicated (extensions with wildcards, multi-wildcard starting matches) it might need a bit more work ;)
Edit: (For your update)
If the wild cards can be more complicated, then you're almost there. There's an optimization in my above code that pulls the dot out (for extension) which has to be put back in so you'd end up with:
/.*(myfile\.jpg|\.png|\.bmp)$/i
Basically '*' -> '.*', '.' -> '\.'(gets escaped), rest goes into the set. Basically it says match anything ending (the dollar sign anchors to the end) in myfile.jpg, .png or .bmp.

c# regex to match specific text

I'm looking to match all text in the format foo:12345 that is not contained within an HTML anchor. For example, I'd like to match lines 1 and 3 from the following:
foo:123456
foo:123456
foo:123456
I've tried these regexes with no success:
Negative lookahead attempt ( incorrectly matches, but doesn't include the last digit )
foo:(\d+)(?!</a>)
Negative lookahead with non-capturing grouping
(?:foo:(\d+))(?!</a>)
Negative lookbehind attempt ( wildcards don't seem to be supported )
(?<!<a[^>]>)foo:(\d+)
If you want to start analysing HTML like this then you probably want to actually parse HTML instead of using regular expressions. The HTML Agility Pack is the usual first port of call. Using Regular Expressions it becomes hard to deal with things like <a></a>foo:123456<a></a> which of course should pull out the middle bit but its extremely hard to write a regex that will do that.
I should add that I am assuming that you do in fact have a block of HTML rather than just individual short strings such as your each line above. Partly I ruled it out becasue matching it if it is the only thing on the line is pretty easy so I figured you'd have got it if you wanted that. :)
Regex is usually not the best tool for the job, but if your case is very specific like in your example you could use:
foo:((?>\d+))(?!</a>)
Your first expression didn't work because \d+ would backtrack till (?!</a>) matches. This can be fixed by not allowing \d+ to backtrack, as above with help of an atomic/nonbacktracking group, or you could also make the lookahead fail in case \d+ backtracks, like:
foo:((?>\d+))(?!</a>|\d)
Altho that is not as efficient.
Note, that lookbehind will not work with differnt string length inside, you may work it out differently
for example
Find and mark all foo-s that are contained in anchor
Find and do your goal with all other
Remove marks
This is prob a long winded way of doing this but you could simply bring back all occurences of foo:some digits then exclude them afterwards..
string pattern = #"foo:\d+ |" +
#"foo:\d+[<]";
Then use matchcollection
MatchCollection m0 = Regex.Matches(file, pattern, RegexOptions.Singleline);
Then loop through each occurrence:
foreach (Match m in m0)
{
. . . exclude the matches that contain the "<"
}
I would use linq and treat the html like xml, for example:
var query = MyHtml.Descendants().ToArray();
foreach (XElement result in query)
{
if (Regex.IsMatch(result.value, #"foo:123456") && result.Name.ToString() != "a")
{
//do something...
}
}
perhaps there's a better way, but i don't know it...this seems pretty straight forward to me :P

Regular Expression to reject special characters other than commas

I am working in asp.net. I am using Regular Expression Validator
Could you please help me in creating a regular expression for not allowing special characters other than comma. Comma has to be allowed.
I checked in regexlib, however I could not find a match. I treid with ^(a-z|A-Z|0-9)*[^#$%^&*()']*$ . When I add other characters as invalid, it does not work.
Also could you please suggest me a place where I can find a good resource of regular expressions? regexlib seems to be big; but any other place which lists very limited but most used examples?
Also, can I create expressions using C# code? Any articles for that?
[\w\s,]+
works fine, as you can see bellow.
RegExr is a great place to test your regular expressions with real time results, it also comes with a very complete list of common expressions.
[] character class \w Matches any word character (alphanumeric & underscore). \s
Matches any whitespace character (spaces, tabs, line breaks). , include comma + is greedy match; which will match the previous 1 or more times.
[\d\w\s,]*
Just a guess
To answer on any articles, I got started here, find it to be an excellent resource:
http://www.regular-expressions.info/
For your current problem, try something like this:
[\w\s,]*
Here's a breakdown:
Match a single character present in the list below «[\w\s,]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
A word character (letters, digits, etc.) «\w»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “,” «,»
For a single character that is not a comma, [^,] should work perfectly fine.
You can try [\w\s,] regular expression. This regex will match only alpha-numeric characters and comma. If any other character appears within text, then this wont match.
For your second question regarding regular expression resource, you can goto
http://www.regular-expressions.info/
This website has lot of tutorials on regex, plus it has lot of usefult information.
Also, can I create expressions using
C# code? Any articles for that?
By this, do you mean to say you want to know which class and methods for regular expression execution? Or you want tool that will create regular expression for you?
You can create expressions with C#, something like this usually does the trick:
Regex regex = new Regex(#"^[a-z | 0-9 | /,]*$", RegexOptions.IgnoreCase);
System.Console.Write("Enter Text");
String s = System.Console.ReadLine();
Match match = regex.Match(s);
if (match.Success == true)
{
System.Console.WriteLine("True");
}
else
{
System.Console.WriteLine("False");
}
System.Console.ReadLine();
You need to import the System.Text.RegularExpressions;
The regular expression above, accepts only numbers, letters (both upper and lower case) and the comma.
For a small introduction to Regular Expressions, I think that the book for MCTS 70-536 can be of a big help, I am pretty sure that you can either download it from somewhere or obtain a copy.
I am assuming that you never messed around with regular expressions in C#, hence I provided the code above.
Hope this helps.
Thank you, all..
[\w\s,]* works
Let me go through regular-expressions.info and come back if I need further support.
Let me try the C# code approach and come back if I need further support.
[This forum is awesome. Quality replies so qucik..]
Thanks again
(…) is denoting a grouping and not a character set that’s denoted with […]. So try this:
^[a-zA-Z0-9,]*$
This will only allow alphanumeric characters and the comma.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories