help with a tag removal regex - c#

I have strings in the form: "[user:fred][priority:3]Lorem ipsum dolor sit amet." where the area enclosed in square brackets is a tag (in the format [key:value]). I need to be able to remove a specific tag given it's key with the following extension method:
public static void RemoveTagWithKey(this string message, string tagKey) {
if (message.ContainsTagWithKey(tagKey)) {
var regex = new Regex(#"\[" + tagKey + #":[^\]]");
message = regex.Replace(message , string.Empty);
}
}
public static bool ContainsTagWithKey(this string message, string tagKey) {
return message.Contains(string.Format("[{0}:", tagKey));
}
Only the tag with the specified key should be removed from the string. My regex doesn't work because it's daft. I need help to write it properly. Alternatively, an implementation without regex is welcome.

I know there are much more feature-rich tools out there, but I like the simplicity and cleanliness of Code Architects Regex Tester (aka YART: Yet Another Regex Tester). Shows groups and captures in a tree view, quite fast, very small, open source. It also generates code in C++, VB, and C# and can automatically escape or unescape regexes for these languages. I dump it in my VS tools folder (C:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools) and set a menu item to it in the Tools menu with Tools > External Tools so I can fire it up quickly from inside VS.
Regexes can be really hard to write sometimes and I know it really helps to be able to test the regex and see the results as you go.
(source: dotnet2themax.com)
Another really popular (but not free) option is Regex Buddy.

If you want to do this without a Regex it isn't difficult. You're already searching for a specific tag key, so you can just search for "[" + tagKey, then search from there for the closing "]", and remove everything between those offsets. Something like...
int posStart = message.IndexOf("[" + tagKey + ":");
if(posStart >= 0)
{
int posEnd = message.IndexOf("]", posStart);
if(posEnd > posStart)
{
message = message.Remove(posStart, posEnd - posStart);
}
}
Is that better than a Regex solution? Since you're only looking for a specific key I think it probably is, on the grounds of simplicity. I love Regexes but they're not always the clearest answer.
Edit: Another reason the IndexOf() solution could be seen as better is that it means there is only one rule for finding the start of the tag, whereas the original code uses a Contains() which searches for something like '[tag:' and then uses a regex which uses a slightly different expression to do the substitution / removal. In theory you could have text which matches one criterion but not the other.

Try this instead:
new Regex(#"\[" + tagKey + #":[^\]+]");
The only thing I changed was to add + to the [^\] pattern, meaning that you match one or more characters that are not a backslash.

I think this is the regex you're looking for:
string regex = #"\[" + tag + #":[^\]+]\]";
Also, you don't need to do a separate check to see if there are tags of that type. Just do a regex replace; if there are no matches, the original string is returned.
public static string RemoveTagWithKey(string message, string tagKey) {
string regex = #"\[" + tag + #":[^\]+]\]";
return Regex.Replace(message, regex, string.Empty);
}
You seem to be writing an extension method, but I wrote this as a static utility method to keep things simple.

Related

"Evaluate" a c# string

I am reading a C# source file.
When I encounter a string, I want to get it's value.
For instance, in the following example:
public class MyClass
{
public MyClass()
{
string fileName = "C:\\Temp\\A Weird\"FileName";
}
}
I would like to retrieve
C:\Temp\A Weird"FileName
Is there an existing procedure to do that?
Coding a solution with all the possible cases should be quite tricky (#, escape sequences. ...).
I am convinced such procedure exists...
I would like to have the dual function too (to inject a string into a C# source file)
Thanks in advance.
Philippe
P.S:
I gave an example with a filename, but I look for a solution working for all kinds of strings.
I'm pretty sure you can use CodeDOM to read a C# code file and parse its elements. It generates a code tree, and then you can look for nodes representing strings.
http://www.codeproject.com/Articles/2502/C-CodeDOM-parser
Other CodeDom parsers:
http://www.codeproject.com/Articles/14383/An-Expression-Parser-for-CodeDom
NRefactory: https://github.com/icsharpcode/NRefactory and http://www.codeproject.com/Articles/408663/Using-NRefactory-for-analyzing-Csharp-code
There is a way of extracting these strings using a regular expression:
("(\\"|[^"])*")
This particular one works on your simple example and gives the filename (complete with leading and trailing quote characters); whether it would work on more complex ones I can't easily tell unfortunately.
For clarity, (\\"|[^"]) matches any character apart from ", except where it has a leading \ character.
Just use ".*" Regex to match all string values, then remove trailing inverted commas and unescape it.
this will allow \" and "" characters inside your string
so both "C:\\Temp\\A Weird\"FileName" and "Hello ""World""" will match

I need a regex expression which can return to me the relative URL + query string from an HTML content string

I have found useful regex expressions from the site, but this particular one eludes me.
Basically, I need to extract this:
/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965
from this string using regex:
<p>test james lafferty joseph <strong>swami</strong> is a great guy.<img src=\"/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965\" alt=\"nasa1\" title=\"nasa1\" style=\"width: 100px; height: 57px; \" width=\"100\" height=\"57\" /></p>\r\n<p><br /></p>\r\n<p><br /></p>
The regex expression I have extracts the URL without the query string. It is ok if the regex hard codes the string '/uploadedimages/'. However, other than this hard-coding, everything else needs to be generic. This could be anything - not just an image, could be an href linked to a pdf file. Query string could be anything valid as well.
Other regex expressions I have found work only with the absolute URLs starting with http, etc.
I am not sure why nobody was able to provide an acceptable answer for this question. As this would be a very real problem for any developer who needs to extract URLs of any kind fully from an HTML fragment which may or may not be valid HTML, here is the answer which I have verified as working in C#:
matches = Regex.Matches(target, "(?<=\")(http:|https:)?[/\\\\](?:[A-Za-z0-9-._~!$&'()*+,;=:# ]|%[0-9a-fA-F]{2})*([/\\\\](?:([A-Za-z0-9-._~!$&'()*+,;=:# ]|%[0-9a-fA-F]{2}))*)*(?:\\?[a-zA-Z0-9=/\\\\&]+)?(?=\")", RegexOptions.IgnoreCase);
This will extract any number of URLs in the HTML fragment with query string, and I have also gone ahead and modified the REGEX so that it works properly with escape characters in C# regex. The pure REGEX will not work as-is in C# as we have to escape the "\" and """ characters.
Assuming you want a regex like this?
<([^=<>]+)=\\?"([^\\"]+)
Otherwise, please be less ambiguous about what you are actually trying to parse out. Thanks!
I'd recommend doing this in stages, since it will be much simpler. You can use .net in a cleaner way, regexes are not needed here, and neither is a full dom parser if you know the format the data will come in. Assuming for the moment that what you really want is the relative url of the image source, and that there is only ever one image in the html, I would recommend something like the following.
string Parse(string html)
{
var temp = html.Substring(html.IndexOf("src=") + 5);
return temp.Substring(0, temp.IndexOf("\""));
}
To do it using regular expressions, based off kgoedtel's answer (modified slightly) you'll need to do something like:
string Parse(string html)
{
var r = new Regex("<img [^=<>]+=\\\\?\"([^\\\\\"]+)");
return r.Match(html).Groups[1].Value;
}
IEnumerable<string> ParseMany(string html)
{
var r = new Regex("[^=<>]+=\\\\?\"([^\\\\\"]+)");
return r.Matches(html).OfType<Match>().Select(m=>m.Groups[1].Value);
}

removing #region

I had to take over a c# project. The guy who developed the software in the first place was deeply in love with #region because he wrapped everything with regions.
It makes me almost crazy and I was looking for a tool or addon to remove all #region from the project. Is there something around?
Just use Visual Studio's built-in "Find and Replace" (or "Replace in Files", which you can open by pressing Ctrl + Shift + H).
To remove #region, you'll need to enable Regular Expression matching; in the "Replace In Files" dialog, check "Use: Regular Expressions". Then, use the following pattern: "\#region .*\n", replacing matches with "" (the empty string).
To remove #endregion, do the same, but use "\#endregion .*\n" as your pattern. Regular Expressions might be overkill for #endregion, but it wouldn't hurt (in case the previous developer ever left comments on the same line as an #endregion or something).
Note: Others have posted patterns that should work for you as well, they're slightly different than mine but you get the general idea.
Use one regex ^[ \t]*\#[ \t]*(region|endregion).*\n to find both: region and endregion. After replacing by empty string, the whole line with leading spaces will be removed.
[ \t]* - finds leading spaces
\#[ \t]*(region|endregion) - finds #region or #endregion (and also very rare case with spaces after #)
.*\n - finds everything after #region or #endregion (but in the same line)
EDIT: Answer changed to be compatible with old Visual Studio regex syntax. Was: ^[ \t]*\#(end)?region.*\n (question marks do not work for old syntax)
EDIT 2: Added [ \t]* after # to handle very rare case found by #Volkirith
In Find and Replace use {[#]<region[^]*} for Find what: and replace it with empty string.
#EndRegion is simple enough to replace.
Should you have to cooperate with region lovers (and keep regions untouched ), then I would recommend "I hate #Regions" Visual Studio extension. It makes regions tolerable - all regions are expanded by default and #region directives are rendered with very small font.
For anyone using ReSharper it's just a simple Atr-Enter on the region line. You will then have the option to remove regions in file, in project, or in solution.
More info on JetBrains.
To remove #region with a newline after it, replace following with empty string:
^(?([^\r\n])\s)*\#region\ ([^\r\n])*\r?\n(?([^\r\n])\s)*\r?\n
To replace #endregion with a leading empty line, replace following with an empty string:
^(?([^\r\n])\s)*\r?\n(?([^\r\n])\s)*\#endregion([^\r\n])*\r?\n
How about writing your own program for it, to replace regions with nothing in all *.cs files in basePath recursively ?
(Hint: Careful with reading files as UTF8 if they aren't.)
public static void StripRegions(string fileName, System.Text.RegularExpressions.Regex re)
{
string input = System.IO.File.ReadAllText(fileName, System.Text.Encoding.UTF8);
string output = re.Replace(input, "");
System.IO.File.WriteAllText(fileName, output, System.Text.Encoding.UTF8);
}
public static void StripRegions(string basePath)
{
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(#"(^[ \t]*\#[ \t]*(region|endregion).*)(\r)?\n", System.Text.RegularExpressions.RegexOptions.Multiline);
foreach (string file in System.IO.Directory.GetFiles(basePath, "*.cs", System.IO.SearchOption.AllDirectories))
{
StripRegions(file, re);
}
}
Usage:
StripRegions(#"C:\sources\TestProject")
You can use the wildcard find/replace:
*\#region *
*\#endregion
And replace with no value. (Note the # needs to be escaped, as visual stuido uses it to match "any number")

Extracting a string starting with x and ending with y

First of all, I did a search on this and was able to find how to use something like String.Split() to extract the string based on a condition. I wasn't able to find however, how to extract it based on an ending condition as well. For example, I have a file with links to images: http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg
You will notice that all the images start with http:// and end with .jpg. However, .jpg is succeeded by http:// without a space, making this a little more difficult.
So basically I'm trying to find a way (Regex?) to extract a string from a string that starts with http:// and ends with .jpg
Regex is the easiest way to do this. If you're not familiar with regular expressions, you might check out Regex Buddy. It's a relatively cheap little tool that I found extremely useful when I was learning. For your particular case, a possible expression is:
(http://.+?\.jpg)
It probably requires some more refinement, as there are boundary cases that could trip this up, but it would work if the file is a simple list.
You can also do free quick testing of expressions here.
Per your latest comment, if you have links to other non-images as well, then you need to make sure it doesn't start at the http:// for one link and read all the way to the .jpg for the next image. Since URLs are not allowed to have whitespace, you can do it like this:
(http://[^\s]+\.jpg)
This basically says, "match a string starting with http:// and ending with .jpg where there is at least one character between the two and none of those characters are whitespace".
Regex RegexObj = new Regex("http://.+?\\.jpg");
Match MatchResults = RegexObj.Match(subject);
while (MatchResults.Success) {
//Do something with it
MatchResults = MatchResults.NextMatch();
}
In your specific case, you could always split if by ".jpg". You will probably end up with one empty element at the end of the array, and have to append the .jpg at the end of each file if you need that. Apart from that I think it would work.
Tested the following code and it worked fine:
public void SplitTest()
{
string test = "http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg";
string[] items = test.Split(new string[] { ".jpg" }, StringSplitOptions.RemoveEmptyEntries);
}
It even get rid of the empty entry...
The following LINQ will separate by http: and make sure to only get values that end with jpg.
var images = from i in imageList.Split(new[] {"http:"},
StringSplitOptions.RemoveEmptyEntries)
where i.EndsWith(".jpg")
select "http:" + i;

Easiest way to convert a URL to a hyperlink in a C# string?

I am consuming the Twitter API and want to convert all URLs to hyperlinks.
What is the most effective way you've come up with to do this?
from
string myString = "This is my tweet check it out http://tinyurl.com/blah";
to
This is my tweet check it out http://tinyurl.com/>blah
Regular expressions are probably your friend for this kind of task:
Regex r = new Regex(#"(https?://[^\s]+)");
myString = r.Replace(myString, "$1");
The regular expression for matching URLs might need a bit of work.
I did this exact same thing with jquery consuming the JSON API here is the linkify function:
String.prototype.linkify = function() {
return this.replace(/[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]+/, function(m) {
return m.link(m);
});
};
This is actually an ugly problem. URLs can contain (and end with) punctuation, so it can be difficult to determine where a URL actually ends, when it's embedded in normal text. For example:
http://example.com/.
is a valid URL, but it could just as easily be the end of a sentence:
I buy all my witty T-shirts from http://example.com/.
You can't simply parse until a space is found, because then you'll keep the period as part of the URL. You also can't simply parse until a period or a space is found, because periods are extremely common in URLs.
Yes, regex is your friend here, but constructing the appropriate regex is the hard part.
Check out this as well: Expanding URLs with Regex in .NET.
You can add some more control on this by using MatchEvaluator delegate function with regular expression:
suppose i have this string:
find more on http://www.stackoverflow.com
now try this code
private void ModifyString()
{
string input = "find more on http://www.authorcode.com ";
Regex regx = new Regex(#"\b((http|https|ftp|mailto)://)?(www.)+[\w-]+(/[\w- ./?%&=]*)?");
string result = regx.Replace(input, new MatchEvaluator(ReplaceURl));
}
static string ReplaceURl(Match m)
{
string x = m.ToString();
x = "< a href=\"" + x + "\">" + x + "</a>";
return x;
}
/cheer for RedWolves
from: this.replace(/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&\?/.=]+/, function(m){...
see: /[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&\?/.=]+/
There's the code for the addresses "anyprotocol"://"anysubdomain/domain"."anydomainextension and address",
and it's a perfect example for other uses of string manipulation. you can slice and dice at will with .replace and insert proper "a href"s where needed.
I used jQuery to change the attributes of these links to "target=_blank" easily in my content-loading logic even though the .link method doesn't let you customize them.
I personally love tacking on a custom method to the string object for on the fly string-filtering (the String.prototype.linkify declaration), but I'm not sure how that would play out in a large-scale environment where you'd have to organize 10+ custom linkify-like functions. I think you'd definitely have to do something else with your code structure at that point.
Maybe a vet will stumble along here and enlighten us.

Categories