Using Regex on encoded strings - c#

I have the following regex:
#"{0}(.+?)(?:{1}(.{4}?))*(?:{2}(.+?))?{3}", "\\[\\[\\[", "\\|\\|\\|", "\\/\\/\\/", "\\]\\]\\]
To find items wrapped in [[[something]]], [[[something///comment]]].
I am using this to parse something on a web response ...
The problem is that in my web response I have a few things encoded as follows:
%5B%5B%5BPedido%20de%20Informa%C3%A7%C3%A3o%5D%5D%5D
So I am not able to identify that it starts with [[[ and finish with ]]] along with the other items.
Is there a way to solve this on the regex side?

You can unescape this string with helper functions like:
Uri.UnescapeDataString("%5B%5B%5BPedido%20de%20Informa%C3%A7%C3%A3o%5D%5D%5D");
will produce:
"[[[Pedido de Informação]]]"
Note: There is also HttpUtility.UrlDecode but required adding reference to System.Web which is not always wanted.

If unescaping the string is not an option, you can use a Noncapturing Group (?:...) and an Alternation Construct | to allow %5B alternatively to [ (same for %5D and ]).
For example, \\[\\[\\[ could be replaced by (?:\\[\\[\\[|%5B%5B%5B). Adapting the complete regex is left as an exercise to the reader.
Note, however, that this will also match [[[...%5D%5D%5D, which might or might not be a problem in your case.

Related

Regular expressions redirection

I want to set redirection from
www.somesite.com/products/dynamicstring/randomtext1/randomtext2
to www.somesite.com/products/dynamicstring
Is it possible to do that through Regex ?
It means if my incming url is
www.somesite.com/products/myproducts/test1/test2 it should redirect to www.somesite.com/products/myproducts/
just briefing more about this :
#TomLord i am using HttpContext.Current.Response.RedirectPermanent(matchingDefinition.To) i have all the redirects "From" and "To" in a class object, in the form of REGEX expressions.Example in From "/product/*" and To "/products" , i am reading these object and trying to redirect them, but i am not able to redirect something like /products/dynamicstring/randomtext1/ to /products/dynamicstring where dynamic string is random string , i dont find any regular expression which can be use to do this. For example /products/samples/randomtext1 should redirect to /products/samples/
Redirection cannot be done with regex alone. Google a bit what is a regular expression in reality. The short answer is: it's string-like expression that describes search pattern. So it can't redirect, not even replace a substring with substring or do anything else then match and capture parts of the matched string.
That being said, regex can help us do what you wanna. I am gonna assume you can use Javascript, cause I can't put a solution in every language. I am also gonna assume you will try to go over the code not copy paste and press enter. If you only need that hire a programmer. If you use another language, principle should be the same:
obtain URL
define regex
use capture group to extract the part of your URL that you need
construct a new URL
redirect to it
While matching the URLs in general is a fair bit more complex, like:
^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$
As long as you are sure you will only be getting URLs and in the format you wanna, we will do it far simpler.
Basically, let's say you have:
some text with 2 dots that ends in com
then a /products/dynamicstring/
then text
then /
then text
As a regex that is:
/\w*.\w*.com\/products\/dynamicstring\/\w*\/\w*/g
Curde matching is done, but we still need to add a capture group we will use to extract part of the string we need:
/(\w*.\w*.com\/products\/)dynamicstring\/\w*\/\w*/g
Oke, now let's leverage this regex to do rest of the work:
Define regex:
var regex = /\w*.\w*.com\/products\/dynamicstring\/\w*\/\w*/g;
Get current URL. If you already have URL use it.
var currUrl = window.location.href;
Extract capture group from string:
var match = regex.exec(currUrl);
Use that to get a new URL from old one:
var redirectUrl = match[1] + myproducts/
Finally, we redirect with:
window.location.replace(redirectUrl);
I wrote all this straight from my head so I recommend you go over each step, look how it works, read some documentation about functions used. You might find an error as well as learn a lot.

How do I do the following using only regex?

Say I have the following string
[id={somecomplexuniquestring}test1],
[id={somecomplexuniquestring}test2],[id={somecomplexuniquestring}test3],
[id={somecomplexuniquestring}test4],[id={somecomplexuniquestring}test5],
[id={somecomplexuniquestring}test6],[id={somecomplexuniquestring}test7],
[id={somecomplexuniquestring}test8],[id={somecomplexuniquestring}test9]
is there a way just using regex to get the following result [id={somecomplexuniquestring}test6]
{somecomplexuniquestring} are unknown strings which cannot be used in the regex.
For example, the following will not work #"[id=[\s\S]+?test6]" as it starts from the very first id.
Is using RegEx the best solution? You have tagged C#, so would
variableWithString.Split(",").Any(x => x.Contains("test6"));
give you the exists match, or
result = variableWithString.Split(",").Where(x => x.Contains("test6"));
give you the match value you are seeking?
This doesn't work??
\[id={.*?}test6\]
This all depends on exactly what the limitations of somecomplexuniquestring are. For example, if you have a guarantee that they do not contain any [ or ] characters, you can use this simple one:
"\[[^\[\]]*test6\]"
Similarly, if it could contain square brackets but no curly braces, you can do something similar:
"\[id={[^{}]*}test6\]"
HOWEVER, if you have no such guarantee, and there's some sort of escaping system for including {} or [] in that string, then you need to let us know how that works to properly answer.
You can use this pattern:
#"\[[^]]*]"
If you want a specific test number you can do this:
#"\[id={[^}]*}test6]"

Deal with '#' through regex

Quick question , I have been trying to match any word containing a '#' from a string list and remove it, but I don't know how to handle it . been playing around on http://regexhero.net/tester/ trying but to no avail.
Essentially if it comes across #ff or wha#s up i will just regex.replace them.
any ideas on the Regular expression to use?.
Thanks.
Don't use regex - just use string.replace - it's a lot faster.
I have a previous answer that covers some hashtag matching approaches.
In summary, if you are pulling statuses containing hashtags from Twitter, you no longer need to find them yourself. You can now specify the include_entities parameter to have Twitter automatically call out mentions, links, and hashtags (if the method you are calling, like statuses/show supports this parameter.
If you just need the regular expression to locate the hashtags and capture it's elements, Twitter provides it in an open source library that contains the following pattern.
(^|[^0-9A-Z&/]+)(#|\uFF03)([0-9A-Z_]*[A-Z_]+[a-z0-9_\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u00ff]*)
More detail and additional links are provided in the original answer.
So you're trying to remove any words containing a #?
If so, give this a try...
\w*#\w*
And replace with nothing, like so...
http://regexhero.net/tester/?id=cda1e713-bdab-4aa2-b63d-a87e9b2c9bce
apple# orange ban#ana becomes orange
But if you're simply trying to remove all instances of #, then String.Replace is the better choice. myString = myString.Replace("#", "");

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.
Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx
Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.
What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?
Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.
Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]
... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.
It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...
With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.
Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

Categories