Regular expression to get facebook profile - c#

I tried to check matches facebook url and get profile in one regular expression:
I have:
http://www.facebook.com/profile.php?id=123456789
https://facebook.com/someusername
I need:
123456789
someusername
using this regular expression:
(?<=(https?://(www.)?facebook.com/(profile.php?id=)?))([^/#?]+)
I get:
profile.php
someusername
Whats wrong?

I advise you to use the System.Uri class to get this information. It does the difficult work for you and can handle all sorts of edge cases.
var profileUri = new Uri(#"http://www.facebook.com/profile.php?id=123456789");
var usernameUri = new Uri(#"https://facebook.com/someusername");
Console.Out.WriteLine(profileUri.Query); // prints "?id=123456789"
Console.Out.WriteLine(usernameUri.AbsolutePath); // prints "/someusername"

I agree with others on using System.Uri but your regex needs two modifications to work:
\ in (profile.php\?id=)
(\n|$) at the end
(?<=(https?://(www\.)?facebook\.com/(profile\.php\?id=)?))([^/#?]+)(\n|$)

The following example writes the query ?id=123456789 to the console.
Uri baseUri = new Uri ("http://www.facebook.com/");
Uri myUri = new Uri (baseUri, "/profile.php?id=123456789");
Console.WriteLine(myUri.Query);
Hope this Helps!

Try this:
https?://(?:www.)?facebook.com/(?:profile.php\?id=)?(.+)
or
https?://(?:www.)?facebook.com/(?:profile.php\?id=)?([^/#\?]+)

Related

Regex to validate url contains culture code

I am using one web application where i am taking current url which contains culture code.
There are two url patterns where i have to validate culture code.
First pattern:
var url = "http://localhost:1469/en-US";
Second pattern:
var url = "http://localhost:1469/somepagename/en-US";
Please suggest me some regex that will validate url contains culture code or not.
Just to answer your question, you can try something like this:
string pat = #"[a-z]{2}-[A-Z]{2}";
Regex regex = new Regex(pat);
var url1 = "http://localhost:1469/en-US";
var url2 = "http://localhost:1469/somepagename/en-US";
Console.WriteLine(regex.Matches(url1)[0]);
Console.WriteLine(regex.Matches(url2)[0]);

RegEx to extract a sub level from url

i have the following set of Urls:
http://test/mediacenter/Photo Gallery/Conf 1/1.jpg
http://test/mediacenter/Photo Gallery/Conf 2/3.jpg
http://test/mediacenter/Photo Gallery/Conf 3/Conf 4/1.jpg
All i want to do is to extract the Conf 1, Conf 2,Conf 3 from the urls, the level after 'Photo Gallery' (Urls are not static, they share common level which is Photo Gallery)
Any help is appreciated
Is it necessary to use Regex? You can get it without using Regex like this
string str= #"http://test/mediacenter/Photo Gallery/Conf 1/1.jpg";
var z=qq.Split('/')[5];
or
var x= new Uri(str).Segments[3];
This ought to do you:
var s = #"http://test/mediacenter/Photo Gallery/Conf 11/1.jpg";
var regex = new Regex(#"(Conf \d*)");
var match = regex.Match(s);
Console.WriteLine(match.Groups[0].Value); // Prints a
Of course, you'd have to be confident the 'Conf x' (where x is a number) wasn't going to be elsewhere in the URL.
This will improve it slightly by stripping off multiple folders (Conf 3/Conf 4) in your example.
var regex = new Regex(#"((Conf \d*/*)+)");
It leaves the trailing / though.
No need for regex.
string testCase = "http://test/mediacenter/Photo Gallery/Conf 1/1.jpg";
string urlBase = "http://test/mediacenter/Photo Gallery/";
if(!testCase.StartsWith(urlBase))
{
throw new Exception("URL supplied doesn't belong to base URL.");
}
Uri uriTestCase = new Uri(testCase);
Uri uriBase = new Uri(urlBase);
if(uriTestCase.Segments.Length > uriBase.Segments.Length)
{
System.Console.Out.WriteLine(uriTestCase.Segments[uriBase.Segments.Length]);
}
else
{
Console.Out.WriteLine("No child segment...");
}
Try a RegEx like this.
Conf[^\/]*
This should give you all "Conf" Parts of the URLs.
I hope that helps.

How to decode Javascript Unicode into C# strings

For example the JSON callback we get on a google autosearch:
window.google.td && window.google.td('tljp1322487273527014', 4,{e:"HY7TTtmRFZPe8QPCvf30Dw",c:1,u:"http://www.google.co.uk/s?hl\x3den\x26cp\x3d5\x26gs_id\x3d17\x26xhr\x3dt\x26q\x3dowasp\x26pf\x3dp\x26sclient\x3dpsy-ab\x26source\x3dhp\x26pbx\x3d1\x26oq\x3d\x26aq\x3d\x26aqi\x3d\x26aql\x3d\x26gs_sm\x3d\x26gs_upl\x3d\x26bav\x3don.2,or.r_gc.r_pw.,cf.osb\x26fp\x3dbd20912ccdf288ab\x26biw\x3d387\x26bih\x3d362\x26tch\x3d4\x26ech\x3d15\x26psi\x3d5o3TTqCqCsnD0QXA7sUI.1322487273527.1\x26wrapid\x3dtljp1322487273527014",d:"[\x22owasp\x22,[[\x22owasp\x22,0,\x220\x22],[\x22owasp\\u003Cb\\u003E top 10\\u003C\\/b\\u003E\x22,0,\x221\x22],[\x22owasp\\u003Cb\\u003E top 10 2011\\u003C\\/b\\u003E\x22,0,\x222\x22],[\x22owasp\\u003Cb\\u003E zap\\u003C\\/b\\u003E\x22,0,\x223\x22]],{\x22j\x22:\x2217\x22}]"});window.google.td && window.google.td('tljp1322487273527014', 4,{e:"HY7TTtmRFZPe8QPCvf30Dw",c:0,u:"http://www.google.co.uk/s?hl\x3den\x26cp\x3d5\x26gs_id\x3d17\x26xhr\x3dt\x26q\x3dowasp\x26pf\x3dp\x26sclient\x3dpsy-ab\x26source\x3dhp\x26pbx\x3d1\x26oq\x3d\x26aq\x3d\x26aqi\x3d\x26aql\x3d\x26gs_sm\x3d\x26gs_upl\x3d\x26bav\x3don.2,or.r_gc.r_pw.,cf.osb\x26fp\x3dbd20912ccdf288ab\x26biw\x3d387\x26bih\x3d362\x26tch\x3d4\x26ech\x3d15\x26psi\x3d5o3TTqCqCsnD0QXA7sUI.1322487273527.1\x26wrapid\x3dtljp1322487273527014",d:""});
more specifically, how to go from:
"\x22te\\u003Cb\\u003Esco\\u003C\\/b\\u003E\x22,0,\x220\x22"
to
"te\u003Cb\u003Esco\u003C\/b\u003E",0,"0"
to
"te<b>sco</b>"
Note that the System.Web UrlDecode and HtmlDecode are not able to handle this.
Interestingly, the AntiXss almost does the reverse, since it can go from:
"te<b>sco</b>"
To
te\00003Cb\00003Esco\00003C\00002Fb\00003E
Security angle
These decodings have a number of security implications since they will be rendered by the browser. For example if in Javascript/jQuery we have a variable with the payload
var xss = "te\u003Cscript\u003Ealert\u002812\u0029\u003C\u002Fscript\u003E"
will be triggered if assigned to a div's html
$("#header").html(xss)
\x....
WTF? \u - dat's okey.
According to previous answer:
string str = #"P\u003e\u003cp\u003e Notes \u003cstrong\u003e Разработчик: \u003c/STRONG\u003e \u003cbr /\u003eЕсли игра Безразлично";
Regex regex = new Regex(#"\\u([0-9a-z]{4})",RegexOptions.IgnoreCase);
str = regex.Replace(str, match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value , System.Globalization.NumberStyles.HexNumber)));
It appears that "\x22te\\u003Cb\\u003Esco\\u003C\\/b\\u003E\x22,0,\x220\x22" is hex encoded, there is nothing available to decode this string out of the box, however the following should work:
var regex = new Regex(#"\\x([a-fA-F0-9]{2})");
var replaced = regex.Replace(input, match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)));

C# regex to get video id from youtube and vimeo by url

I'm busy trying to create two regular expressions to filter the id from youtube and vimeo video's. I've already got the following expressions;
YouTube: (youtube\.com/)(.*)v=([a-zA-Z0-9-_]+)
Vimeo: vimeo\.com/([0-9]+)$
As i explained below there are 2 types of urls matched by the regular expressions i already created. Several other types of urls from Vimeo and YouTube aren't coverd by the expressions. What i prefer most is that all this can be covered in two expressions. One for all Vimeo video's and one for all youtube video's. I've been busy experimenting with some different expressions, but no succes so far. I'm still trying to master regular expressions, so i hope i'm on the right way and somebody can help me out! If more information is required, please let me know!
VIMEO URLs NOT MATCHED:
http://vimeo.com/channels/hd#11384488
http://vimeo.com/groups/brooklynbands/videos/7906210
http://vimeo.com/staffpicks#13561592
YOUTUBE URLs NOT MATCHED
http://www.youtube.com/user/username#p/a/u/1/bpJQZm_hkTE
http://www.youtube.com/v/bpJQZm_hkTE
http://youtu.be/bpJQZm_hkTE
URLs Matched
http://www.youtube.com/watch?v=bWTyFIYPtYU&feature=popular
http://vimeo.com/834881
The idea is to match all the url's mentioned above with two regular expressions. One for vimeo and one for youtube.
UPDATE AFTER ANSWER Sedith:
This is how my expressions look now
public static readonly Regex VimeoVideoRegex = new Regex(#"vimeo\.com/(?:.*#|.*/videos/)?([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
public static readonly Regex YoutubeVideoRegex = new Regex(#"youtu(?:\.be|be\.com)/(?:(.*)v(/|=)|(.*/)?)([a-zA-Z0-9-_]+)", RegexOptions.IgnoreCase);
And in code i have
var youtubeMatch = url.match(YoutubeVideoRegex );
var vimeoMatch = url.match(VimeoVideoRegex );
var youtubeIndex = (youtubeMatch.length - 1)
var youtubeId = youtubeMatch[youtubeIndex];
As you can see i now need to find the index where the videoId is in the array with matches returned from the regex. But i want it to only return the id itselfs, so i don't need to modify the code when youtube of vimeo ever decide to change there urls. Any tips on this?
I had a play around with the examples and came up with these:
Youtube: youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+)
Vimeo: vimeo\.com/(?:.*#|.*/videos/)?([0-9]+)
And they should match all those given. The (?: ...) means that everything inside the bracket won't be captured. So only the id should be obtained.
I'm a bit of a regex novice myself, so don't be surprised if someone else comes in here screaming not to listen to me, but hopefully these will be of help.
I find this website extremely useful in working out the patterns: http://www.regexpal.com/
Edit:
get the id like so:
string url = ""; //url goes here!
Match youtubeMatch = YoutubeVideoRegex.Match(url);
Match vimeoMatch = VimeoVideoRegex.Match(url);
string id = string.Empty;
if (youtubeMatch.Success)
id = youtubeMatch.Groups[1].Value;
if (vimeoMatch.Success)
id = vimeoMatch.Groups[1].Value;
That works in plain old c#.net, can't vouch for asp.net
In case you are writing some application with view model (e.g. ASP.NET MVC):
public string YouTubeUrl { get; set; }
public string YouTubeVideoId
{
get
{
var youtubeMatch =
new Regex(#"youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+)")
.Match(this.YouTubeUrl);
return youtubeMatch.Success ? youtubeMatch.Groups[1].Value : string.Empty;
}
}
Vimeo:
vimeo\.com/(?:.*#|.*/)?([0-9]+)

Regex implementation

I have encountered this piece of code that is supposed to determine the parent url in a hierarchy of dynamic (rewritten) urls. The basic logic goes like this:
"/testing/parent/default.aspx" --> "/testing/default.aspx"
"/testing/parent.aspx" --> "/testing/default.aspx"
"/testing/default.aspx" --> "/default.aspx"
"/default.aspx" --> null
...
private string GetParentUrl(string url)
{
string parentUrl = url;
if (parentUrl.EndsWith("Default.aspx", StringComparison.OrdinalIgnoreCase))
{
parentUrl = parentUrl.Substring(0, parentUrl.Length - 12);
if (parentUrl.EndsWith("/"))
parentUrl = parentUrl.Substring(0, parentUrl.Length - 1);
}
int i = parentUrl.LastIndexOf("/");
if (i < 2) return null;
parentUrl = parentUrl.Substring(0, i + 1);
return string.Format(CultureInfo.InvariantCulture, "{0}Default.aspx", parentUrl);
}
This code works but it smells to me. It will not work with urls that have a querystring. How can I improve it using regex?
Have a look at the answers to SO question "Getting the parent name of a URI/URL from absolute name C#"
This will show you how to use System.Uri to access the segments of an URL. System.Uri also allows to manipulate the URL in the way you want (well, not the custom logic) without the danger of creating invalid URLs. There is no need to hack your own functions to dissect URLs.
A straight forward approach will be splitting url by "?" and concatenate query string at the end...
I recommend you not to use Regex in this scenario. Regex that solves this task will be "real code smell". Above code isn't so bad, use f3lix and Leon Shmulevich recommendations to make it better.

Categories