C# regex to get video id from youtube and vimeo by url - c#

I'm busy trying to create two regular expressions to filter the id from youtube and vimeo video's. I've already got the following expressions;
YouTube: (youtube\.com/)(.*)v=([a-zA-Z0-9-_]+)
Vimeo: vimeo\.com/([0-9]+)$
As i explained below there are 2 types of urls matched by the regular expressions i already created. Several other types of urls from Vimeo and YouTube aren't coverd by the expressions. What i prefer most is that all this can be covered in two expressions. One for all Vimeo video's and one for all youtube video's. I've been busy experimenting with some different expressions, but no succes so far. I'm still trying to master regular expressions, so i hope i'm on the right way and somebody can help me out! If more information is required, please let me know!
VIMEO URLs NOT MATCHED:
http://vimeo.com/channels/hd#11384488
http://vimeo.com/groups/brooklynbands/videos/7906210
http://vimeo.com/staffpicks#13561592
YOUTUBE URLs NOT MATCHED
http://www.youtube.com/user/username#p/a/u/1/bpJQZm_hkTE
http://www.youtube.com/v/bpJQZm_hkTE
http://youtu.be/bpJQZm_hkTE
URLs Matched
http://www.youtube.com/watch?v=bWTyFIYPtYU&feature=popular
http://vimeo.com/834881
The idea is to match all the url's mentioned above with two regular expressions. One for vimeo and one for youtube.
UPDATE AFTER ANSWER Sedith:
This is how my expressions look now
public static readonly Regex VimeoVideoRegex = new Regex(#"vimeo\.com/(?:.*#|.*/videos/)?([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
public static readonly Regex YoutubeVideoRegex = new Regex(#"youtu(?:\.be|be\.com)/(?:(.*)v(/|=)|(.*/)?)([a-zA-Z0-9-_]+)", RegexOptions.IgnoreCase);
And in code i have
var youtubeMatch = url.match(YoutubeVideoRegex );
var vimeoMatch = url.match(VimeoVideoRegex );
var youtubeIndex = (youtubeMatch.length - 1)
var youtubeId = youtubeMatch[youtubeIndex];
As you can see i now need to find the index where the videoId is in the array with matches returned from the regex. But i want it to only return the id itselfs, so i don't need to modify the code when youtube of vimeo ever decide to change there urls. Any tips on this?

I had a play around with the examples and came up with these:
Youtube: youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+)
Vimeo: vimeo\.com/(?:.*#|.*/videos/)?([0-9]+)
And they should match all those given. The (?: ...) means that everything inside the bracket won't be captured. So only the id should be obtained.
I'm a bit of a regex novice myself, so don't be surprised if someone else comes in here screaming not to listen to me, but hopefully these will be of help.
I find this website extremely useful in working out the patterns: http://www.regexpal.com/
Edit:
get the id like so:
string url = ""; //url goes here!
Match youtubeMatch = YoutubeVideoRegex.Match(url);
Match vimeoMatch = VimeoVideoRegex.Match(url);
string id = string.Empty;
if (youtubeMatch.Success)
id = youtubeMatch.Groups[1].Value;
if (vimeoMatch.Success)
id = vimeoMatch.Groups[1].Value;
That works in plain old c#.net, can't vouch for asp.net

In case you are writing some application with view model (e.g. ASP.NET MVC):
public string YouTubeUrl { get; set; }
public string YouTubeVideoId
{
get
{
var youtubeMatch =
new Regex(#"youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+)")
.Match(this.YouTubeUrl);
return youtubeMatch.Success ? youtubeMatch.Groups[1].Value : string.Empty;
}
}

Vimeo:
vimeo\.com/(?:.*#|.*/)?([0-9]+)

Related

Replacing html content in a string

I have a string that has html contents such as:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
What I need in the end is:
string myMessage = "Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given";
I can do this replacing each string as myMessage = myMessage.Replace("string to replace", ""); but then I have to take in each string and replace it will empty. Could there be a better solution?
If I understand you correctly you have a larger text with multiple occurrences of "<a ....>" and actually you want to replace that entire thing by simply only the URL given in the href.
Not sure if this makes it so much easier for you but you could use Regex.Matches something like e.g.
var myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var matches = Regex.Matches(myMessage, "(.+?)<a.+?href=\"(.+?)\".+?<\\/a>(.+?)");
var strBuilder = new StringBuilder();
foreach (Match match in matches)
{
var groups = match.Groups;
strBuilder.Append(groups[1]) // Please the website for more information (
.Append(groups[2]) // http://www.africau.edu/images/default/sample.pdf
.Append(groups[3]); // )
}
Debug.Log(strBuilder.ToString());
So what does this do?
(.+?) will create a group for everything before the first encounter of the following <a => groups[1]
<a.+?href=" matches everything starting with <a and ending with href=" => ignored
(.+?) will create a group for everything between href=" and the next " (so the URL) => groups[2]
".+?<\/a> matches everything from the " until the next </a> => ignored
(.+?) will create a group for everything after the </a> => groups[3]
and groups[0] is the entire match.
so finally we just want to combine
groups[1] + groups[2] + groups[3]
but in a loop so we find possibly multiple matches within the same string and it is simply more efficient to use a StringBuilder for that.
Result
Please the website for more information (http://www.africau.edu/images/default/sample.pdf)
you can simply adjust this to e.g. also remove the ( ) or include the text between the tags but I figured actually this makes the most sense for now.
I personally don't like to rely on the string format always being what I expect as this can lead to errors down the road.
Instead, I offer two ways I can think of doing this:
Use regular expressions:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var capturePattern = #"(.+)\(<a .*href.*?=""(.*?)"".*>(.*)</a>\)";
var regex = new Regex(capturePattern);
var captures = regex.Match(myMessage);
var newString = $"{captures.Groups[1]}{captures.Groups[2]}{captures.Groups[3]}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given
Of course, regular expressions are only as good as the cases you can think of/test. I wrote this up quickly just to illustrate so make sure to verify for other string variations.
The other way is using HTMLAgilityPack:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var doc = new HtmlDocument();
doc.LoadHtml(myMessage);
var prefix = doc.DocumentNode.ChildNodes[0].InnerText;
var url = doc.DocumentNode.SelectNodes("//a[#href]").First().GetAttributeValue("href", string.Empty);
var suffix= doc.DocumentNode.ChildNodes[1].InnerText + doc.DocumentNode.ChildNodes[2].InnerText;
var newString = $"{prefix}{url}{suffix}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information (http://www.africau.edu/images/default/sample.pdf easy details given)
Notice this method preserves the parenthesis around the link. This is because from the agility pack's perspective, the first parenthesis is part of the text of the node. You can always remove them with a quick replace.
This method adds a dependency but this library is very mature and has been around for a long time.
it goes without saying that for both methods, you should make sure to add [error handling] checks for unexpected conditions.

Regex required for renaming file in C#

I need a regex for renaming file in c#. My file name is 22px-Flag_Of_Sweden.svg.png. I want it to rename as sweden.png.
So for that I need regex. Please help me.
I have various files more than 300+ like below:
22px-Flag_Of_Sweden.svg.png - should become sweden.png
13px-Flag_Of_UnitedStates.svg.png - unitedstates.png
17px-Flag_Of_India.svg.png - india.png
22px-Flag_Of_Ghana.svg.png - ghana.png
These are actually flags of country. I want to extract Countryname.Fileextension. Thats all.
var fileNames = new [] {
"22px-Flag_Of_Sweden.svg.png"
,"13px-Flag_Of_UnitedStates.svg.png"
,"17px-Flag_Of_India.svg.png"
,"22px-Flag_Of_Ghana.svg.png"
,"asd.png"
};
var regEx = new Regex(#"^.+Flag_Of_(?<country>.+)\.svg\.png$");
foreach ( var fileName in fileNames )
{
if ( regEx.IsMatch(fileName))
{
var newFileName = regEx.Replace(fileName,"${country}.png").ToLower();
//File.Save(Path.Combine(root, newFileName));
}
}
I am not exactly sure how this would look in c# (although the regex is important and not the language), but in Java this would look like this:
String input = "22px-Flag_Of_Sweden.svg.png";
Pattern p = Pattern.compile(".+_(.+?)\\..+?(\\..+?)$");
Matcher m = p.matcher(input);
System.out.println(m.matches());
System.out.println(m.group(1).toLowerCase() + m.group(2));
Where the relevant for you is this part :
".+_(.+?)\\..+?(\\..+?)$"
Just concat the two groups.
I wish I knew a bit of C# right now :)
Cheers Eugene.
This will return country in the first capture group: ([a-zA-Z]+)\.svg\.png$
I don't know c# but the regex could be:
^.+_(\pL+)\.svg\.png
and the replace part is : $1.png

Pulling the Video ID of a YouTube video using substrings

I am currently trying to extract the ID of a YouTube video from the embed url YouTube supplies.
I am currently using this as an example:
<iframe width="560" height="315" src="http://www.youtube.com/embed/aSVpBqOsC7o" frameborder="0" allowfullscreen></iframe>
So far my code currently looks like this,
else if (TB_VideoLink.Text.Trim().Contains("http://www.youtube.com/embed/"))
{
youtube_url = TB_VideoLink.Text.Trim();
int Count = youtube_url.IndexOf("/embed/", 7);
string cutid = youtube_url.Substring(Count,youtube_url.IndexOf("\" frameborder"));
LB_VideoCodeLink.Text = cutid;
}
I Seem to be getting there, however the code falls over on CutID and I am not sure why???
Cheers
I always find it much easier to use regular expressions for this sort of thing, Substringand IndexOf always seem dated to me, but that's just my personal opinion.
Here is how I would solve this problem.
Regex regexPattern = new Regex(#"src=\""\S+/embed/(?<videoId>\w+)");
Match videoIdMatch = regexPattern.Match(TB_VideoLink.Text);
if (videoIdMatch.Success)
{
LB_VideoCodeLink.Text = videoIdMatch.Groups["videoId"].Value;
}
This will perform a regular expression match, locating src=", ignoring all characters up until /embed/ then extracting all the word characters after it as a named group.
You can then get the value of this named group. The advantage is, this will work even if frameborder does not occur directly after the src.
Hope this is useful,
Luke
The second parameter of the Substring method is length, not second index. Subtract the index of the second test from the first to get the required length.
else if (TB_VideoLink.Text.Trim().Contains("http://www.youtube.com/embed/"))
{
youtube_url = TB_VideoLink.Text.Trim();
// Find the start of the embed code
int Count = youtube_url.IndexOf("/embed/", 7);
// From the start of the embed bit, search for the next "
int endIndex = youtube_url.IndexOf("\"", Count);
// The ID is from the 'Count' variable, for the next (endIndex-Count) characters
string cutid = youtube_url.Substring(Count, endIndex - Count);
LB_VideoCodeLink.Text = cutid;
}
You probably should have some more exception handling for when either of the two test strings do not exist.
Similar to answer above, but was beaten to it.. doh
//Regex with YouTube Url and Group () any Word character a-z0-9 and expect 1 or more characters +
var youTubeIdRegex = new Regex(#"http://www.youtube.com/embed/(?<videoId>\w+)",RegexOptions.IgnoreCase|RegexOptions.Compiled);
var youTubeUrl = TB_VideoLink.Text.Trim();
var match = youTubeIdRegex.Match(youTubeUrl);
var youTubeId = match.Groups["videoId"].Value; //Group[1] is (\w+) -- first group ()
LB_VideoCodeLink.Text = youTubeId;

How to use C# regular expressions to emulate forum tags

I am building a forum and I want to be able to use simple square bracket tags to allow users to format text. I am currently accomplishing this by parsing the string and looking for the tags. It's very tedious, especially when I run into a tag like this [url=http://www.something.com]Some text[/url]. Having to parse the attribute, and the value, and make sure it has proper opening and closing tags is kind of a pain and seems silly. I know how powerful regular expressions are but I'm not good at them and they frustrate me to no end.
Any of you regex gurus willing to help me out? I think an example would get me started. Just a regex for finding tags like [b]bolded text[/b] and tags with attributes like the link one I listed above would be helpful. Thanks in advance!
Edit: Links to laymen's terms tutorials for regex are also helpful.
This should work. The "=something.com" is optional and accommodates single or double quotes and it also makes sure that the closing tag matches the opening tag.
protected void Page_Load(object sender, EventArgs e)
{
string input = #"My link: [url='http://www.something.com'][b]Some text[/b][/url] is awesome. Jazz hands activate!!";
string result = Parse(input);
}
//Result: My link: <b>Some text</b> is awesome. Jazz hands activate!!
private static string Parse(string input)
{
string regex = #"\[([^=]+)[=\x22']*(\S*?)['\x22]*\](.+?)\[/(\1)\]";
MatchCollection matches = new Regex(regex).Matches(input);
for (int i = 0; i < matches.Count; i++)
{
var tag = matches[i].Groups[1].Value;
var optionalValue = matches[i].Groups[2].Value;
var content = matches[i].Groups[3].Value;
if (Regex.IsMatch(content, regex))
{
content = Parse(content);
}
content = HandleTags(content, optionalValue, tag);
input = input.Replace(matches[i].Groups[0].Value, content);
}
return input;
}
private static string HandleTags(string content, string optionalValue, string tag)
{
switch (tag.ToLower())
{
case "url":
return string.Format("{1}", optionalValue, content);
case "b":
return string.Format("<b>{0}</b>", content);
default:
return string.Empty;
}
}
UPDATE
Now i'm just having fun with this. I cleaned it up a bit and replaced the " with \x22 so the entire string can easily be escaped per #Brad Christie's suggestion. Also this regex won't break if there are "[" or "]" characters in the content. Also it handles tags recursively
I'm not saying that you can't do this with regular expressions, but I think you're going to find it very, very difficult. You'll have to decide what to do with things like [b]this is [bold text[/b], and other cases where the user has [ or ] characters. And will you allow nesting? (i.e. [b]this is bold, [i]italic[/i] text[/b]).
I would suggest that you look into using something like Markdown.

Regex implementation

I have encountered this piece of code that is supposed to determine the parent url in a hierarchy of dynamic (rewritten) urls. The basic logic goes like this:
"/testing/parent/default.aspx" --> "/testing/default.aspx"
"/testing/parent.aspx" --> "/testing/default.aspx"
"/testing/default.aspx" --> "/default.aspx"
"/default.aspx" --> null
...
private string GetParentUrl(string url)
{
string parentUrl = url;
if (parentUrl.EndsWith("Default.aspx", StringComparison.OrdinalIgnoreCase))
{
parentUrl = parentUrl.Substring(0, parentUrl.Length - 12);
if (parentUrl.EndsWith("/"))
parentUrl = parentUrl.Substring(0, parentUrl.Length - 1);
}
int i = parentUrl.LastIndexOf("/");
if (i < 2) return null;
parentUrl = parentUrl.Substring(0, i + 1);
return string.Format(CultureInfo.InvariantCulture, "{0}Default.aspx", parentUrl);
}
This code works but it smells to me. It will not work with urls that have a querystring. How can I improve it using regex?
Have a look at the answers to SO question "Getting the parent name of a URI/URL from absolute name C#"
This will show you how to use System.Uri to access the segments of an URL. System.Uri also allows to manipulate the URL in the way you want (well, not the custom logic) without the danger of creating invalid URLs. There is no need to hack your own functions to dissect URLs.
A straight forward approach will be splitting url by "?" and concatenate query string at the end...
I recommend you not to use Regex in this scenario. Regex that solves this task will be "real code smell". Above code isn't so bad, use f3lix and Leon Shmulevich recommendations to make it better.

Categories