How extract Specific names from url? - c#

I already have Listbox full with URLs like this I convert them to String
http://example.com/1392/Music/1392/Shahrivar/21/Avicii%20-%20True/01.%20Avicii%20Ft.%20Aloe%20Blacc%20-%20Wake%20Me%20Up%20(CDQ)%20%5b320%5d.mp3
and I wanna extract for example on this link Name of Song: "Avicii Ft Aloe Blacc -Wake Me Up " I'm using c# I already extract links from a web page and now I only need to extract names from links. thanks already for any suggestions or help.

Try this:
using System;
using System.Linq;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main (string[] args)
{
var url = "http://example.com/1392/Music/1392/Shahrivar/21/Avicii%20-%20True/01.%20Avicii%20Ft.%20Aloe%20Blacc%20-%20Wake%20Me%20Up%20(CDQ)%20%5b320%5d.mp3";
var uri = new Uri (url);
string[] segments = uri.Segments.Select (x => WebUtility.UrlDecode (x).TrimEnd ('/')).ToArray ();
}
}
}

If you know the structure of the URL you are scraping you should be able to break-off the useless part of the string.
For example, if you know that the URL follows the form:
http://example.com/1392/Music/1392/Shahrivar/21/{Artist}-{Album}/{Track Information}
Roughly, I think the following would allow you to extract the information you want from a single link:
void Main (string[] args)
{
var example = #"http://example.com/1392/Music/1392/Shahrivar/21/Avicii%20-%20True/01.%20Avicii%20Ft.%20Aloe%20Blacc%20-%20Wake%20Me%20Up%20(CDQ)%20%5b320%5d.mp3";
var parts = example.split('/');
var album = parts[7];
var trackInfo = parts[8];
var trackParts = trackInfo.split('-');
var artist = trackParts[0];
var trackTitle = trackParts[1];
Console.WriteLine(trackTitle);
}
Here I am splitting the URL by '/', which is a messy solution, but for a simple case, it works. Then I am finding the index within the tokenized string where the desired information can be found. once I have the track information, I know the convention is to separate the Artist from the Title by a '-', so I split again and then have both artist and title.
You can refactor this into a method which takes the URL, and returns an object containing the Artist and song title. After that, you might want to use a string.Replace on the '%20' with ' '.

First of all, use HttpUtility.DecodeUrl. This function will decode HTML special chars, leaving a plain string to work with. You can then simply split by /.

Related

Retrieving a portion of a url

I need your collective wisdom. I am needing to get a portion of a url so that I can pass it as a parameter to make some stuff happen. Here's what I've got:
here is an example of the url "somesite/somepage/johndoe21911", I am needing to get the "21911" so that I can pass it into this:
var url = Request.ApplicationPath.Replace("/", "");
Session["agencyId"] = _Apps.GetGehaAgencyData(portion needed goes here);
Any direction is greatly appreciated
If your URL looks like an actual URL (with the http:// part) then you could use Uri class:
private static void Extract()
{
Uri uri = new Uri("http://somesite/somepage/johndoe21911");
string last = uri.Segments.LastOrDefault();
string numOnly = Regex.Replace(last, "[^0-9 _]", string.Empty);
Console.WriteLine(last);
Console.WriteLine(numOnly);
}
If it's exactly like in your example (without the http:// part) then you could do something like this:
private static void Extract()
{
string uri = "http://somesite/somepage/johndoe21911";
string last = uri.Substring(uri.LastIndexOf('/') + 1);
string numOnly = Regex.Replace(last, "[^0-9 _]", string.Empty);
Console.WriteLine(last);
Console.WriteLine(numOnly);
}
Above is assuming you want ALL numerics from the last segment of the URL, which is what you've said your requirement is. That is, if your URL were to look like this:
somesite/somepage/john123doe456"
This will extract 123456.
If you want only the last 5 characters, you could simply use string.Substring() to extract the last five characters.
If you want numerics which are at the end of the string then this would work.
private static void Extract()
{
string uri = "somesite/somepage/john123doe21911";
string last = uri.Substring(uri.LastIndexOf('/') + 1);
string numOnly = Regex.Match(last, #"\d+$").Value;
Console.WriteLine(last);
Console.WriteLine(numOnly);
}
Oh and saying I've come across some stuff on google, but wasn't really sure on how to implement them is a very lazy answer. If you Google you can find countless examples of how to do all these things, even on this site itself. Please from next time onward do your research first and try yourself first.

how can i trim the string in c# after each file extension name

I have a string of attachments like this:
"SharePoint_Health Check Assessment.docx<br>Tes‌​t Workflow.docx<br>" .
and i used this method :
AttachmentName = System.Text.RegularExpressions.Regex.Replace(AttachmentName, #"<(.|\n)*?>", "String.Empty");
and i got result :
SharePoint_Health Check Assessment.docxTest Workflow.docx
How can i split the string using c# and get the result with each file name seperately like :
SharePoint_Health Check Assessment.docx
Test Workflow.docx
and then show them into some control one by one.
and after that i want just the URL of the string like
"http://srumos1/departments/Attachments/2053_3172016093545_ITPCTemplate.txt"
and
"http://srumos1/departments/Attachments/2053_3172016093545_ITPCTemplate.txt"
how can i do that
i got it this way
AttachmentName = Regex.Replace(AttachmentName, #"<(.|\n)*?>", string.Empty);
Well there's your problem. You had valid delimiter but stripped them away for some reason. Leave the delimiters there and use String.Split to split them based on that delimiter.
Or replace the HTML with a delimiter instead of an empty string:
AttachmentName = Regex.Replace(AttachmentName, #"<(.|\n)*?>", "|");
And then split based off of that:
string[] filenames = AttachmentName.Split(new [] {'|'},
StringSplitOptions.RemoveEmptyEntries);
You can use a regex for extracting file names if you do not have any other clear way to do that. Can you try the code below ?;
using System;
using System.Collections.Generic;
using System.Text;
using System.Linq;
using System.Text.RegularExpressions;
namespace ExtensionExtractingTest
{
class Program
{
static void Main(string[] args)
{
string fileNames = "test.docxtest2.txttest3.pdftest.test.xlxtest.docxtest2.txttest3.pdftest.test.xlxtest.docxtest2.txttest3.pdftest.test.xlxourtest.txtnewtest.pdfstackoverflow.pdf";
//Add your extensions to regex definition
Regex fileNameMatchRegex = new Regex(#"[a-zA-Z0-9]*(\.txt|\.pdf|\.docx|\.txt|\.xlx)", RegexOptions.IgnoreCase);
MatchCollection matchResult = fileNameMatchRegex.Matches(fileNames);
List<string> fileNamesList = new List<string>();
foreach (Match item in matchResult)
{
fileNamesList.Add(item.Value);
}
fileNamesList = fileNamesList.Distinct().ToList();
Console.WriteLine(string.Join(";", fileNamesList));
}
}
}
And a working example is here http://ideone.com/gbopSe
PS: Please keep in mind you have to know your file name extensions or you have to predict filename extension length 3 or 4 and that will be a painful string parsing operation.
Hope this helps

How can I get a part/subdomain of my URL in C#?

I have a URL like the following
http://yellowcpd.testpace.net
How can I get yellowcpd from this? I know I can do that with string parsing, but is there a builtin way in C#?
Assuming your URLs will always be testpace.net, try this:
var subdomain = Request.Url.Host.Replace("testpace.net", "").TrimEnd('.');
It'll just give you the non-testpace.net part of the Host. If you don't have Request.Url.Host, you can do new Uri(myString).Host instead.
try this
string url = Request.Url.AbsolutePath;
var myvalues= url.Split('.');
How can I get yellowcpd from this? I know I can do that with string
parsing, but is there a builtin way in C#?
.Net doesn't provide a built-in feature to extract specific parts from Uri.Host. You will have to use string manipulation or a regular expression yourself.
The only constant part of the domain string is the TLD. The TLD is the very last bit of the domain string, eg .com, .net, .uk etc. Everything else under that depends on the particular TLD for its position (so you can't assume the next to last part is the "domain name" as, for .co.uk it would be .co
This fits the bill.
Split over two lines:
string rawURL = Request.Url.Host;
string domainName = rawURL .Split(new char[] { '.', '.' })[1];
Or over one:
string rawURL = Request.Url.Host.Split(new char[] { '.', '.' })[1];
The simple answer to your question is no there isn't a built in method to extract JUST the sub-domain. With that said this is the solution that I use...
public enum GetSubDomainOption
{
ExcludeWWW,
IncludeWWW
};
public static class Extentions
{
public static string GetSubDomain(this Uri uri,
GetSubDomainOption getSubDomainOption = GetSubDomainOption.IncludeWWW)
{
var subdomain = new StringBuilder();
for (var i = 0; i < uri.Host.Split(new char[]{'.'}).Length - 2; i++)
{
//Ignore any www values of ExcludeWWW option is set
if(getSubDomainOption == GetSubDomainOption.ExcludeWWW && uri.Host.Split(new char[]{'.'})[i].ToLowerInvariant() == "www") continue;
//I use a ternary operator here...this could easily be converted to an if/else if you are of the ternary operators are evil crowd
subdomain.Append((i < uri.Host.Split(new char[]{'.'}).Length - 3 &&
uri.Host.Split(new char[]{'.'})[i+1].ToLowerInvariant() != "www") ?
uri.Host.Split(new char[]{'.'})[i] + "." :
uri.Host.Split(new char[]{'.'})[i]);
}
return subdomain.ToString();
}
}
USAGE:
var subDomain = Request.Url.GetSubDomain(GetSubDomainOption.ExcludeWWW);
or
var subDomain = Request.Url.GetSubDomain();
I currently have the default set to include the WWW. You could easilly reverse this by switching the optional parameter value in the GetSubDomain() method.
In my opinion this allows for an option that looks nice in code and without digging in appears to be 'built-in' to c#. Just to confirm your expectations...I tested three values and this method will always return just the "yellowcpd" if the exclude flag is used.
www.yellowcpd.testpace.net
yellowcpd.testpace.net
www.yellowcpd.www.testpace.net
One assumption that I use is that...splitting the hostname on a . will always result in the last two values being the domain (i.e. something.com)
As others have pointed out, you can do something like this:
var req = new HttpRequest(filename: "search", url: "http://www.yellowcpd.testpace.net", queryString: "q=alaska");
var host = req.Url.Host;
var yellow = host.Split('.')[1];
The portion of the URL you want is part of the hostname. You may hope to find some method that directly addresses that portion of the name, e.g. "the subdomain (yellowcpd) within TestSpace", but this is probably not possible, because the rules for valid host names allow for any number of labels (see Valid Host Names). The host name can have any number of labels, separated by periods. You will have to add additional restrictions to get what you want, e.g. "Separate the host name into labels, discard www if present and take the next label".

In C#, what is the best way to parse out this value from a string?

I have to parse out the system name from a larger string. The system name has a prefix of "ABC" and then a number. Some examples are:
ABC500
ABC1100
ABC1300
the full string where i need to parse out the system name from can look like any of the items below:
ABC1100 - 2ppl
ABC1300
ABC 1300
ABC-1300
Managers Associates Only (ABC1100 - 2ppl)
before I saw the last one, i had this code that worked pretty well:
string[] trimmedStrings = jobTitle.Split(new char[] { '-', '–' },StringSplitOptions.RemoveEmptyEntries)
.Select(s => s.Trim())
.ToArray();
return trimmedStrings[0];
but it fails on the last example where there is a bunch of other text before the ABC.
Can anyone suggest a more elegant and future proof way of parsing out the system name here?
One way to do this:
string[] strings =
{
"ABC1100 - 2ppl",
"ABC1300",
"ABC 1300",
"ABC-1300",
"Managers Associates Only (ABC1100 - 2ppl)"
};
var reg = new Regex(#"ABC[\s,-]?[0-9]+");
var systemNames = strings.Select(line => reg.Match(line).Value);
systemNames.ToList().ForEach(Console.WriteLine);
prints:
ABC1100
ABC1300
ABC 1300
ABC-1300
ABC1100
demo
You really could leverage a Regex and get better results. This one should do the trick [A-Za-z]{3}\d+, and here is a Rubular to prove it. Then in the code use it like this:
var matches = Regex.Match(someInputString, #"[A-Za-z]{3}\d+");
if (matches.Success) {
var val = matches.Value;
}
You can use a regular expression to parse this. There may be better expressions, but this one works for your case:
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string txt="ABC500";
string re1="((?:[a-z][a-z]+))";
string re2="(\\d+)"
Regex r = new Regex(re1+re2,RegexOptions.IgnoreCase|RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
String word1=m.Groups[1].ToString();
String int1=m.Groups[2].ToString();
Console.Write("("+word1.ToString()+")"+"("+int1.ToString()+")"+"\n");
}
}
}
}
You should definitely use Regex for this. Depending on the exact nature of the system name, something like this could prove to be enough:
Regex systemNameRegex = new Regex(#"ABC[0-9]+");
If the ABC part of the name can change, you can modify the Regex to something like this:
Regex systemNameRegex = new Regex(#"[a-zA-Z]+[0-9]+");

C# replace multiple href values

I have a block of html that looks something like this;
<p>33</p>
There are basically hundreds of anchor links which I need to replace the href based on the anchor text. For example, I need to replace the link above with something like;
33.
I will need to take the value 33 and do a lookup on my database to find the new link to replace the href with.
I need to keep it all in the original html as above!
How can I do this? Help!
Although this doesn't answer your question, the HTML Agility Pack is a great tool for manipulating and working with HTML: http://html-agility-pack.net
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack
Slurp your HTML into an XmlDocument (your markup is valid, isn't it?) Then use XPath to find all the <a> tags with an href attribute. Apply the transform and assign the new value to the href attribute. Then write the XmlDocument out.
Easy!
Use a regexp to find the values and replace
A regexp like "/<p><a herf=\"[^\"]+\">([^<]+)<\\/a><\\/p> to match and capture the ancor text
Consider using the the following rough algorithm.
using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
static class Program
{
static void Main ()
{
string html = "<p>33</p>"; // read the whole html file into this string.
StringBuilder newHtml = new StringBuilder (html);
Regex r = new Regex (#"\<a href=\""([^\""]+)\"">([^<]+)"); // 1st capture for the replacement and 2nd for the find
foreach (var match in r.Matches(html).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = DBTranslate (text);
newHtml.Remove (match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert (match.Groups[1].Index, newHref);
}
Console.WriteLine (newHtml);
}
static string DBTranslate(string s)
{
return "junk_" + s;
}
}
(The OrderByDescending makes sure the indexes don't change as you modify the StringBuilder.)
So, what you want to do is generate the replacement string based on the contents of the match. Consider using one of the Regex.Replace overloads that take a MatchEvaluator. Example:
static void Main()
{
Regex r = new Regex(#"<a href=""[^""]+"">([^<]+)");
string s0 = #"<p>33</p>";
string s1 = r.Replace(s0, m => GetNewLink(m));
Console.WriteLine(s1);
}
static string GetNewLink(Match m)
{
return string.Format(#"(<a href=""{0}.html"">{0}", m.Groups[1]);
}
I've actually taken it a step further and used a lambda expression instead of explicitly creating a delegate method.

Categories