Most appropriate way of getting absolute URLs from crawled urls

Most appropriate way of getting absolute URLs from crawled urls - c#

Assume that i have root url as follow
http://www.monstermmorpg.com
Now i will show several url examples and how to get target
url1: http://www.monstermmorpg.com/
url2: http://www.monstermmorpg.com/Register#21312
url3: Register#21312
url4: /Register
url5: Register
url6: /Register?news=true&news2=true
// there may be more that goes to same url but i don't have full list atm
I need a function that will result following urls as following with help of root url
url1: http://www.monstermmorpg.com
url2: http://www.monstermmorpg.com/Register
url3: http://www.monstermmorpg.com/Register
url4: http://www.monstermmorpg.com/Register
url5: http://www.monstermmorpg.com/Register
url6: http://www.monstermmorpg.com/Register?news=true&news2=true
There is this method but i believe that is insufficient any better method ?
C# .net 4.5 WPF application
Uri baseUri= new Uri("http://www.contoso.com");
Uri myUri = new Uri(baseUri,"catalog/shownew.htm?date=today");
Console.WriteLine(myUri.AbsoluteUri);

static void Main(string[] args)
{
var baseUrl = "http://www.monstermmorpg.com";
var urls = new string[] {
"http://www.monstermmorpg.com/",
"http://www.monstermmorpg.com/Register#21312",
"Register#21312",
"/Register",
"Register",
"/Register?news=true&news2=true" };
var absoluteUrls = new List<string>();
foreach (var url in urls)
{
if (url.StartsWith("http"))
{
var uri = new Uri(url);
absoluteUrls.Add(uri.Host + uri.PathAndQuery);
}
else
{
var urlWithSlash = url;
if (!urlWithSlash.StartsWith("/"))
urlWithSlash = "/" + url;
var uri = new Uri(baseUrl + urlWithSlash);
absoluteUrls.Add(uri.Host + uri.PathAndQuery);
}
}
// Now absoluteUrls contains
//url1: http://www.monstermmorpg.com
//url2: http://www.monstermmorpg.com/Register
//url3: http://www.monstermmorpg.com/Register
//url4: http://www.monstermmorpg.com/Register
//url5: http://www.monstermmorpg.com/Register
//url6: http://www.monstermmorpg.com/Register?news=true&news2=true
}

Related

HttpUtility.ParseQueryString missing some characters

I'm trying to extract en email with the + special character but for some reason the ParseQueryString skips it:
namespace ParsingProblem
{
class Program
{
static void Main(string[] args)
{
var uri = new System.Uri("callback://gmailauth/#email=mypersonalemail15+1#gmail.com");
var parsed = System.Web.HttpUtility.ParseQueryString(uri.Fragment);
var email = parsed["#email"];
// Email is: mypersonalemail15 1#gmail.com and it should be mypersonalemail15+1#gmail.com
}
}
}

The + symbol in a URL is interpreted as a space character. To fix that, you need to URL encode the email address first. For example:
var urlEncodedEmail = System.Web.HttpUtility.UrlEncode("mypersonalemail15+1#gmail.com");
var uri = new System.Uri($"callback://gmailauth/#email={urlEncodedEmail}");
var parsed = System.Web.HttpUtility.ParseQueryString(uri.Fragment);
var email = parsed["#email"];

How to download a picture from the site?

I am trying to make a parser based on "AngleSharp".
I use the following code for download:
var itemsAttr = document.QuerySelectorAll("img[id='print_user_photo']");
string foto_url = itemsAttr[0].GetAttribute("src");
string path = pathFolderIMG + id_source + ".jpg";
WebClient webClient = new WebClient();
webClient.DownloadFile(foto_url, path);
For pages "type_1" -link - the code works.
For pages "type_2" - link - the code does not work.
How to download photos for pages "type_2"?

Please read the AngleSharp documentation carefully, e.g., looking at the FAQ we get:
var imageUrl = #"https://via.placeholder.com/150";
var localPath = #"g:\downloads\image.jpg";
var download = context.GetService<IDocumentLoader>().FetchAsync(new DocumentRequest(new Url(imageUrl)));
using (var response = await download.Task)
{
using (var target = File.OpenWrite(localPath))
{
await response.Content.CopyToAsync(target);
}
}
where we used a configuration like
var config = Configuration.Default.WithDefaultLoader(new LoaderOptions { IsResourceLoadingEnabled = true }).WithCookies();
var context = BrowsingContext.New(config);

(C#) How to GET a download url to a certain path?

So let's say I have a download url that when you GET it, it downloads a file.
Now, this file is not a txt or anything, it has no extension.
How would I code a GET request to the URL, but make it download to a certain path?
EDIT: Also, how would I convert it to a TXT and read from the txt afterwards?
NOTE: It's a get request site that instantly downloads the file, not a file on a site you can open in your browser
EDIT 2: It actually returns xml, not the file, sorry
just using a browser downloads it.

What is the real content of that file?
You can try to configure the content-type as "application/octet-stream".
It asks the server for byte content.
If the content is regular text already, you can simply add ".txt" to the file name and you can read it whenever you want.

You do it like this it shouldn't matter if your link has a clear ending like the one I have used. Or if you are really serious about making the GET part explicit use RestSharp. Look now you can even change the file extensions from within the code not that it would matter the least bit. I tossed in some Linq2Xml since you mentioned your file was xml and I thought you possible needed to do something with it.
using System;
using System.Diagnostics;
using System.IO;
using System.Net.Http;
using System.Xml.Linq;
using System.Linq;
using RestSharp;
namespace Get2File
{
internal class Program
{
private const string FallbackUrl = #"https://gist.github.com/Rusk85/8d189cd35295cfbd272d8c2121110e38/raw/4885d9ba37528faab50d9307f76800e2e1121ea2/example-xml-with-embedded-html.xml";
private string _downloadedContent = null;
private const string FileNameWithoutExtension = "File";
private static void Main(string[] args)
{
var p = new Program();
p.Get2FileWithRestSharp(fileExtensions:".xml");
p.UseLinq2XmlOnFile();
}
private void Get2File(string altUrl = null, string fileExtensions = ".txt")
{
var url = !string.IsNullOrEmpty(altUrl)
? altUrl
: FallbackUrl;
var client = new HttpClient();
var content = client.GetStringAsync(url).Result;
_downloadedContent = content;
var outputPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, $"{FileNameWithoutExtension}{fileExtensions}");
File.WriteAllText(outputPath, content);
}
private void Get2FileWithRestSharp(string altUrl = null, string fileExtensions = ".txt")
{
var url = !string.IsNullOrEmpty(altUrl)
? altUrl
: FallbackUrl;
var urlAsUri = new Uri(url);
var client = new RestClient(urlAsUri);
var request = new RestRequest(Method.GET);
var content = string.Empty;
var result = client.Execute(request);
content = result.Content;
_downloadedContent = content;
var output = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, $"{FileNameWithoutExtension}{fileExtensions}");
File.WriteAllText(output, content);
}
private void UseLinq2XmlOnFile()
{
XElement xElement = XElement.Parse(_downloadedContent);
var elements = xElement.Elements();
var StringElement = elements.FirstOrDefault(e => e.Name == "String");
var tranlateXAttribute = StringElement.Attributes().FirstOrDefault(attr => attr.Name == "translate");
Debug.WriteLine(tranlateXAttribute.Value);
}
}
}

Check if it is root domain in string

I'm new to C#,
lets say I have a string
string testurl = "http://www.mytestsite.com/hello";
if (test url == root domain) {
// do something
}
I want to check if that string "testurl" is the root domain i.e http://www.mytestsite.com or http://mytestsite.com etc.
Thanks.

Use the Uri class:
var testUrl = new Uri("http://www.mytestsite.com/hello");
if (testUrl.AbsolutePath== "/")
{
Console.WriteLine("At root");
}
else
{
Console.WriteLine("Not at root");
}
Which nicely deals with any normalization issues that may be required (e.g. treating http://www.mytestsite.com and http://www.mytestsite.com/ the same)

You may try like this:
string testurl = "http://www.mytestsite.com/hello"
if ( GetDomain.GetDomainFromUrl(testurl) == rootdomain) {
// do something
}

You can also try using URI.HostName property
The following example writes the host name (www.contoso.com) of the server to the console.
Uri baseUri = new Uri("http://www.contoso.com:8080/");
Uri myUri = new Uri(baseUri, "shownew.htm?date=today");
Console.WriteLine(myUri.Host);
If the hostname returned is equal to "http://mytestsite.com" you are done.

string testurl = "http://www.mytestsite.com/hello";
string prefix = testurl.Split(new String[] { "//" })[0] + "//";
string url = testurl.Replace(prefix, "");
string root = prefix + url.Split("/")[0];
if (testurl == root) {
// do something
}

Comparing different urls for save domain

Introduction:
I have a start url suppose www.example.com now i run scraper on this url to collect all the internal links belonging to same site and external links.
Problem:
I am using the code below to compare a found url with the main url www.example.com to see if they both have same domain so i take the url as internal url.
Uri baseUri = new Uri(url); //main URL
Uri myUri = new Uri(baseUri, strRef); //strRef is new found link
//domain = baseUri.Host;
domain = baseUri.Host.Replace("www.", string.Empty).Replace("http://", string.Empty).Replace("https://", string.Empty).Trim();
string domain2=myUri.Host.Replace("www.", string.Empty).Replace("http://", string.Empty).Replace("https://", string.Empty).Trim();
strRef = myUri.ToString();
if (domain2==(domain) )
{ //DO STUFF }
Is the above the correct logic ? Because if supposing i get a new url http://news.example.com the domain name found becomes : news.example.com which does not match the domain name of the main url . Is this correct?Should it match or not . And What is a better way if mine is not good enough.

here is a solution to find main domain from subdomain
string url = "http://www.xxx.co.uk";
string strRef = "http://www.news.xxx.co.uk";
Uri baseUri = new Uri(url); //main URL
Uri myUri = new Uri(baseUri, strRef); //strRef is new found link
var domain = baseUri.Host;
domain = baseUri.Host.Replace("www.", string.Empty).Replace("http://", string.Empty).Replace("https://", string.Empty).Trim();
//hrere is solution
string domain2 = GetDomainName(strRef);
strRef = myUri.ToString();
if (domain2 == (domain))
{ //DO STUFF
}
private static string GetDomainName(string url)
{
string domain = new Uri(url).DnsSafeHost.ToLower();
var tokens = domain.Split('.');
if (tokens.Length > 2)
{
//Add only second level exceptions to the < 3 rule here
string[] exceptions = { "info", "firm", "name", "com", "biz", "gen", "ltd", "web", "net", "pro", "org" };
var validTokens = 2 + ((tokens[tokens.Length - 2].Length < 3 || exceptions.Contains(tokens[tokens.Length - 2])) ? 1 : 0);
domain = string.Join(".", tokens, tokens.Length - validTokens, validTokens);
}
return domain;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Most appropriate way of getting absolute URLs from crawled urls - c#

Related

HttpUtility.ParseQueryString missing some characters

How to download a picture from the site?

(C#) How to GET a download url to a certain path?

Check if it is root domain in string

Comparing different urls for save domain

Categories

Resources