Need to extract a specific url from source code in c# console - c#

Im making a bot that needs to display images from page links that are user fed. The only way i see of doing this is getting the href code from the source code
using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString("url that is input by the user");
Console.WriteLine(htmlCode);
Console.ReadKey();
}
is the current code that gets a url. If it helps, this query targets the card pages on the duelmaster wiki so the page layout is identical. I guess what im trying to ask is how do i get that code from the entire source code file?

You can use regex to extract href data from a string
Regular Expression :-
href[\s]=[\s]\"(.?)[\s]\"
C# Code
Include namespace
using System.Text.RegularExpressions;
Updated Code
static void Main()
{
Console.WriteLine("Enter Url you want to Extract data from");
string userInput = Console.ReadLine();
Task t = new Task(DownloadPageAsync);
t.Start();
Console.WriteLine("Downloading page...");
Console.ReadLine();
}
static async void DownloadPageAsync(string requestUrl)
{
// ... Use HttpClient instead of webclient
using (HttpClient client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(requestUrl))
using (HttpContent content = response.Content)
{
string mydata = await content.ReadAsStringAsync();
Regex regex = new Regex("href[\\s]*=[\\s]*\"(.*?)[\\s]*\\\"");
foreach (Match htmlPath in regex.Matches(mydata))
{
// Here you can write your custom logic
Console.WriteLine(htmlPath.Groups[1].Value);
}
}
}
Code explanation
Regex regex = new Regex("href[\\s]*=[\\s]*\"(.*?)[\\s]*\\\"");
This line will create regex object with given regular expression
you can find regex explanation Here after posting given regular expression
foreach (Match htmlPath in regex.Matches(mydata))
{
This line will iterate through all the matches found using regex in given string.
Console.WriteLine(htmlPath.Groups[1].Value);
Notice (.*?) in regex its capture group
Above line will give you your contains inside that group in your case data inside href brackets

Related

Problem With Cyrillic Characters as URL Parameter

I'm trying to translate some text by sending a GET request to https://translate.googleapis.com/ from a C# application.
The request should be formatted as following:
"/translate_a/single?client=gtx&sl=BG&tl=EN&dt=t&q=Здравей Свят!"
where sl= is the source language, tl= is the target language and q= is the text to be translated.
The response is a JSON array with the translated text and other details.
The problem is that when I try to translate from bulgarian to english the result gets broken like: "Р-РґСЂР ° РІРμР№ РЎРІСЏС,!"
There is no problem when I'm translating from english to bulgarian (no cyrillic in the URL) so my gues is that the problem is in the request.
Also whenever I'm sending the request directly from the browser the result is properly translated text.
How I'm doing it:
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
using System.Net.Http;
using System.Web;
class Program
{
static void Main(string[] args)
{
string ApiUrl = "https://translate.googleapis.com/translate_a/single?client=gtx&sl={0}&tl={1}&dt=t&q={2}";
string targetLang = "en";
string sourceLang = "bg";
string text = "Здравей Свят!";
text = HttpUtility.UrlPathEncode(text);
string url = string.Format(ApiUrl, sourceLang, targetLang, text);
using (var client = new HttpClient())
{
var result = client.GetStringAsync(url).Result;
var jRes = (JArray)JsonConvert.DeserializeObject(result);
var translatedText = jRes[0][0][0].ToString();
var originalText = jRes[0][0][1].ToString();
var sourceLanguage = jRes[2].ToString();
}
}
}
Any suggestion will be appreciated.
Thanks to this comment I have managed to recieve a properly formatted response.
The thing is that I'm not using two important parameters in the URL:
ie=UTF-8
oe=UTF-8
The URL should look like this:
https://translate.googleapis.com/translate_a/single?client=gtx&sl=BG&tl=EN&dt=t&q=Здравей%20Свят!&ie=UTF-8&oe=UTF-8

Using Regex to insert domain name into url

I am pulling in text from a database that is formatted like the sample below. I want to insert the domain name in front of every URL within this block of text.
<p>We recommend you check out the article
<a id="navitem" href="/article/why-apples-new-iphones-may-delight-and-worry-it-pros/" target="_top">
Why Apple's new iPhones may delight and worry IT pros</a> to learn more</p>
So with the example above in mind I want to insert http://www.mydomainname.com/ into the URL so it reads:
href="http://www.mydomainname.com/article/why-apples-new-iphones-may-delight-and-worry-it-pros/"
I figured I could use regex and replace href=" with href="http://www.mydomainname.com but this appears to not be working as I intended. Any suggestions or better methods I should be attempting?
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"^href=\"$", "href=\"https://www.mydomainname.com/");
You could use regex...
...but it's very much the wrong tool for the job.
Uri has some handy constructors/factory methods for just this purpose:
Uri ConvertHref(Uri sourcePageUri, string href)
{
//could really just be return new Uri(sourcePageUri, href);
//but TryCreate gives more options...
Uri newAbsUri;
if (Uri.TryCreate(sourcePageUri, href, out newAbsUri))
{
return newAbsUri;
}
throw new Exception();
}
so, say sourcePageUri is
var sourcePageUri = new Uri("https://somehost/some/page");
the output of our method with a few different values for href:
https://www.foo.com/woo/har => https://www.foo.com/woo/har
/woo/har => https://somehost/woo/har
woo/har => https://somehost/some/woo/har
...so it's the same interpretation as the browser makes. Perfect, no?
Try this code:
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"(href=[ \t]*\")\/", "$1https://www.mydomainname.com/", RegexOptions.Multiline);
Use html parser, like CsQuery.
var html = "your html text here";
var path = "http://www.mydomainname.com";
CQ dom = html;
CQ links = dom["a"];
foreach (var link in links)
link.SetAttribute("href", path + link["href"]);
html = dom.Html();

StreamReader get string between certain characters

I have a program that sends emails utilizing templates via a web service. To test the templates, I made a simple program that reads the templates, fills it up with dummy value and send it. The problem is that the templates have different 'fill in' variable names. So what I want to do is open the template, make a list of the variables and then fill them with dummy text.
Right no I have something like:
StreamReader SR = new StreamReader(myPath);
.... //Email code here
Msg.Body = SR.ReadToEnd();
SR.Close();
Msg.Body = Msg.Body.Replace(%myFillInVariable%, "Test String");
....
So I'm thinking, opening the template, search for values in between "%" and put them in an ArrayList, then do the Msg.Body = SR.ReadToEnd(); part. Loop the ArrayList and do the Replace part using the value of the Array.
What I can't find is how to read the value between the % tags. Any suggestions on what method to use will be greatly appreciated.
Thanks,
MORE DETAILS:
Sorry if I wasn't clear. I'm passing the name of the TEMPLATE to the script from a drop down. I might have a few dozen Templates and they all have different %VariableToBeReplace%. So that's is why I want to read the Template with the StreamReader, find all the %value names%, put them into an array AND THEN fill them up - which I already know how to do. It's getting the the name of what I need to replace in code which I don't know what to do.
I am not sure on your question either but here is a sample of how to do the replacement.
You can run and play with this example in LinqPad.
Copy this content into a file and change the path to what you want. Content:
Hello %FirstName% %LastName%,
We would like to welcome you and your family to our program at the low cost of %currentprice%. We are glad to offer you this %Service%
Thanks,
Some Person
Code:
var content = string.Empty;
using(var streamReader = new StreamReader(#"C:\EmailTemplate.txt"))
{
content = streamReader.ReadToEnd();
}
var matches = Regex.Matches(content, #"%(.*?)%", RegexOptions.ExplicitCapture);
var extractedReplacementVariables = new List<string>(matches.Count);
foreach(Match match in matches)
{
extractedReplacementVariables.Add(match.Value);
}
extractedReplacementVariables.Dump("Extracted KeyReplacements");
//Do your code here to populate these, this part is just to show it still works
//Modify to meet your needs
var replacementsWithValues = new Dictionary<string, string>(extractedReplacementVariables.Count);
for(var i = 0; i < extractedReplacementVariables.Count; i++)
{
replacementsWithValues.Add(extractedReplacementVariables[i], "TestValue" + i);
}
content.Dump("Template before Variable Replacement");
foreach(var key in replacementsWithValues.Keys)
{
content = content.Replace(key, replacementsWithValues[key]);
}
content.Dump("Template After Variable Replacement");
Result from LinqPad:
I am not really sure that I understood your question but, you can try to put on the first line of the template your 'fill in variable'.
Something like:
StreamReader SR = new StreamReader(myPath);
String fill_in_var=SR.ReadLine();
String line;
while((line = SR.ReadLine()) != null)
{
Msg.Body+=line;
}
SR.Close();
Msg.Body = Msg.Body.Replace(fill_in_var, "Test String");

How to search a downloaded string of a website?

I have downloaded the string and found the index but am not able to get the text which I am searching for. Here is my code:
System.Net.WebClient client = new System.Net.WebClient();
string downloadedString = client.DownloadString("http://www.gmail.com");
int ss = downloadedString.IndexOf("fun");
string mm = downloadedString.Substring(ss);
textBox1.Text = mm;
try the following
if (downloadedString .Contains("fun"))
{
// Process...
}
Visiting www.gmail.com will perform 3 directs. Try the following url instead:
https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2
Also, consider using a proper HTML Parser like the HTML Agility Pack.

Equivalent of Python code in C#?

I have to really ask this question as I donot know Python.
Following are a few lines taken from this place. I would appreciate if someone guides me in translating the following to C#
#Step 1: Get a session key
servercontent = myhttp.request(baseurl + '/services/auth/login', 'POST',
headers={}, body=urllib.urlencode({'username':username, 'password':password}))[1]
sessionkey = minidom.parseString(servercontent).getElementsByTagName('sessionKey')[0].childNodes[0].nodeValue
print "====>sessionkey: %s <====" % sessionkey
I can't translate it to C#, but I can explain what this code does:
Login to baseurl + '/services/auth/login' using the username and password provided.
Read the contents of that URL.
Parse the content for the first <sessionkey> tag, and read the value of its first child node.
Here's a quick-n-dirty translation:
using System.Linq.Xml;
using System.Net;
using System.Collections.Generic;
using System.Web;
// ...
var client = new WebClient();
var parameters = new Dictionary<string, string>
{
{ "username", username },
{ "password", password }
};
var result = client.UploadString(String.Format("{0}/services/auth/login", BaseUrl), UrlEncode(parameters));
var doc = XDocument.Load(result); // load response into XML document (LINQ)
var key = doc.Elements("sessionKey").Single().Value // get the one-and-only <sessionKey> element.
Console.WriteLine("====>sessionkey: {0} <====", key);
// ...
// Utility function:
private static string UrlEncode(IDictionary<string, string> parameters)
{
var sb = new StringBuilder();
foreach(var val in parameters)
{
// add each parameter to the query string, url-encoding the value.
sb.AppendFormat("{0}={1}&", val.Key, HttpUtility.UrlEncode(val.Value));
}
sb.Remove(sb.Length - 1, 1); // remove last '&'
return sb.ToString();
}
This code does a check to see that the response only has one sessionKey element, otherwise it'll throw an exception if there's 0, or more than 1. Then it prints it out.

Categories