How to search a downloaded string of a website? - c#

I have downloaded the string and found the index but am not able to get the text which I am searching for. Here is my code:
System.Net.WebClient client = new System.Net.WebClient();
string downloadedString = client.DownloadString("http://www.gmail.com");
int ss = downloadedString.IndexOf("fun");
string mm = downloadedString.Substring(ss);
textBox1.Text = mm;

try the following
if (downloadedString .Contains("fun"))
{
// Process...
}

Visiting www.gmail.com will perform 3 directs. Try the following url instead:
https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2
Also, consider using a proper HTML Parser like the HTML Agility Pack.

Related

Using Regex to insert domain name into url

I am pulling in text from a database that is formatted like the sample below. I want to insert the domain name in front of every URL within this block of text.
<p>We recommend you check out the article
<a id="navitem" href="/article/why-apples-new-iphones-may-delight-and-worry-it-pros/" target="_top">
Why Apple's new iPhones may delight and worry IT pros</a> to learn more</p>
So with the example above in mind I want to insert http://www.mydomainname.com/ into the URL so it reads:
href="http://www.mydomainname.com/article/why-apples-new-iphones-may-delight-and-worry-it-pros/"
I figured I could use regex and replace href=" with href="http://www.mydomainname.com but this appears to not be working as I intended. Any suggestions or better methods I should be attempting?
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"^href=\"$", "href=\"https://www.mydomainname.com/");
You could use regex...
...but it's very much the wrong tool for the job.
Uri has some handy constructors/factory methods for just this purpose:
Uri ConvertHref(Uri sourcePageUri, string href)
{
//could really just be return new Uri(sourcePageUri, href);
//but TryCreate gives more options...
Uri newAbsUri;
if (Uri.TryCreate(sourcePageUri, href, out newAbsUri))
{
return newAbsUri;
}
throw new Exception();
}
so, say sourcePageUri is
var sourcePageUri = new Uri("https://somehost/some/page");
the output of our method with a few different values for href:
https://www.foo.com/woo/har => https://www.foo.com/woo/har
/woo/har => https://somehost/woo/har
woo/har => https://somehost/some/woo/har
...so it's the same interpretation as the browser makes. Perfect, no?
Try this code:
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"(href=[ \t]*\")\/", "$1https://www.mydomainname.com/", RegexOptions.Multiline);
Use html parser, like CsQuery.
var html = "your html text here";
var path = "http://www.mydomainname.com";
CQ dom = html;
CQ links = dom["a"];
foreach (var link in links)
link.SetAttribute("href", path + link["href"]);
html = dom.Html();

How to get XML-code of webpage that is opened in IE (without using WebRequest)?

I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code

how do i get the freindly url?

i writen this code
im using it for checking if a certin url is found on a web page
private void checkUrls (){
WebClient client;
for (int i = 0; i < Convert.ToInt32(txtnum.Text); i++) {
try
{
string Url = "http://www." + txtUrl.Text + i.ToString();
client = new WebClient();
string result = client.DownloadString(Url);
if (result.Contains(txtsearch.Text))
MessageBox.Show(Url);
}
catch (Exception ex) { }
the base url look like this:
http://www.example.com/?p=35
but on two sites when i ask for this:
http://www.example.com/?p=35
i get redirected to somthing like this
http://www.example.com/some_categoery/postitle/
i need to search the site in the first manner
but to download the content of the freindly url
can anyone show me the right direction to do so ?
i checking website where i do no know how many pages there on the site
You could try the HtmlAgilityPack to get all the anchor tags and check the href attribute for the value you want.

C# Html Agility Pack ( SelectSingleNode )

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

Screen scraping HTTPS using C#

How to screen scrape HTTPS using C#?
You can use System.Net.WebClient to start an HTTPS connection, and pull down the page to scrape with that.
Look into the Html Agility Pack.
You can use System.Net.WebClient to grab web pages. Here is an example: http://www.codersource.net/csharp_screen_scraping.html
If for some reason you're having trouble with accessing the page as a web-client or you want to make it seem like the request is from a browser, you could use the web-browser control in an app, load the page in it and use the source of the loaded content from the web-browser control.
Here's a concrete (albeit trivial) example. You can pass a ship name to VesselFinder in the querystring, but even if it only finds one ship with that name it still shows you the search results screen with one ship. This example detects that case and takes the user straight to the tracking map for the ship.
string strName = "SAFMARINE MAFADI";
string strURL = "https://www.vesselfinder.com/vessels?name=" + HttpUtility.UrlEncode(strName);
string strReturnURL = strURL;
string strToSearch = "/?imo=";
string strPage = string.Empty;
byte[] aReqtHTML;
WebClient objWebClient = new WebClient();
objWebClient.Headers.Add("User-Agent: Other"); //You must do this or HTTPS won't work
aReqtHTML = objWebClient.DownloadData(strURL); //Do the name search
UTF8Encoding utf8 = new UTF8Encoding();
strPage = utf8.GetString(aReqtHTML); // get the string from the bytes
if (strPage.IndexOf(strToSearch) != strPage.LastIndexOf(strToSearch))
{
//more than one instance found, so leave return URL as name search
}
else if (strPage.Contains(strToSearch) == true)
{
//find the ship's IMO
strPage = strPage.Substring(strPage.IndexOf(strToSearch)); //cut off the stuff before
strPage = strPage.Substring(0, strPage.IndexOf("\"")); //cut off the stuff after
}
strReturnURL = "https://www.vesselfinder.com" + strPage;

Categories