How to screen scrape HTTPS using C#?
You can use System.Net.WebClient to start an HTTPS connection, and pull down the page to scrape with that.
Look into the Html Agility Pack.
You can use System.Net.WebClient to grab web pages. Here is an example: http://www.codersource.net/csharp_screen_scraping.html
If for some reason you're having trouble with accessing the page as a web-client or you want to make it seem like the request is from a browser, you could use the web-browser control in an app, load the page in it and use the source of the loaded content from the web-browser control.
Here's a concrete (albeit trivial) example. You can pass a ship name to VesselFinder in the querystring, but even if it only finds one ship with that name it still shows you the search results screen with one ship. This example detects that case and takes the user straight to the tracking map for the ship.
string strName = "SAFMARINE MAFADI";
string strURL = "https://www.vesselfinder.com/vessels?name=" + HttpUtility.UrlEncode(strName);
string strReturnURL = strURL;
string strToSearch = "/?imo=";
string strPage = string.Empty;
byte[] aReqtHTML;
WebClient objWebClient = new WebClient();
objWebClient.Headers.Add("User-Agent: Other"); //You must do this or HTTPS won't work
aReqtHTML = objWebClient.DownloadData(strURL); //Do the name search
UTF8Encoding utf8 = new UTF8Encoding();
strPage = utf8.GetString(aReqtHTML); // get the string from the bytes
if (strPage.IndexOf(strToSearch) != strPage.LastIndexOf(strToSearch))
{
//more than one instance found, so leave return URL as name search
}
else if (strPage.Contains(strToSearch) == true)
{
//find the ship's IMO
strPage = strPage.Substring(strPage.IndexOf(strToSearch)); //cut off the stuff before
strPage = strPage.Substring(0, strPage.IndexOf("\"")); //cut off the stuff after
}
strReturnURL = "https://www.vesselfinder.com" + strPage;
Related
Am working in asp.net and had to rewrite some urls rewriting is working fine here is an example I had to change URL mywebsite.com/search.aspx?cat=1 to mywebsite.com/search/cameras and it's working fine now I have to change page meta tags and when I try to get url by using
HttpContext.Current.Request.Url.PathAndQuery
am getting search.aspx?cat=1
while I want here is address written in address bar which is search/cameras
if it's not possible than is there any way to set meta tags for specific pages?
here is code for url rewrite
m_boolIsCustomPage = true;
m_strPageBaseUrl = "search.aspx";
if (m_intIDSearch > -1)
{
l_strQueryContents = m_intIDSearch.ToString();
m_intIDSearch = -1;
}
else
{
l_strQueryContents = "-1";
m_intIDSearch = -1;
}
HttpContext.Current.Request.RawUrl
As received by IIS prior to any manipulation.
Request.RawUrl vs. Request.Url
Okay, so I've googled on several occasions regarding this, but each one suggests respectfully decent choices like : "Selenium" which works fine, but isn't use-able without firefox (or even within an API to my knowledge?).
I have this code :
public byte[] GetFileViaHttp(string url)
{
using (WebClient client = new WebClient())
{
return client.DownloadData(url);
}
}
Then I also have this code :
byte[] result = GetFileViaHttp(#"http://ip-lookup.net/");
string str = Encoding.UTF8.GetString(result);
richTextBox1.Text = str;
Works fine, returns my IP's information, but I want to automate this with other IP addresses, rather than return my own.
How would this be done ?
By this I mean, I want the API to take txtBox1.Text (IP) & print the details into richTextBox1.Text (Host/Country) ..
How could this be done ?
I looked around the site and found a help document that details exactly what you want.
Simply pass the IP value as an unnamed query string parameter:
http://ip-lookup.net/?127.0.0.1
In your code:
byte[] result = GetFileViaHttp(string.Format("http://ip-lookup.net?{0}", ipAddress));
where you are injecting a string ip address as ipAddress.
You can find their help page here. I looked for a legal agreement but I wasn't able to find one, so please use at your own risk and discretion.
UPDATE:
If you are getting 403s, you need to pass along a user agent header. Your WebClient instance can be modified to include a header in the request.
public byte[] GetFileViaHttp(string url)
{
using (WebClient client = new WebClient())
{
client.Headers.Add("User-Agent: Other");
return client.DownloadData(url);
}
}
I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code
I'm trying to allow users to post videos on my site by supplying only the URL. Right now I'm able to allow YouTube videos by just parsing the URL and obtaining the ID, and then inserting that ID into their given "embed" code and putting that on the page.
This limits me to only YouTube videos however, what I'm looking to do is something similar to facebook where you can put in the YouTube "Share" URL OR the url of the page directly, or any other video url, and it loads the video into their player.
Any idea how they do this? or any other comparable way to just show a video based just on a URL? Keep in mind that youtube videos (which would probably be most popular anyway) don't give the video url, but the url to the video on the YouTube page (which is why their embed code is needed with just the ID).
Hopefully this made sense, and I hope somebody might be able to offer me some advice on where to look!
Thanks guys.
I would suggest adding support for OpenGraph attributes, which are common among content services which work to enable other sites to embed their content. The information on the pages will be contained in their <meta> tags, which means you would have to load the URL via something like the HtmlAgilityPack:
var doc = new HtmlDocument();
doc.Load(webClient.OpenRead(url)); // not exactly production quality
var openGraph = new Dictionary<string, string>();
foreach (var meta in doc.DocumentNode.SelectNodes("//meta"))
{
var property = meta["property"];
var content = meta["content"];
if (property != null && property.Value.StartsWith("og:"))
{
openGraph[property.Value]
= content != null ? content.Value : String.Empty;
}
}
// Supported by: YouTube, Vimeo, CollegeHumor, etc
if (openGraph.ContainsKey("og:video"))
{
// 1. Get the MIME Type
string mime;
if (!openGraph.TryGetValue("og:video:type", out mime))
{
mime = "application/x-shockwave-flash"; // should error
}
// 2. Get width/height
string _w, _h;
if (!openGraph.TryGetValue("og:video:width", out _w)
|| !openGraph.TryGetValue("og:video:height", out _h))
{
_w = _h = "300"; // probably an error :)
}
int w = Int32.Parse(_w), h = Int32.Parse(_h);
Console.WriteLine(
"<embed src=\"{0}\" type=\"{1}\" width=\"{2}\" height=\"{3}\" />",
openGraph["og:video"],
mime,
w,
h);
}
I have an aspx page which has some javascript code like
<script>
setTimeout("document.write('" + place.address + "');",1);
</script>
As it is clear from the code it will going to write something on the page after a very short delay of 1 ms. I have created an another page to get the page executed by some query string and get its output. The problem is
I can not avoid the delay as simply writing document.write(place.address); will not print anything as it takes a little time to get values so if I set it in setTimeout for delayed output of 1 ms it always return me a value
If I request the output from another page using
System.Net.WebClient wc = new System.Net.WebClient();
System.IO.StreamReader sr = new System.IO.StreamReader(wc.OpenRead("http://localhost:4859/Default.aspx?lat=" + lat + "&lng=" + lng));
string strData = sr.ReadToEnd();
I get the source code of the document instead of the desired output.
I would like to either avoid that delay or else delayed the client request output so that I get a desired value not the source code.
The JS on default.aspx is
<script type="text/javascript">
var geocoder;
var address;
function initialize() {
geocoder = new GClientGeocoder();
var qs=new Querystring();
if(qs.get("lat") && qs.get("lng"))
{
geocoder.getLocations(new GLatLng(qs.get("lat"),qs.get("lng")),showAddress);
}
else
{
document.write("Invalid Access Or Not valid lat long is provided.");
}
}
function getAddress(overlay, latlng) {
if (latlng != null) {
address = latlng;
geocoder.getLocations(latlng, showAddress);
}
}
function showAddress(r) {
place = r.Placemark[0];
setTimeout("document.write('" + place.address + "');",1);
//document.write(place.address);
}
</script>
and the code on requestClient.aspx is as
System.Net.WebClient wc = new System.Net.WebClient();
System.IO.StreamReader sr = new System.IO.StreamReader(wc.OpenRead("http://localhost:4859/Default.aspx?lat=" + lat + "&lng=" + lng));
string strData = sr.ReadToEnd();
I'm not a JavaScript expert, but I believe using document.write after the page has finished loading is a bad thing. You should be creating an html element that your JavaScript can manipulate, once the calculation is complete.
Elaboration
In your page markup, create a placeholder for where you want the address to appear:
<p id="address">Placeholder For Address</p>
In your JavaScript function, update that placeholder:
function showAddress(r) {
place = r.Placemark[0];
setTimeout("document.getElementById('address').innerHTML = '" + place.address + "';",1);
}
string strData = sr.ReadToEnd();
I get the source code of the document instead of the desired output
(Could you give a sample of the output. I don't think I've seen a web scraper work that way so that would help me to be sure. But if not this is a good example web scraper)
Exactly what are you doing with the string "strData" If you are just writing it out, I recommend you putting it in a Server side control (like a literal). If at all possible, I'd recommend you do this server side using .net rather than waiting 1 ms in javascript (which isn't ideal considering the possibility that 1 ms may or may not be an ideal amount of time to wait on a particular user's machine hence: "client side"). In a case like this and I had to do it client side I would use the element.onload event to determine if a page has finished loading.