Getting SourceCode of Website - c#

I used the following code to get the SourceCode from a SharePoint 2010 site:
try {
WebRequest req = HttpWebRequest.Create("myLink");
req.Method = "GET";
req.Credentials = System.Net.CredentialCache.DefaultNetworkCredentials;
string source = "";
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) {
source += reader.ReadToEnd();
}
}
From the source string i was able to search for keywords i was looking for on the website.
Now the SharePoint has been migrated to 2016 and i am not able anymore to view the specific content in the Source Code.
However it is posible to use for example the integrated web developer tool of chrome to view the structure of the site. Also my content i am looking for is visible in this case.
How is it possible to get this information programmatically using C#?

Try this:
using (WebClient client = new WebClient ()) // WebClient class inherits IDisposable
{
client .Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
string htmlCode = client.DownloadString("myLink");
//...
}

Related

cannot get page source using c# httpwebrequest

I tried to get the source of a particular site page using the code below but it failed.
I was able to get the page source in 1~2 seconds using a webbrowser or webdriver, but httpwebrequest failed.
I tried putting the actual webbrowser cookie into httpwebrequest, but it failed, too.
(Exception - The operation has timed out)
I wonder why it failed and want to learn through failure.
Thank you in advance!!.
string Html = String.Empty;
CookieContainer cc = new CookieContainer();
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("https://www.coupang.com/");
req.Method = "GET";
req.Host = "www.coupang.com";
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3";
req.Headers.Add("Accept-Language", "ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7");
req.CookieContainer = cc;
using (HttpWebResponse res = (HttpWebResponse)req.GetResponse())
using (StreamReader str = new StreamReader(res.GetResponseStream(), Encoding.UTF8))
{
Html = str.ReadToEnd();
}
Removing req.Host from your code should do the trick.
According to the documentation:
If the Host property is not set, then the Host header value to use in an HTTP request is based on the request URI.
You already set the URI in (HttpWebRequest)WebRequest.Create("https://www.coupang.com/") so I don't think doing it again is necessary.
Result
Please let me know if it helps.

C# Empty WebClient Downloadstring

I'm trying to download the html string of a website. The website has te following url:
https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de
First I tried to do a simple WebClient Request:
var wc = new WebClient();
string websitenstring = "";
websitenstring = wc.DownloadString("http://www.gastrosg.ch/default.asp?id=3020000&siteid=1&langid=de");
But, the websiteString was empty. Then, I read in some posts, that I have to send some additional headerinformations :
var wc = new WebClient();
string websitenstring = "";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate, br";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7";
wc.Headers[HttpRequestHeader.CacheControl] = "max-age=0";
wc.Headers[HttpRequestHeader.Host] = "www.gastrobern.ch";
wc.Headers[HttpRequestHeader.Upgrade] = "www.gastrobern.ch";
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
websitenstring = wc.DownloadString("https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de");
I tried this, but no answer. Then, I also tried to set some cookies:
wc.Headers.Add(HttpRequestHeader.Cookie,
"CFID=10609582;" +
"CFTOKEN=32721418;" +
"_ga=GA1.2.37" +
"_ga=GA1.2.379124242.1539000256;" +
"_gid=GA1.2.358798732.1539000256;" +
"_dc_gtm_UA-1237799-1=1;");
But this also didn't work. I also found out, that the Browser is somehow doing multiple requests, and my C-Sharp Application is just doing one and showing the first response headers.
But I don't know how I can make a following up request. I'm thankful for every answer.
Try HttpClient instead
Here is an Example On how to use it
public async static Task<string> GetString(string url)
{
HttpClient client = new HttpClient();
// Way around to avoid Deadlock
HttpResponseMessage message = await client.GetAsync(url).ConfigureAwait(false);
return await message.Content.ReadAsStringAsync().ConfigureAwait(false);
}
To call this Method
string dataFromServer = GetString("https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de").Result;
I checked Here dataFromServer has HTML content to that page

UPS: AWB tracking via web crawling

Problem
Here at work, people spend a lot of time tracking AWB (Air way bill) from diferent sources (UPS, FedEx, DHL, ...). So, I was required to improve the process in order save valuable time, I was thinking to accomplish this using Excel as platform with Excel-DNA & C# but I have been trying some tests (crawling UPS) with no success.
Tests
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://wwwapps.ups.com/WebTracking/track?HTMLVersion=5.0&loc=es_MX&Requester=UPSHome&WBPM_lid=homepage%2Fct1.html_pnl_trk&trackNums=5007052424&track.x=Rastrear");
request.Method = "GET";
request.UserAgent = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.Headers.Add("Accept-Language: es-ES,es;q=0.8");
request.Headers.Add("Accept-Encoding: gzip,deflate,sdch");
request.KeepAlive = false;
request.Referer = #"http://www.ups.com/";
request.ContentType = "text/html; charset=utf-8";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
Or...
using (var client = new WebClient())
{
var values = new NameValueCollection();
values.Add("HTMLVersion", "5.0");
values.Add("loc", "es_MX");
values.Add("Requester", "UPSHome");
values.Add("WBPM_lid", "homepage/ct1.html_pnl_trk");
values.Add("trackNums", "5007052424");
values.Add("track.x", "Rastrear");
client.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
client.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
client.Headers[HttpRequestHeader.AcceptLanguage] = "es-ES,es;q=0.8";
client.Headers[HttpRequestHeader.Referer] = #"http://www.ups.com/";
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36";
string url = #"https://wwwapps.ups.com/WebTracking/track?";
byte[] result = client.UploadValues(url, values);
System.IO.File.WriteAllText(#"C:\UPSText.txt", Encoding.UTF8.GetString(result));
}
But none of the above examples worked as expected.
Question
Is it possible to web-crawl UPS in order to keep a track of AWB?
Note
Currently, I have no access to UPS API.
I just finished writing my script for it. The trick is that there is another url where you can just include the tracking number in the url and land directly on the page. You will then have to parse the tables as xml tags won't work. Just offset off of a header.

Cant access site using webclient method..?

I am making a desktop yellowpage application. I can access all countries yellowpage site but not australian site. I dont know why?
Here is the code
class Program
{
static void Main(string[] args)
{
WebClient wb = new WebClient();
wb.Headers.Add("user-agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US)");
string html = wb.DownloadString("http://www.yellowpages.com.au");
Console.WriteLine(html);
}
}
For all other site I get html of the website for australian site I get null. i even tried httpwebrequest also.
Here is the yellowpage australian site: http://www.yellowpages.com.au
Thanks in advance
It looks like that website will only send over gzip'ed data. Try switching to HttpWebRequest and using auto decompression:
var request = (HttpWebRequest)WebRequest.Create("http://www.yellowpages.com.au");
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705;)";
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
In addition to #bkaid's correct (and upvoted) answer, you can use your own class inherited from WebClient to uncompress/handle gzip compressed html:
public class GZipWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = (HttpWebRequest)base.GetWebRequest(address);
request.AutomaticDecompression = DecompressionMethods.GZip |
DecompressionMethods.Deflate;
return request;
}
}
Having done this, the following works just fine:
WebClient wb = new GZipWebClient();
string html = wb.DownloadString("http://www.yellowpages.com.au");
When I view the transfer from that website in Wireshark, it says it's a malformed HTTP packet. It says it uses chunked transfer, then says the following chunk has 0 bytes and then sends the code of the website. That's why WebClient returns an empty string (not null). And I think it's correct behavior.
It seems browsers ignore this error and so they can display the page properly.
EDIT:
As bkaid pointed out, the server seems to handle send correct gziped response. The following code works for me:
WebClient wb = new WebClient();
wb.Headers.Add("Accept-Encoding", "gzip");
string html;
using (var webStream = wb.OpenRead("http://www.yellowpages.com.au"))
using (var gzipStream = new GZipStream(webStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(gzipStream))
html = streamReader.ReadToEnd();

Getting html source from url, css inline problem!

I have a strange problem:
I am getting the html source from url using this:
string html;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(Url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
{
html = reader.ReadToEnd();
reader.Close();
}
response.Close();
}
The page that I am requesting has css inline like this:
<span class="VL" style="display:inline-block;height:20px;width:0px;"></span>
But the html var value has only:
<span class="VL" style="display:inline-block;"></span>
Anyone knows why? I have tested with many enconders and using WebRequest and WebClient too, but doesn't work too.
You might need to send a User Agent so that the site doesn't think that you are a bot. Some sites don't bother with CSS when requested from bots. Also the reading of the remote HTML could be simplified using a WebClient:
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4";
string html = client.DownloadString(url);
}
Are you viewing the source through a browser development tool, by clicking inspect element? Is it possible you are viewing the source from a browser which is adding the height and width attributes on the client side through JavaScript and showing you the modified style.

Categories