Problem pulling data from website in .NET and C# - c#

I have written a web scraping program to go to a list of pages and write all the html to a file. The problem is that when I pull a block of text some of the characters get written as '�'. How do I pull those characters into my text file? Here is my code:
string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage.asp?did={0}&blockName={1}", id.ToString(), name.Trim());
// our third request is for the actual webpage after the login.
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(baseUri);
request.Method = "GET";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)";
//get the response object, so that we may get the session cookie.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
// and read the response
string page = reader.ReadToEnd();
StreamWriter SW;
string filename = string.Format("{0}.txt", id.ToString());
SW = File.AppendText("C:\\Share\\" + filename);
SW.Write(page);
reader.Close();
response.Close();

You're saving a page named loadimage to a text file. Are you sure that's really all text?
Either way, you can save yourself a lot of code by using System.Net.WebClient.DownloadFile().

You need to specify your encoding in this line:
StreamReader reader = new StreamReader(response.GetResponseStream());
and
File.AppendText("C:\\Share\\" + filename); uses UTF-8

Specify Unicode encoding, like so:
New StreamReader(response.GetResponseStream(), Text.Encoding.UTF8)
..same for the StreamWriter

Related

i am unable to create PDF file by URL using ASPOSE.PDF in C#, I am getting the following problem

I want to create PDF file by URL using ASPOSE.PDF and try the following code:
string dataDir =#"C:\Users\UbaidUllah\Documents\Visual Studio 2015\Projects\aspose\Data\AsposePDF\DocumentConversion\";
WebRequest request = WebRequest.Create("https://www.cricbuzz.com/cricket-stats/icc-rankings/men/batting");
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
string responseFromServer = reader.ReadToEnd();
reader.Close();
dataStream.Close();
response.Close();
MemoryStream stream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(responseFromServer));
HtmlLoadOptions options = new HtmlLoadOptions("https://www.cricbuzz.com/");
Document pdfDocument = new Document(stream, options);//----not execute this line nor give any error-----
pdfDocument.Save(dataDir + "WebPageToPDF_out.pdf");
When i debug this code, it does not go forward from the second last line(as described in comment) nor give any error, i wait much time but do not get any response.
I don't know where the mistake is. please review the code and help me to solve the issue.
Thank you very much!

how to open webpage in c# without using webbrowser class

I want to know how to open webpage in c# without using webbrowser class. First time on c sharp. I tried below but that did not work. Can anyone help.
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create("http://google.com");
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
If you try to simply open a website without doing something else with it, you could do something like that to open the defined default browser:
string url = "http://google.com";
System.Diagnostics.Process.Start(url);

programmatically clicking links in c#

I'am trying to make a script to programmatically.
I managed to get the hole html page to a string, Now I want to somehow click the elements that I have there. I'm kind of lost so any info could help.
I'v tried to get the document as a HtmlDocument but for some reason when I use the getElementById method it doesnt find the element.
Please, Any info would help.
Thanks.
Currently this is the code i'v got,
It brings me up to the point where i have string that is it's value is the html document, now I need to some how extract the relavent tag and click it programmatically.
Thanks for your inputs,
Still waiting for one that can help me.
string email = "someemail*";`enter code here`
string pw = "somepass";
string PostData = String.Format("email={0}&pass={1}", email, pw);
CookieContainer cookieContainer = new CookieContainer();
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://www.facebook.com/*******");
req.CookieContainer = cookieContainer;
req.Method = "POST";
req.ContentLength = PostData.Length;
req.ContentType = "application/x-www-form-urlencoded";
req.AllowAutoRedirect = true;
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
ASCIIEncoding encoding = new ASCIIEncoding();
byte[] loginDataBytes = encoding.GetBytes(PostData);
req.ContentLength = loginDataBytes.Length;
Stream stream = req.GetRequestStream();
stream.Write(loginDataBytes, 0, loginDataBytes.Length);
HttpWebResponse webResp = (HttpWebResponse)req.GetResponse();
Stream datastream = webResp.GetResponseStream();
StreamReader reader = new StreamReader(datastream);
string sLine = "";
string json = "";
while (sLine != null)
{
sLine = reader.ReadLine();
json += sLine;
}
json.ToString();
perhaps you might want to look at WaitN, it allows you to do all this really really easily
A "click on a link" is the same thing as sending a HTTP request. If you can parse the URI from the document you have, you can create the HTTP request separately and send that.
Clicking on a link is done by issuing a HTTP-Get for the href of the link.
If there is JavaScript interactivity, then you need to take a webbrowser element, and inject a javascript, that on document.ready executes document.getElementById("whatever").click()
See
How do I programmatically click a link with javascript?
You can use HTML agility pack to parse a HTML document and extract the HREF argument.

Grabbing HTML from URL doesn't work - any tips?

I have tried several methods in C# using webclient and webresponse and they all return
<html><head><meta http-equiv=\"REFRESH\" content=\"0; URL=http://www.windowsphone.com/en-US/games?list=xbox\"><script type=\"text/javascript\">function OnBack(){}</script></head></html>"
instead of the actual rendered page when you use a browser to go to http://www.windowsphone.com/en-US/games?list=xbox
How would you go about grabbing the HTML from that location?
http://www.windowsphone.com/en-US/games?list=xbox
Thanks!
/edit: examples added:
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
Uri inputUri = new Uri(inputUrl);
WebRequest request = WebRequest.CreateDefault(inputUri);
request.Method = "GET";
WebResponse response;
try
{
response = request.GetResponse();
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
resultHTML = reader.ReadToEnd();
}
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebClient webClient = new WebClient();
try
{
resultHTML = webClient.DownloadString(inputUrl);
}
catch { }
Tried:
string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
string resultHTML = String.Empty;
WebResponse objResponse;
WebRequest objRequest = HttpWebRequest.Create(inputUrl);
try
{
objResponse = objRequest.GetResponse();
using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
{
resultHTML = sr.ReadToEnd();
sr.Close();
}
}
catch { }
I checked for this URL, and you need to parse the cookies.
When you try to access the page for the first time, you are redirected to an https URL on login.live.com and then redirected back to the original URL. The https page sets a cookie called MSPRequ for the domain login.live.com. If you do not have this cookie, you cannot access the site.
I tried disabling cookies in my browser and it ends up looping infinitely back to the URL https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1328303901&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fgames%3Flist%3Dxbox&lc=1033&id=268289. It's been going on for several minutes now and doesn't appear it will ever stop.
So you will have to grab the cookie from the https page when it is set, and persist that cookie for your subsequent requests.
This might be because the server you are requesting HTML from returns different HTML depending on the User Agent string. You might try something like this
webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
That particular header may not work, but you could try others that would mimic standard browsers.

Deserialize problem for timestamp in ASP.NET

We have webpage in our project where website post the XML data to.
string data = string.Concat("XMLParameter=", SampleXML, "&", "AccessCode=", "XYZ");
if (uri.Scheme == Uri.UriSchemeHttp)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
request.Method = WebRequestMethods.Http.Post;
request.ContentLength = data.Length;
request.ContentType = "application/x-www-form-urlencoded";
StreamWriter writer = new StreamWriter(request.GetRequestStream());
writer.Write(data);
writer.Close();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream()); string tmp = reader.ReadToEnd();
response.Close();
Response.Write(tmp);
}
When we pass following XML with post,
<notification timestamp="2009-09-11T11:51:07+02:00">
<reservation creation_date="2010-09-10T12:03:13">
</reservation>
</notification>
On the receiving end in our web page, timestamp we receive do not contain +. Instead of +, it gives us Space charactor. So, while deserializing it, we got error.
We get the data in page using Request.Form["XMLParameter"]
Any solution ???
You have a funny service where the XML data is posted as if it were coming from an HTML form with a text field containing the XML data.
As a consequence, the web server expects your data to be URL encoded. You even tell in the content type that it's encoded. But it isn't. You don't encode it anywhere in your code.
So if you want to stick with your funny service, then you need to run your XML data through HttpUtility.UrlEncode first.

Categories