How does cite bite achieve it's cache? - c#

Can anyone shed light on how citebite achieves it's cache and in particular how it is able to display the cache having the same layout as the original page?
I am looking to achieving something very similar: I pulled the html from the source using
public static string sourceCache (string URL)
{
string sourceURL = URL;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
}
string data = readStream.ReadToEnd();
response.Close();
readStream.Close();
return data;
}
return "couldn't retrieve cache";
}
}
which I then send to my database storing as nvarchar(max). When loading the page to display the cache, I pull the field and set it as the innerhtml of a div property.
However, whereas on citebite their cache retains the styling and layout of the sourcepage, mine does not.
Where am I going wrong?
I have an asp.net 4.5 c# web forms website

Create one for this page, look at the source. The secret is
<base href="http://stackoverflow.com/questions/28432505/how-does-cite-bite-acheive-its-cache" />
The HTML Base Element () specifies the base URL to use for all
relative URLs contained within a document.There is maximum one
element in a document.

As per #Alex K above the base element appears to be the issue.
I have amended the code to check if the existing html has "base href" in it, and if not to insert the base element with the href set to the source url
public static string sourceCache (string URL)
{
string sourceURL = URL;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
}
string data = readStream.ReadToEnd();
response.Close();
readStream.Close();
if (data.Contains("base href"))
{
return data;
}
else
{
//we need to insert the base href with the source url
data = basecache(data, URL);
return data;
}
}
return "couldn't retrieve cache";
}
public static string basecache (string htmlsource, string urlsource)
{
//make sure there is a head tag
if (htmlsource.IndexOf("<head>") != -1)
{
int headtag = htmlsource.IndexOf("<head>");
string newhtml = htmlsource.Insert(headtag + "<head>".Length, "<base href='" + urlsource + "'/>");
return newhtml;
}
else if(htmlsource.IndexOf("<head>") != -1)
{
int headtag = htmlsource.IndexOf("<head>");
string newhtml = htmlsource.Insert(headtag + "<head>".Length, "<base href='" + urlsource + "'/>");
return newhtml;
}
else
{
return htmlsource;
}
}
So far i've only tested it on a few sites/domains but it appears to work, thank you so much Alex for your help.

Related

How would I extract certain information from a website into a table in c#

I have read in the information but I cant seem to figure out how to get the information I want ie the name, type etc into a table.
here what I did to get the data. ( I used a try-catch)
string url = ("http://www.oiseaux-birds.com/card-laughing-dove.html");
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
Console.WriteLine(result);
}

How can I store a part of a very large HTML stream?

I have to get the HTML code of a web and after that to find this class:
<span class='uccResultAmount'>0,896903</span>
I have tried with Regular-Expressions.
And also with Streams, I mean, storing the whole HTML code in a string. However, the code is very large for a string. So that makes it impossible, because the amount 0,896903 I am searching does not exist in the string.
Is there any way to only read a little block of the Stream?
A part of the method:
public static string getValue()
{
string data = "not found";
string urlAddress = "http://www.xe.com/es/currencyconverter/convert/?Amount=1&From=USD&To=EUR";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
}
data = readStream.ReadToEnd(); // the string in which I should search for the amount
response.Close();
readStream.Close();
}
If you find an easier way to fix my problem let me know it.
I would use HtmlAgilityPack and Xpath
var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://www.xe.com/es/currencyconverter/convert/?Amount=1&From=USD&To=EUR");
var value = doc.DocumentNode.SelectSingleNode("//span[#class='uccResultAmount']")
.InnerText;
A Linq version is also possible
var value = doc.DocumentNode.Descendants("span")
.Where(s => s.Attributes["class"] != null && s.Attributes["class"].Value == "uccResultAmount")
.First()
.InnerText;
Don't use this. Just to show
But the problem is that this html code does not fit in a single string
is not correct
string html = new WebClient().DownloadString("http://www.xe.com/es/currencyconverter/convert/?Amount=1&From=USD&To=EUR");
var val = Regex.Match(html, #"<span[^>]+?class='uccResultAmount'>(.+?)</span>")
.Groups[1]
.Value;

C# downloading page returns old page

I have problem, i wrote method to get current song on Czech radio. They do not have API so i had to get song from html via html agility.dll
Problem is even though song title changes on page my method downloads old page, usually i have to wait like 20 seconds and have my app closed, then it works.
I thought some cache problem, but i could not fix it.
tried: DownloadString method did not refresh either.
public static string[] GetEV2Songs()
{
List<string> songy = new List<string>();
string urlAddress = "http://www.evropa2.cz/";
string data = "";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
data = readStream.ReadToEnd();
response.Close();
readStream.Close();
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
string temp = "";
foreach (var node in doc.DocumentNode.SelectNodes("//body//h2"))
{
if (node.InnerText.Contains("&ndash"))
{
temp = node.InnerText.Replace("–", "-");
songy.Add(temp);
}
}
return songy.ToArray();
}
Sounds like being a caching problem. Try to replace the 4th line with something like that:
string urlAddress = "http://www.evropa2.cz/?_=" + System.Guid.NewGuid().ToString();

Can't get ResponseStream from WebException

I have a desktop client, that communicates with serverside via Http.
When server has some issues with data processing it returns description of an error in JSON in Http response body with proper Http-code (mainly it is HTTP-400).
When i read HTTP-200 response everithing's fine and this code works:
using (var response = await httpRequest.GetResponseAsync(token))
{
using (var reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("utf-8")))
{
return await reader.ReadToEndAsync();
}
}
But when an error occures and WebException is thrown and caught there is this code:
catch (WebException ex)
{
if (ex.Status == WebExceptionStatus.ProtocolError)
{
using (var response = (HttpWebResponse) ex.Response)
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.GetEncoding("utf-8")))
{
var json = reader.ReadToEnd();
}
}
}
}
}
I have already done something to it to maybe make it work, but the next happens:
response.ContentLength is valid (184)
but stream.Length is 0
and after that i can't read json (it's "")
I don't even know where to look, because everything looks like it should work.
What might be the problem?
After a month of almost everyday thinking I've found workaround.
The thing was that WebException.Response.GetResponseStream() returns not exactly the same stream that was obtained during request (can't find link to msdn right now) and by the time we get to catch the exception and read this stream the actual response stream is lost (or something like that, don't really know and was unable to find any info on the net, except looking into CLRCore which is now opensource).
To save the actual response until catching WebException you must set KeepAlive Property on your HttpRequest and voila, you get your response while catching exception.
So the working code looks like that:
try
{
var httpRequest = WebRequest.CreateHttp(Protocol + ServerUrl + ":" + ServerPort + ServerAppName + url);
if (HttpWebRequest.DefaultMaximumErrorResponseLength < int.MaxValue)
HttpWebRequest.DefaultMaximumErrorResponseLength = int.MaxValue;
httpRequest.ContentType = "application/json";
httpRequest.Method = method;
var encoding = Encoding.GetEncoding("utf-8");
if (httpRequest.ServicePoint != null)
{
httpRequest.ServicePoint.ConnectionLeaseTimeout = 5000;
httpRequest.ServicePoint.MaxIdleTime = 5000;
}
//----HERE--
httpRequest.KeepAlive = true;
//----------
using (var response = await httpRequest.GetResponseAsync(token))
{
using (var reader = new StreamReader(response.GetResponseStream(), encoding))
{
return await reader.ReadToEndAsync();
}
}
}
catch (WebException ex)
{
if (ex.Status == WebExceptionStatus.ProtocolError)
{
using (var response = (HttpWebResponse)ex.Response)
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.GetEncoding("utf-8")))
{
return reader.ReadToEnd();
//or handle it like you want
}
}
}
}
}
I don't know if it is good to keep all connection alive like that, but since it helped me to read actual responses from server, i think it might help someone, who faced the same problem.
EDIT: Also it is important not to mess with HttpWebRequest.DefaultMaximumErrorResponseLength.
I remember facing similar issue before and there was something related to setting the stream's position. Here is one of my solutions for reading webResponse that worked for me earlier. Please try if similar approach works for you:-
private ResourceResponse readWebResponse(HttpWebRequest webreq)
{
HttpWebRequest.DefaultMaximumErrorResponseLength = 1048576;
HttpWebResponse webresp = null;// = webreq.GetResponse() as HttpWebResponse;
var memStream = new MemoryStream();
Stream webStream;
try
{
webresp = (HttpWebResponse)webreq.GetResponse();
webStream = webresp.GetResponseStream();
byte[] readBuffer = new byte[4096];
int bytesRead;
while ((bytesRead = webStream.Read(readBuffer, 0, readBuffer.Length)) > 0)
memStream.Write(readBuffer, 0, bytesRead);
}
catch (WebException e)
{
var r = e.Response as HttpWebResponse;
webStream = r.GetResponseStream();
memStream = Read(webStream);
var wrongLength = memStream.Length;
}
memStream.Position = 0;
StreamReader sr = new StreamReader(memStream);
string webStreamContent = sr.ReadToEnd();
byte[] responseBuffer = Encoding.UTF8.GetBytes(webStreamContent);
//......
//.......
Hope this helps!

Handeling Cookies With the Broswer Control in C#

I am automating a process on the website. For simplicity sake, lets say there are two parts to the process. The first is logging into site and the second is clicking a button on the page. I believe that the login mechanism uses a cookie to handle authentication. I have used Fiddler and was able to view this cookie.
The issue I have is that as of now, I can automate the login and clicking the button, but I can only do it for one control. I only have one login and the system does not allow me to login again using another browser. What I want to do is issue multiple request to click the button at the same time. But right now I am stuck doing them sequentially.
Is there a way that I can get the cookies from the Browser control and use it for other web requests?
You can make your own request using HttpRequest and response object.
The function below might help you regarding this. By using this, you don't need to login again and again, you just add cookies to the request which will provide authentication and you just send requests in loop:
public static bool SessionRequest(Fiddler.Session oS, ref string htmlContent, string requestMethod)
{
try
{
WebRequest request = WebRequest.Create(oS.fullUrl);
if (oS != null && oS.oRequest.headers != null && oS.oRequest.headers.Count() > 0)
{
NameValueCollection coll = new NameValueCollection();
request.Headers = new WebHeaderCollection();
foreach (Fiddler.HTTPHeaderItem rh in oS.oRequest.headers)
{
if (rh.Name.Contains("Cookie"))
{
((HttpWebRequest)request).CookieContainer = new CookieContainer();
string[] cookies = UtilitiesScreenScrapper.UtilityMethods.SplitString(rh.Value, ";");
if (cookies != null && cookies.Length > 0)
{
foreach (string c in cookies)
{
string[] cookie = UtilitiesScreenScrapper.UtilityMethods.SplitString(c, "=");
if (cookie != null && cookie.Length > 0)
{
cookie[0] = cookie[0].Replace(" ", "%");
cookie[1] = cookie[1].Replace(" ", "%");
((HttpWebRequest)request).CookieContainer.Add(new Uri(oS.fullUrl), new Cookie(cookie[0].Trim(), cookie[1].Trim()));
}
}
}
else
{
string[] cookie = UtilitiesScreenScrapper.UtilityMethods.SplitString(rh.Value, "=");
if (cookie != null && cookie.Length > 0)
{
((HttpWebRequest)request).CookieContainer.Add(new Uri(oS.url), new Cookie(cookie[0], cookie[1]));
}
}
}
else if (rh.Name.Contains("User-Agent"))
{
((HttpWebRequest)request).UserAgent = rh.Value;
}
else if (rh.Name.Contains("Host"))
{
((HttpWebRequest)request).Host = "www." + oS.host;
}
else if (rh.Name.Equals("Accept"))
{
((HttpWebRequest)request).Accept = rh.Value;
}
else if (rh.Name.Contains("Content-Type"))
{
((HttpWebRequest)request).ContentType = rh.Value;
}
else if (rh.Name.Contains("Content-Length"))
{
((HttpWebRequest)request).ContentLength = oS.RequestBody.Length;
}
else if (rh.Name.Contains("Connection"))
{
//((HttpWebRequest)request).Connection = rh.Value;
}
else if (rh.Name.Equals("Referer"))
{
((HttpWebRequest)request).Referer = oS.host;
}
else
{
((HttpWebRequest)request).Headers.Add(rh.Name + ":");
((HttpWebRequest)request).Headers[rh.Name] = rh.Value;
}
}
((HttpWebRequest)request).Headers.Add("Conneciton:");
((HttpWebRequest)request).Headers["Conneciton"] = "keep-alive";
((HttpWebRequest)request).AllowAutoRedirect = true;
Stream dataStream;
if (oS.RequestBody.Length > 0)
{
request.Method = "POST";
// Get the request stream.
dataStream = request.GetRequestStream();
// Write the data to the request stream.
dataStream.Write(oS.RequestBody, 0, oS.RequestBody.Length);
// Close the Stream object.
dataStream.Close();
}
else
{
request.Method = "GET";
}
//string postData = string.Empty;
//byte[] byteArray = Encoding.UTF8.GetBytes(postData);
// Set the ContentType property of the WebRequest.
//request.ContentType = "application/x-www-form-urlencoded";
// Get the response.
WebResponse response = request.GetResponse();
//resp = response;
// Display the status.
Console.WriteLine(((HttpWebResponse)response).StatusDescription);
// Get the stream containing content returned by the server.
dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine(responseFromServer);
htmlContent = responseFromServer;
// Clean up the streams.
reader.Close();
dataStream.Close();
response.Close();
}
}
catch(Exception ex)
{
throw ex;
}
return false;
}

Categories