How search for HTML elements in StreamReader or String - c#

I've been searching a simple web crawler, and i need search an elements inside my StreamBuilder or string. Example, i need get all content inside an div with id "bodyDiv". Which tool helper me with this?
private static string GetPage(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "Simple crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}

I would use HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlText);
var div = doc.DocumentNode.SelectSingleNode("//div[#id='bodyDiv']");
if(div!=null)
{
var yourtext = div.InnerText;
}

Related

how to store and get stored data in to a session in c# if it stored using var

I'm getting data from web service like follow
string serviceUrl = "https://www.mscholid.com/assings/handlqueryrs/myprod.ashx";
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(serviceUrl);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string result = sr.ReadToEnd();
sr.Close();
var rootResult = XElement.Parse(result);
now I want to put this root result into a session
Session["rootv"] = rootResult;
then I want to retrieve it.
store function should do inside a class
public class NileResult
{
public dynamic nilecruiseFinalData_Images(string selectedID)
{
string serviceUrl = "https://www.mscholid.com/assings/handlqueryrs/myprod.ashx";
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(serviceUrl);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string result = sr.ReadToEnd();
sr.Close();
var rootResult = XElement.Parse(result);
//in here I want to store in to a session
}
}
how can I do this.
To access the session of the request, you can use:
HttpContext.Current.Session["rootv"] = rootResult;
HttpContext.Current is the current context of the request.

C# downloading page returns old page

I have problem, i wrote method to get current song on Czech radio. They do not have API so i had to get song from html via html agility.dll
Problem is even though song title changes on page my method downloads old page, usually i have to wait like 20 seconds and have my app closed, then it works.
I thought some cache problem, but i could not fix it.
tried: DownloadString method did not refresh either.
public static string[] GetEV2Songs()
{
List<string> songy = new List<string>();
string urlAddress = "http://www.evropa2.cz/";
string data = "";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
data = readStream.ReadToEnd();
response.Close();
readStream.Close();
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
string temp = "";
foreach (var node in doc.DocumentNode.SelectNodes("//body//h2"))
{
if (node.InnerText.Contains("&ndash"))
{
temp = node.InnerText.Replace("–", "-");
songy.Add(temp);
}
}
return songy.ToArray();
}
Sounds like being a caching problem. Try to replace the 4th line with something like that:
string urlAddress = "http://www.evropa2.cz/?_=" + System.Guid.NewGuid().ToString();

how to convert string data to html in richtextbox

I am fetching the response using httpwebrequest in winform now i want to display it as html page in my winform for this i am using richtextbox but it is simply displaying me text not html please tell me how can do it here is my code for this
Uri uri = new Uri("http://www.google.com");
if (uri.Scheme == Uri.UriSchemeHttp) {
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
request.Method = WebRequestMethods.Http.Get;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string tmp = reader.ReadToEnd();
richTextBox1.Text = tmp;
}
There is a .Net control which allows editing of html in winforms,
Take a look http://winformhtmltextbox.codeplex.com/

XmlDocument failed to load XHTML string because of error "Reference to undeclared entity 'nbsp'"

I use the following code to translate the HTTP response stream into a XmlDocument.
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
Stream responseStream = response.GetResponseStream();
StreamReader responseReader = new StreamReader(responseStream);
String responseString = responseReader.ReadToEnd();
Console.WriteLine(responseString);
Int32 htmlTagIndex = responseString.IndexOf("<html",
StringComparison.OrdinalIgnoreCase);
XmlDocument responseXhtml = new XmlDocument();
responseString = responseString.Substring(htmlTagIndex); // MARK 1
responseString = responseString.Replace("&nbsp", " "); // MARK 2
responseXhtml.LoadXml(responseString);
return responseXhtml;
The MARK 1 line is to skip the DOC Type definition line.
The MARK 2 line is to avoid the error Reference to undeclared entity 'nbsp'.
Is there any better way to do this? There're too much string operation in the above code.
Thanks!
I would directly use HtmlAgilityPack to parse the html. Even if you have to convert html to xml, you can use it.
using (WebClient wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.google.com"));
doc.OptionOutputAsXml = true;
StringWriter writer = new StringWriter();
doc.Save(writer);
var xDoc = XDocument.Load(new StringReader(writer.ToString()));
}

how to get url response(content : xml data) from solr search engine in asp.net

my url is :
http://localhost:8983/solr/db/select/?q=searchtext&version=2.2&start=0&rows=10&indent=on
How can i get response (xml data) from this url in asp.net. my result search is :
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
String a = response.ResponseUri.ToString();
But, I cant get content of xml data.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = new StreamReader(receiveStream, Encoding.UTF8);
String a = readStream.ReadToEnd();
ResponseUri is just the URL for the response. You need to use GetResponseStream().
You should probably be using the XmlDocument class.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Your code would look something like this:
XmlDocument doc = new XmlDocument();
doc.Load(response.GetResponseStream());
string root = doc.DocumentElement.OuterXml;

Categories