Parsing XML from webbrowser control? - c#

I need to parse a XML file (which is generated by PHP) from a webbrowser control as the page I am trying to parse requires cookies to track some things. When I use something like:
string xmlURL = "urltophpfile";
XmlTextReader reader = null;
reader = new XmlTextReader(xmlUrl);
to parse it, cookies aren't enabled so I need to use a webbrowser control or something which will allow me to use cookies.
The problem I am having is that when I try to put the webbrowser text to a string (string info = webBrowser2.DocumentText.ToString(); it gives the full source of the web page and therefore I can't parse it.
Does anyone have any suggestions on how I can work this out please?

You should use HttpWebRequest and specify the CookieContainer property.
This URL has a good example of it: http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx
EDIT: To be clear, I mean use HttpWebRequest to fetch the XML, and then load the xml using XmlTextReader.Create with one of the overloads that supports a stream or direct string input.

Related

How to read only a small part of a .XML

I built an application in order to read a file, but even with the fact that my connection is fast, the page takes several seconds to load, I would like to know how to read only the first records of this .xml
string rssURL = "http://www.cnt.org.br/Paginas/feed.aspx?t=n";
System.Net.WebRequest myRequest = System.Net.WebRequest.Create(rssURL);
System.Net.WebResponse myResponse = myRequest.GetResponse();
System.IO.Stream rssStream = myResponse.GetResponseStream();
System.Xml.XmlDocument rssDoc = new System.Xml.XmlDocument();
rssDoc.Load(rssStream);
System.Xml.XmlNodeList rssItems = rssDoc.SelectNodes("rss/channel/item");
Tks..
As the fore posters mention you can’t download part of a web request. But you can start parsing Xml before the request finished. Using XmlDocument is the wrong approach for your use case, because it needs the complete request to create the object. Try using XmlTextReader.
There is no easy way to download part of a web request and ensure it is what you want. One workaround would be to use the Google Feed API.
You'd have to use the JSON interface since they don't provide a library for C#, but since it's going through Google's servers it will be much faster. You'd have to modify your code a little bit, since it returns JSON by default instead of XML, but that is a trivial change to make. You can also change the parameter output=xml to retrieve the XML representation of the data.
Try going to this page, that is your same feed, with fewer elements and loads much faster. That only returns a few elements, but if you want 10 elements, all you have to do is add num=10 to the URL. For example, this url has 10 elements. Read the API documentation a little more to see what variables you can add to cater the request to what you want to do.

Access to content of a process information

I create an instance of IE with this code:
System.Diagnostics.Process p =
System.Diagnostics.Process.Start("IEXPLORE.EXE",
#"http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
I want to get the contents of the opened window. Is it possible to read the HTML content by this process?
Use following COde,
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
// TODO: ur logice here
}
no. your processes run in different virtual addressing spaces. That would have been a serious security vulnerability if you could have read the memory space allocated by another process.
Edit: Consider using something like a WebBrowserControl in your original process. That way you cold easily retrieve the page it displays.
It might be possible, but I'd actually use a HttpWebRequest to obtain the HTML content. If you really just want to get the HTML content for a given http-URL, using IE as a separate process is definitely not the way to go.
You should use WebClient class to retrieve web page content. Check this link:
http://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.80).aspx

How to get raw page source (not generated source) from c#

The goal is to get the raw source of the page, I mean do not run the scripts or let the browsers format the page at all. for example: suppose the source is <table><tr></table> after the response, I don't want get <table><tbody><tr></tr></tbody></table>, how to do this via c# code?
More info: for example, type "view-source:http://feeds.gawker.com/kotaku/full" in the browser's address bar will give u a xml file, but if you just call "http://feeds.gawker.com/kotaku/full" it will render a html page, what I want is the xml file. hope this is clear.
Here's one way, but it's not really clear what you actually want.
using(var wc = new WebClient())
{
var source = wc.DownloadString("http://google.com");
}
If you mean when rendering your own page. You can get access the the raw page content using a ResponseFilter, or by overriding page render. I would question your motives for doing this though.
Scripts run client-side, so it has no bearing on any c# code.
You can use a tool such as Fiddler to see what is actually being sent over the wire.
disclaimer: I think Fiddler is amazing

Read only the title and/or META tag of HTML file, without loading complete HTML file

Scenario :
I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database
What I am doing is using System.Net.WebClient Class's DownloadString(url_path) to download & then Saving it to Database by LINQ To SQL
But this DownloadString function gives me complete html source, I just need only Title part & META tag part.
Any ideas, to download only that much content?
I think you can open a stream with this url and use this stream to read the first x bytes, I can't tell the exact number but i think you can set it to reasonable number to get the title and the description.
HttpWebRequest fileToDownload = (HttpWebRequest)HttpWebRequest.Create("YourURL");
using (WebResponse fileDownloadResponse = fileToDownload.GetResponse())
{
using (Stream fileStream = fileDownloadResponse.GetResponseStream())
{
using (StreamReader fileStreamReader = new StreamReader(fileStream))
{
char[] x = new char[Number];
fileStreamReader.Read(x, 0, Number);
string data = "";
foreach (char item in x)
{
data += item.ToString();
}
}
}
}
I suspect that WebClient will try to download the whole page first, in which case you'd probably want a raw client socket. Send the appropriate HTTP request (manually, since you're using raw sockets), start reading the response (which will not be immediately) and kill the connection when you've read enough. However, the rest will have probably already been sent from the server and winging its way to your PC whether you want it or not, so you might not save much - if anything - of the bandwidth.
Depending on what you want it for, many half decent websites have a custom 404 page which is a lot simpler than a known page. Whether that has the information you're after is another matter.
You can use the verb "HEAD" in a HttpWebRequest to return the the response headers (not element. To get the full element with the meta data you'll need to download the page and parse out the meta data you want.
System.Net.WebRequest.Create(uri) { Method = "HEAD" };

How can I import a raw RSS feed in C#?

Does anyone know an easy way to import a raw, XML RSS feed into C#? Am looking for an easy way to get the XML as a string so I can parse it with a Regex.
Thanks,
-Greg
This should be enough to get you going...
using System.Net
WebClient wc = new WebClient();
Stream st = wc.OpenRead(“http://example.com/feed.rss”);
using (StreamReader sr = new StreamReader(st)) {
string rss = sr.ReadToEnd();
}
If you're on .NET 3.5 you now got built-in support for syndication feeds (RSS and ATOM). Check out this MSDN Magazine Article for a good introduction.
If you really want to parse the string using regex (and parsing XML is not what regex was intended for), the easiest way to get the content is to use the WebClient class.It got a download string which is straight forward to use. Just give it the URL of your feed. Check this link for an example of how to use it.
I would load the feed into an XmlDocument and use XPATH instead of regex, like so:
XmlDocument doc = new XmlDocument();
HttpWebRequest request = WebRequest.Create(feedUrl) as HttpWebRequest;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
StreamReader reader = new StreamReader(response.GetResponseStream());
doc.Load(reader);
<parse with XPATH>
}
What are you trying to accomplish?
I found the System.ServiceModel.Syndication classes very helpful when working with feeds.
You might want to have a look at this: http://www.codeproject.com/KB/cs/rssframework.aspx
XmlDocument (located in System.Xml, you will need to add a reference to the dll if it isn't added for you) is what you would use for getting the xml into C#. At that point, just call the InnerXml property which gives the inner Xml in string format then parse with the Regex.
The best way to grab an RSS feed as the requested string would be to use the System.Net.HttpWebRequest class. Once you've set up the HttpWebRequest's parameters (URL, etc.), call the HttpWebRequest.GetResponse() method. From there, you can get a Stream with WebResponse.GetResponseStream(). Then, you can wrap that stream in a System.IO.StreamReader, and call the StreamReader.ReadToEnd(). Voila.
The RSS is just xml and can be streamed to disk easily. Go with Darrel's example - it's all you'll need.

Categories