this is basically what I'm doing. I select a science article from en.wikipedia.org and get the a list of users that have made edits and how many times they've edited the article. To get this I follow links from the page to get that lead me to toolserver. I use this page http://toolserver.org/~daniel/WikiSense/Contributors.php?wikilang=en&wikifam=.wikipedia.org&page=Quantum_mechanics&since=&until=&grouped=on&hideanons=on&order=-edit_count&max=100&order=-edit_count&format=wiki to retrieve the editors in a sorted list and excluding anonymous. This works well, because it comes in a nicely formatted list, (even though it has dates which I don't need).
However, to judge their credibility, I need to take a look at the top users and see the top articles they're contributing to, to see if they're editing a lot of science articles or just random junk. I'm having a hard time getting data on each of these users, as currently, the only site I can find that shows user history is http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=5000&target=Aquirata
However, it takes quite a while to get a single user's webpage, at least 20 seconds, and then I still have to parse out the useless data, etc. I don't need close to as much data as I'm forced to download. This is my code so far for getting a user's data:
static string getWebPage(string url)
{
WebClient client = new WebClient();
client.Headers.Add("user-agent",
"Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
return client.DownloadString(url);
}
static void Main(string[] args)
{
string url = "http://en.wikipedia.org/w/index.php?title=Special:Contributions&limit=50&target=Aquirata";
string page = getWebPage(url);
var lines = page.Split('\n', '\r');
var edits = lines.Where(t => t.StartsWith("<li class"));
foreach (string s in edits)
Console.WriteLine(s);
Console.ReadLine();
}
Is there a possible alternative that will be faster and/or easier? Maybe there's a database somewhere for this? (i'm not sure if wikimedia has statistics on user's contributions).
Also, I'm using C# because I'm most familiar with it. I might switch over to java to allow cross platform but I'm open for any other suggestions.
I think wikipedia provides their data for download (so you don't have to strip it out from the HTML page).
See: http://dumps.wikimedia.org/enwiki/
HTH
Selecting only a certain part of a document can be done with a range request, which are documented in RFC 2616 Section 14.16.
For example:
$ curl -H"range: bytes=1-20" www.apache.org
!DOCTYPE HTML PUBLIC
$
I think that you can Deal with the WIKI as XML so you can use XPATHs to get the required data.
Related
I create an instance of IE with this code:
System.Diagnostics.Process p =
System.Diagnostics.Process.Start("IEXPLORE.EXE",
#"http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
I want to get the contents of the opened window. Is it possible to read the HTML content by this process?
Use following COde,
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.asnaf.ir/moreinfounit.php?sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de=1&iIkjkkewr782332ihdsfJHLKDSJKHWPQ397iuhdf87D3dffR=2009585&gGtkh87KJg89jhhJG75gjhu64HGKvuttt87guyr6e67JHGVt=117&cCli986gjdfJK755jh87KJ87hgf9871g00113kjJIZAEQ798=0a26e8ea07358781d128aa4bc98dd89a");
// TODO: ur logice here
}
no. your processes run in different virtual addressing spaces. That would have been a serious security vulnerability if you could have read the memory space allocated by another process.
Edit: Consider using something like a WebBrowserControl in your original process. That way you cold easily retrieve the page it displays.
It might be possible, but I'd actually use a HttpWebRequest to obtain the HTML content. If you really just want to get the HTML content for a given http-URL, using IE as a separate process is definitely not the way to go.
You should use WebClient class to retrieve web page content. Check this link:
http://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.80).aspx
I have to download some content from a website every day so I figure it will be nice to have a program that will do it... The problem is that the website requires authentication.
My current solution is by using System.Windows.Forms.WebBrowser control. I currently do something like:
/* Create browser */
System.Windows.Forms.WebBrowser browser = new System.Windows.Forms.WebBrowser();
/* navigate to desired site */
browser.Navigate("http://stackoverflow.com/");
// wait for browser to download dom
/* Get all tags of type input */
var elements = browser.Document.Body.GetElementsByTagName("input");
/* let's look for the one we are interested */
foreach (System.Windows.Forms.HtmlElement curInput in elements)
{
if (curInput.GetAttribute("name") == "q") //
{
curInput.SetAttribute("value", "I changed the value of this input");
break;
}
}
// etc
I think this approach works but is not the best solution. I have tried to use the webclient class and that seems to work but for some reason it does not work. I belive the reason why it does not work is because I have to save the cookies?
So my question is how will I be able to track all the bytes that get send to the server and all the bytes that get responded in order to download what I need. In other words I will like to have the webclient act as a webrowser and once I get to the part I need by just looking at the source I should be able to parser the data that I need.
I will appreciate if someone can show me an example of how to do so. Google chrome does a pretty good job displaying lots of information:
Thanks in advance,
Antonio
Answering your question:
The best utility i know to track traffic is Fiddler (its free).
For sending advanced HTTP requests, you should use class System.Net.HttpWebRequest, which also has property CookieContainer, and Headers, allowing you to do what ever you want.
Hope it helps.
The goal is to get the raw source of the page, I mean do not run the scripts or let the browsers format the page at all. for example: suppose the source is <table><tr></table> after the response, I don't want get <table><tbody><tr></tr></tbody></table>, how to do this via c# code?
More info: for example, type "view-source:http://feeds.gawker.com/kotaku/full" in the browser's address bar will give u a xml file, but if you just call "http://feeds.gawker.com/kotaku/full" it will render a html page, what I want is the xml file. hope this is clear.
Here's one way, but it's not really clear what you actually want.
using(var wc = new WebClient())
{
var source = wc.DownloadString("http://google.com");
}
If you mean when rendering your own page. You can get access the the raw page content using a ResponseFilter, or by overriding page render. I would question your motives for doing this though.
Scripts run client-side, so it has no bearing on any c# code.
You can use a tool such as Fiddler to see what is actually being sent over the wire.
disclaimer: I think Fiddler is amazing
I was wondering if someone could give me some guidance here. I'd like to be able to programatically get every image on a webpage as quickly as possible. This is what I'm currently doing: (note that clear is a WebBrowser control)
if (clear.ReadyState == WebBrowserReadyState.Complete)
{
doc = (IHTMLDocument2)clear.Document.DomDocument;
sobj = doc.selection;
body = doc.body as HTMLBody;
sobj.clear();
range = body.createControlRange() as IHTMLControlRange;
for (int j = 0; j < clear.Document.Images.Count; j++)
{
img = (IHTMLControlElement)clear.Document.Images[j].DomElement;
HtmlElement ele = clear.Document.Images[j];
string test = ele.OuterHtml;
string test2 = ele.InnerHtml;
range.add(img);
range.select();
range.execCommand("Copy", false, null);
Image image = Clipboard.GetImage();
if (image != null)
{
temp = new Bitmap(image);
Clipboard.Clear();
......Rest of code ...........
}
}
}
However, I find this can be slow for alot of images, and additionally it hijacks my clipboard. I was wondering if there is a better way?
I suggest using HttpWebRequest and HttpWebResponse. In your comment you asked about efficiency/speed.
From the standpoint of data being transferred using HttpWebRequest will be at worst the same as using a browser control, but almost certainly much better. When you (or a browser) makes a request to a web server, you initially only get the markup for the page itself. This markup may include image references, objects like flash, and resources (like scripts and css files) that are referenced, but not actually included in the page itself. A web browser will then proceed to request all the associated resources needed to render the page, but using HttpWebRequest you can request only those things that you actually want (the images).
From the standpoint of resources or processing power required to extract entities from a page, there is no comparison: using a broswer control is far more resource intensive than scanning an HttpWebResponse. Scanning some data using C# code is extremely fast. Rendering a web page involves javascript, graphics rendering, css parsing, layout, caching, and so on. It's a pretty intensive operation, actually. Using a browser under programmatic control, this will quickly become apparent: I doubt you could process more than a page every second or so.
On the other hand, a C# program dealing directly with a web server (with no rendering engine involved) could probably handle dozens if not hundreds of pages per second. For all practical purposes, you'd really be limited only by the response time of the server and your internet connection.
There are multiple approaches here.
If it's a one time thing, just browse to the site and select File > Save Page As... and let the browser save all the images locally for you.
If it's a recurring thing there are lots of different ways.
buy a program that does this. I'm sure there are hundreds of implementations.
use the html agility pack to grab the page and compile a list of all the images I want. Then spin a thread for each image that downloads and saves it. You might limit the number of threads depending on various factors like your (and the sites) bandwidth and local disk speed. Note that some sites have arbitrary limitations placed on the number of concurrent requests per connection they will handle. Depending on the site this might be as few as 3.
This is by no means conclusive. There are lots of other ways. I probably wouldn't do it through a WebBrowser control though. That code looks brittle.
I have a webpage which I would like users to be able to send to a friend at the click of a button. I am currently using Chilkat's MailMan but I keep getting intermittent problems with it. It seems occassionaly on the first attempt to mail it throws a null pointer exception. Then if try the exact same page again it sends no problem.
Are there any other components out
there that will do what I am trying
to do?
Would it be easier to right my
own light weight component to do it?
Has anyone had the above problem that
can be solved easily and then I don't
have to worry about the above?
EDIT:
Maybe I should clear something up. I know how to send emails. That is not the problem. The Chilkat component I was using could take in a webpage and put it into an email and send it. The person that receives it then has an email with all the CSS included and the pictures and everything in the email.
This is actually not a trivial exercise.
What you want to do, is download the HTML (which is the easy part). You then have to parse it, and extract all of the css references, and image references, and either:
Embed them into the email, or
Convert all links to absolute links.
When you look at all the bad HTML out there, you find out this isn't trival. The reason why I know this, is I wrote this functionality into aspNetEmail (www.aspNetEmail.com), and had to account for all sorts of bad HTML.
Could you use the WebClient class to get the webpage that the user is requesting? You'd want to change any relative links to absolute links (e.g. from "/images/logo.gif" to "http://myapp.com/images/logo.gif"), then take the output and use that as the body of the MailMessage object
i.e.
public void MailToAFriend(string friendMailAddress, Uri uriToEmail) {
MailMessage message = new MailMessage();
message.From = "your_email_address#yourserver.com";
message.To = friendEmailAddress;
message.Subject = "Check out this awesome page!";
message.Body = GetPageContents(uriToEmail);
SmtpClient mailClient = new SmtpClient();
mailClient.Send(message);
}
private string GetPageContents(Uri uri) {
var webClient = new WebClient();
string dirtyHtml = webClient.DownloadString(uri);
string cleanedHtml = MakeReadyForEmailing(dirtyHtml);
return cleanedHtml;
}
private string MakeReadyForEmailing(string html) {
// some implementation to replace any significant relative link
// with absolute links, strip javascript, etc
}
There's lots of resources on Google to get you started on the regex to do the replacement.
1) .NET comes with a reasonably adequate class for sending mail, in System.Net.Mail.
2) If it happens only rarely and does not repeat, just put it in a try block and retry two more times before considering it a failure. While it may sound crude, it's a very effective solution.