How to find a link in HTML document? (C#) - c#

I have a C# Form with WebBrowser object.
This object contains HTML Document.
And there is a link in that document that has no markers (no id and no name)
How can I access this element??
I tried to use this:
webBrowser1.Document.GetElementsByTagName("a")[n]
But it is not very useful, because if there will be some new link on the page, I'll need to rebuild all program.
I also can not do loops through document, or get a substring of Document.ToString() because then I can not click the link.
Would be great if you could give me some advice.

In this kind of situation the best idea is always to find an "Anchor", meaning - a place in the document that never change.
Lets say that
dada
Doesn't have an ID or Name, so the closest you can go is check if the parent of the element you're looking for has an ID.
<div id="parentDiv">
Some text
Some other stuff
The link you're looking for
</div>
That way you could get the parentDiv, which you know doesn't change, and then the A tag inside that parent (which should be permanent unless that website completely changes the structure which is one of the problems in parsing external HTML pages)
Shai.

you can use Html Agility Pack. and select links by xpath
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
// do stuff
}

You should have some info on how to identify the link. it may be id or name or the text. If the text is always same then check the inner text of that link.

Related

How to get first-level elements from HTML file with HTML Agility Pack & c#

I want to get first-level elements via parsing HTML file with HTML Agility Pack ,for example result will be like this:
<html>
<body>
<div class="header">....</div>
<div class="main">.....</div>
<div class="right">...</div>
<div class="left">....</div>
<div class="footer">...</div>
</body>
</html>
That each is contains other tag...
I want to extract all text that exist in the website,but separately . for example right side separate,left side separate , footer and so...
can anyone help me?
thanks...
Use HtmlAgilityPack to load the webpage from the given URL, then parse it by selecting the correct corresponding tags.
HtmlWeb page = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
docc = page.Load("http://www.google.com");
If you want to select a specific div with the class name 'header', you do so by using the DocumentNode property of your document object.
string mainText = doc.DocumentNode.SelectSingleNode("//div[#class=\"main\"]").InnerText;
Chances are though that you have several tags in your HTML that are members of the 'main' class, thus you have to select them all then iterate over the collection, or be more precise when you select your single node.
To get a collection representation of all tags i.e. in class 'main', you use the DocumentNode.SelectNodes property instead.
I suggest you take a look at this question at SO where some of the basics and links to tutorials are available.
How to use HTML Agility pack

How to remove html tag and just leave text in C#?

I see a similar case which answer my question in JQuery way:
Remove tag but leave contents - jQuery/Javascript
But I need to know how C# works for this? Is there any API in C# that I can use to strip out the tag and leave the plane text?
From:
some content and more and more content
To:
some content and more and more content
Thanks.
Use HtmlAgilityPack ( http://htmlagilitypack.codeplex.com/ ), load your HTML document into a HtmlDocument instance and query document.DocumentNode.InnerText.

Regular expression to find anchor link with special href?

I just need to find a regular expression for the following:
I have some content in div tag, that includes lot of anchor links in it. So my task is to find anchor links with href as format of "components/showdoc.aspx?docid=" And then add onclick event for that anchor link only, leave the rest of the anchor links.
<div id="content" runat="server">
test doc
</div>
This expression gives and add target to it.
RegEx.Replace(inputString, "<(a)([^>]+)>", "<$1 target=""_blank""$2>")
Thanks
If you are looking to make permanent changes to your HTML file, first manage your HTML parsing by loading it into a System.Windows.Forms.WebBrowser control. From there you can perform DOM-like modifications to the HTML without the dangerous repercussions of parsing corruption that can be caused by performing Regex.Replace on the raw file. (Apparently RegEx + HTML is a serious issue for some).
So first in your code you would:
WebBrowser myBrowser = new WebBrowser();
myBrowser.URL = #"C:\MyPath\MyFile.HTML";
HtmlElement myDocBody = myBrowser.Document.Body;
Then you can navigate through your document body, seeking out your div tag and looking for your anchor tags by using the HtmlElement.Id property and HtmlElement.GetAttribute method.
Note: feel free to still use RegEx matching on the URL strings but only after extracting them from a GetAttribute("href") method.
To add the onClick method, simply invoke the HtmlElement.SetAttribute method.
When you have finished all your modifications, save the changes by writing the WebBrowser.DocumentText to file.
Here is a reference:
http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.aspx
Don't use regex to parse html, it's evil.
You could use the HTML Agility Pack, it even has a nice NuGet Package.
Alternatively, you could do this on the client side with a single line of jQuery:
$('a[href*="components/showdoc.aspx?docid="]').on('click', myClickFunction);
This is making use of the Attribute Contains Selector.
If you want to find the docid in your click function, you could write something like this in your click function:
function myClickFunction(e){
var href = $(this).attr('href');
var docId = href.split('=')[1];
alert(docId);
}
Note that this assumes there's only ever one query string value, if you wanted to make this more robust you could do something like in this answer: https://stackoverflow.com/a/1171731/21200

Embedded WebBrowser in Windows Form C# project

I have a form with an embedded web browser control on it. I am currently using WebBrowser and use it like so:
webBrowser1.Navigate("about:blank");
HtmlDocument doc = this.webBrowser1.Document;
doc.Write(string.Empty);
String htmlContent = GetHTML();
doc.Write(htmlContent);
This writes the HTML correctly to the web browser control BUT it never clears the existing data and it just appends, so I end up with N web pages stacked on top of each other.
Is this the best control to use? If so why is it not clearing existing data?
You need to use:
HtmlDocument doc = this.webBrowser1.Document.OpenNew(true);
now the contents of the document will be cleared before writing.
All calls to Write should be preceded
by a call to OpenNew, which will clear
the current document and all of its
variables. Your calls to Write will
create a new HTML document in its
place. To change only a specific
portion of the document, obtain the
appropriate HtmlElement and set its
InnerHtml property.
Yes, it is.
You should be able to call the Clear method if you need to clear contents.
Check this article for in-depth details and sample code:
http://www.codeproject.com/KB/miscctrl/simplebrowserformfc.aspx
Call HtmlDocument.OpenNew between pages:
OpenNew will clear the previous loaded
document, including any associated
state, such as variables. It will not
cause navigation events in WebBrowser
to be raised.

Click an HTML link inside a WebBrowser Control

C# Visual Studio 2010
I am loading a complex html page into a webbrowser control. But, I don't have the ability to modify the webpage. I want to click a link on the page automatically from the windows form. But, the ID appears to be randomly generated each time the page is loaded (so I believe referencing the ID will not work).
This is the content of the a href link:
<a
id="u_lp_id_58547"
href="javascript:void(0)"
class="SGLeftPanelText" onclick="setStoreParams('cases;212', 212); window.leftpanel.onClick('cases_ss_733');return false; ">
My Assigned</a>
Is the anyway to click the link from C#?
Thanks!
UPDATE:
I feel like this is close but it is just not working:
HtmlElementCollection links = helpdeskWebBrowser.Document.Window.Frames["main_pending_events_frame"].Document.GetElementsByTagName("a");
MessageBox.Show(links.Count.ToString());
I have tried plugging in every single frame name and tried both "a" and "A" in the TagName field but just have not had any luck. I can just not find any links; the message box is always 0. What am I missing?
Something like this should work:
HtmlElement link = webBrowser.Document.GetElementByID("u_lp_id_58547")
link.InvokeMember("Click")
EDIT:
Since the IDs are generated randomly, another option may be to identify the links by their InnerText; along these lines.
HtmlElementCollection links = webBrowser.Document.GetElementsByTagName("A");
foreach (HtmlElement link in links)
{
if (link.InnerText.Equals("My Assigned"))
link.InvokeMember("Click");
}
UPDATE:
You can get the links within an IFrame using:
webBrowser.Document.Window.Frames["MyIFrame"].Document.GetElementsByTagName("A");
Perhaps you will have to isolate the link ID value using more of the surrounding HTML context as a "target" and then extract the new random ID.
In the past I have used the "HtmlAgilityPack" to easily parse "screen-scraped" HTML to isolate areas of interest within a page - this library seems to be easy to use and reliable.

Categories