I wrote some code in VB.Net a while ago that is using XElement, XDocument, etc... to store and manipulate HTML. Some of the HTML makes use of attribute names that contain a hyphen/dash (-). I encountered issues using LinqToXml to search for XElements by these attributes.
Back then I found an article (can't find it now) that indicated the solution in VB.net was to use syntax like this:
Dim rootElement as XElement = GetARootXElement()
Dim query = From p In rootElement.<div> Where p.#<data-qid> = 5 Select p
The "magic" syntax is the #<> which somehow translates the hyphenated attribute name into a format that can be successfully used by Linq. This code works great in VB.Net.
The problem is that we have now converted all the VB.Net code to C# and the conversion utility choked on this syntax. I can't find anything about this "magic" syntax in VB.Net and so I was hoping someone could fill in the details for me, specifically, what the C# equivalent is. Thanks.
Here is an example:
<div id='stuff'>
<div id='stuff2'>
<div id='stuff' data-qid=5>
<!-- more html -->
</div>
</div>
</div>
In my code above the rootElement would be the stuff div and I would want to search for the inner div with the attribuate data-qid=5.
I can get the following to compile in C# - I think it's equivalent to the original VB (note that the original VB had Option Strict Off):
XElement rootElement = GetARootXElement();
var query = from p in rootElement.Elements("div")
where p.Attribute("data-qid").Value == 5.ToString()
select p;
Here's my (revised) test, which finds the div with the 'data-qid' attribute:
var xml = System.Xml.Linq.XElement.Parse("<div id='stuff'><div id='stuff2'><div id='stuff3' data-qid='5'><!-- more html --></div></div></div>");
var rootElement = xml.Element("div");
var query = from p in rootElement.Elements("div")
where p.Attribute("data-qid").Value == 5.ToString()
select p;
Use HtmlAgilityPack (available from NuGet) to parse HTML. Here is an example:
HtmlDocument doc = new HtmlDocument();
doc.Load("index.html");
var innerDiv =
doc.DocumentNode.SelectSingleNode("//div[#id='stuff']/*/div[#data-qid=5]");
This XPath query gets inner div tag which has data-qid equal to 5. Also outer div should have id equal to 'stuff'. And here is the way to get data-qid attribute value:
var qid = innerDiv.Attributes["data-qid"].Value; // 5
Instead of using HtmlAgilityPack offered by Sergey Berezovskiy, there's easier way to do without it by using XmlPath's Extensions class, containing extension methods to work with LINQ to XML:
using System.Xml.XPath;
var xml = XElement.Parse(html);
var innderDiv = xml.XPathSelectElement("//div[#id='stuff' and #data-qid=5]");
Related
I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;
I want to process/manipulate some HTML markup
e.g.
<a id="flFileList_gvDoItFiles_btnContent_1" href="javascript:__doPostBack('flFileList$gvDoItFiles$ctl03$btnContent','')">Untitled.png.3154ROGG635264188946573079.png</a>
changed to
<a id="flFileList_gvDoItFiles_btnContent_1" href="javascript:__doPostBack('flFileList$gvDoItFiles$ctl03$btnContent','')">Untitled.png</a>
I want achieve this using C# string processing.
Not getting any idea for this.
I have logic written convert
Untitled.png.3154ROGG635264188946573079.png to
Untitled.png
I am stuck in how do I identify and replace th string in markup?
String.Split()??
I suggest you to use HtmlAgilityPack for parsing HTML. You can easily get a element by it's id, and then replace it's inner text:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html_string);
string xpath = "//a[#id='flFileList_gvDoItFiles_btnContent_1']";
var a = doc.DocumentNode.SelectSingleNode(xpath);
a.InnerHtml = ConvertValue(a.InnerHtml); // call your logic for converting value
string result = a.OuterHtml;
I'm new to both XML and C#; I'm trying to find a way to efficiently parse a given xml file to retrieve relevant numerical values, base on the "proj_title" value=heat_run or any other possible values. For example, calculating the duration of a particular test run (proj_end val-proj_start val).
ex.xml:
<proj ID="2">
<proj_title>heat_run</proj_title>
<proj_start>100</proj_start>
<proj_end>200</proj_end>
</proj>
...
We can't search by proj ID since this value is not fixed from test run to test run. The above file is huge: ~8mb, and there's ~2000 tags w/ the name proj_title. is there an efficient way to first find all tag names w/ proj_title="heat_run", then to retrieve the proj start and end value for this particular proj_title using C#??
Here's my current C# code:
public class parser
{
public static void Main()
{
XmlDocument xmlDoc= new XmlDocument();
xmlDoc.Load("ex.xml");
//~2000 tags w/ proj_title
//any more efficient way to just look for proj_title="heat_run" specifically?
XmlNodeList heat_run_nodes=xmlDoc.GetElementsByTagName("proj_title");
}
}
8MB really isn't very large at all by modern standards. Personally I'd use LINQ to XML:
XDocument doc = XDocument.Load("ex.xml");
var projects = doc.Descendants("proj_title")
.Where(x => (string) x == "heat_run")
.Select(x => x.Parent) // Just for simplicity
.Select(x => new {
Start = (int) x.Element("proj_start"),
End = (int) x.Element("proj_end")
});
foreach (var project in projects)
{
Console.WriteLine("Start: {0}; End: {1}", project.Start, project.End);
}
(Obviously adjust this to your own requirements - it's not really clear what you need to do based on the question.)
Alternative query:
var projects = doc.Descendants("proj")
.Where(x => (string) x.Element("proj_title") == "heat_run")
.Select(x => new {
Start = (int) x.Element("proj_start"),
End = (int) x.Element("proj_end")
});
You can use XPath to find all nodes that match, for example:
XmlNodeList matches = xmlDoc.SelectNodes("proj[proj_title='heat_run']")
matches will contain all proj nodes that match the critera. Learn more about XPath: http://www.w3schools.com/xsl/xpath_syntax.asp
MSDN Documentation on SelectNodes
Use XDocument and use the LINQ api.
http://msdn.microsoft.com/en-us/library/bb387098.aspx
If the performance is not what you expect after trying it, you have to look for a sax parser.
A Sax parser will not load the whole document in memory and try to apply an xpath expression on everything in memory. It works more in an event driven approach and in some cases this can be a lot faster and does not use as much memory.
There are probably sax parsers for .NET around there, haven't used them myself for .NET but I did for C++.
I'm trying to do something simple, but somehow it doesnt work for me, here's my code:
var items = html.DocumentNode.SelectNodes("//div[#class='itembox']");
foreach(HtmlNode e in items)
{
int x = items.count; // equals 10
HtmlNode node = e;
var test = e.SelectNodes("//a[#class='head']");// I need this to return the
// anchor of the current itembox
// but instead it returns the
// anchor of each itembox element
int y =test.count; //also equals 10!! suppose to be only 1
}
my html page looks like this:
....
<div class="itembox">
<a Class="head" href="one.com">One</a>
</div>
<div class="itembox">
<a Class="head" href="two.com">Two</a>
</div>
<!-- 10 itembox elements-->
....
Is my XPath expression wrong? am i missing something?
Use
var test = e.SelectNodes(".//a[#class='head']");
instead. Your current code ( //a[]) searches all a elements starting from the root node. If you prefix it with a dot instead (.//a[]) only the descendants of the current node will be considered. Since it is a direct child in your case you could of course also do:
var test = e.SelectNodes("a[#class='head']");
As always see the Xpath spec for details.
var test = e.SelectNodes("//a[#class='head']");
This is an absolute expression, but you need a relative XPath expression -- to be evaluated off e.
Therefore use:
var test = e.SelectNodes("a[#class='head']");
Do note: Avoid using the XPath // pseudo-operator as much as possible, because such use may result in significant inefficiencies (slowdown).
In this particular XML document the a elements are just children of div -- not at undefinite depth off div.
I'm looking for a regular expression to isolate the src value of an img.
(I know that this is not the best way to do this but this is what I have to do in this case)
I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.
string matchString = Regex.Match(original_text, #"(<img([^>]+)>)").Value;
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
I know you say you have to use regex, but if possible i would really give this open source project a chance:
HtmlAgilityPack
It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.
Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time
The code for your query would look something like this: (uncompiled code)
List<string> imgScrs = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
var nodes = doc.DocumentNode.SelectNodes(#"//img[#src]"); s
foreach (var img in nodes)
{
HtmlAttribute att = img["src"];
imgScrs.Add(att.Value)
}
I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:
List<string> images = new List<string>();
WebClient client = new WebClient();
string site = "http://www.mysite.com";
var htmlText = client.DownloadString(site);
var htmlDoc = new HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true
};
htmlDoc.LoadHtml(htmlText);
foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
{
HtmlAttribute att = img.Attributes["src"];
images.Add(att.Value);
}
This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D
<img.+?src="(.+?)".+?/?>
The regex you want should be along the lines of:
(<img.*?src="([^"])".*?>)
Hope this helps.
you can also use a look behind to do it without needing to pull out a group
(?<=<img.*?src=")[^"]*
remember to escape the quotes if needed
This is what I use to get the tags out of strings:
</? *img[^>]*>
Here is the one I use:
<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>
The good part is that it matches any of the below:
<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">
And it can also match some unexpected scenarios like extra attributes, e.g:
<img src = "test.jpg" width="300">