Matching tags in C# - c#

I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!

I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.

there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

How to retrieve data from an html string from a span tag by using Regular Expressions?

I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.

removing xml tag with regex

I need to remove the tag "image" with regex.
I'm working with C# .Net
example <rrr><image from="91524" to="92505" /></rrr> should become:
<rrr></rrr>
Anyone???
You shouldn't really be using regex for this task, especially when .NET provides such powerful tools to handle XML:
XElement xml = XElement.Parse("<rrr><image from=\"91524\" to=\"92505\" /></rrr>");
xml.Descendants("image").Remove();
However if you insist on doing this with regex, let's see what happens:
string xml = "<rrr><image from=\"91524\" to=\"92505\" /></rrr>";
string output = Regex.Replace(xml, "<image.*?>", "");
This method has some problems though that the first method solves for you. Example problems:
Doesn't handle case sensitivity.
> characters in attributes can confuse the regex.
Newlines won't be matched correctly.
Incorrectly matches other tags that start with image like <image2 />.
XML comments can cause problems.
Doesn't handle both <image /> and <image></image>.
etc...
Some of these are easy to fix, some are more tricky. But in the end it's not worth spending time improving the regular expression solution to handle all the special cases when the LINQ to XML solution is so simple and does all this for you.
Even though XML is very regular and suffers from a draconian "validate or die" policy, this Stack Overflow question will prove very enlightening.
Regular expressions are powerful--but the XML tools in .NET are better for this task, because they are designed to handle this sort of thing. You can manipulate the XML based upon its structure, something Regexes can't do because they see your XML as text.
XML is text, but it's text with a particular structure. Take advantage of that known quality.
Try this:
<image[^>]*>

Find href attribute values that do not contain “javascript:”

I have a RegEx which nicely finds the href's in a URL:
<[aA][^>]*? href=[\"'](?<url>[^\"]+?)[\"'][^>]*?>
However, I want it to NOT find any href that contains the text, 'javascript:' in it.
The reason is that I sometimes need to mod the href and sometimes don't. When there is a 'javascript:' text in the href I want it not to be found by the regex.
(ASP.NET, C#)
I really wouldn't recommend using a regexp for this, since HTML isn't regular and there are no end of edge cases to cater for. If at all possible, please use an HTML parser. I think you'll find it a lot less grief.
A word javascript can be written in other ways. Look at ha.ckers.org article.
Simple excluding javascript word dot't provide you safety at all.

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

Categories