removing xml tag with regex - c#

I need to remove the tag "image" with regex.
I'm working with C# .Net
example <rrr><image from="91524" to="92505" /></rrr> should become:
<rrr></rrr>
Anyone???

You shouldn't really be using regex for this task, especially when .NET provides such powerful tools to handle XML:
XElement xml = XElement.Parse("<rrr><image from=\"91524\" to=\"92505\" /></rrr>");
xml.Descendants("image").Remove();
However if you insist on doing this with regex, let's see what happens:
string xml = "<rrr><image from=\"91524\" to=\"92505\" /></rrr>";
string output = Regex.Replace(xml, "<image.*?>", "");
This method has some problems though that the first method solves for you. Example problems:
Doesn't handle case sensitivity.
> characters in attributes can confuse the regex.
Newlines won't be matched correctly.
Incorrectly matches other tags that start with image like <image2 />.
XML comments can cause problems.
Doesn't handle both <image /> and <image></image>.
etc...
Some of these are easy to fix, some are more tricky. But in the end it's not worth spending time improving the regular expression solution to handle all the special cases when the LINQ to XML solution is so simple and does all this for you.

Even though XML is very regular and suffers from a draconian "validate or die" policy, this Stack Overflow question will prove very enlightening.
Regular expressions are powerful--but the XML tools in .NET are better for this task, because they are designed to handle this sort of thing. You can manipulate the XML based upon its structure, something Regexes can't do because they see your XML as text.
XML is text, but it's text with a particular structure. Take advantage of that known quality.

Try this:
<image[^>]*>

Related

XML: how to pre-parse when only SOME data is escaped?

XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

Matching tags in C#

I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!
I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.
there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack

Parsing XML-ish data

Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.
No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).
Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?

C# regex html table inside a table

I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?
Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).
Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!
Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.
Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx
Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.
What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?
Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.
Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Categories