Hi guyes just had a quick question about using multi-line in regex:
The Regex:
string content = Regex.Match(onix.Substring(startIndex,endIndex - startIndex), #">(.+)<", RegexOptions.Multiline).Groups[1].Value;
Here is the string of text I am reading:
<Title>
<TitleType>01</TitleType>
<TitleText textcase="02">18th Century Embroidery Techniques</TitleText>
</Title>
Here is what I am getting:
01
What I want is everything between the
<Title> and </Title>.
This works perfectly when everything is on one line but since starts on another line it seems to be skipping it or not including it into the pattern.
Any assistance is much appreciated.
You must also use the Singleline option, along with Multiline:
string content = Regex.Match(onix.Substring(startIndex,endIndex - startIndex), #">(.+)<", RegexOptions.Multiline | RegexOptions.Singleline).Groups[1].Value;
But do yourself a favor and stop parsing XML using Regular Expressions! Use an XML parser instead!
You can parse the XML text using the XmlDocument class, and use XPath selectors to get to the element you're interested in:
XmlDocument doc = new XmlDocument();
doc.LoadXml(...); // your load the Xml text
XmlNode root = doc.SelectSingleNode("Title"); // this selects the <Title>..</Title> element
// modify the selector depending on your outer XML
Console.WriteLine(root.InnerXml); // displays the contents of the selected node
RegexOptions.Multiline will just change the meaning of ^ and $ to beginning/end of lines instead of beginning/end of the entire string.
You want to use RegexOptions.Singleline instead, which will result in . match line breaks (as well as everything else).
You might want to parse what is probably XML instead. If possible this is the preferred way of working instead of parsing it by employing regular expressions. Please disregard if not applicable.
Related
I am trying to figure out how to write a regex that will strip out the values enclosed in an xml tag. For example,
string xml = "<MyElement1 attribute="bla"><MyElement1>12345</MyElement1></MyElement1>"
I want to know how to do the following:
match on MyElement1 nodes that do not have an attribute
So specifically, using my example I would match <MyElement1>12345</MyElement1> and replace <MyElement1> and </MyElement1> so that my final node looks like this: <MyElement1 attribute="bla">12345</MyElement1>
I've tried: [<][^>]*[>] but this matches on all elements. I'm not sure how to specify specific elements I want to match on.
I have made edits to make the question more focused and clearer as suggested based on the downvotes. I understand that I can use parse and navigate my document tree, but I prefer to use a regex replace of some sort because I want to apply this logic to any number of xml files with different tree structures, elements, and attributes.
Well you really don't need to use regular expressions, you just need to parse your XML using an XML parser.
One of the options you have would be to use the XDocument.Parse( xml ) method and XElement, where the first would be to parse the string, and the second to read it's tag and it's value. An example for reading it would be the following one
string xml = "<MyElement1>12345</MyElement1><MyElement2>abcd</MyElement2><MyElement3>12345</MyElement3><MyElement4>12345</MyElement4>";
// wrap your element in a rootnode (you seem to be missing one in your example)
var document = XDocument.Parse( $"<root>{xml}</root>");
// get the root node and loop over it's children (cast XNode to XElement in the process)
foreach (var node in document.Root.Nodes().OfType<XElement>()) {
// name is tag, value is well, it's value
Console.WriteLine($"{node.Name}: {node.Value}");
}
Note that for the example to parse the document correctly, you must add a rootnode, as xml can have only one rootnode in the document. In my sample, I enclosed the rootnode during the parsing
This sample code uses the System.Xml.Linq namespace, so don't forget to import that one.
One additional comment would be that your supplied XML code had an error in it (MyElemen4 opening tag with MyElement4 closing tag)
I would recommend using a XML Parser but if you want, you can use a simple regex like <([\w]*)>(.*?)<\/[\w]*>, this would return the name of the tag and the value inside.
Output:
Match 1
Full match 0-30 <MyElement1>12345</MyElement1>
Group 1. 1-11 MyElement1
Group 2. 12-17 12345
Match 2
Full match 30-59 <MyElement2>abcd</MyElement2>
Group 1. 31-41 MyElement2
Group 2. 42-46 abcd
Match 3
Full match 59-89 <MyElement3>12345</MyElement3>
Group 1. 60-70 MyElement3
Group 2. 71-76 12345
Match 4
Full match 89-118 <MyElemen4>12345</MyElement4>
Group 1. 90-99 MyElemen4
Group 2. 100-105 12345
Keep in mind it doesn't take in consideration of tag attributes. If you want to fetch a specific tag you can replace [\w] with the tag name you want.
I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.
Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx
Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.
What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?
Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.
Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, ', >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.
How can I replace a certain part in a xml file with a definied string?
<tag1></tag2>
<tag2></tag2>
...etc
<soundcard num=0>
<name>test123</name>
</soundcard>
<soundcard num=1>
<name>test123</name>
</soundcard>
<soundcard num=2>
<name>test123</name>
</soundcard>
<tag5></tag5>
replace all soundcard parts that the result looks like that:
<tag1></tag2>
<tag2></tag2>
...etc
{0}
<tag5></tag5>
I'm using c# .net 3.5 and I thougt of a regex solution
If it has to be a regex, your XML file is well-formed, and you know (say, from the DTD) that <soundcard> tags can't be nested, then you can use
(<soundcard.*?</soundcard>\s*)+
and replace all with {0}.
In C#:
resultString = Regex.Replace(subjectString, #"(<soundcard.*?</soundcard>\s*)+", "{0}", RegexOptions.Singleline);
For a quick-and-dirty fix to a one-off problem, I think that's OK. It's not OK to think of regex as the proper tool to handle XML in general.
Personally I would use Linq to XML and remove the entities and replace it with a Text Node.
Update Apr 16/2010 4:40PM MST
Here's an example of Linq to XML, I'm a bit rusty but it should at least give you an idea of how this is done.
XElement root = XElement.Load("myxml.xml");
var soundcards = select el from root.Elements() where el.Name == "soundcard" select el;
var prev_node = soundcards.First().PreviousNode;
// Remove Nodes
foreach(XElement card in soundcards)
card.Remove();
// Build your content here into a variable called newChild
prev_node.AddAfterSelf(newChild);
My suggestion would be to use an XSLT transformation to replace the tags you want to replace with a known tag, say , and then String.Replace('', '{0}');.
I echo what Johannes said, do NOT try to build REs to do this. As your XML gets more complex, you error rate will increase.
How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:
<tagName>Data between the tag</tagName>
What would be the correct regex to find these values? (Note that tagName could be anything).
I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!
EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.
Thanks in advance.
You can use: <(\w+)>(.*?)<\/\1>
Group #1 is the tag, Group #2 is the content.
Using regular expressions to parse XML is a terrible error.
This is efficient (it doesn't parse the XML into a DOM) and simple enough:
string s = "<tagName>Data between the tag</tagName>";
using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
xr.Read();
Console.WriteLine(xr.ReadElementContentAsString());
}
Edit:
Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:
Consider this fairly trivial test case:
<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>
There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.
You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.
And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.
You can use a backreference like \1 to refer to an earlier match:
#"<([^>]*)>(.*)</\1>"
The \1 will match what was captured by the first parenthesized group.
with Perl:
my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;
where $1 is now filled with the data you captured
Going forward, if you get stuck check out regexlib.com
It's the first place I go when i get stuck on regex
I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the > ;. (I'd also do it with the < tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>