Find and replace text inside xml document using regular expression - c#

I am using c# console app to get xml document. Now once xmldocument is loaded i want to search for specific href tag:
href="/abc/def
inside the xml document.
once that node is found i want to strip tag completly and just show Hello.
Hello
I think i can simply get the tag using regex. But can anyone please tell me how can i remove the href tag completly using regex?

xml & html same difference: tagged content. xml is stricter in it's formatting.
for this use case I would use transformations and xpath queries rebuild the document. As #Yahia stated, regex on tagged documents is typically a bad idea. the regex for parsing is far to complex to be affective as a generic solution.

The most popular technology for similar tasks is called XPath. (It is also a key component of XQuery and XSLT.) Would the following perhaps solve your task, too?
root.SelectSingleNode("//a[#href='/abc/def']").InnerText = "Hello";

You could try
string x = #"<?xml version='1.0'?>
<EXAMPLE>
<a href='/abc/def'>Hello</a>
</EXAMPLE>";
System.Xml.XmlDocument doc = new XmlDocument();
doc.LoadXml(x);
XmlNode n = doc.SelectSingleNode("//a[#href='/abc/def']");
XmlNode p = n.ParentNode;
p.RemoveChild(n);
System.Xml.XmlNode newNode = doc.CreateNode("element", "a", "");
newNode.InnerXml = "Hello";
p.AppendChild(newNode);
Not really sure if this is what you are trying to do but it should be enough to get you headed in right direction.

Related

C# Load XML with special characters inside node

We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.

Parsing HTML string usiing C#

I have a string with html text as shown below.
string htmlText = "<h1>This is heading 1</h1><p>This is some text.</p>
<hr><h2>This is heading 2</h2><p>This is some other text.</p><hr>";
Can we convert this html string as we see it in browser after it has been parsed so that later we can use this parsed string where ever required.
Later I want to copy this data to a sharepoint list multiline rich text column. There I dont need these tags to come, but
This answer provides an example using HtmlAgilityPack, which is much more robust than rolling your own parsing or regular expressions.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
Your question isn't entirely clear and cuts off at the end. But you can actually parse the data if you want. Just examine each character to find the tags using string indexes (e.g. htmlText[i]).
If you need something a little more robust, use HtmlMonkey or HtmlAgilityPack to parse it for you.
The best way is using regular expression to extract inner next between html tags
some. Something like this might does work:
((.+?)</h.?>)+((.+?)</p.?>)

Find HTML / XML node using RegEx

I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:
var parser = new HtmlParser();
var source = "<html><head><title>Test Title</title></head><body><h1>Some example source</h1><p>This is a paragraph element and example postode EC1A 4NP</body></html>";
var document = parser.Parse(source);
Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");
var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.
Edit
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string
If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN
You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.
Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:
//*[contains(text(),'EC1A 4NP')]
returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing
I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in <p> tags.
var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

How can I put the <!CDATA> in a XML tag

I'm trying to put <!CDATA> in a specific tag in my XML file, but the result is <![CDATA[mystring]]>
Someone can help me ?
The encoding
XmlProcessingInstruction pi = doc.CreateProcessingInstruction("xml", "version=\"1.0\" encoding=\"utf-8\"");
How I'm doing
texto.InnerText = "<![CDATA[" + elemento.TextoComplementar.ToString() + "]]>";
XmlNode xnode = xdoc.SelectSingleNode("entry/entry_status");
XmlCDataSection CData;
InnerText performs whatever escaping is required.
xnode.InnerText = "Hi, How are you..??";
If you want to work with CDATA node then:
CData = doc.CreateCDataSection("Hi, How are you..??");
You haven't explained how you are creating the XML - but it looks like it's via XmlDocument.
You can therefore use CreateCDataSection.
You create the CData node first, supplying the text to go in it, and then add it as a child to an XmlElement.
You should probably consider Linq to XML for working with XML - in my most humble of opinions, it has a much more natural API for creating XML, doing away with the XML DOM model in favour of one which allows you to create whole document trees inline. This, for example, is how you'd create an element with an attribute and a cdata section:
var node = new XElement("root",
new XAttribute("attribute", "value"),
new XCData("5 is indeed > 4 & 3 < 4"));

Replace xml tag with regex

How can I replace a certain part in a xml file with a definied string?
<tag1></tag2>
<tag2></tag2>
...etc
<soundcard num=0>
<name>test123</name>
</soundcard>
<soundcard num=1>
<name>test123</name>
</soundcard>
<soundcard num=2>
<name>test123</name>
</soundcard>
<tag5></tag5>
replace all soundcard parts that the result looks like that:
<tag1></tag2>
<tag2></tag2>
...etc
{0}
<tag5></tag5>
I'm using c# .net 3.5 and I thougt of a regex solution
If it has to be a regex, your XML file is well-formed, and you know (say, from the DTD) that <soundcard> tags can't be nested, then you can use
(<soundcard.*?</soundcard>\s*)+
and replace all with {0}.
In C#:
resultString = Regex.Replace(subjectString, #"(<soundcard.*?</soundcard>\s*)+", "{0}", RegexOptions.Singleline);
For a quick-and-dirty fix to a one-off problem, I think that's OK. It's not OK to think of regex as the proper tool to handle XML in general.
Personally I would use Linq to XML and remove the entities and replace it with a Text Node.
Update Apr 16/2010 4:40PM MST
Here's an example of Linq to XML, I'm a bit rusty but it should at least give you an idea of how this is done.
XElement root = XElement.Load("myxml.xml");
var soundcards = select el from root.Elements() where el.Name == "soundcard" select el;
var prev_node = soundcards.First().PreviousNode;
// Remove Nodes
foreach(XElement card in soundcards)
card.Remove();
// Build your content here into a variable called newChild
prev_node.AddAfterSelf(newChild);
My suggestion would be to use an XSLT transformation to replace the tags you want to replace with a known tag, say , and then String.Replace('', '{0}');.
I echo what Johannes said, do NOT try to build REs to do this. As your XML gets more complex, you error rate will increase.

Categories