I am attempting to learn regex by using it to edit some scripts I have.
My scripts contain like so
<person name="John">Will be out of town</person><person name="Julie">Will be in town.</person>
I need to replace the name values in the script - the addition to the name is always the same, but I might have names that I don't want to update.
Quick example of what I have:
string[] names = new string[1];
names[0] = "John-Example";
names[1] = "Paul-Example";
string ToFix = "<person name=\"John\">Will be out of town</person><person name=\"Julie\">Will be in town.</person>"
for (int i=0; i<names.Length; i++)
{
string Name = names[i];
ToFix = Regex.Replace(ToFix, "(<.*name=\")(" + Name.Replace("-Example", "") + ".*)(\".*>)", "$1" + Name + "$3", RegexOptions.IgnoreCase);
}
This works for the most part, but I have two problems with it. Sometime it removes too much, if I have multiple persons in the string, it will remove everything between the first person and the last person, as so:
Hello <person name="John">This is John</person><person name="Paul">This is Paul</person>
becomes
Hello <person name="John-Example">This is Paul</person>
Also, I would like to remove any extra text behind the name value and before the closing carrat, so that:
<person name="John" hello>
Should be corrected to:
<person name="John-Example">
I have read several articles on regex and feel that I am just missing something small here. How and why would I go about fixing this?
EDIT: I don't think these scripts that I am working with classify as XML - the entire script may or may not have <> tags. Back to my original goal with this question, can someone explain the behavior of the regex? And how would I remove extra text after the name value before the closing tag?
Your regex is too greedy. Try .*? rather than just .*
Also, please don't use regex to parse XML.
Here's an example of how to do what I think you want, using XDocument:
var xdoc = XDocument.Parse(ToFix);
foreach (var person in xdoc.Elements("person"))
{
var name = person.Attribute("name");
if (person.LastAttribute != name)
{
person.RemoveAttributes();
person.SetAttributeValue(name.Name, name.Value + "-Example");
}
}
var output = xdoc.ToString();
Related
I've been working on trying to get this string split in a couple different places which I managed to get to work, except if the name had a forward-slash in it, it would throw all of the groups off completely.
The string:
123.45.678.90:00000/98765432109876541/[CLAN]PlayerName joined [windows/12345678901234567]
I essentially need the following:
IP group: 123.45.678.90:00000 (without the following /)
id group: 98765432109876541
name group: [CLAN]PlayerName
id1 group: 12345678901234567
The text "joined" also has to be there. However windows does not.
Here is what I have so far:
(?<ip>.*)\/(?<id>.*)\/(.*\/)?(?<name1>.*)( joined.*)\[(.*\/)?(?<id1>.*)\]
This works like a charm unless the player name contains a "/". How would I go about escaping that?
Any help with this would be much appreciated!
Since you tag your question with C# and Regex and not only Regex, I will propose an alternative. I am not sure if it will more efficient or not. I find it easiest to read and to debug if you simply use String.Split():
Demo
public void Main()
{
string input = "123.45.678.90:00000/98765432109876541/[CLAN]Player/Na/me joined [windows/12345678901234567]";
// we want "123.45.678.90:00000/98765432109876541/[CLAN]Player/Na/me joined" and "12345678901234567]"
// Also, you can remove " joined" by adding it before " [windows/"
var content = input.Split(new string[]{" [windows/"}, StringSplitOptions.None);
// we want ip + groupId + everything else
var tab = content[0].Split('/');
var ip = tab[0];
var groupId = tab[1];
var groupName = String.Join("/", tab.Skip(2)); // merge everything else. We use Linq to skip ip and groupId
var groupId1 = RemoveLast(content[1]); // cut the trailing ']'
Console.WriteLine(groupName);
}
private static string RemoveLast(string s)
{
return s.Remove(s.Length - 1);
}
Output:
[CLAN]Player/Na/me joined
If you are using a class for ip, groupId, etc. and I guess you do, just put everything in it with a constructor which accept a string as parameter.
You shouldn't be using greedy quanitifiers (*) with an open character such as .. It won't work as intended and will result in a lot of backtracking.
This is slightly more efficient, but not overly strict:
^(?<ip>[^\/\n]+)\/(?<id>[^\/]+)\/(?<name1>\S+)\D+(?<id1>\d+)]$
Regex demo
You basically needs to use non greedy selectors (*?). Try this:
(?<ip>.*?)\/(?<id>.*?)\/(?<name1>.*?)( joined )\[(.*?\/)?(?<id1>.*?)\]
really just started using regex and can only do basic things.
I want to convert hyphens in the element tag of an XML file to underscores. I already have a c# app that reads a config file with find and replace elements and does some other cleaning work in RegEx but can't figure this one out
so currently it will go in as
< convert-there-here > but-not-these < / convert-these-here >
And I want it spat out as
< convert_these_here > but-not-these < / convert_these_here >
The C# script jut sucks in the file and reads it line by line, it doesn't look at it as an xml file
so basically i thought i just need a pattern that looks for any and all hyphens BETWEEN a < and >
Thanks
Ditch the regex. Parse your XML, and fix it. Using the XDocument class makes this really easy.
Say we start with the following XML document:
<this-is>
<an-xml>
<doc but-I="remain-untouched">look-at-me</doc>
</an-xml>
</this-is>
We can load it into an XDocument and fix up the element names.
var str = "<this-is><an-xml><doc but-I=\"remain-untouched\">look-at-me</doc></an-xml></this-is>";
var xdoc = XDocument.Parse(str);
foreach(var el in xdoc.Descendants())
{
var name = el.Name.LocalName;
name = name.Replace("-", "_");
el.Name = el.Name.Namespace + name;
}
var fixedXmlString = xdoc.ToString();
Now this gives us the following:
<this_is>
<an_xml>
<doc but-I="remain-untouched">look-at-me</doc>
</an_xml>
</this_is>
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!
I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);
It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)
I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"
I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit