I have a string that is basically an XML node, and I need to extract the values of the attributes. I am trying to use the following C# code to accomplish this:
string line = "<log description="Reset Controls - MFB - SkipSegment = True" start="09/13/2011 10:29:58" end="09/13/2011 10:29:58" timeMS="0" serviceCalls="0">"
string pattern = "\"[\\w ]*\"";
Regex r = new Regex(pattern);
foreach (Match m in Regex.Matches(line, pattern))
{
MessageBox.Show(m.Value.Substring(1, m.Value.Length - 2));
}
The problem is that this is only returning the last occurrence from the string ("0" in the above example), when each string contains 5 occurrences. How do I get every occurrence using C#?
Don't try to parse XML with regular expressions. Use an XML API instead. It's simply a really bad idea to try to hack together "just enough of an XML parser" - you'll end up with incredibly fragile code.
Now your line isn't actually a valid XML element at the moment - but if you add a </log> it will be.
XElement element = XElement.Parse(line + "</log>");
foreach (XAttribute attribute in element.Attributes())
{
Console.WriteLine("{0} = {1}", attribute.Name, attribute.Value);
}
That's slightly hacky, but it's better than trying to fake XML parsing yourself.
As an aside, you need to escape your string for double-quotes and add a semi-colon:
string line = "<log description=\"Reset Controls - MFB - SkipSegment = True\" start=\"09/13/2011 10:29:58\" end=\"09/13/2011 10:29:58\" timeMS=\"0\" serviceCalls=\"0\">";
To actually answer your question, your pattern should probably be "\"[^\"]*\""because \w won't match spaces, symbols, etc.
Related
I would like to replace words in a html string with another word, but it must only replace the exact word and not if it is part of the spelling of part of a word. The problem that I am having is that the html open or closing tags or other html elements are affecting what words are matched in the regex or it is replacing parts of words.
PostTxt = “<div>The <b>cat</b> sat on the mat, what a catastrophe.
The <span>cat</span> is not allowed on the mat. This makes things complicated; the cat  must go!
</div><p>cat cat cat</p>”;
string pattern = "cat";
//replacement string to use
string replacement = "******";
//Replace words
PostTxt = Regex.Replace(PostTxt, pattern, replacement, RegexOptions.IgnoreCase);
}
I would like it to return.
<div>The <b>***</b> sat on the mat, what a catastrophe. The <span>***</span> is not allowed on the mat. This makes things complicated; the ***  must go! </div><p>*** *** ***</p>
Any suggestions and help will be greatly appreciated.
This is the simplified solution of the code I implemented using html-agility-pack.net. Regex is not a solution to this problem as noted by See: Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms. – Olivier Jacot-Descombes
PostTxt = "<div>The <b>cat</b> sat on the mat, what a catastrophe.
The <span>cat</span> is not allowed on the mat. This makes things complicated; the cat must go!
</div><p>Cat cat cat</p>";
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.LoadHtml(PostTxt);
//replacement string to use
string replacement = “*****”;
string pattern = #"\b" + Regex.Escape("cat") + #"\b";
var nodes = mainDoc.DocumentNode.SelectNodes("//*") ?? new HtmlNodeCollection(null);
foreach (var node in nodes)
{
node.InnerHtml = Regex.Replace(node.InnerHtml, pattern, replacement, RegexOptions.IgnoreCase);
}
PostTxt = mainDoc.DocumentNode.OuterHtml;
I have a string, for example
<#String1#> + <#String2#> , <#String3#> --<#String4#>
And I want to use regex/string manipulation to get the following result:
<#String1#>,<#String2#>,<#String3#>,<#String4#>
I don't really have any experience doing this, any tips?
There are multiple ways to do something like this, and it depends on exactly what you need. However, if you want to use a single regex operation to do it, and you only want to fix stuff that comes between the bracketed strings, then you could do this:
string input = "<#String1#> + <#String2#> , <#String3#> --<#String4#>";
string pattern = "(?<=>)[^<>]+(?=<)";
string replacement = ",";
string result = Regex.Replace(input, pattern, replacement);
The pattern uses [^<>]+ to match any non-pointy-bracket characters, but it combines it with a look-behind statement ((?<=>)) and a look-ahead statement (?=<) to make sure that it only matches text that occurs between a closing and another opening set of brackets.
If you need to remove text that comes before the first < or after the last >, or if you find the look-around statements confusing, you may want to consider simply matching the text that comes between the brackets and then loop through all the matches and build a new string yourself, rather than using the RegEx.Replace method. For instance:
string input = "sdfg<#String1#> + <#String2#> , <#String3#> --<#String4#>ag";
string pattern = #"<[^<>]+>";
List<String> values = new List<string>();
foreach (Match m in Regex.Matches(input, pattern))
values.Add(m.Value);
string result = String.Join(",", values);
Or, the same thing using LINQ:
string input = "sdfg<#String1#> + <#String2#> , <#String3#> --<#String4#>ag";
string pattern = #"<[^<>]+>";
string result = String.Join(",", Regex.Matches(input, pattern).Cast<Match>().Select(x => x.Value));
If you're just after string manipulation and don't necessarily need a regex, you could simply use the string.Replace method.
yourString = yourString.Replace("#> + <#", "#>,<#");
I'm trying to get some text from a large text file, the text I'm looking for is:
Type:Production
Color:Red
I pass the whole text in the following method to get (Type:Production , Color:Red)
private static void FindKeys(IEnumerable<string> keywords, string source)
{
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + #"\B\s" + keys + #"\B\s" + "):",
RegexOptions.Singleline);
foreach (Match m in matches)
{
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found)
{
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
}
My problems are the following:
The search returns _Type: as well, where I only need Type:
The search return Color:Red/n/n/n/n/n (with the rest of the text, where I only need Color:Red
So, basically:
- How can I force Regex to get the exact match for Type and ignore _Type
- How to get only the text after : and ignore /n/n/ and any other text
I hope this is clear
Thanks,
Your regex currently looks like this:
(?<key>\B\sWord1|Word2|Word3\B\s):
I see the following issues here:
First, Word1|Word2|Word3 should be put in parenthesis. Otherwise, it will search for \B\sWord1 or Word2 or Word3\B\s, which is not what you want (I guess).
Why \B\s? A non-boundary followed by a whitespace? That doesn't make sense. I guess you want just \b (= word boundary). There's no need to use it in the end, because the colon already constitutes a word boundary.
So, I would suggest to use the following. It will fix the _Type problem, because there is no word boundary between _ and Type (since _ is considered to be a word character).
\b(?<key>Word1|Word2|Word3):
If the text following the key is always just a single word, I'd match it in the regex as well: (\s* allows for whitespace after the colon, I don't know if you need this. \w+ ensures that only word characters -- i.e. no line breaks etc. -- are matched as the value.)
\b(?<key>Word1|Word2|Word3):\s*(?<value>\w+)
Then you just need to iterate through all the matches and extract the key and value groups. No need for any string operations or index arithmetic.
So if I understand correctly, you have:
Pairs of key:values
Each pair is separated by a space
Within each pair, the key and value is separated by “:”
Then I would not use regex at all. I would:
use String.Split(' ') to get an array of pairs
loop over all the pairs
use String.Split(':') to get the key and value from each pair
I need a regular expression to basically get the first part of a string, before the first slash ().
For example in the following:
C:\MyFolder\MyFile.zip
The part I need is "C:"
Another example:
somebucketname\MyFolder\MyFile.zip
I would need "somebucketname"
I also need a regular expression to retrieve the "right hand" part of it, so everything after the first slash (excluding the slash.)
For example
somebucketname\MyFolder\MyFile.zip
would return
MyFolder\MyFile.zip.
You don't need a regular expression (it would incur too much overhead for a simple problem like this), try this instead:
yourString = yourString.Substring(0, yourString.IndexOf('\\'));
And for finding everything after the first slash you can do this:
yourString = yourString.Substring(yourString.IndexOf('\\') + 1);
This problem can be handled quite cleanly with the .NET regular expression engine. What makes .NET regular expressions really nice is the ability to use named group captures.
Using a named group capture allows you to define a name for each part of regular expression you are interested in “capturing” that you can reference later to get at its value. The syntax for the group capture is "(?xxSome Regex Expressionxx). Remember also to include the System.Text.RegularExpressions import statement when using regular expression in your project.
Enjoy!
//Regular expression
string _regex = #"(?<first_part>[a-zA-Z:0-9]+)\\{1}(?<second_part>(.)+)";
//Example 1
{
Match match = Regex.Match(#"C:\MyFolder\MyFile.zip", _regex, RegexOptions.IgnoreCase);
string firstPart = match.Groups["first_part"].Captures[0].Value;
string secondPart = match.Groups["second_part"].Captures[0].Value;
}
//Example 2
{
Match match = Regex.Match(#"somebucketname\MyFolder\MyFile.zip", _regex, RegexOptions.IgnoreCase);
string firstPart = match.Groups["first_part"].Captures[0].Value;
string secondPart = match.Groups["second_part"].Captures[0].Value;
}
You are aware that .NET's file handling classes do this a lot more elegantly, right?
For example in your last example, you could do:
FileInfo fi = new FileInfo(#"somebucketname\MyFolder\MyFile.zip");
string nameOnly = fi.Name;
The first example you could do:
FileInfo fi = new FileInfo(#"C:\MyFolder\MyFile.zip");
string driveOnly = fi.Root.Name.Replace(#"\", "");
This matches all non \ chars
[^\\]*
Here is the regular expression solution using the "greedy" operator '?'...
var pattern = "^.*?\\\\";
var m = Regex.Match("c:\\test\\gimmick.txt", pattern);
MessageBox.Show(m.Captures[0].Value);
Split on slash, then get first item
words = s.Split('\\');
words[0]
I have a string like
A150[ff;1];A160;A100;D10;B10'
in which I want to extract A150, D10, B10
In between these valid string, i can have any characters. The one part that is consistent is the semicolumn between each legitimate strings.
Again the junk character that I am trying to remove itself can contain the semi column
Without having more detail for the specific rules, it looks like you want to use String.Split(';') and then construct a regex to parse out the string you really need foreach string in your newly created collection. Since you said that the semi colon can appear in the "junk" it's irrelevant since it won't match your regex.
var input = "A150[ff+1];A160;A150[ff-1]";
var temp = new List<string>();
foreach (var s in input.Split(';'))
{
temp.Add(Regex.Replace(s, "(A[0-9]*)\\[*.*", "$1"));
}
foreach (var s1 in temp.Distinct())
{
Console.WriteLine(s1);
}
produces the output
A150
A160
First,you should use
string s="A150[ff;1];A160;A100;D10;B1";
s.IndexOf("A160");
Through this command you can get the index of A160 and other words.
And then s.Remove(index,count).
If you only want to remove the 'junk' inbetween the '[' and ']' characters you can use regex for that
Regex regex = new Regex(#"\[([^\}]+)\]");
string result = regex.Replace("A150[ff;1];A160;A100;D10;B10", "");
Then String.Split to get the individual items