Is there a way to selectively replace XElement content with other XElements?
I have this XML:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
And I want to render it as this:
<prompt>
There is something I want to tell you.<break time="3s"/>
You are my favorite caller today.<break time="1s"/>
Have a great day!
</prompt>
I need to replace the placeholders with actual XElements, but when I try to alter the content of an XElement, .NET of course escapes all of the angle brackets. I understand why the content would normally need to be correctly escaped, but I need to bypass that behavior and inject XML directly into content.
Here's my code that would otherwise work.
MatchCollection matches = Regex.Matches(content, #"\[(\w+)=(\d+)]");
foreach (XElement element in voiceXmlDocument.Descendants("prompt"))
{
if (matches[0] == null)
continue;
element.Value = element.Value.Replace(matches[0].Value, #"<break time=""5s""/>");
}
This is a work in progress, so don't worry so much about the validity of the RegEx pattern, as I will work that out later to match several conditions. This is proof of concept code and the focus is on replacing the placeholders as described. I only included the iteration and RegEx code here to illustrate that I need to be able to do this to a whole document that is already populated with content.
You can use XElement.Parse() method:
First, get the outer xml of your XElement, for example,
string outerXml = element.ToString();
The you have exactly this string to work with:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
Then you can do your replacement
outerXml = outerXml.Replace(matches[0].Value, #"<break time=""5s""/>");
Then you can parse it back:
XElement repElement = XElement.Parse(outerXml);
And, finally, replace original XElement:
element.ReplaceWith(repElement);
The key to all of this is the XText, which allows you to work with text as an element.
This is the loop:
foreach (XElement prompt in voiceXmlDocument.Descendants("prompt"))
{
string text = prompt.Value;
prompt.RemoveAll();
foreach (string phrase in text.Split('['))
{
string[] parts = phrase.Split(']');
if (parts.Length > 1)
{
string[] pause = parts[0].Split('=');
prompt.Add(new XElement("break", new XAttribute("time", pause[1])));
// add a + "s" if you REALLY want it, but then you have to get rid
// of it later in some other code.
}
prompt.Add(new XText(parts[parts.Length - 1]));
}
}
This is the end result
<prompt>
There is something I want to tell you.<break time="3" />
You are my favorite caller today.<break time="1" />
Have a great day!
</prompt>
class Program
{
static void Main(string[] args)
{
var xml =
#"<prompt>There is something I want to tell you.[pause=3] You are my favorite caller today.[pause=1] Have a great day!</prompt>";
var voiceXmlDocument = XElement.Parse(xml);
var pattern = new Regex(#"\[(\w+)=(\d+)]");
foreach (var element in voiceXmlDocument.DescendantsAndSelf("prompt"))
{
var matches = pattern.Matches(element.Value);
foreach (var match in matches)
{
var matchValue = match.ToString();
var number = Regex.Match(matchValue, #"\d+").Value;
var newValue = string.Format(#"<break time=""{0}""/>", number);
element.Value = element.Value.Replace(matchValue, newValue);
}
}
Console.WriteLine(voiceXmlDocument.ToString());
}
}
Oh, my goodness, you guys were quicker than I expected! So, thanks for that, however in the meantime, I solved it a slightly different way. The code here looks expanded from before because once I got it working, I added some specifics into this particular condition:
foreach (XElement element in voiceXmlDocument.Descendants("prompt").ToArray())
{
// convert the element to a string and see to see if there are any instances
// of pause placeholders in it
string elementAsString = element.ToString();
MatchCollection matches = Regex.Matches(elementAsString, #"\[pause=(\d+)]");
if (matches == null || matches.Count == 0)
continue;
// if there were no matches or an empty set, move on to the next one
// iterate through the indexed matches
for (int i = 0; i < matches.Count; i++)
{
int pauseValue = 0; // capture the original pause value specified by the user
int pauseMilliSeconds = 1000; // if things go wrong, use a 1 second default
if (matches[i].Groups.Count == 2) // the value is expected to be in the second group
{
// if the value could be parsed to an integer, convert it from 1/8 seconds to milliseconds
if (int.TryParse(matches[i].Groups[1].Value, out pauseValue))
pauseMilliSeconds = pauseValue * 125;
}
// replace the specific match with the new <break> tag content
elementAsString = elementAsString.Replace(matches[i].Value, string.Format(#"<break time=""{0}ms""/>", pauseMilliSeconds));
}
// finally replace the element by parsing
element.ReplaceWith(XElement.Parse(elementAsString));
}
Oh, my goodness, you guys were quicker than I expected!
Doh! Might as well post my solution anyway!
foreach (var element in xml.Descendants("prompt"))
{
Queue<string> pauses = new Queue<string>(Regex.Matches(element.Value, #"\[pause *= *\d+\]")
.Cast<Match>()
.Select(m => m.Value));
Queue<string> text = new Queue<string>(element.Value.Split(pauses.ToArray(), StringSplitOptions.None));
element.RemoveAll();
while (text.Any())
{
element.Add(new XText(text.Dequeue()));
if (pauses.Any())
element.Add(new XElement("break", new XAttribute("time", Regex.Match(pauses.Dequeue(), #"\d+"))));
}
}
For every prompt element, Regex match all your pauses and put them in a queue.
Then use these prompts to delimit the inner text and grab the 'other' text and put that in a queue.
Clear all data from the element using RemoveAll and then iterate over your delimited data and re-add it as the appropriate data type. When you are adding in the new attributes you can use Regex to get the number value out of the original match.
Related
I have a line like
"In this task, you need to use the following equipment:
Perforator
Screwdriver
Drill
After finishing work, the tool must be cleaned."
How do I extract elements from this string? As a result, I need an array like {"Perforator", "Screwdriver", "Drill"}
This is how I would do this using regular expressions (assuming the input text is similar to your example and there is always numbering in front of each item):
string input = #"
In this task, you need to use the following equipment:
1. Perforator
2. Screwdriver
3. Drill
After finishing work, the tool must be cleaned.";
string pattern = #"(\d\. )([a-zA-z]*)";
var results = Regex.Matches(input, pattern);
foreach (Match result in results)
{
Console.WriteLine(result.Groups[2].Value); // ...or insert each in a List
}
Result:
Perforator
Screwdriver
Drill
One possible way to do it would be to break the string into lines, then try to convert each line into a number plus text, and take only the lines where the conversion is successful.
IEnumerable<string> GetNumberedItems(string input)
{
var lines = input.Split(new [] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
var items = line.Split('.');
if (items.Length != 2) continue;
var ok = int.TryParse(items[0].Trim(), out _);
if (ok) yield return items[1].Trim();
}
}
I have done something like:
var a = "77,82,83";
foreach (var group in a.Split(','))
{
a = group.Replace("83", string.Empty);
}
If i want to remove 83 but override last updated value and got output empty or remove value from that i passed to replace.
e.g var a = 77,82,83
want output like 77,82
Edit:
"83" can be in any position.
If you want output as string you don't need to Split. Just get the LastIndexOf the , character and perform Substring on the variable:
var a = "77,82,83";
var newString = a.Substring(0, a.LastIndexOf(',')); // 77,82
If you are unsure if the string has at least one ,, you can validate before performing a Substring:
var a = "77,82,83";
var lastIndex = a.LastIndexOf(',');
if (lastIndex > 0)
var newString = a.Substring(0, lastIndex);
Update:
If you want to remove specific values from any position:
Split the string -> Remove the values using Where -> Join them with , separator
a = string.Join(",", a.Split(',').Where(i => i != "83"));
Here's a fiddle
You might need to clarify the question slightly but I think you're asking for the following:
var a = "72,82,83";
var group = a.Split(',').ToList();
int position = group.FindIndex(p => p.Contains("83"));
group.RemoveAt(position);
You can make the item you're looking for in the Contains query a parameter.
I think the problem you're having with your original code is that the foreach is a loop over each item in the array, so you're trying to remove "83" on each pass.
I have a string that looks like this
2,"E2002084700801601390870F"
3,"E2002084700801601390870F"
1,"E2002084700801601390870F"
4,"E2002084700801601390870F"
3,"E2002084700801601390870F"
This is one whole string, you can imagine it being on one row.
And I want to split this in the way they stand right now like this
2,"E2002084700801601390870F"
I cannot change the way it is formatted. So my best bet is to split at every second quotation mark. But I haven't found any good ways to do this. I've tried this https://stackoverflow.com/a/17892392/2914876 But I only get an error about invalid arguements.
Another issue is that this project is running .NET 2.0 so most LINQ functions aren't available.
Thank you.
Try this
var regEx = new Regex(#"\d+\,"".*?""");
var lines = regex.Matches(txt).OfType<Match>().Select(m => m.Value).ToArray();
Use foreach instead of LINQ Select on .Net 2
Regex regEx = new Regex(#"\d+\,"".*?""");
foreach(Match m in regex.Matches(txt))
{
var curLine = m.Value;
}
I see three possibilities, none of them are particularly exciting.
As #dvnrrs suggests, if there's no comma where you have line-breaks, you should be in great shape. Replace ," with something novel. Replace the remaining "s with what you need. Replace the "something novel" with ," to restore them. This is probably the most solid--it solves the problem without much room for bugs.
Iterate through the string looking for the index of the next " from the previous index, and maintain a state machine to decide whether to manipulate it or not.
Split the string on "s and rejoin them in whatever way works the best for your application.
I realize regular expressions will handle this but here's a pure 2.0 way to handle as well. It's much more readable and maintainable in my humble opinion.
using System;
using System.Collections.Generic;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
const string data = #"2,""E2002084700801601390870F""3,""E2002084700801601390870F""1,""E2002084700801601390870F""4,""E2002084700801601390870F""3,""E2002084700801601390870F""";
var parsedData = ParseData(data);
foreach (var parsedDatum in parsedData)
{
Console.WriteLine(parsedDatum);
}
Console.ReadLine();
}
private static IEnumerable<string> ParseData(string data)
{
var results = new List<string>();
var split = data.Split(new [] {'"'}, StringSplitOptions.RemoveEmptyEntries);
if (split.Length % 2 != 0)
{
throw new Exception("Data Formatting Error");
}
for (var index = 0; index < split.Length / 2; index += 2)
{
results.Add(string.Format(#"""{0}""{1}""", split[index], split[index + 1]));
}
return results;
}
}
}
In C#, how do I get the text of an System.Windows.Form.HtmlElement not including the text from its children?
If I have
<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>
then the InnerText property of the whole thing is "aaabbbcccddd" and I just want "aaa".
I figure this should be trivial, but I haven't found anything to produce the "immediate" text of an HtmlElement in C#. More ludicrous ideas are "subtracting" the InnerText of the children from the parent, but that's an insane amount of work for something that I'm sure is trivial.
(All I want is access to the Text Node of the HtmlElement.)
I'd certain appreciate any help (or pointer) that anyone can supply.
Many thanks.
Examples:
<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div> -> Produce "aaa"
<div><div>ccc</div><div>ddd</div></div> -> Produce ""
<div>ccc</div> -> Produce "ccc"
Edit
There are a number of ways to skin this particular cat, none of them elegant. However, given my constraints (not my HTML, quite possibly not valid), I think Aleksey Bykov's solution is closest to what I needed (and indeed, I did implement the same solution he suggested in the last comment.)
I've selected his solution and upvoted all the other ones that I think would work, but weren't optimal for me. I'll check back to upvote any other solutions that seem likely to work.
Many thanks.
Maybe it's simpler than that, if you're willing to use XmlDocument instead of HtmlDocument - you can just use the 'Value' property of the XmlElement.
This code gives the output you want for the 3 cases you mentioned:
class Program
{
private static string[] htmlTests = {#"<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>",
#"<div><div>ccc</div><div>ddd</div></div>",
#"<div>ccc</div>" };
static void Main(string[] args)
{
var page = new XmlDocument();
foreach (var test in htmlTests)
{
page.LoadXml(test);
Console.WriteLine(page.DocumentElement.FirstChild.Value);
}
}
}
Output:
aaa
ccc
I am not sure what you mean by HtmlElement, but with XmlElement you would do it like this:
using System;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Text;
public static class XmlUtils {
public static IEnumerable<String> GetImmediateTextValues(XmlNode node) {
var values = node.ChildNodes.Cast<XmlNode>().Aggregate(
new List<String>(),
(xs, x) => { if (x.NodeType == XmlNodeType.Text) { xs.Add(x.Value); } return xs; }
);
return values;
}
public static String GetImmediateJoinedTextValues(XmlNode node, String delimiter) {
var values = GetImmediateTextValues(node);
var text = String.Join(delimiter, values.ToArray());
return text;
}
}
EDIT:
Well, if your HtmlElement comes from System.Windows.Forms, then what you need to do is to use its DomElement property trying to cast it to one of the COM interfaces defined in mshtml. So all you need to do is to be able to tell if the element you are looking at is a text node and get its value. First you gotta add a reference to the mshtml COM library. You can do something like this (I cannot verify this code immediately).
public Bool IsTextNode(HtmlElement element) {
var result = false;
var nativeNode = element.DomElement as mshtml.IHTMLDOMNode;
if (nativeNode != null) {
var nodeType = nativeNode.nodeType;
result = nodeType == 3; // -- TextNode: http://msdn.microsoft.com/en-us/library/aa704085(v=vs.85).aspx
}
return result
}
Well, you could do something like this (assuming your input is in a string called `input'):
string pattern = #">.*?<";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
var first_match = matches[0].ToString();
string result = first_match.Substring(1, first_match.Length - 2);
I probably wouldn't do it (or just relay on matching the string for the first <div> and </div>) ... here, for extra credit:
int start = pattern.IndexOf(">") + 1;
int end = pattern.IndexOf("<", start);
string result = input.Substring(start, end - start);
I am parsing a webpage for http links by first parsing out all the anchored tags, then parsing out the href tags, then running a regex to remove all tags that aren't independent links (like href="/img/link.php"). The following code works correctly, but also appends lots of blank lines in between the parsed links.
while (parse.ParseNext("a", out tag))
{
string value;
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResults = regexObj.Match(value);
value2 = matchResults.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
To fix this, I added the following code and it works to clean up the list:
if (value2 != "")
{
lstPages.AppendText(value2 + "\r\n");
}
However, I
Don't believe this is the most efficient way to go about this and
Still don't understand where the != "" lines come from.
My actual question is on both of these but more for issue #2, as I would like to learn why I receive these results, but also if there is a more efficient method for this.
The reason you are getting an empty string in value2 is that matchResults.Value == "" if the regular expression fails to match. Instead of checking if value2 != "", you could directly check matchResults.Success to see if the regular expression matched. You're basically doing that, anyway, since your particular regular expression would never match an empty string, but checking matchResults.Success would be more straightforward.
Another thing to consider is that it's not necessary to create the Regex object every iteration of your loop. Here are the modifications I suggest:
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
while (parse.ParseNext("a", out tag))
{
string value;
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResult = regexObj.Match(value);
if (matchResult.Success)
{
value2 = matchResult.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
}
Use Html Agility Pack instead
static void Main(string[] args)
{
var html = new HtmlDocument();
var request = WebRequest.Create("http://stackoverflow.com/questions/6256982/parsing-links-and-recieving-extra-blanks/6257328#6257328") as HttpWebRequest;
using (var response = request.GetResponse())
using (var responseStream = response.GetResponseStream())
{
html.Load(responseStream);
}
foreach (var absoluteHref in html.DocumentNode.SelectNodes("//a[starts-with(#href, 'http')]"))
{
Console.WriteLine(absoluteHref.Attributes["href"].Value);
}
}
TryGetValue is a generic Method (of Type T). If it doesnt have any value to return, it returnd the default value of the type, which is String.Empty or "" for String