I have gone through a lot of answers for this but was not able to solve issue so asking.
I am getting my xml in a string. It consist of "< 6" as content in some node values.
As a result I am getting an exception
Name cannot begin with the ' ' character, hexadecimal value 0x20. Line 3270, position 54.
Here is the code:
string patternToReplaceAnd = "&(?![a-z#]+;)";
Regex reg = new Regex(patternToReplaceAnd);
xml = reg.Replace(xml, "&");
XDocument xDoc = XDocument.Parse(xml);
Can anyone help me out?
You say you're getting your XML in a string. You're not. You're getting garbage in a string.
If the garbage is really important to you then you can try and convert it to XML. How you do that depends on just how bad it is, which we can't really judge.
Much better: refuse to accept shoddy goods. Go back to the supplier and tell them to generate real XML.
I do realize that this question is old but I came across the same problem today and I hope my answer will help someone who may land on this question in the future.
The problem is the content that includes < followed by the space. You will have to replace that content with < so that It is not recognised as a malformed xml start tag.
xml = xml.Replace('< ',"< "); //make sure you include the space after < to avoid replacing actual tags.
XDocument xDoc = XDocument.Parse(xml);
Related
We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.
I need to figure out a good way using C# to parse an XML file for (NULL) and remove it from the tags and replace it with the word BAD.
For example:
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
should be replaced with
<GC5_BAD DIRTY="False"></GC5_BAD>
Part of the problem is I have no control over the original XML, I just need to fix it once I receive it. The second problem is that the (NULL) can appear in zero, one, or many tags. It appears to be an issue with users filling in additional fields or not. So I might get
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
or
<MH_OTHSECTION_TXT_(NULL) DIRTY="False"></MH_OTHSECTION_TXT_(NULL)>
or
<LCDATA_(NULL) DIRTY="False"></LCDATA_(NULL)>
I am a newbie to C# and programming.
EDIT:
So I have come up with the following function that while not pretty, so far work.
public static string CleanInvalidXmlChars(string fileText)
{
List<char> charsToSubstitute = new List<char>();
charsToSubstitute.Add((char)0x19);
charsToSubstitute.Add((char)0x1C);
charsToSubstitute.Add((char)0x1D);
foreach (char c in charsToSubstitute)
fileText = fileText.Replace(Convert.ToString(c), string.Empty);
StringBuilder b = new StringBuilder(fileText);
b.Replace("", string.Empty);
b.Replace("", string.Empty);
b.Replace("<(null)", "<BAD");
b.Replace("(null)>", "BAD>");
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String result = nullMatch.Replace(b.ToString(), "<$1_BAD$2>");
result = result.Replace("(NULL)", "BAD");
return result;
}
I have only been able to find 6 or 7 bad XML files to test this code on, but it has worked on each of them and not removed good data. I appreciate the feedback and your time.
In general, regular expressions are not the right way of handling XML files. There's a range of solutions to handle XML files correctly - you can read up on System.Xml.Linq for a good start. If you're a newbie, it's certainly something you should learn at some point. As Ed Plunkett pointed out in the comments, though, your XML is not actually XML: ( and ) characters are not allowed in XML element names.
Since you will have to do it as an operation on a string, Corak's comment to use
contentOfXml.Replace("(NULL)", "BAD");
may be a good idea, but will break if any elements can contain the string (NULL) as anything other than their name.
If you want a regex approach, this might work decently, but I'm not sure if it's not missing any edge cases:
var regex = new Regex(#"(<\/?[^_]*_)\(NULL\)([^>]*>)");
var result = regex.Replace(contentOfXml, "$1BAD$2");
Will it be suitable for you to read this XML as a string and perform a regex replacement? Like:
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String processedXmlString = nullMatch.Replace(originalXmlString, "<$1_BAD$2>");
I have some invalid XML from a vendor that I need to process. Here is an example:
<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>
So, we have a few problems. First, there is no root document. I overcome that by adding a root document. No problem. The second, and more difficult problem, is the less than symbol. I can just encode the whole thing but it will encode the XML tags. Is there a library or simple method out there somewhere for handling this? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. Appreciate any help.
I would read the file line by line and use a regex to get the values between the nodes. Your example doesn't have nested elements so this is pretty easy. While reading line by line you can replace encode the inner values. The named capture group (?.*?) will get everything between the nodes into the group named xml.
var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;
I tried to save apostrophe ' to XML, but always I get an error.
When I want to save new item, first I tried to find it. I use this
XmlNode letters = root.SelectSingleNode("//letters");
XmlNode oldFileLetter = letters.SelectSingleNode("letter[#name='"+letterName+"']");
but when letterName contains apostrophe ' I get an error, that path isn't closed
I also found this c# parsing xml with and apostrophe throws exception but when I did what Steven said, it's OK for apostrophe, but double quotes throw exception.
I need to pass " and ' too.
You also could replace the apostriphe by '
letterName = letterName.Replace("'", "'");
XmlNode letters = root.SelectSingleNode("//letters");
XmlNode oldFileLetter = letters.SelectSingleNode("letter[#name='"+letterName+"']");
Take a look at this thread about special chars on a xml file.
The issue here is that your XPath already has an apostrophe indicating the beginning of a string within the XPath, so any apostrophe in your letterName value would be interpereted as closing the string value.
Contrary to Felipe's advice, XPaths are not themselves XML, so replacing the apostrophes with ' will not work. It will avoid the error, but you won't find the node you're looking for if letterName contains an apostrophe. Also, there is no difference in C# between "'" and "\'", so that will not help either.
I'd suggest looping through the letter elements and identifying the one where #name has the value you're looking for:
XmlNode oldFileLetter = null;
foreach(XmlNode letterNameNode in letters.SelectNodes("letter/#name"))
{
if(letterNameNode.Value.Equals(letterName))
{
oldFileLetter = letterNameNode.ParentNode;
break;
}
}
The only other approach I know of involves rigging up a system to allow defining and using XPath variables in your paths, but that's usually overkill.
Have you tried escaping it as such:
\'
You have to write it as an entity i think...
I'm,not sure but i can recall having come across this issue once before.
Look at this wikipedia thread...
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
const string apo = "\'";
XmlNode letters = root.SelectSingleNode("//letters");
XmlNode oldFileLetter = letters.SelectSingleNode("letter[#name="+apo+letterName+apo+"]")
I'm using the new System.Xml.Linq to create HTML documents (Yes, I know about HtmlDocument, but much prefer the XDocument/XElement classes). I'm having a problem inserting (or any other HTML entity). What I've tried already:
Just putting text in directly doesn't work because the & gets turned int &.
new XElement("h1", "Text to keep together.");
I tried parsing in the raw XML using the following, but it barfs with this error:
XElement.Parse("Text to keep together.");
--> Reference to undeclared entity 'nbsp'.`
Try number three looks like the following. If I save to a file, there is just a space, the gets lost.
var X = new XDocument(new XElement("Name", KeepTogether("Hi Mom!")));
private static XNode KeepTogether(string p)`
{
return XElement.Parse("<xml>" + p.Replace(" ", " ") + "</xml>").FirstNode;
}
I couldn't find a way to just shove the raw text through without it getting escaped. Am I missing something obvious?
I couldn't find a way to just shove the raw text through without it getting escaped.
Just put the Unicode character in that refers to (U+00A0 NO-BREAK SPACE) directly in a text node, then let the serializer worry about whether it needs escaping to or not. (Probably not: if you are using UTF-8 or ISO-8859-1 as your page encoding the character can be included directly without having to worry about encoding it into an entity reference or character reference).
new XElement("h1", "Text\u00A0to\u00A0keep\u00A0together");
Replacing the & with a marker like ##AMP## solved my problem. Maybe not the prettiest solution but I got a demo for a customer in 10 mins so...I don't care :)
Thx for the idea
I know this is old, but I found something and I'm rather surprised!
XElement element = new XElement("name",new XCData("<br /> & etc"));
And there you go! CDATA text!
You could also try using numbered entities, they need no declaration.
Numbered entity equivalent to the named entity is
Unlike amp (&), lt (<) etc, nbsp is not known entity to XML, so you need to declare it.
In XML, e.g. &xyz; is treated as an entity, The parser will reference its value to produce the output.
// the xml, plz remove '.' within xml
string xml = "<xml>test&.n.b.s.p;test</xml>";
// declare nbsp as xml entity and its value is " " in this case.
string declareEntity = "<!DOCTYPE xml [<!ENTITY nbsp \" \">]>";
XElement x = XElement.Parse( declareEntity + xml );
// output with a space between tests
// <xml>test test</xml>
or
// plz remove '.' in the string
XElement.Parse("<xml>" + HttpUtility.HtmlEncode("Text&.n.b.s.p;keep everything") + "</xml>");
You can paste the character as you wish to see it if you copy it somewhere else. Viusal studio allows that.
Though this is hard to do if you need , it is easy if you need any symbols, for example:
&bull ...just paste •
↔ ...just paste ↔
I came up with this slightly daft approach which suits me:
String replace all the & with ##AMP## when you store the data....
And reverse that operation on output.
I am using this in conjunction with XElement SQL column and works a treat.
Regards
Neil