remove illegal 0x1f charector from xml - c#

I have a program that generates some data and saves it as an xml, unfortunately for my purposes I cant save it in the newer XML that allows for characters like 0x1f. As a result, I need to eliminate this character from my xml. All I have been able to find that seems to do this is this http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ but I don't know java-script, and would like to be able to use a script that I am able to understand. I do know basic C#, but am not great in it. Anyway, what would be the easiest way to filter this character? I do think this is a good question for the online community anyway as finding a working method in C# from Google proves to be challenging.

From this post: How can you strip non-ASCII characters from a string? (in C#)
Adjusting it for your case:
string s = File.ReadAllText(filepath);
s = Regex.Replace(s, #"[\u0000-\u001F]", string.Empty);
File.WriteAllText(newFilepath, s);
Then test the new file. Don't overwrite the old until you know if this works or not.

Related

XML: how to pre-parse when only SOME data is escaped?

XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

character to use when splitting strings in visual c#?

Ok, I'm racking my brains over this one. It's pretty simple though (I think).
I'm currently creating a text file as a comma separated string of values.
Later, I read in that file data and then use the .split function to split the data by commas.
I discovered that sometimes one of the description fields in the data conatins an embedded comma, which ends up throwing the split command off.
Is there any special character I could use that could pretty much guarantee wouldn't be in the data, or is there a better way to accomplish this? Thanks!
// Initial Load
fullString = fileName + "," + String.Join(",", fieldValues);
// Access later
String[] valuesArray = myString.Split(',');
Short answer, there's no "simple" way to do it using Split. The best you can hope for is to set the deliminator as something cooky that wouldn't ever get used (but even that's not a guarantee).
The simple method would be to used something like CsvHelper (get it through Nuget) or any of the other dozen or so packages that are designed for parsing CSV.

c# equivalent to stripcslashes function?

I am working with a project that includes getting MMS from a mms-gateway and storing the image on disk.
This includes using a received base64encoded string and storing it as a zip to a web server. This zip is then opened, and the image is retrieved.
We have managed to store it as a zip file, but it is corrupted and cannot be opened.
The documentation from the gateway is pretty sparse, and we have only a php example to rely on. I think we have figured out how to "translate" most of it, except for the PHP function stripcslashes(inputvalue). Can anyone shed shed any light on how to do the same thing in c#?
We are thankful for any help!
stripcslashes() looks for "\x" type elements within longer strings (where 'x' could be any character, or perhaps, more than one). If the 'x' is not recognised as meaningful, it just removes the '\' but if it does recognise it as a valid C-style escape sequence (i.e. "\n" is newline; "\t" is tab, etc.), as I understand it, the recognised character is inserted instead: \t will be replaced by a tab character (0x09, I think) in your string.
I'm not aware of any simple way to get the .net framework to do the same thing without building a similar function yourself. This obviously isn't very hard, but you need to know which escape sequences to process.
If you happen to know (or find out by inspecting your base64 text) that the only thing in your input that will need processing is a particular one or two sequences (say, tab characters), it becomes very easy and the following snippet shows use of String.Replace():
string input = #"Some\thing"; // '#' means string stored without processing '\t'
Console.WriteLine(input);
string output = input.Replace(#"\t", "\t");
Console.WriteLine(output);
Of course, if you really do simply want to remove all the slashes:
string output = input.Replace(#"\", "");

Culture specific characters to nice URL format

I need some functionality to make the following string in a url-friendly format:
"knæ som gør" should be "kna-som-gor"
That is, replacing culture specific characters to characters that can be used in urls.
Using .Net and C#
Please help me :)
/Andreas
Don't complicate things. :)
Either use a regexp, or simply use String.Replace.
You can find a solution that removes diacritics here: How do I remove diacritics (accents) from a string in .NET?. This solution does not help you with æ or ø, though.
Maybe that removes enough of your special characters that the rest can be translated using simple replacing?
If "url-friendly" does not mean pretty, you could also use HttpUtility.UrlEncode, which produces
"kn%c3%a6+som+g%c3%b8r".
Edit: Added possible solution (end of post).
I had a very similar problem, albeit for file names rather than URLs. The main problem seems to be that there is no standard way to ask for the "best ASCII replacement for ø", so even if you can locate all the unwanted characters it is hard to automate which replacement to insert.
I posted quite a bit of code that might be helpful. See this StackOverflow question for details.
Edit: I think the solution to this problem lies with StringInfo, which allows you to iterate through the sub-characters (Unicode surrogates or combining characters) in a string. This should make it possible to detect and convert something like å (which can be encoded in Unicode as either A-WITH-RING or RINGED-A; filter out the decorator and keep the part that is a normal character).

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

Categories