I have an issue with Serializing XML. I have an object with a DateTime property where the millisecond value is 990. However when I view the outputted string it is showing like this...
<ReadingsDateTime>2016-07-04T10:10:00.99Z</ReadingsDateTime>
The code used to convert this to xml is below, what is going on, I can not find a reason that this is happening.
string xml;
try
{
var serializer = new XmlSerializerFactory().CreateSerializer(typeof(T), xmlNamespace);
using (var memoryStream = new MemoryStream())
{
var settings = new XmlWriterSettings
{
Indent = false,
NamespaceHandling = NamespaceHandling.OmitDuplicates,
CloseOutput = false,
WriteEndDocumentOnClose = true,
};
using (var xmlWriter = XmlWriter.Create(memoryStream, settings))
{
serializer?.Serialize(xmlWriter, obj);
}
memoryStream.Seek(0, SeekOrigin.Begin);
using (var steamReader = new StreamReader(memoryStream))
{
xml = steamReader.ReadToEnd();
}
}
}
catch (Exception ex)
{
throw new ApplicationException("Unable to convert to XML from an object", ex);
}
return xml;
.990 is the same as .99, its a fractional number so the last 0 digit is dropped. Digits have importance starting from the left hand side and going to the right. Example:
1.0000 is the same value as 1
2.94 is the same value as 2.940 or 2.9400 or 2.94000.
The serializer just removes the trailing 0 digits. If you want to always capture any trailing 0 digits (not sure why you would) you can add a custom string property and specify the exact output to be serialized and read in there and ignore the DateTime property, see this previous SO post as example.
Related
I just want to ask you if there is any possibilities to get the number of all characters in file during reading the CSV file? I don't want to load file into memory two times (one time for parsing, second time for counting).
I need to parse CSV file but also I need to get the number of all characters in this file (with delimeters). Someone has any idea how to do that in the most efficient way?
using (TextReader stream = new StreamReader(file.OpenReadStream()))
{
CsvReader reader = new CsvReader(stream, GetCsvReaderOptions());
while (reader.Read())
{
//parsing
}
}
There is an option to iterate through all fields in actual reader row
and at the end increment length by delimeters (number of fields ==
number of delimeters).
Also I have idea to count characters on parsed objects by reflection
(get all properties value from object).
I don't think that these options will be efficient.
Thanks in Advance
You can use Reader.Context.RawRecord and remove the line endings. (Assuming you don't want to count those)
using (TextReader stream = new StreamReader(file.OpenReadStream()))
{
var count = 0;
CsvReader reader = new CsvReader(stream, GetCsvReaderOptions());
while (reader.Read())
{
count += reader.Context.RawRecord.Replace("\n", "").Replace("\r", "").Length;
//parsing
}
}
The basic way of doing this could be the following:
using (TextReader stream = new StreamReader(file.OpenReadStream()))
{
var content = stream.ReadToEnd();
var length = content.Length;
}
So that variable "length" will contain the count of all symbols in the passed file
In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.
It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.
Am I missing something? Is this like so intentionally?
Here's a reproduction code:
static void Main(string[] args)
{
string s1 = "abc";
byte[] abcWithBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
{
sw.Write(s1);
sw.Flush();
abcWithBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
}
byte[] abcWithoutBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
{
sw.Write(s1);
sw.Flush();
abcWithoutBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
}
var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
Console.WriteLine(restore1.Length); // 3
Console.WriteLine(restore1); // abc
var restore2 = Encoding.UTF8.GetString(abcWithBom);
Console.WriteLine(restore2.Length); // 4 (!)
Console.WriteLine(restore2); // ?abc
}
private static string FormatArray(byte[] bytes1)
{
return string.Join(", ", from b in bytes1 select b.ToString("x"));
}
It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.
If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.
Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.
Based on the answer by Jon Skeet (thanks!), this is how I just did it:
var memoryStream = new MemoryStream(byteArray);
var s = new StreamReader(memoryStream).ReadToEnd();
Note that this will probably only work reliably if there is a BOM in the byte array you are reading from. If not, you might want to look into another StreamReader constructor overload which takes an Encoding parameter so you can tell it what the byte array contains.
for those who do not want to use streams I found a quite simple solution using Linq:
public static string GetStringExcludeBOMPreamble(this Encoding encoding, byte[] bytes)
{
var preamble = encoding.GetPreamble();
if (preamble?.Length > 0 && bytes.Length >= preamble.Length && bytes.Take(preamble.Length).SequenceEqual(preamble))
{
return encoding.GetString(bytes, preamble.Length, bytes.Length - preamble.Length);
}
else
{
return encoding.GetString(bytes);
}
}
I know I am kind of late to the party but here's the code I am using (feel free to adapt to C#) if you need:
Public Function Serialize(Of YourXMLClass)(ByVal obj As YourXMLClass,
Optional ByVal omitXMLDeclaration As Boolean = True,
Optional ByVal omitXMLNamespace As Boolean = True) As String
Dim serializer As New XmlSerializer(obj.GetType)
Using memStream As New MemoryStream()
Dim settings As New XmlWriterSettings() With {
.Encoding = Encoding.UTF8,
.Indent = True,
.omitXMLDeclaration = omitXMLDeclaration}
Using writer As XmlWriter = XmlWriter.Create(memStream, settings)
Dim xns As New XmlSerializerNamespaces
If (omitXMLNamespace) Then xns.Add("", "")
serializer.Serialize(writer, obj, xns)
End Using
Return Encoding.UTF8.GetString(memStream.ToArray())
End Using
End Function
Public Function Deserialize(Of YourXMLClass)(ByVal obj As YourXMLClass, ByVal xml As String) As YourXMLClass
Dim result As YourXMLClass
Dim serializer As New XmlSerializer(GetType(YourXMLClass))
Using memStream As New MemoryStream()
Dim bytes As Byte() = Encoding.UTF8.GetBytes(xml.ToArray)
memStream.Write(bytes, 0, bytes.Count)
memStream.Seek(0, SeekOrigin.Begin)
Using reader As XmlReader = XmlReader.Create(memStream)
result = DirectCast(serializer.Deserialize(reader), YourXMLClass)
End Using
End Using
Return result
End Function
I have some JSON from a third party system that contains backslashes in the value. For example:
string extract = #"{""key"": ""\/Date(2015-02-02)\/""}";
which without the c# string escaping corresponds to the string:
{"key": "\/Date(2015-02-02)\/"}
I'd like to be able to format (e.g. indent) this JSON.
Typically for formatting, I might use something like JsonConvert like so:
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract), Formatting.Indented)
This doesn't quite work, as it sees the value as a date, but as it's not in the standard MS format of \/Date(ticks)\/, it goes to a date of 1 Jan 1970:
{
"key": "1970-01-01T00:00:02.015+00:00"
}
Next approach is to use the serializer settings to not convert dates (I'm not bothered whether it recognises the field as a date, although it would probably be handy later on):
JsonSerializerSettings settings = new JsonSerializerSettings
{
DateParseHandling = DateParseHandling.None,
};
JsonConvert.SerializeObject(JsonConvert.DeserializeObject(extract, settings), Formatting.Indented);
This appears to have treated the backslash as an escape character during the deserialization, so it is "lost" once I see the final result:
{
"key": "/Date(2015-02-02)/"
}
Is there a way that I can format the JSON in C# (with or without JsonConvert), that will preserve the backslash in the value?
Note that the real JSON I am dealing with is (a) reasonably large, but not too large for some regex/find-replace solution, if really necessary (b) not under my control, so I can't change the format. I'm sure the answer is already on StackOverflow, but I'm finding it difficult to find the right search terms...
Have you tried:
extract = extract.Replace("\\","\\\\");
before parsing the string?
The basic problem is that, in a JSON string literal, the escaped solidus "\/" means exactly the same as the unescaped solidus "/", and Json.NET parses and interprets this escaping at a very low level, namely JsonTextReader.ReadStringIntoBuffer(). Thus there's no way for higher level code to detect and remember whether a string literal was formatted as "\/Date(2015-02-02)\/" or "/Date(2015-02-02)/" and later write back one or the other as appropriate.
If you are OK with always adding the extra escaping to strings that start with /Date( and end with )/, you can use a custom subclass of JsonTextWriter to do this:
public class DateLiteralJsonTextWriter : JsonTextWriter
{
public DateLiteralJsonTextWriter(TextWriter writer) : base(writer) { }
public override void WriteValue(string value)
{
const string startToken = #"/Date(";
const string replacementStartToken = #"\/Date(";
const string endToken = #")/";
const string replacementEndToken = #")\/";
if (value != null && value.StartsWith(startToken) && value.EndsWith(endToken))
{
var sb = new StringBuilder();
// Add the initial quote.
sb.Append(QuoteChar);
// Add the new start token.
sb.Append(replacementStartToken);
// Add any necessary escaping to the innards of the "/Date(.*)/" string.
using (var writer = new StringWriter(sb))
using (var jsonWriter = new JsonTextWriter(writer) { StringEscapeHandling = this.StringEscapeHandling, Culture = this.Culture, QuoteChar = '\"' })
{
var content = value.Substring(startToken.Length, value.Length - startToken.Length - endToken.Length);
jsonWriter.WriteValue(content);
}
// Strip the embedded quotes from the above.
sb.Remove(replacementStartToken.Length + 1, 1);
sb.Remove(sb.Length - 1, 1);
// Add the replacement end token and final quote.
sb.Append(replacementEndToken);
sb.Append(QuoteChar);
// Write without any further escaping.
WriteRawValue(sb.ToString());
}
else
{
base.WriteValue(value);
}
}
}
Then parse with DateParseHandling = DateParseHandling.None as you are currently doing:
var settings = new JsonSerializerSettings { DateParseHandling = DateParseHandling.None };
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
using (var jsonWriter = new DateLiteralJsonTextWriter(writer) { Formatting = Formatting.Indented})
{
JsonSerializer.CreateDefault(settings).Serialize(jsonWriter, JsonConvert.DeserializeObject(extract, settings));
}
Console.WriteLine(sb);
This prints:
{
"key": "\/Date(2015-02-02)\/"
}
I have a string of XML(utf-8).I need to store the string in the database(MS SQL). Encoding a string must be UTF-16.
This code does not work, utf16Xml is empty
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
utf8Xml - string contains a serialize object(encoding UTF8).
How convert xml string UTF8 to UTF16?
This might help you
MemoryStream ms = new MemoryStream();
XmlWriterSettings xws = new XmlWriterSettings();
xws.OmitXmlDeclaration = true;
xws.Indent = true;
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
using (XmlWriter xw = XmlWriter.Create(ms, xws))
{
xDoc.WriteTo(xw);
}
Encoding ut8 = Encoding.UTF8;
Encoding ut116 = Encoding.Unicode;
byte[] utf16XmlArray = Encoding.Convert(ut8, ut116, ms.ToArray());
var utf16Xml = Encoding.Unicode.GetString(utf16XmlArray);
Given that XDocument.Parse only accepts a string, and that string in .NET is always UTF-16 Little Endian, it looks like you are going through a lot of steps to effectively do nothing. Either:
The string – utf8Xml – is already UTF-16 LE and can be inserted into SQL Server as is (i.e. do nothing) as SqlDbType.Xml or SqlDbType.NVarChar,
or
utf8Xml somehow contains UTF-8 byte sequences, which would be invalid UTF-16 LE (i.e. "Unicode" in Microsoft-land) byte sequences. If this is the case, then you might be able to simply:
add the XML Declaration, stating that the encoding is UTF-8:
xDoc.Declaration.Encoding = "utf-8";
do not omit the XML declaration:
OmitXmlDeclaration = false;
pass utf8Xml into SQL Server as DbType.VarChar
For further explanation, please see my answer to the related question (here on S.O.):
How to solve “unable to switch the encoding” error when inserting XML into SQL Server
.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through �x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "'" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.