Read Parts of an Xml File trough Stream instead of only one - c#

So I've been working on a old piece of code for a project.
I've managed to optimize it for 64bit usage.
But there's only 1 issue. When using the XmlSerializer.Deserialize
It breaks because the input text/Deserialized data is TOO BIG. (overflow/exceeds the 2gb int limit).
I've tried to find a fix, but no answer was helpful.
Here's the code in question.
if (File.Exists(dir + "/" + fileName))
{
string XmlString = File.ReadAllText(dir + "/" + fileName, Encoding.UTF8);
BXML_LIST deserialized;
using (MemoryStream input = new MemoryStream(Encoding.UTF8.GetBytes(XmlString)))
{
using (XmlTextReader xmlTextReader = new XmlTextReader(input))
{
xmlTextReader.Normalization = false;
XmlSerializer xmlSerializer = new XmlSerializer(typeof(BXML_LIST));
deserialized = (BXML_LIST)xmlSerializer.Deserialize(xmlTextReader);
}
}
xml_list.Add(deserialized);
}
Following many questions asked here, I tought I could use a method to "split" the xml file (WHILE KEEPING THE SAME TYPE OF BXML_LIST)
Then deserialize it and to finish: Combine it to match it's original content to avoid having the overflow error when deserializing the whole file.
Thing is, I have no idea how to implement this. Any help or guidance would be amazing!
// Edit 1:
I've found a piece of code from another site, don't know if it could be a reliable way to combine the splitted xml file:
var xml1 = XDocument.Load("file1.xml");
var xml2 = XDocument.Load("file2.xml");
//Combine and remove duplicates
var combinedUnique = xml1.Descendants("AllNodes")
.Union(xml2.Descendants("AllNodes"));
//Combine and keep duplicates
var combinedWithDups = xml1.Descendants("AllNodes")
.Concat(xml2.Descendants("AllNodes"));

Your code gives me the creeps, you're so inefficient at using up memory.
string XmlString = File.ReadAllText - Here you load the entire file into memory at the first time.
Encoding.UTF8.GetBytes(XmlString) - Here you spend memory for the same data for the second time.
new MemoryStream(...) - Here you spend memory for the same data for the third time.
xmlSerializer.Deserialize - Here, memory is spent again for deserialized data. But there's no getting away from it.
Write like this
using (XmlReader xmlReader = XmlReader.Create(dir + "/" + fileName))
{
XmlSerializer xmlSerializer = new XmlSerializer(typeof(BXML_LIST));
deserialized = (BXML_LIST)xmlSerializer.Deserialize(xmlReader);
}
In this case, xmlSerializer will read data from the file using xmlReader in a stream, in parts.
Perhaps, this may be enough to solve your problem.

Related

How to get full path of file using File.WriteAllText in C#?

I have added key in settings file as <add key="Test.Directory" value="Data/Test/XML_Files" />.I need to pass this path to File.WriteAllText and concatenate as c:/Data/Test/XML_Files/TestFile but the path is taken only till c:/Data/Test/XML_Files.Please help to concatenate and get the full path
var xmlFilePath = ConfigurationManager.AppSettings["Test.Directory"];
string _xmlFileName = new DirectoryInfo(Path.GetFullPath(xmlFilePath));
string Records = string.Empty;
using (StringWriter Writer = new Utf8StringWriter())
{
xmlSerializer.Serialize(Writer, itemList);
Records = Writer.ToString();
}
File.WriteAllText(string.Format(#_xmlFileName + "'\'TestFile" + ".dat" + DateTime.Now.ToString("yyyyMMddHHmmssfff") + Guid.NewGuid().ToString().Substring(1, 5) ), Records);
You error seems to be in those quotes added around the backslash before the TestFile constant. But I strongly suggest you to be more clear in your building of the filename and to use Path.Combine to create the final full filename
string timestamp = DateTime.Now.ToString("yyyyMMddHHmmssfff") +
Guid.NewGuid().ToString().Substring(1, 5);
string file = Path.Combine(_xmlFileName, $"TestFile-{timestamp}.dat");
File.WriteAllText(file, Records);
Of course you could put everything in a single line, but this will not be a noticeable advantage of any kind for your performances and makes the understanding of the code really difficult. (Note, for example, that your original code has the datetime/guid part after the extension and this is probably an oversight caused by the complexity of the expression)
Maybe you can try something like this?:
var folderName = Path.Combine(#_xmlFileName, "TestFile");
var fileName = $#"{folderName}\{DateTime.Now:yyyyMMddHHmmssfff}.dat";
File.WriteAllText(fileName, txRrcWellRecords);

C# XmlReader reads XML wrong and different based on how I invoke the reader's methods

So my current understanding of how the C# XmlReader works is that it takes a given XML File and reads it node-by-node when I wrap it in a following construct:
using System.Xml;
using System;
using System.Diagnostics;
...
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
settings.IgnoreWhitespace = true;
settings.IgnoreProcessingInstructions = true;
using (XmlReader reader = XmlReader.Create(path, settings))
{
while (reader.Read())
{
// All reader methods I call here will reference the current node
// until I move the pointer to some further node by calling methods like
// reader.Read(), reader.MoveToContent(), reader.MoveToElement() etc
}
}
Why will the following two snippets (within the above construct) produce two very different results, even though they both call the same methods?
I used this example file for testing.
Debug.WriteLine(new string(' ', reader.Depth * 2) + "<" + reader.NodeType.ToString() + "|" + reader.Name + ">" + reader.ReadString() + "</>");
(Snippet 1)
vs
(Snippet 2)
string xmlcontent = reader.ReadString();
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Debug.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
Output of Snippet 1:
<XmlDeclaration|xml></>
<Element|rss></>
<Element|head></>
<Text|>Test Xml File</>
<Element|description>This will test my xml reader</>
<EndElement|head></>
<Element|body></>
<Element|g:id>1QBX23</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<Element|item></>
<Text|>2QXB32</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Yes, this is formatted as it was in my output window. As to be seen it skipped certain elements and outputted a wrong depth for a few others. Therefore, the NodeTypes are correct, unlike Snippet Number 2, which outputs:
<XmlDeclaration|xml></>
<Element|xml></>
<Element|title></>
<EndElement|title>Test Xml File</>
<EndElement|description>This will test my xml reader</>
<EndElement|head></>
<Element|item></>
<EndElement|g:id>1QBX23</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<Element|g:id></>
<EndElement|g:id>2QXB32</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Once again, the depth is messed up, but it's not as critical as with Snippet Number 1. It also skipped some elements and assigned wrong NodeTypes.
Why can't it output the expected result? And why do these two snippets produce two totally different outputs with different depths, NodeTypes and skipped nodes?
I'd appreciate any help on this. I searched a lot for any answers on this but it seems like I'm the only one experiencing these issues. I'm using the .NET Framework 4.6.2 with Asp.net Web Forms in Visual Studio 2017.
Firstly, you are using a method XmlReader.ReadString() that is deprecated:
XmlReader.ReadString Method
... reads the contents of an element or text node as a string. However, we recommend that you use the ReadElementContentAsString method instead, because it provides a more straightforward way to handle this operation.
However, beyond warning us off the method, the documentation doesn't precisely specify what it actually does. To determine that, we need to go to the reference source:
public virtual string ReadString() {
if (this.ReadState != ReadState.Interactive) {
return string.Empty;
}
this.MoveToElement();
if (this.NodeType == XmlNodeType.Element) {
if (this.IsEmptyElement) {
return string.Empty;
}
else if (!this.Read()) {
throw new InvalidOperationException(Res.GetString(Res.Xml_InvalidOperation));
}
if (this.NodeType == XmlNodeType.EndElement) {
return string.Empty;
}
}
string result = string.Empty;
while (IsTextualNode(this.NodeType)) {
result += this.Value;
if (!this.Read()) {
break;
}
}
return result;
}
This method does the following:
If the current node is an empty element node, return an empty string.
If the current node is an element that is not empty, advance the reader.
If the now-current node is the end of the element, return an empty string.
While the current node is a text node, add the text to a string and advance the reader. As soon as the current node is not a text node, return the accumulated string.
Thus we can see that this method is designed to advance the reader. We can also see that, given mixed-content XML like <head>text <b>BOLD</b> more text</head>, ReadString() will only partially read the <head> element, leaving the reader positioned on <b>. This oddity is likely why Microsoft deprecated the method.
We can also see why your two snippets function differently. In the first, you get reader.Depth and reader.NodeType before calling ReadString() and advancing the reader. In the second you get these properties after advancing the reader.
Since your intent is to iterate through the nodes and get the value of each, rather than ReadString() or ReadElementContentAsString() you should just use XmlReader.Value:
gets the text value of the current node.
Thus your corrected code should look like:
string xmlcontent = reader.Value;
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Console.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
XmlReader is tricky to work with. You always need to check the documentation to determine exactly where a given method positions the reader. For instance, XmlReader.ReadElementContentAsString() moves the reader past the end of the element, whereas XmlReader.ReadSubtree() moves the reader to the end of the element. But as a general rule any method named Read is going to advance the reader, so you need to be careful using a Read method inside an outer while (reader.Read()) loop.
Demo fiddle here.

Roslyn Formatter.Format() takes too long

I am using Roslyn to generate a big amount of code (about 60k lines).
The problem comes when I use Formatter.Format() to format the whitespace. The actual formatting takes way too long (~60k lines in ~200s).
Used code.
public string GenerateCode()
{
var workspace = new AdhocWorkspace();
OptionSet options = workspace.Options;
options = options.WithChangedOption(CSharpFormattingOptions.NewLinesForBracesInMethods, true);
options = options.WithChangedOption(CSharpFormattingOptions.NewLinesForBracesInProperties, true);
CompilationUnitSyntax compilationUnit = CreateCompilationUnit();// this method builds the syntax tree.
SyntaxNode formattedNode = Formatter.Format(compilationUnit, workspace, options);
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
formattedNode.WriteTo(writer);
}
return sb.ToString();
}
I came to a realization, that a human readable formatting is not essential (still would be nice). I stopped formatting the code but then the generated code is actually unable to compile. That is because some keywords don't have the necessary whitespace around them. For example "publicstaticclassMyClass".
I tried different options of the Formatter but none were sufficient.
Then I was looking for an alternative "minimal" formatter. To my knowledge, there isn't any.
Finally, I managed to solve this by putting extra whitespace in the identifiers themselves.
var className = "MyClass";
SyntaxFactory.ClassDeclaration(" " + className)
.AddModifiers(
// Using explicit identifier with extra whitespace
SF.Identifier(" public "),
SF.Identifier(" static "));
// Instead of the SyntaxKind enum
//SF.Token(SyntaxKind.PublicKeyword),
//SF.Token(SyntaxKind.StaticKeyword));
And for the code generation.
public string GenerateCode()
{
var workspace = new AdhocWorkspace();
CompilationUnitSyntax compilationUnit = CreateCompilationUnit(); // this method builds the syntax tree.
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
compilationUnit.WriteTo(writer);
}
return sb.ToString();
}
This way the generation is much faster (~60k lines in ~2s Not really lines since it is not formatted but it is the same amount of code). Although this works, it seems kinda hacky. Another solution might be to create an alternative Formatter but that is a task i don't wish to undertake.
Did anyone come up with a better solution? Is there some way to use the Formatter more efficiently?
Note: The time measurements provided include the time of building the syntax tree and several other procedures. In both cases the formatting is about 98% of the measured time. So it is still possible to use them for comparison.
The "minimal formatter" you're looking for is the .NormalizeWhitespace() method.
It's not suitable for code you intend humans to maintain, but I'm assuming that shouldn't be an issue since you're dealing with a 60k line file!

How to get only xml from file in c#?

I have a problem with parsing file with XmlReader. I have a file containing info like this:
<Users>
<User>
<Email>email</Email>
<Key>23456</Key>
</User>
</Users>
asdfsof48f43uf489f3yf3y39fh3f489f3hf94[t]45.54tv,]5t
File contains xml values and then encrypted data from byte[] array.
The problem I've encountered is when i use:
using (var reader = XmlReader.Create(fileName))
{
while (reader.Read())
{
//parsing
}
}
I got 'System.Xml.XmlException' at line where encrypted bytes begin.
My question is: how to retrieve only xml part and only byte[] part?
If in case the encrypted data is always the last line you can use below snippet to read only XML part of data given that the XML data is limited in size
var fileLines = File.ReadAllLines(#"c:\temp\file.txt");
var xmlFromFile = string.Join("", fileLines, 0, fileLines.Length - 1);
using (var reader = XmlReader.Create(new StringReader(xmlFromFile)))
{
// Your logic goes here
}
you can do string parsing...
int start, end;
string myFile = File.ReadAllText("...");
start = myFile .IndexOf("<Users>");
end = myFile .IndexOf("</Users>") + 8;
myFile = myFile.Substring(start, end-start);
At that point you can load it into a xml document if you want. This all depends on you being 100% sure about the file format. This is a pretty fragile answer, so don't use it if you don't have a total trust in your input file.

Using XDocument to write raw XML

I'm trying to create a spreadsheet in XML Spreadsheet 2003 format (so Excel can read it). I'm writing out the document using the XDocument class, and I need to get a newline in the body of one of the <Cell> tags. Excel, when it reads and writes, requires the files to have the literal string
embedded in the string to correctly show the newline in the spreadsheet. It also writes it out as such.
The problem is that XDocument is writing CR-LF (\r\n) when I have newlines in my data, and it automatically escapes ampersands for me when I try to do a .Replace() on the input string, so I end up with &#10; in my file, which Excel just happily writes out as a string literal.
Is there any way to make XDocument write out the literal
as part of the XML stream? I know I can do it by deriving from XmlTextWriter, or literally just writing out the file with a TextWriter, but I'd prefer not to if possible.
I wonder if it might be better to use XmlWriter directly, and WriteRaw?
A quick check shows that XmlDocument makes a slightly better job of it, but xml and whitespace gets tricky very quickly...
I battled with this problem for a couple of days and finally came up with this solution. I used XMLDocument.Save(Stream) method, then got the formatted XML string from the stream. Then I replaced the &#10; occurrences with
and used the TextWriter to write the string to a file.
string xml = "<?xml version=\"1.0\"?><?mso-application progid='Excel.Sheet'?><Workbook xmlns=\"urn:schemas-microsoft-com:office:spreadsheet\" xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\" xmlns:ss=\"urn:schemas-microsoft-com:office:spreadsheet\" xmlns:html=\"http://www.w3.org/TR/REC-html40\">";
xml += "<Styles><Style ss:ID=\"s1\"><Alignment ss:Vertical=\"Center\" ss:WrapText=\"1\"/></Style></Styles>";
xml += "<Worksheet ss:Name=\"Default\"><Table><Column ss:Index=\"1\" ss:AutoFitWidth=\"0\" ss:Width=\"75\" /><Row><Cell ss:StyleID=\"s1\"><Data ss:Type=\"String\">Hello&#10;&#10;World</Data></Cell></Row></Table></Worksheet></Workbook>";
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(xml); //load the xml string
System.IO.MemoryStream stream = new System.IO.MemoryStream();
doc.Save(stream); //save the xml as a formatted string
stream.Position = 0; //reset the stream position since it will be at the end from the Save method
System.IO.StreamReader reader = new System.IO.StreamReader(stream);
string formattedXML = reader.ReadToEnd(); //fetch the formatted XML into a string
formattedXML = formattedXML.Replace("&#10;", "
"); //Replace the unhelpful &#10;'s with the wanted endline entity
System.IO.TextWriter writer = new System.IO.StreamWriter("C:\\Temp\test1.xls");
writer.Write(formattedXML); //write the XML to a file
writer.Close();

Categories