Fastest way to parse byte array? - c#

I'm currently trying to parse an XML string to get various datapoints. My code below works but is eating up a ton of CPU usage so I want to optimize it anyway possible.
public static List<Purchase> ParsePurchases(Profile profile, byte[] data)
{
// Parse the profile XML and extract purchases
using (var ms = new MemoryStream(data))
{
using (var reader = new StreamReader(ms, Encoding.UTF8))
{
// read the data into a string
var xmlString = reader.ReadToEnd();
// create the DOM over it
XmlDocument doc = new XmlDocument();
doc.LoadXml(xmlString);
var purchaseElements = doc.GetElementsByTagName("purchase");
List<Purchase> purchases = new List<Purchase>();
for(var e = 0; e < purchaseElements.Count; e++)
{
var ele = (XmlElement)purchaseElements[e];
purchases.Add(
new Purchase(
profile,
Int32.Parse(ele.GetAttribute("id")),
Int32.Parse(((XmlElement)ele.GetElementsByTagName("price")[0]).InnerText),
Int32.Parse(((XmlElement)ele.GetElementsByTagName("quantity")[0]).InnerText),
((XmlElement)ele.GetElementsByTagName("description")[0]).InnerText
));
}
return purchases;
}
}
}
My LoadXml call is eating up the most CPU usage, around 44%, and my ReadToEnd call is eating up another 22%. Any ideas of how to optimize this?

Related

Convert response stream to XML

I sent an XML post to demo API and the response comes back as a stream of XML something like this:
API=3CProductData&XML=%3CProductData+Name%3D%22NameTest%22%3E%0D%0A++%3CId%3EXXXXXXXXX%3C%2FId%3E%0D%0A%3C%2FProductData%3E
I'm guessing this is what stream look like and my goal is to take that response and store it inside a new ProductData Object here is what I have done so far:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
// as an xml: deserialise into your own object or parse as you wish
StreamReader respStream = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Default);
string receivedResponse = respStream.ReadToEnd();
XmlSerializer x = new XmlSerializer(typeof(ProductData));
ProductData product = (ProductData) x.Deserialize(new StringReader(receivedResponse));
Console.WriteLine("Node1: " + product.Id.ToString());
Console.WriteLine("Node2: " + product.Name);
Console.ReadKey();
}
The Error Comes back with System.InvalidOperationException: 'There is an error in XML document (0, 0).'
XmlException: Root element is missing.
Here are two different solutions.
public ProductData TestFunction()
{
ProductData result = new ProductData();
string apiResponse = "API=3CProductData&XML=%3CProductData+Name%3D%22NameTest%22%3E%0D%0A++%3CId%3EXXXXXXXXX%3C%2FId%3E%0D%0A%3C%2FProductData%3E";
string xml = HttpUtility.UrlDecode(apiResponse.Substring(apiResponse.IndexOf("XML=") + 4));
XmlDocument document = new XmlDocument();
document.LoadXml(xml);
XmlNode newNode = document.DocumentElement;
// Name is actually an attribute on the ProductData
result.Name = ((XmlAttribute)newNode.Attributes["Name"]).InnerText;
// Id is an actual node
result.ID = ((XmlNode)newNode.FirstChild).InnerText;
using (TextReader reader = new StringReader(xml))
{
var serializer = new XmlSerializer(typeof(ProductData));
result = (ProductData)serializer.Deserialize(reader);
}
return result;
}
[Serializable]
[XmlRoot("ProductData")]
public class ProductData
{
[XmlElement("Id")]
public string ID { get; set; }
[XmlAttribute("Name")]
public string Name { get; set; }
}
There is one very fragile part of this code, and I didn't spend a lot of time trying to do handle it. The XML is not really well-formed in my opinion, so you're going to have to substring after the XML= which is why I added the +4 at the end of that. Probably a smoother way to do it, but again the issue is really with converting the XML. Since the XML is really simple, you can just target the values via SelectSingleNode. If you want to go the StreamReader route, you need to make sure your class/properties have the attributes set up (i.e. [XmlRoot("Productdata")])
You must remove the part API=3CProductData&XML= in your string and then, decode your part XML
Look at this code working :
string strRegex = #"<ProductData Name=""NameTest"">\r\n <Id>XXXXXXXXX</Id>\r\n</ProductData>";
ProductData result = null;
using (TextReader reader = new StringReader(strRegex))
{
var serializer = new XmlSerializer(typeof(ProductData));
result = (ProductData)serializer.Deserialize(reader);
}

How to improve performance of CSV upload via datatable

I have a working solution for uploading a CSV file. Currently, I use the IFormCollection for a user to upload multiple CSV files from a view.
The CSV files are saved as a temp file as follows:
List<string> fileLocations = new List<string>();
foreach (var formFile in files)
{
filePath = Path.GetTempFileName();
if (formFile.Length > 0)
{
using (var stream = new FileStream(filePath, FileMode.Create))
{
await formFile.CopyToAsync(stream);
}
}
fileLocations.Add(filePath);
}
I send the list of file locations to another method (just below). I loop through the file locations and stream the data from the temp files, I then use a data table and SqlBulkCopyto insert the data. I currently upload between 50 and 200 files at a time and each file is around 330KB. To insert a hundred, it takes around 6 minutes, which is around 30-35MB.
public void SplitCsvData(string fileLocation, Guid uid)
{
MetaDataModel MetaDatas;
List<RawDataModel> RawDatas;
var reader = new StreamReader(File.OpenRead(fileLocation));
List<string> listRows = new List<string>();
while (!reader.EndOfStream)
{
listRows.Add(reader.ReadLine());
}
var metaData = new List<string>();
var rawData = new List<string>();
foreach (var row in listRows)
{
var rowName = row.Split(',')[0];
bool parsed = int.TryParse(rowName, out int result);
if (parsed == false)
{
metaData.Add(row);
}
else
{
rawData.Add(row);
}
}
//Assigns the vertical header name and value to the object by splitting string
RawDatas = GetRawData.SplitRawData(rawData);
SaveRawData(RawDatas);
MetaDatas = GetMetaData.SplitRawData(rawData);
SaveRawData(RawDatas);
}
This code then passes the object to the to create the datatable and insert the data.
private DataTable CreateRawDataTable
{
get
{
var dt = new DataTable();
dt.Columns.Add("Id", typeof(int));
dt.Columns.Add("SerialNumber", typeof(string));
dt.Columns.Add("ReadingNumber", typeof(int));
dt.Columns.Add("ReadingDate", typeof(string));
dt.Columns.Add("ReadingTime", typeof(string));
dt.Columns.Add("RunTime", typeof(string));
dt.Columns.Add("Temperature", typeof(double));
dt.Columns.Add("ProjectGuid", typeof(Guid));
dt.Columns.Add("CombineDateTime", typeof(string));
return dt;
}
}
public void SaveRawData(List<RawDataModel> data)
{
DataTable dt = CreateRawDataTable;
var count = data.Count;
for (var i = 1; i < count; i++)
{
DataRow row = dt.NewRow();
row["Id"] = data[i].Id;
row["ProjectGuid"] = data[i].ProjectGuid;
row["SerialNumber"] = data[i].SerialNumber;
row["ReadingNumber"] = data[i].ReadingNumber;
row["ReadingDate"] = data[i].ReadingDate;
row["ReadingTime"] = data[i].ReadingTime;
row["CombineDateTime"] = data[i].CombineDateTime;
row["RunTime"] = data[i].RunTime;
row["Temperature"] = data[i].Temperature;
dt.Rows.Add(row);
}
using (var conn = new SqlConnection(connectionString))
{
conn.Open();
using (SqlTransaction tr = conn.BeginTransaction())
{
using (var sqlBulk = new SqlBulkCopy(conn, SqlBulkCopyOptions.Default, tr))
{
sqlBulk.BatchSize = 1000;
sqlBulk.DestinationTableName = "RawData";
sqlBulk.WriteToServer(dt);
}
tr.Commit();
}
}
}
Is there another way to do this or a better way to improve performance so that the time to upload is reduced as it can take a long time and I am seeing an ever increasing use of memory to around 500MB.
TIA
You can improve performance by removing the DataTable and reading from the input stream directly.
SqlBulkCopy has a WriteToServer overload that accepts an IDataReader instead of an entire DataTable.
CsvHelper can CSV files using a StreamReader as an input. It provides CsvDataReader as an IDataReader implementation on top of the CSV data. This allows reading directly from the input stream and writing to SqlBulkCopy.
The following method will read from an IFormFile, parse the stream using CsvHelper and use the CSV's fields to configure a SqlBulkCopy instance :
public async Task ToTable(IFormFile file, string table)
{
using (var stream = file.OpenReadStream())
using (var tx = new StreamReader(stream))
using (var reader = new CsvReader(tx))
using (var rd = new CsvDataReader(reader))
{
var headers = reader.Context.HeaderRecord;
var bcp = new SqlBulkCopy(_connection)
{
DestinationTableName = table
};
//Assume the file headers and table fields have the same names
foreach(var header in headers)
{
bcp.ColumnMappings.Add(header, header);
}
await bcp.WriteToServerAsync(rd);
}
}
This way nothing is ever written to a temp table or cached in memory. The uploaded files are parsed and written to the database directly.
In addition to #Panagiotis's answer, why don't you interleave your file processing with the file upload? Wrap up your file processing logic in an async method and change the loop to a Parallel.Foreach and process each file as it arrives instead of waiting for all of them?
private static readonly object listLock = new Object(); // only once at class level
List<string> fileLocations = new List<string>();
Parallel.ForEach(files, (formFile) =>
{
filePath = Path.GetTempFileName();
if (formFile.Length > 0)
{
using (var stream = new FileStream(filePath, FileMode.Create))
{
await formFile.CopyToAsync(stream);
}
await ProcessFileInToDbAsync(filePath);
}
// Added lock for thread safety of the List
lock (listLock)
{
fileLocations.Add(filePath);
}
});
Thanks to #Panagiotis Kanavos, I was able to work out what to do. Firstly, the way I was calling the methods, was leaving them in memory. The CSV file I have is in two parts, vertical metadata and then the usual horizontal information. So I needed to split them into two. Saving them as tmp files was also causing an overhead. It has gone from taking 5-6 minutes to now taking a minute, which for a 100 files containing 8,500 rows isn't bad I suppose.
Calling the method:
public async Task<IActionResult> UploadCsvFiles(ICollection<IFormFile> files, IFormCollection fc)
{
foreach (var f in files)
{
var getData = new GetData(_configuration);
await getData.SplitCsvData(f, uid);
}
return whatever;
}
This is the method doing the splitting:
public async Task SplitCsvData(IFormFile file, string uid)
{
var data = string.Empty;
var m = new List<string>();
var r = new List<string>();
var records = new List<string>();
using (var stream = file.OpenReadStream())
using (var reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var header = line.Split(',')[0].ToString();
bool parsed = int.TryParse(header, out int result);
if (!parsed)
{
m.Add(line);
}
else
{
r.Add(line);
}
}
}
//TODO: Validation
//This splits the list into the Meta data model. This is just a single object, with static fields.
var metaData = SplitCsvMetaData.SplitMetaData(m, uid);
DataTable dtm = CreateMetaData(metaData);
var serialNumber = metaData.LoggerId;
await SaveMetaData("MetaData", dtm);
//
var lrd = new List<RawDataModel>();
foreach (string row in r)
{
lrd.Add(new RawDataModel
{
Id = 0,
SerialNumber = serialNumber,
ReadingNumber = Convert.ToInt32(row.Split(',')[0]),
ReadingDate = Convert.ToDateTime(row.Split(',')[1]).ToString("yyyy-MM-dd"),
ReadingTime = Convert.ToDateTime(row.Split(',')[2]).ToString("HH:mm:ss"),
RunTime = row.Split(',')[3].ToString(),
Temperature = Convert.ToDouble(row.Split(',')[4]),
ProjectGuid = uid.ToString(),
CombineDateTime = Convert.ToDateTime(row.Split(',')[1] + " " + row.Split(',')[2]).ToString("yyyy-MM-dd HH:mm:ss")
});
}
await SaveRawData("RawData", lrd);
}
I then use a data table for the metadata (which takes 20 seconds for a 100 files) as I map the field names to the columns.
public async Task SaveMetaData(string table, DataTable dt)
{
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(_configuration.GetConnectionString("DefaultConnection"), SqlBulkCopyOptions.Default))
{
sqlBulk.DestinationTableName = table;
await sqlBulk.WriteToServerAsync(dt);
}
}
I then use FastMember for the large data parts for the raw data, which is more like a traditional CSV.
public async Task SaveRawData(string table, IEnumerable<LogTagRawDataModel> lrd)
{
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(_configuration.GetConnectionString("DefaultConnection"), SqlBulkCopyOptions.Default))
using (var reader = ObjectReader.Create(lrd, "Id","SerialNumber", "ReadingNumber", "ReadingDate", "ReadingTime", "RunTime", "Temperature", "ProjectGuid", "CombineDateTime"))
{
sqlBulk.DestinationTableName = table;
await sqlBulk.WriteToServerAsync(reader);
}
}
I am sure this can be improved on, but for now, this works really well.

Loop through large XML file using XDocument

I have to copy nodes from an existing XML file to a newly created XML file.
I'm using an XDocument instance to access the existing XML file. The problem is the XML file can be quite large (lets say 500K lines; Openstreetmap data).
What would be the best way to loop through large XML files without causing memory errors?
I currently just use XDocument.Load(path) and loop through doc.Descendants(), but this causes the program to freeze until the loop is done. So I think I have to loop async, but I don't know the best way to achieve this.
You can use XmlReader and IEnumerable<XElement> iterator to yield elements you need.
This approach isn't asynchronous but it saves memory, because you don't need load whole file in the memory for handling. Only elements you select to copy.
public IEnumerable<XElement> ReadFile(string pathToTheFile)
{
using (XmlReader reader = XmlReader.Create(pathToTheFile))
{
reader.MoveToContent();
while (reader.Read())
{
If (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("yourElementName"))
{
XElement element = XElement.ReadFrom(reader) as XElement;
yield return element ;
}
}
}
}
}
You can read files asynchronously
public async Task<IEnumerable<XElement>> ReadFileAsync(string pathToTheFile)
{
var elements = new List<XElement>();
var xmlSettings = new XmlReaderSettings { Async = true };
using (XmlReader reader = XmlReader.Create(pathToTheFile, xmlSettings))
{
await reader.MoveToContentAsync();
while (await reader.ReadAsync())
{
If (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("yourElementName"))
{
XElement element = XElement.ReadFrom(reader) as XElement;
elements.Add(element);
}
}
}
}
return elements;
}
Then you can loop all files asynchronously and await for the result
var fileTask1 = ReadFileAsync(filePath1);
var fileTask2 = ReadFileAsync(filePath2);
var fileTask3 = ReadFileAsync(filePath3);
await Task.WhenAll(new Task[] { fileTask1, fileTask2, fileTask3} );
// use results
var elementsFromFile1 = fileTask1.Result;

Deserialized object not same as source

I'm having problem deserializing previously serialized XML.
My class is generated from .xsd by xsd.exe utility. I have no influence on the structure of .xsd as it was issued by government, and used to standardize communication...
The problem is, when I create an object, set some properties on it, then serialize it using XmlSerializer, and then deserialize it back, I do not get the same "contents" as I started with. Some of the elements from XML deserialize in "Any" property instead of properties from which these elements were serialized in a first place.
Perhaps a clumsy explanation, but I've created a sample project that reproduces my issue.
Sample project can be found here.
Edit:
Ok, here is some sample code. Unfortunately I can't paste everything here, because file generated by xsd.exe is over 4000 lines long. But everything required is in linked file.
My test console app:
static void Main(string[] args)
{
Pismeno pismeno = new Pismeno();
#region Build sample content
pismeno.Sadrzaj = new SadrzajTip();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml("<SomeXml xmlns=\"http://some.namespace.com/\">Some content goes here</SomeXml>");
pismeno.Sadrzaj.Item = xmlDoc.DocumentElement;
pismeno.Prilog = new PismenoPrilog[1];
pismeno.Prilog[0] = new PismenoPrilog();
pismeno.Prilog[0].VrijemeNastanka = DateTime.Now.ToString("s");
XmlDocument xmlTitle = new XmlDocument();
xmlTitle.LoadXml("<Title xmlns=\"http://some.namespace.com/\">Test title 1</Title>");
pismeno.Prilog[0].Any = new XmlElement[1];
pismeno.Prilog[0].Any[0] = xmlTitle.DocumentElement;
pismeno.Prilog[0].Sadrzaj = new SadrzajTip();
EAdresaTip eat = new EAdresaTip();
eat.URL = "http://www.example.com/testfile.doc";
pismeno.Prilog[0].Sadrzaj.Item = eat;
#endregion
// Serialize object, and then deserialize it again
string pismenoSer = Serialize(pismeno);
Pismeno pismeno2 = Deserialize<Pismeno>(pismenoSer);
// Objects to compare. "source" has source.Sadrzaj and source.Prilog properties set
// "shouldBeTheSameAsSource" has shouldBeTheSameAsSource.Any property set
Pismeno source = pismeno;
Pismeno shouldBeTheSameAsSource = pismeno2;
}
public static string Serialize(object o)
{
string ret = null;
using (var stream = new MemoryStream())
{
XmlWriter xw = new XmlTextWriter(stream, Encoding.UTF8) { Formatting = Formatting.Indented };
new XmlSerializer(o.GetType()).Serialize(xw, o);
stream.Flush();
stream.Seek(0, SeekOrigin.Begin);
ret = (new StreamReader(stream, Encoding.UTF8)).ReadToEnd();
}
return ret;
}
public static T Deserialize<T>(string xml)
{
return (T)new XmlSerializer(typeof(T)).Deserialize(XmlReader.Create(new StringReader(xml)));
}

How to append data in a serialized file on disk

I have a program written in C# that serializes data into binary and write it on the disk. If I want to add more data to this file, fist I have to deserialise whole file and then append more serialized data to it. Is it possible to append data to this serialized file without deserialising the existing data so that I can save some time during whole process?
You don't have to have to read all the data in the file to append data.
You can open it in append mode and write the data.
var fileStream = File.Open(fileName, FileMode.Append, FileAccess.Write, FileShare.Read);
var binaryWriter = new BinaryWriter(fileStream);
binaryWriter.Write(data);
Now that we know (comments) that we're talking about a DataTable/DataSet via BinaryFormatter, it becomes clearer. If your intention is for that to appear as extra rows in the existing table, then no: that isn't going to work. What you could do is append, but deserialize each table in turn, then manually merge the contents. That is probably your best bet with what you describe. Here's an example just using 2, but obviously you'd repeat the deserialize/merge until EOF:
var dt = new DataTable();
dt.Columns.Add("foo", typeof (int));
dt.Columns.Add("bar", typeof(string));
dt.RemotingFormat = SerializationFormat.Binary;
var ser = new BinaryFormatter();
using(var ms = new MemoryStream())
{
dt.Rows.Add(123, "abc");
ser.Serialize(ms, dt); // batch 1
dt.Rows.Clear();
dt.Rows.Add(456, "def");
ser.Serialize(ms, dt); // batch 2
ms.Position = 0;
var table1 = (DataTable) ser.Deserialize(ms);
// the following is the merge loop that you'd repeat until EOF
var table2 = (DataTable) ser.Deserialize(ms);
foreach(DataRow row in table2.Rows) {
table1.ImportRow(row);
}
// show the results
foreach(DataRow row in table1.Rows)
{
Console.WriteLine("{0}, {1}", row[0], row[1]);
}
}
However! Personally I have misgivings about both DataTable and BinaryFormatter. If you know what your data is, there are other techniques. For example, this could be done very simply with "protobuf", since protobuf is inherently appendable. In fact, you need to do extra to not append (although that is simple enough too):
[ProtoContract]
class Foo
{
[ProtoMember(1)]
public int X { get; set; }
[ProtoMember(2)]
public string Y { get; set; }
}
[ProtoContract]
class MyData
{
private readonly List<Foo> items = new List<Foo>();
[ProtoMember(1)]
public List<Foo> Items { get { return items; } }
}
then:
var batch1 = new MyData { Items = { new Foo { X = 123, Y = "abc" } } };
var batch2 = new MyData { Items = { new Foo { X = 456, Y = "def" } } };
using(var ms = new MemoryStream())
{
Serializer.Serialize(ms, batch1);
Serializer.Serialize(ms, batch2);
ms.Position = 0;
var merged = Serializer.Deserialize<MyData>(ms);
foreach(var row in merged.Items) {
Console.WriteLine("{0}, {1}", row.X, row.Y);
}
}

Categories