My program generates some numeric results which are held together in a class:
[Serializable]
public class Examination
{
public string _examiner { get; set; }
public string _interpretation { get; set; }
DateTime _examination_date { get; set; }
// two following properties about 100x100 in size
public Point3D[,] _surface_coordinates { get; set; }
public double [,] _mapa_curvatura { get; set; }
public Point3DCollection _symmetry_line { get; set; }
}
Now I want to persist this using serialization (no need for ORM/Database in principle), and I am having some doubts:
I need the serialized data to be accessible to scripts in other languages (Python mostly) so I wouldn't use Binary Serialization, using XmlSerialization instead;
Multidimensional data is not supported, so I had to convert [,] arrays to [][] arrays, which looked a bit "dirty" to me (not big deal, though, if that were the only matter);
The resulting XML is a bit too big (2Mb per file), while I was getting results a lot smaller with quick'n'dirty binary formats in python (for example, saving array of doubles to a long string, putting rows and columns in the filename itself, then parsing it and reshaping during deserialization: not pretty too!).
Since I am a complete beginner in the Serialization field, I would like to ask: "Which would be an adviseable strategy/tactic to serialize and deserialize this class to disk?" Primary requirements would be:
Readable in other languages;
Compact file size;
Respecting C#/.NET good-practices and common idioms;
Thanks for reading!
Readable in other languages
This is one of the reasons XML was designed, personally I would stick with it.
Compact file size
Have you considered storing it as a compressed file? e.g. MyXml.zip.
Respecting C#/.NET good-practises and common idioms
Just stick to the docs and you should be fine.
Related
Introduction to the goal:
I am currently trying to optimize performance and memory usage of my code. (mainly Ram bottleneck)
The program will have many instances of the following element at the same time. Especially when historic prices should be processed at the fastest possible rate.
The struct looks like this in it's simplest way:
public struct PriceElement
{
public DateTime SpotTime { get; set; }
public decimal BuyPrice { get; set; }
public decimal SellPrice { get; set; }
}
I realized the performance benefits of using the struct just like an empty bottle and refill it after consumption. This way, I do not have to reallocate memory for each single element in the line.
However, it also made my code a little more dangerous for human errors in the program code. Namely I wanted to make sure that I always update the whole struct at once rather than maybe ending up with just an updated sellprice and buyprice because I forgot to update an element.
The element is very neat like this but I have to offload methods into functions in another classes in order to have the functionality I require - This in turn would be less intuitive and thus less preferable in code.
So I added some basic methods which make my life a lot easier:
public struct PriceElement
{
public PriceElement(DateTime spotTime = default(DateTime), decimal buyPrice = 0, decimal sellPrice = 0)
{
// assign datetime min value if not happened already
spotTime = spotTime == default(DateTime) ? DateTime.MinValue : spotTime;
this.SpotTime = spotTime;
this.BuyPrice = buyPrice;
this.SellPrice = sellPrice;
}
// Data
public DateTime SpotTime { get; private set; }
public decimal BuyPrice { get; private set; }
public decimal SellPrice { get; private set; }
// Methods
public decimal SpotPrice { get { return ((this.BuyPrice + this.SellPrice) / (decimal)2); } }
// refills/overwrites this price element
public void UpdatePrice(DateTime spotTime, decimal buyPrice, decimal sellPrice)
{
this.SpotTime = spotTime;
this.BuyPrice = buyPrice;
this.SellPrice = sellPrice;
}
public string ToString()
{
System.Text.StringBuilder output = new System.Text.StringBuilder();
output.Append(this.SpotTime.ToString("dd/MM/yyyy HH:mm:ss"));
output.Append(',');
output.Append(this.BuyPrice);
output.Append(',');
output.Append(this.SellPrice);
return output.ToString();
}
}
Question:
Let's say I have PriceElement[1000000] - will those additional methods put additional strain on the system memory or are they "shared" between all structs of type PriceElement?
Will those additional methods increase the time to create a new PriceElement(DateTime, buy, sell) instance, respectively the load on the garbage collector?
Will there be any negative impacts, I have not mentioned here?
will those additional methods put additional strain on the system memory or are they "shared" between all structs of type PriceElement?
Code is shared between all instances. So no additional memory will be used.
Code is stored separately from any data, and the memory for the code is only dependent on the amount of code, not how many instance of objects there are. This is true for both classes and structs. The main exception is generics, this will create a copy of the code for each type combination that is used. It is a bit more complicated since the code is Jitted, cached etc, but that is irrelevant in most cases since you cannot control it anyway.
I would recommend making your struct immutable. I.e. change UpdatePrice so it returns a new struct instead of changing the existing one. See why is mutable structs evil for details. Making the struct immutable allow you to mark the struct as readonly and that can help avoid copies when passing the struct with an in parameter. In modern c# you can take references to structs in an array, and that also helps avoiding copies (as you seem to be aware of).
I need to be able to search over a collection of approx 2 million items in C#. Search should be possible over multiple fields. Simple string-matching is good enough.
Using an external dependency like a database is not an option, but using an in-memory database would be OK.
The main goal is to do this memory-efficient.
The type in the collection is quite simple and has no long strings:
public class Item
{
public string Name { get; set; } // Around 50 chars
public string Category { get; set; } // Around 20 chars
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; } // 2-3 items
}
Focus and requirements
Clarification of focus and requirements:
No external dependencies (like a database)
Memory-efficient (below 2 GB for 2 million items)
Searchable items in collection (must be performant)
Today's non-optimal solution
Using a simple List<T> over above type, either as a class or a struct, still requires about 2 GB of memory.
Is there a better way?
The most significant memory hog in your class is the use of a read-only list. Get rid of it and you will reduce memory footprint by some 60% (tested with three tags):
public class Item
{
public string Name { get; set; }
public string Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public string Tags { get; set; } // Semi-colon separated
}
Also, consider using DateTime instead of DateTimeOffset. That will further reduce memory footprint with around 10%.
There are many things you can do in order to reduce the memory footprint of your data, but probably the easiest thing to do with the greatest impact would be to intern all strings. Or at least these that you expect to be repeated a lot.
// Rough example (no checks for null values)
public class Item
{
private string _name;
public string Name
{
get { return _name; }
set { _name = String.Intern(value); }
}
private string _category;
public string Category
{
get { return _category; }
set { _category = String.Intern(value); }
}
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
private IReadOnlyList<string> _tags;
public IReadOnlyList<string> Tags
{
get { return _tags; }
set { _tags = Array.AsReadOnly(value.Select(s => String.Intern(s)).ToArray()); }
}
}
Another thing you could do, more difficult and with smaller impact, would be to assign the same IReadOnlyList<string> object to items with identical tags (assuming that many items with identical tags exist in your data).
Update: Also don't forget to call TrimExcess to the list after you fill it with items, in order to get rid of the unused capacity.
This method can be used to minimize a collection's memory overhead if no new elements will be added to the collection.
With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.
If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.
Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.
You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?
With that out of the way, let's get into an answer.
Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.
Find or implement a Suffix Tree.
The suffix tree allows you to find substrings (with time complexity O(log N) O(m), where m is the word length) and get back the original objects they occur in.
Construct the suffix tree once, with the strings of each object pointing back to that object.
Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.
Edit: On memory usage
As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.
For example, the words fire and firm will be stored as a parent node fir with two leaf nodes, e and m, thus forming the words. Should the word fish be introduced, the node fir will be split: a parent node fi, with child nodes sh and r, and the r having child nodes e and m. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.
With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a through z, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.
The only overhead comes from storing separate substrings and the nodes that represent and connect them.
You can do theses dots, then you will see if there is trouble:
you can enable gcAllowVeryLargeObjects to enables arrays that are greater than 2 gigabytes.
Let the class implementation. When you choose between class and struct, the performance is not the main factor. I think there is no reason to use struct here. See Choosing Between Class and Struct.
Depending your search filter, you must override GetHashCode and Equal.
Do you need to mutate properties, or just search object in the collection?
If you just want research, and if your properties repeat themselves a lot, you can have one property used by many objects.
In this way, the value is stored only one time, and the object only store the reference.
You can do this only if you dont want to mutate the property.
As exemple, if two objects have the same category:
public class Category
{
public string Value { get; }
public Category(string category)
{
Value = category;
}
}
public class Item
{
public string Name { get; set; }
public Category Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; }
}
class Program
{
public void Init()
{
Category category = new Category("categoryX");
var obj1 = new Item
{
Category = category
};
var obj2 = new Item
{
Category = category
};
}
}
I would not expect any major memory issues with 2M objects if you are running 64-bits. There is a max size limit of lists of 2Gb, but a reference is only 8 bytes, so the list should be well under this limit. The total memory usage will depend mostly on how large the strings are. There will also be some object overhead, but this is difficult to avoid if you need to store multiple strings.
Also, how do you measure memory? The .Net runtime might over allocate memory, so the actual memory usage of your object might be significantly lower than the memory reported by windows. Use a memory profiler to get an exact count.
If strings are duplicated between many objects there might be a major win if you can deduplicate them, making use of the same instance.
using a struct instead of a class could avoid some overhead, so I made some tests:
list of objects using LINQ - 46ms
list of objects using for loop - 16ms
list of structs using for loop - 250ms
list of readonly structs with ref-return using for loop: 180ms
The exact times will depend on what query you are doing, these numbers are mostly for comparison.
Conclusion is that a regular List of objects with a regular for loop is probably the fastest. Also, iterating over all objects is quite fast, so in most cases it should not cause a major performance issue.
If you need better performance you will need to create some kind of index so you can avoid iterating over all items. Exact strategies for this is difficult to know without knowing what kinds of queries you are doing.
One option could be to use some variant of in memory database, this could provide most of the indexing functionality. SQLite would be one example
If the categories could be defined as an Enum, you can map it to bits, that would help in reducing the size pretty much. From 20bytes to say 2bytes(short int), this could approximately save around 36M bytes for 2M objects.
I'm trying to deserialize a JSON object array in C#. The array represents a row of a table, mainly consisting of plain strings. However, one or more of the items in the array may not be a string but infact a JSON object in itself, e.g.
"rows":[[{"date":"20140521","category":"one"},"data","moredata","evenmoredata"],...]
or on a different response from the server, the order may be different
"rows":[["data","moredata",{"date":"20140521","category":"one"},"evenmoredata"],...]
I'm trying to just treat this as a list of objects, with a known type called RowObject below:
[DataContract]
[KnownType(typeof(RowObject))]
public class Table
{
// other members to deserialize
[DataMember(Name = "rows")]
public List<List<object>> Rows { get; set; }
}
[DataContract]
public class RowObject
{
[DataMember(Name = "date")]
public DateTime date { get; set; }
[DataMember(Name = "category")]
public string category { get; set; }
}
This approach kind of worked in that the plain strings in the row were deserialized, however the RowObjects do not seem to be recognised even though I have tried to put them down as a KnownType. When I look at my deserialized List<object> Row, the RowObject just seems to be a blank object shown as {object} in the debugger.
I've managed to do this with known types and derived types elsewhere in the project but this problem dealing with plain strings has got me stuck. I've had a look at this question which seems pretty similar, but my problem with this answer is that I don't know which elements in the list are going to be the complex type. Also, I'm just using the .Net DataContractJsonSerializer throughout the project so would like to avoid third party libraries if at all possible.
Is it possible to deserialize a JSON array of different types like this?
Set EmitTypeInformation in DataContractJsonSerializerSettings to EmitTypeInformation.Always on the server side. That way you will get information about the types of your objexts, inside the json string.
I've been working on a project for a while to parse a list of entries from a csv file and use that data to update a database.
For each entry I create a new user instance that I put in a collection. Now I want to iterate that collection and compare the user entry to the user from the database (if it exists). My question is, how can I compare that user (entry) object to the user (db) object, while returning a list with differences?
For example following classes generated from database:
public class User
{
public int ID { get; set; }
public string EmployeeNumber { get; set; }
public string UserName { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public Nullable<int> OfficeID { get; set; }
public virtual Office Office { get; set; }
}
public class Office
{
public int ID { get; set; }
public string Code { get; set; }
public virtual ICollection<User> Users { get; set; }
}
To save some queries to the database, I only fill the properties that I can retrieve from the csv file, so the ID's (for example) are not available for the equality check.
Is there any way to compare these objects without defining a rule for each property and returning a list of properties that are modified? I know this question seems similar to some earlier posts. I've read a lot of them but as I'm rather inexperienced at programming, I'd appreciate some advice.
From what I've gathered from what I've read, should I be combining 'comparing properties generically' with 'ignoring properties using data annotations' and 'returning a list of CompareResults'?
There are several approaches that you can solve this:
Approach #1 is to create separate DTO-style classes for the contents of the CSV files. Though this involves creating new classes with a lot of similar fields, it decouples the CSV file format from your database and gives you the ability to change them later without influencing the other part. In order to implement the comparison, you could create a Comparer class. As long as the classes are almost identical, the comparison can get all the properties from the DTO class and implement the comparison dynamically (e.g. by creating and evaluating a Lambda expression that contains a BinaryExpression of type Equal).
Approach #2 avoids the DTOs, but uses attributes to mark the properties that are part of the comparison. You'd need to create a custom attribute that you assign to the properties in question. In the compare, you analyze all the properties of the class and filter out the ones that are marked with the attribute. For the comparison of the properties you can use the same approach as in #1. Downside of this approach is that you couple the comparison logic tightly with the data classes. If you'd need to implement several different comparisons, you'd clutter the data classes with the attributes.
Of course, #1 results in a higher effort than #2. I understand that it is not what you are looking for, but maybe having a separate, strongly-typed compared class is also an approach one can think about.
Some more details on a dynamic comparison algorithm: it is based on reflection to get the properties that need to be compared (depending on the approach you get the properties of the DTO or the relevant ones of the data class). Once you have the properties (in case of DTOs, the properties should have the same name and data type), you can create a LamdaExpression and compile and evaluate it dynamically. The following lines show an excerpt of a code sample:
public static bool AreEqual<TDTO, TDATA>(TDTO dto, TDATA data)
{
foreach(var prop in typeof(TDTO).GetProperties())
{
var dataProp = typeof(TDATA).GetProperty(prop.Name);
if (dataProp == null)
throw new InvalidOperationException(string.Format("Property {0} is missing in data class.", prop.Name));
var compExpr = GetComparisonExpression(prop, dataProp);
var del = compExpr.Compile();
if (!(bool)del.DynamicInvoke(dto, data))
return false;
}
return true;
}
private static LambdaExpression GetComparisonExpression(PropertyInfo dtoProp, PropertyInfo dataProp)
{
var dtoParam = Expression.Parameter(dtoProp.DeclaringType, "dto");
var dataParam = Expression.Parameter(dataProp.DeclaringType, "data");
return Expression.Lambda(
Expression.MakeBinary(ExpressionType.Equal,
Expression.MakeMemberAccess(
dtoParam, dtoProp),
Expression.MakeMemberAccess(
dataParam, dataProp)), dtoParam, dataParam);
}
For the full sample, see this link. Please note that this dynamic approach is just an easy implementation that leaves room for improvement (e.g. there is no check for the data type of the properties). It also does only check for equality and does not collect the properties that are not equal; but that should be easy to transfer.
While the dynamic approach is easy to implement, the risk for runtime errors is bigger than in a strongly-typed approach.
I have a need to parse (and build) fixed length text based messages that may in some cases contain array fields.
Example:
PARTA LOTA 02SUBLOT1 SUBLOT2 03TEST1 RESULT1 TEST2 RESULT2 TEST3 RESULT3
If this were an object, it might use the Lot object below.
Part Number (PARTA)
Lot Number (LOTA)
An Array of 2 SubLot Objects (SUBLOT1 with quantity 150 and SUBLOT2 with Quantity 999)
An Array of 3 Test Results (TEST1 with result 1234.67890, ...)
Note that the number of array items is specified in the message.
I was hoping to use the FileHelpers library that I've seen people talking about, but it doesn't appear to support multiple array fields where there is another field specifying the quantity, and it doesn't support field types that themselves have the attribute of [FixedLengthRecord()].
This is what I would like to be able to do. Note that the field length of 10 is just an artifact of keeping this simple. Not all fields would normally be defined with the same length.
[FixedLengthRecord()]
public class Lot
{
[FieldFixedLength(10)]
public string PartNumber { get; set; }
[FieldFixedLength(10)]
public string LotNumber { get; set; }
[FieldFixedLength(10)]
public SubLot[] SubLots { get; set; }
[FieldFixedLength(10)]
public Test[] Tests { get; set; }
}
[FixedLengthRecord()]
public class SubLot
{
[FieldFixedLength(10)]
public string SubLotNumber { get; set; }
[FieldFixedLength(10)]
public int Quantity { get; set; }
}
[FixedLengthRecord()]
public class Test
{
[FieldFixedLength(10)]
public string Description { get; set; }
[FieldFixedLength(10)]
public double Result { get; set; }
}
Anyone have any idea if this is possible with FileHelpers? Any other ideas? I have many different message types so I would rather not manually code for each one. The attribute decoration method in FileHelpers seems like a great clean solution and I'm considering just extending it, but I want to make sure I'm not missing a better solution out there.
I believe that I done something very similar in the past.
The way that I tackled this issue is to use custom attributes. This allowed me to create classes and nested objects which described my data exactly as described in the specification and use custom attributes to describe the data attributes (lenght, type, padding requirements, if required etc).
I also ended up writing a custom serialization/deserialization for the classes and attributes however that was really specific to the actual application as the data was coming through a custom government protocol which also sent and received data in fixed sized chunks or packets over encrypted sockets with continuation codes etc.
Tutorials
http://msdn.microsoft.com/en-us/library/aa288454%28v=vs.71%29.aspx
http://www.codeproject.com/KB/cs/attributes.aspx
http://www.devx.com/dotnet/Article/11579