Yes, I know, serialized data produced by boost is intended only for the library internal use and not to be read by third parties. However I find myself in a position that I have to mimic binary serialized data originated from .NET (std::vector of tiny PODs) which would be deserialized later by boost (C++ native). CLI/C++ interop with native boost is not possible since the assembly have to be CLR pure.
Is it feasible? just to write the right sequence of bytes? Is there any binary format spec? I didnt find any.
EDIT001: Some background: I have a table in a database, hundreds of millions of rows. each row consists of two IDs - entity ID, parent entity ID and additional column for entity data (all entity data in the form of JSON, but it doesnt matter, I cant change it). Now, in the native C++ I have to select entities by parent ID to get all entities it has, it would yield (sometimes) 5M rows, as one can guess it will take ages to query, receive, iterate, parse and load into vector of C++ structs. So I've tested what if I have my own table with parent ID as PK and a column with all entities belonging to that ID binary serialized. The result (aside data transfer over network, etc) I can parse (actually, boost can) it in ~400ms which is not blazing fast but good enough for me. Now, how do I get my table populated with binary data? Obviously DBA team cant help here, they know nothing about boost binary format, so I resorted to CLR user defined function which MUST be implemented as "pure" CLR. this UDF is supposed to be called from stored procedure which populates table with individual entities and in the end will run over these and create binary bulk. But how I can mimic the boost binary format if I cant call boost (CLI/C++) in my assembly???
Related
I currently follow a pattern where I store objects which are serialized and deserialized to a particular column.
This pattern was fine before, however, now due to the frequency of transactions the cost of serializing the object to a JSON string and then later retrieving the string and deserializing back to an object is too expensive.
Is it possible to store an object directly to a column to avoid this cost? I am using Entity Framework and I would like to work the data stored in this column as type Object.
Please advise.
JSON serialization is not fast. It's faster and less verbose than XML, but a lot slower than binary serialization. I would look at third party binary serializers, namely ZeroFormatter, or Wire/Hyperion. For my own stuff I use Wire as a "fast enough" and simple to implement option.
As far as table structure I would recommend storing serialized data in a separate 1..0-1 associated table. So if I had an Order table that I wanted to serialize some extra order-related structure (coming from 3rd party delivery system for example) I'd create another table called OrderDeliveryInfo with a PK of OrderID to join to the Order table to house the Binary[] column for the serialized data. The reason for this is to avoid the cost of retrieving and transmitting the binary blob every time I query Order records unless I explicitly request the delivery info.
I have been using following way to save dictionaries into the database:
Convert the dictionary to XML.
Pass this XML to a SP(Stored Procedure).
In the SP use:
Select
Key1,
Value1
into #TempTable
FROM OPENXML(#handle, '//ValueSet/Values', 1)
WITH
(
key1 VARCHAR(MAX),
value1 varchar(100)
)
Done.
Is there a way to save dictionaries to a database without converting it to XML?
It depends whether...
You want the data to be stored: The fastest way (both implementation and performance) to do that is by binary serialization (Protocol buffers for example). However the data is not readable with a select and every application who needs to read the data must use the same serialization (if it exists in the same technology/language). From my point of view, it breaks the purpose of storing in a SQL database.
You want the data to be readable by humans: XML is an option while not so fast and a little bit difficult to read and still it is not query-able. However, it is quite fast to implement. You can also dump the result to a file and it's still readable. Moreover, you can share the data with other applications as XML is a widespread format.
You want the data to be query-able. Depending on the way you go, it could be not so easy to implement. You would need two tables (one for keys and one for values). Then you could write either your own custom mapping code to map columns to properties or you could use frameworks for mapping objects to tables like Entity framework or NHibernate.
While Entity or NHibernate may appear a bit huge swiss knife for a small problem, it's always interesting to built some expertise in it, as the inner concepts are re-usable and it can really speed up development once you got a working setup.
Serialize the Dictionary, and store the binary data.
Then De-Serialize your data back into Dictionary.
Tutorial1 Tutorial2
Loop through the dictionary using a foreach statement.
What is best way to store big objects? In my case it's something like tree or linked list.
I tried the following:
1) Relational db
Is not good for tree structures.
2) Document db
I tried RavenDB but it raised System.OutOfMemory exception when i call SaveChanges method
3) .Net Serialization
It's working very slow
4) Protobuf
It cannt to deserialize List<List<>> types and im not sure about linked structures.
So...?
You mention protobuf - I routinely use protobuf-net with objects that are many hundreds of megabytes in size, but: it does need to be suitably written as a DTO, and ideally as a tree (not a bidirectional graph, although that usage is supported in some scenarios).
In the case of a doubly-linked list, that might mean simply: marking the "previous" links as not serialized, then doing a fix-up in an after-deserialize callback, to correctly set the "previous" links. Pretty easy normally.
You are correct in that it doesn't currently support nested lists. This is usually trivial to side-step by using a list of something that has a lists but I'm tempted to make this implicit - i.e. the library should be able to simulate this without you needing to change your model. If you are interested in me doing this, let me know.
If you have a concrete example of a model you'd like to serialize, and want me to offer guidance, let me know - if you can't post it here, then my email is in my profile. Entirely up to you.
Did you tried Json.NET and store the result in a file?
Option [ 2 ] : NOSQL ( Document ) Database
I suggest Cassandra.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
CouchDb does a very good job with challenges like that one.
storing a tree in CouchDb
storing a tree in relational databases
A generic data import module:
I am reading data from any of 6 data source types (CSV, Active Directory, SQL, Access, Oracle, Sharepoint) into a datatable.
This data is then possibly changed by the users by casting, and calculation per column and written to a SQL table (any table selected by the user).
Doing this seems easy except that the user must also be able to replace some fields in the datatable with values from fields in the target SQL database (lookups)
I would really like to do all of the above to the datatable before sending to the target databes, but cannot, repeat NOT use Linq since the table structures (both source and target are unknown and do not represent a specific business object.
tl;dr I need to do data transformations on any datatable. Which is a good way (No Linq!)
EDIT: The source and target tables are different in structure.
I ended up writing a class for each database type all part of one interface, and using a GenericConnection based on DbConnection for the differenct types of sources.
I broke the process up into:
Import
Transform
Write
stages that can be saved and reopened for re-use or editing.
The Transform part consists of:
Casting
Calculations (string,int,decimal,date,bool) as Add, Subtract, Divide, Multiply, AND, OR, Substring, Replace
Lookups against other tables
Direct copying
The transformations can be queued so that one column of data can go through any amount to match the target before being written to target.
Currently, I'm sitting on an ugly business application written in Access that takes a spreadsheet on a bi-daily basis and imports it into a MDB. I am currently converting a major project that includes this into SQL Server and .net, specifically in c#.
To house this information there are two tables (alias names here) that I will call Master_Prod and Master_Sheet joined on an identity key parent to the Master_Prod table, ProdID. There are also two more tables to store history, History_Prod and History_Sheet. There are more tables that extend off of Master_Prod but keeping this limited to two tables for explanation purposes.
Since this was written in Access, the subroutine to handle this file is littered with manually coded triggers to deal with history that were and have been a constant pain to keep up with, one reason why I'm glad this is moving to a database server rather than a RAD tool. I am writing triggers to handle history tracking.
My plan is/was to create an object modeling the spreadsheet, parse the data into it and use LINQ to do some checks client side before sending the data to the server... Basically I need to compare the data in the sheet to a matching record (Unless none exist, then its new). If any of the fields have been altered I want to send the update.
Originally I was hoping to put this procedure into some sort of CLR assembly that accepts an IEnumerable list since I'll have the spreadsheet in this form already but I've recently learned this is going to be paired with a rather important database server that I am very concerned with bogging down.
Is this worth putting a CLR stored procedure in for? There are other points of entry where data enters and if I could build a procedure to handle them given the objects passed in then I could take a lot of business rule away from the application at the expense of potential database performance.
Basically I want to take the update checking away from the client and put it on the database so the data system manages whether or not the table should be updated so the history trigger can fire off.
Thoughts on a better way to implement this along the same direction?
Use SSIS. Use Excel Source to read the spreadsheets, perhaps use a Lookup Transformation to detect new items and finally use a SQL Server Destination to insert the stream of missing items into SQL.
SSIS is way better fit to these kind of jobs that writing something from scratch, no matter how much fun linq is. SSIS Packages are easier to debug, maintain and refactor than some dll with forgoten sources. Besides, you will not be able to match the refinements SSIS has in managing its buffers for high troughput Data Flows.
Originally I was hoping to put this
procedure into some sort of CLR
assembly that accepts an IEnumerable
list since I'll have the spreadsheet
in this form already but I've recently
learned this is going to be paired
with a rather important database
server that I am very concerned with
bogging down.
Does not work. Any input into a C# written CLR procedure STILL has to follow normal SQL semantics. All that can change is the internal setup. Any communication up with the client has to be done in SQL. Which means executions / method calls. No way to directly pass in an enumerable of objects.
My plan is/was to create an object
modeling the spreadsheet, parse the
data into it and use LINQ to do some
checks client side before sending the
data to the server... Basically I need
to compare the data in the sheet to a
matching record (Unless none exist,
then its new). If any of the fields
have been altered I want to send the
update.
You probably need to pick a "centricity" for your approach - i.e. data-centric or object-centric.
I would probably model the data appropriately first. This is because relational databases (or even non-normalized models represented in relational databases) will often outlive client tools/libraries applications. I would probably start trying to model in a normal form and think about the triggers to maintain audit/history as you mention during this time also.
I would typically then think of the data coming in (not an object model or an entity, really). So then I focus on the format and semantics of the inputs and see if there is misfit in my data model - perhaps there were assumptions in my data model which were incorrect. Yes, I'm not thinking of making an object model which validates the spreadsheet even though spreadsheets are notoriously fickle input sources. Like Remus, I would simply use SSIS to bring it in - perhaps to a staging table and then some more validation before applying it to production tables with some T-SQL.
Then I would think about a client tool which had an object model based on my good solid data model.
Alternatively, the object approach would mean modeling the spreadsheet, but also an object model which needs to be persisted to the database - and perhaps you now have two object models (spreadsheet and full business domain) and database model (storage persistence), if the spreadsheet object model is not as complete as the system's business domain object model.
I can think of an example where I had a throwaway external object model kind of like this. It read a "master file" which was a layout file describing an input file. This object model allowed the program to build SSIS packages (and BCP and SQL scripts) to import/export/do other operations on these files. Effectively it was a throwaway object model - it was not used as the actual model for the data in the rows or any kind of navigation between parent and child rows, etc., but simply an internal representation for internal purposes - it didn't necessarily correspond to a "domain" entity.