Persisting with XML: multiple files or single file? - c#

I've already about general guidelines on What would be a good approch for using XML as data persistence for a small C# app?.
I decide to use XML under the hood. It's a small app, a report card app for teacher use. These are my main entities:
Student
Course
Teacher (there should be only one, but I'll store it because of future integration possibilities)
Grade (a student can have more than one grade in each course)
I have some points I would like suggestions:
Under the hood, should I have one XML file per entity or one big XML file?
How's that under a performance perspective
How's that under a data joining perspective
Under the hood, should I use Linq to XML? Is there something else to even be considered?

One XML file is probably easier.
That file might look like this:
<ReportCardData>
<Students>
<Student>...data for student A ...</Student>
...
</Students>
<Teachers>
<Teacher> ...
...
</Teachers>
<Courses>
<Course> ...
...
</Courses>
</ReportCardData>
The code in C# would look like this:
public class ReportCardData
{
public List<Student> Students;
public List<Teacher> Teachers;
public List<Course> Courses;
}
And the code to de-serialize (read) the data looks like this:
ReportCardData rcd = null;
var s= new System.Xml.Serialization.XmlSerializer(typeof(ReportCardData>));
using(System.IO.StreamReader reader= System.IO.File.OpenText(filepath))
{
rcd= (ReportCardData) s.Deserialize(reader);
}
...be sure to add in the appropriate exception handling, etc.
Using XML Serialization works fine for something like this, even for large data sets with multiple tens of megabytes. (If you are talking about 100's of Megabytes of data, then maybe consider a real database like SQL Express)
The reading and writing performance will likely be fine.
Keep in mind that when you de-serialize data from the XML file, the entire dataset will be held in memory in your app. So if it is 15mb worth of grade data, then it is all in memory at one time.
You also asked about a "data joining perspective" - not sure what that means, but using Linq to XML you can perform queries across that data. The performance of the in-memory queries is also fine.

Related

'Streaming' data into Sql server

I'm working on a project where we're receiving data from multiple sources, that needs to be saved into various tables in our database.
Fast.
I've played with various methods, and the fastest I've found so far is using a collection of TableValue parameters, filling them up and periodically sending them to the database via a corresponding collection of stored procedures.
The results are quite satisfying. However, looking at disk usage (% Idle Time in Perfmon), I can see that the disk is getting periodically 'thrashed' (a 'spike' down to 0% every 13-18 seconds), whilst in between the %Idle time is around 90%. I've tried varying the 'batch' size, but it doesn't have an enormous influence.
Should I be able to get better throughput by (somehow) avoiding the spikes while decreasing the overall idle time?
What are some things I should be looking out to work out where the spiking is happening? (The database is in Simple recovery mode, and pre-sized to 'big', so it's not the log file growing)
Bonus: I've seen other questions referring to 'streaming' data into the database, but this seems to involve having a Stream from another database (last section here). Is there any way I could shoe-horn 'pushed' data into that?
A very easy way of inserting loads of data into an SQL-Server is -as mentioned- the 'bulk insert' method. ADO.NET offers a very easy way of doing this without the need of external files. Here's the code
var bulkCopy = new SqlBulkCopy(myConnection);
bulkCopy.DestinationTableName = "MyTable";
bulkCopy.WriteToServer (myDataSet);
That's easy.
But: myDataSet needs to have exactly the same structure as MyTable, i.e. Names, field types and order of fields must be exactly the same. If not, well there's a solution to that. It's column mapping. And this is even easier to do:
bulkCopy.ColumnMappings.Add("ColumnNameOfDataSet", "ColumnNameOfTable");
That's still easy.
But: myDataSet needs to fit into memory. If not, things become a bit more tricky as we have need a IDataReader derivate which allows us to instantiate it with an IEnumerable.
You might get all the information you need in this article.
Building on the code referred to in alzaimar's answer, I've got a proof of concept working with IObservable (just to see if I can). It seems to work ok. I just need to put together some tidier code to see if this is actually any faster than what I already have.
(The following code only really makes sense in the context of the test program in code download in the aforementioned article.)
Warning: NSFW, copy/paste at your peril!
private static void InsertDataUsingObservableBulkCopy(IEnumerable<Person> people,
SqlConnection connection)
{
var sub = new Subject<Person>();
var bulkCopy = new SqlBulkCopy(connection);
bulkCopy.DestinationTableName = "Person";
bulkCopy.ColumnMappings.Add("Name", "Name");
bulkCopy.ColumnMappings.Add("DateOfBirth", "DateOfBirth");
using(var dataReader = new ObjectDataReader<Person>(people))
{
var task = Task.Factory.StartNew(() =>
{
bulkCopy.WriteToServer(dataReader);
});
var stopwatch = Stopwatch.StartNew();
foreach(var person in people) sub.OnNext(person);
sub.OnCompleted();
task.Wait();
Console.WriteLine("Observable Bulk copy: {0}ms",
stopwatch.ElapsedMilliseconds);
}
}
It's difficult to comment without knowing the specifics, but one of the fastest ways to get data into SQL Server is Bulk Insert from a file.
You could write the incoming data to a temp file and periodically bulk insert it.
Streaming data into SQL Server Table-Valued parameter also looks like a good solution for fast inserts as they are held in memory. In answer to your question, yes you could use this, you just need to turn your data into a IDataReader. There's various ways to do this, from a DataTable for example see here.
If your disk is a bottleneck you could always optimise your infrastructure. Put database on a RAM disk or SSD for example.

LINQ to SQL Attaching collection of object from XML file Database

I am developing an HRM application to import and export xml data from database. The application receives exported xml data for the employee entry. I imported the xml file using linq to xml, where I converted the xml into respective objects. Then I want to attach (update) the employee objects.
I tried to use
//linqoper class for importing xml data and converts into IEnumerable employees object.
var emp = linqoper.importxml(filename.xml);
Using (EmployeedataContext db = new EmployeedatContext){
db.attachAllonSubmit(emp);
db.submitchange();
}
But I got error
“An entity can only be attached as modified without original state if it declares as version member or doesn't have an update check policy”.
I have also an option to retrieve each employee, and assign value to the new employee from xml data using this format.
//import IEnumerable of Employee objects
var employees = = linqoper.importxml(filename.xml)
using(Employeedatacontext db = new Employeedatacontext){
foreach(var empobj in employees)
{
Employee emp = db.Employee.where(m=>m.id==empobj.Id);
emp.FirstName=empobj.FirstName;
emp.BirthDate=empobj.BirthDate;
//….continue
}
db.submitChanges();
}
But the problem with the above is I have to iterate through the whole employee objects, which is very tiresome.
So is there any other way, I could attach (update) the employee entity in the database using LINQ to SQL.
I have seen some similar links on SO, but none of them seems to help.
https://stackoverflow.com/questions/898267/linq-to-sql-attach-refresh-entity-object
When linq-to-sql saves the changes to the database, it has to know properties of the object has been changed. It also checks if a potentially conflicting update to the database have been done during the update (optimistic concurrency).
To handle those cases LINQ-to-SQL needs two copies of the object when attaching. One with the original values (as present in the DB) and one with the new, changed values. There is also a more advanced mechanism involving a version member which is mapped to a rowversion column.
The linq-to-sql way to update a set of data is to first read all data from the database, then update the objects retrieved form the database and finally call SubmitChanges(). That would be my first approach in your situation.
If you experience performance problems, then it's time to go outside of linq-to-sql's toolbox. A solution with better performance is to load the new data into a separate staging table (for best performance, use bulk insert). Then run a SQL command or Stored Procedure that does the actual merging of data. The SQL Merge clause is excellent for this kind of updates.
LINQ to SQL is proper ORM, but if you want to take control of create/update/delete in your hand; than you can try some simple ORMs which just provide ways to do CRUD operations. I can recommend one http://crystalmapper.codeplex.com, it is simple yet powerful.
Why CrystalMapper?
I built this for large financial transaction system with lots of insert and update operations. What I need is speed and control of insert/update serving complex business scenarios ... hitting multiple tables just for one transaction.
When I put this to use in social text processing platform, it serves very well there too.

Making a Simple Database Program in C#

I'm currently writing a simple text analysis program in C#. Currently it takes simple statistics from the text and prints them out. However, I need to get it to the point where in input mode you input sample text, specifying an author, and it writes the statistics to a database entry of that specific author. Then in a different mode the program will take text, and see if it can accurately identify the author by pulling averages from the DB files and comparing the text's statistics to sample statistics. What I need help with is figuring out the best way to make a database out of text statistics. Is there some library I could use for this? Or should I simply do simple reading and writing from text files that I'll store the information in? Any and all ideas are welcome, as I'm struggling to come up with a solution to this problem.
Thanks,
PardonMyRhetoric
You can use and XmlSerializer to persist your data to file really easily. There are numerous tutorials you can find on google that will teach you how in just a few minutes. However, most of them want to show you how to add attributes to your properties to customize the way it serializes, so I'll just point out that those aren't really necessary. So long as you have the [Serializeable] tag over your class all you need is something that looks like this to save:
void Save()
{
using (var sw = new StreamWriter("somefile.xml"))
(new XmlSerializer(typeof(MyClass))).Serialize(sw, this);
}
and something like this in a function to read it:
MyClass Load()
{
XmlSerializer xSer = new XmlSerializer(typeof(MyClass));
using (var sr = new StreamReader("somefile.xml"))
return (MyClass)xSer.Deserialize(sr);
}
I don't think in this stage you'll need a database. Try to select appropriate data structures from the .NET framework itself. Try to use dictionary or lists, don't use arrays for this, and the methods you write will become simpler. Try to learn LINQ - it's like queries to database, but to regular data structures. When you'll get this and the project will grow, try to add a database.

receiving everyday XML files - 12 types need to do search on these everyday

Asp.NET - C#.NET
I need a advice regarding a design problem below:
I'll receive everyday XML files. It changes the quantity e.g. yesterday 10 XML files received, today XML 56 files received and maybe tomorrow 161 XML files etc.
There are 12 types (12 XSD)... and in the top there is a attribute called FormType e.g. FormType="1", FormType="2" , FormType="12" etc. up to 12 formtypes.
All of them have common fields like Name, adres, Phone.
But e.g. FormType=1 is for Construction, FormType=2 is for IT, FormType 3=Hospital, Formtype=4 is for Advertisement etc. etc.
As I said all of them have common attributes.
Requirements:
Need a search screen so the user can do search on these XML contents. But I don't have any clue how to approach this. e.g. Search the text in some attributes for the xml's received from Date_From and Date_To.
Problem:
I've heard about putting the XML's in a Binary field and do XPATH query or whatever but don't know the word's to search on google.
I was thinking to create a big database.table and read all XML's and put in the Database Table. But the issue is some xml attributes are very huge like 2-3 pages. and the same attributes in other XML file are empty..
So creating NVARCHAR(MAX) for every XML attribute and putting them in table.field.... After some period my DATABASE will be a big big monster...
Can someone advice what is the best approach to handle this issue?
I'm not 100% sure I understand your problem. I'm guessing that the query's supposed to return individual XML documents that meet some kind of user-specified criteria.
In that event, my starting point would probably be to implement a method for querying a single XML document, i.e. one that returns true if the document's a hit and false otherwise. In all likelihood, I'd make the query parameter an XPath query, but who knows? Here's a simple example:
public bool TestXml(XDocument d, string query)
{
return d.XPathSelectElements(query).Any();
}
Next, I need a store of XML documents to query. Where does that store live, and what form does it take? At a certain level, those are implementation details that my application doesn't care about. They could live in a database, or the file system. They could be cached in memory. I'd start by keeping it simple, something like:
public IEnumerable<XDocument> XmlDocuments()
{
DirectoryInfo di = new DirectoryInfo(XmlDirectoryPath);
foreach (FileInfo fi in di.GetFiles())
{
yield return XDocument.Load(fi.Filename);
}
}
Now I can get all of the documents that fulfill a request like this:
public IEnumerable<XDocument> GetDocuments(query)
{
return XmlDocuments.Where(x => TextXml(x, query));
}
The thing that jumps out at me when I look at this problem: I have to parse my documents into XDocument objects to query them. That's going to happen whether they live in a database or the file system. (If I stick them in a database and write a stored procedure that does XPath queries, as someone suggested, I'm still parsing all of the XML every time I execute a query; I've just moved all that work to the database server.)
That's a lot of I/O and CPU time that gets spent doing the exact same thing over and over again. If the volume of queries is anything other than tiny, I'd consider building a List<XDocument> the first time GetDocuments() is called and come up with a scheme of keeping that list in memory until new XML documents are received (or possibly updating it when new XML documents are received).

Storing Data from Forms without creating 100's of tables: ASP.NET and SQL Server

Let me first describe the situation. We host many Alumni events over the course of each year and provide online registration forms for each event. There is a large chunk of data that is common for each event:
An Event with dates, times, managers, internal billing info, etc.
A Registration record with info about the payment and total amount charged per form submission
Bio/Demographic and alumni data about the 1 or more attendees (name, address, degree, etc.)
We store all of the above data within columns in tables as you would expect.
The trouble comes with the 'extra' fields we are asked to put on the forms. Maybe it is a dinner and there is a Veggie or Carnivore option, perhaps there is lodging and there are bed or smoking options, or perhaps there is an optional transportation option. There are tons of weird little "can you add this to the form?" types of requests we receive.
Currently, we JSONify any non-standard data and store it all in one column (per attendee) called 'extras'. We can read this data out in code but it is not well suited to querying. Our internal staff would like to generate a quick report on Veggie dinners needed for instance.
Other than creating a separate table for each form that holds the specific 'extra' data items, are there any other approaches that could make my life (and reporting) easier? Anyone working in a simialr environment?
This is actually one of the toughest problem to solve efficiently. The SQL Server Customer Advisory Team has dedicated a white-paper to the topic which I highly recommend you read: Best Practices for Semantic Data Modeling for Performance and Scalability.
You basically have 3 options:
semantic database (entity-attribute-value)
XML column
sparse columns
Each solution comes with ups and downs. Out of the top of my hat I'd say XML is probably the one that gives you the best balance of power and flexibility, but the optimal solution really depends on lots of factors like data set sizes, frequency at which new attributes are created, the actual process (human operators) that create-populate-use these attributes etc, and not at least your team skill set (some might fare better with an EAV solution, some might fare better with an XML solution). If the attributes are created/managed under a central authority and adding new attributes is a reasonable rare event, then the sparse columns may be a better answer.
Well you could also have the following db structure:
Have a table to store custom attributes
AttributeID
AttributeName
Have a mapping table between events and attributes with:
AttributeID
EventID
AttributeValue
This means you will be able to store custom information per event. And you will be able to reuse your attributes. You can include some metadata as
AttributeType
AllowBlankValue
to the attribute to handle it easily afterwards
Have you considered using XML instead of JSON? Difference: XML is supported (special data type) and has query integration ;)
quick and dirty, but actually nice for querying: simply add new columns. it's not like the empty entries in the previous table should cost a lot.
more databasy solution: you'll have something like an event ID in your table. You can link this to an n:m table connecting events to additional fields. And then store the additional field data in a table with additional_field_id, record_id (from the original table) and the actual value. Probably creates ugly queries, but seems politically correct in terms of database design.
I understand "NoSQL" (not only sql ;) databases like couchdb let you store arbitrary fields per record, but since you're already with SQL Server, I guess that's not an option.
This is the solution that we first proposed in ASP.NET Forums (that later became Community Server), and that the ASP.NET team built a similar version of in the ASP.NET 2.0 Membership when they released it:
Property Bags on your domain objects
For example:
Event.Profile() or in your case, Event.Extras().
Basically, a property bag is a serialized collection of data stored in a name/value pair in a column (or columns). The ASP.NET 2.0 Membership went the route of storing names in a semi-colon delimited list, and values in the same:
Table: aspnet_Profile
Column: PropertyNames (separated by semi-colons, and has start index and end index)
Column: PropertyValues (separated by semi-colons, and only stores the string value)
The downside to that approach is it is all strings, and manually has to be parsed (even though the membership system does it for you automatically).
Recently, my current method is I've built FormCollection and NameValueCollection C# extension methods that automatically serialize the collections to an XML result. And I store that XML in the table in it's own column associated with that entity. I also have a deserializer C# extension on XElement that deserializes that data back to the collection at runtime.
This gives you the power of actually querying those properties in XML, via SQL (though, that can be slow though - always flatten out your read-only data).
The final note is runtime querying: The general rule we follow is, if you are going to query a property of an entity in normal application logic, then you move that property to an actual column on the table - and create the appropriate indexes. If that data will never be queried directly (for example, Linq-to-Sql or EF), then leave it in the XML Property Bag.
Property Bags gives you the power of extending your domain models however you like, without having to modify the db schema.

Categories