Writing your own storage-system: where to start?

Writing your own storage-system: where to start? - c#

reading about NoSQL (http://nosql.eventbrite.com/), a movement aimed at encouraging the dropping of traditional relational databases in favor of custom, application-suited storage systems.
Intrigued by the idea of trying to write a small personal storage system (for the .net framework) as a learning pet project, what are you suggestions or useful links? Where to start? How to balance what's on the hard drive and what's in memory?
I think this could be an interesting opportunity to learn the insides of database inner work, but I really lack the most basic theory of it.
Thanks.

The NoSQL movement is aimed at huge scale systems, at sizes where the relational model truly breaks. Before you start writing your own storage I highly recommend understanding the relational model, as is one of the best documented and well understood domains in CS. Start with the Gray's and Reuter's Transaction Processing, this book explains everything there is to know about implementing a classic RDBMS. Next on your list should be the Readings in Database Systems, this is a collection of the most relevant scientific papers and articles.

Before you get going I would recommend looking into SQL Servers ability to store XML files as BLOB objects inside the relational database. Perhaps your storage system doesn't need to be "from scratch". It could be a hybrid on top of SQLs XML storage capability.

Well it all depends on the app you are building.
For example, if your app just needs to persist a few hundred objects and cut through them in a few ways and doesn't care if stuff gets corrupt once in a while. You could potentially just use LINQ to query a List and persist the List to disk once in a while.
If you need anything that has the magic ACID properties, well its going to take tons of work.
If you need something that supports Transactions, its going to take tons of work.
If you need something that understands the ANSI-SQL, you are going to have to write a parser, which is lots of work.
Before embarking on writing any kind of database I think you should understand a lot of database theory, get a book, read it.

Take a look at the work done by the Prevayler guys. They make the observation that if you can fit the data in ram, most usage scenario's show much better performance, and a lot less writing of code, without a RDBMS. On the other hand the Google, Amazon guys show that for large amounts of data you do not want to use a RDBMS. As we're moving to 64-bit OS-es and pcs with lots of ram, RDBMS's are between a rock and a hard place.

The SO question "Implementing a database — How to get started" has some usefull answers to your question!

Although this is a late response. There are a few basic scenarios you need to take into account before you do this even if you have prior knowledge in how the dbo and its engine works.
1. Is it for heavy storage?
If so, then you need to fine tune the pages and work on a file format that does take so much tick time to load and retrieve.
Does it need to handle many connection?
Again the pages are important but also you may need to create an engine for a Service or an app based instanced working behind the scenes.
Is it for application usage or web usage?
If it is for the web, then really use MySql or MSSQL.
Do not opt for inline memory to as your db storage because that nullifies the purpose of a database. The database was create so that you can free up the memory and release the table object(s) after an amount of time giving that memory back to the system. If it is for light use, create a simple XML/custom file database system because you are not saving or altering mass amounts of data at a time. Better than that, use SQLite which is very well suited for that purpose. If it is for opensource or commercial use do not go with inline memory because you don't to force someone to meet a high memory requirement, memory cost money and some folks are still running 32-bit OS.

Related

Most effective way of storing and managing moderate number of users

In a current project of mine I need to manage and store a moderate number (from 10-100 to 5000+) of users (ID, username, and some other data).
This means I have to be able to find users quickly at runtime, and I have to be able to save and restore the database to continue statistics after a restart of the program. I will also need to register every connect/disconnect/login/logout of a user for the statistics. (And some other data as well, but you get the idea).
In the past, I saved settings and other stuff in encoded textfiles, or serialized the needed objects and wrote them down. But these methods require me to rewrite the whole database on each change, and that's increasingly slowing it down (especially with a growing number of users/entries), isn't it?
Now the question is: What is the best way to do this kind of thing in C#?
Unfortunately, I don't have any experience in SQL or other query languages (except for a bit of LINQ), but that's not posing any problem for me, as I have the time and motivation to learn one (or more if required) for this task.

Most effective is highly subjective based on who you ask even if narrowing down this question to specific needs. If you are storing non-relational data Mongo or some other NoSQL type of database such as Raven DB would be effective. If your data has a relational shape then an RDBMS such as MySQL, SQL Server, or Oracle would be effective. Relational databases are ideal if you are going to have heavy reporting requirements as this allows non-developers more ease of access in writing simple SQL queries against it. But also keeping in mind performance with disk cache persistence that databases provide. Commonly accessed data is stored in memory to save the round trips to the disk (with hybrid drives I suppose accessing some files directly accomplishes the same thing however SSD's are still not as fast as RAM access). So you really need to ask yourself some questions to identify the best solution for you; What is the shape of your data (flat, relational, etc), do you have reporting requirements where less technical team members need to be able to query the data repository, and what are your performance metrics?

Reduce SQL Server overhead caching query results

I have a software who does a heavy processing based on some files.
I have to query some tables in SQL Server in the process and this is killing the DB and the application performance. (other applications use the same tables).
After optimizing queries and code, getting better results but not enough. After research I reached the solution: Caching some query results. My idea is cache one specific table (identified as the overhead) rows that the file being process need.
I was think in using AppCache Fabric (I'm on MS stack), made some tests it have a large memory usage for small objects ( appcache service have ~350MB of ram usage without objects). But I need to make some queries in these result table (like search for lastname, ssn, birthdate etc.)
My second option is MongoDb as a cache store. I've research about this and most of people I read recommend using memcached or Redis, but I'm using Windows servers and they're not supported officialy.
Using mongo as cache store in this case it is a good approach? Or AppFabric Caching + tag search is better?

It is hard to tell what is better because we don't know enough about your bottlenecks. A lot is depending on quality of the data you're discussing. If the data is very static and is not called constantly but to compile the data set is time-consuming, the good solution might be to use the materialized view. If this data is frequently called than you better caching it on some server (e.g. app fabric).
There are many techniques and possibilities. But you really need to think of the network traffic, demand, size, etc, etc. And it is hard to answer this here without knowing all the details.
Looks like you are on the right way but may be all you need is just a parametrized query. Hard to tell. But I would add Materialized view into the roster that you just posted. May be all you need is to build this view from all the data you need and just access its contents.

My question to you would be that what are your long-term goals or estimates for your application? If this is the highest load you are going to expereince then tuning the DB or using MVL would be an answer. But the long term solution to this is distributed caching and you are already thinking along those lines. Your data requirements is what we'd called "reference data" or "lookup-data" and once you are excuting multiple lookups with limited DB resources there will be performance issue and your DB will become a performance bottleneck.
So the solution, that you are already thinking of, is caching this "reference" data in a cache without the need to go to the database, while, at the same time, keeping cache synchronized with the Database.
Appfabric I wouldn't be too sure about as it will have the same support issues that you mention. What is your budget like? Can you think about spending on a cachisng solution like NCache?

ADO and Microsoft SQL database backup and archival

I am working on re-engineering/upgrade of a tool. The database communication is in C++(unmanaged ADO) and connects to SQL server 2005.
I had a few queries regarding archiving and backup/restore techniques.
Generally archiving is different than backup/restore . can someone provide any link which explains me that .Presently the solution uses bcp tool for archival.I see lot of dependency on table names in the code. what are the things i have to consider in choosing the design(considering i have to take up the backup/archival on a button click, database size of 100mb at max)
Will moving the entire communication to .net will be of any help? considering lot of ORM tools. also all the bussiness logic and UI is in C#
What s the best method to verify the archival data ?
PS: the questionmight be too high level, but i did not get any proper link to understand this. It will be really helpful if someone can answer. I can provide more details!
Thanks in advance!

At 100 MB, I would say you should probably not spend too much time on archiving, and just use traditional backup strategies. The size of your database is so small that archiving would be quite an elaborate operation with very little gain, as the archiving process would typically only be relevant in the case of huge databases.
Generally speaking, a backup in database terms is a way to provide recoverability in case of a disaster (accidental data deletion, server crash, etc). Archiving mostly means you partition your data.
A possible goal with archiving is to keep specific data available for querying, but without the ability to alter it. When dealing with high volume databases, this is an excellent way to increase performance, as read-only data can be indexed much more densely than "hot" data. It also allows you to move the read-only data to an isolated RAID partition that is optimized for READ operations, and will not have to bother with the typical RDBMS IO. Also, by removing the non-active data from the regular database means the size of the data contained in your tables will decrease, which should boost performance of the overall system.
Archiving is typically done for legal reasons. The data in question might not be important for the business anymore, but the IRS or banking rules require it to be available for a certain amount of time.
Using SQL Server, you can archive your data using partitioning strategies. This normally involves figuring out the criteria based on which you will split the data. An example of this could be a date (i.e. data older than 3 years will be moved to the archive-part of the database). In case of huge systems, it might also make sense to split data based on geographical criteria (I.e. Americas on one server, Europe on another).
To answer your questions:
1) See the explanation written above
2) It really depends on what the goal of upgrading is. Moving it to .NET will get the code to be managed, but how important is that for the business?
3) If you do decide to partition, verifying it works could include issuing a query on the original database for data that contains both values before and after the threshold you will be using for partitioning, then splitting the data, and re-issuing the query afterwards to verify it still returns the same record-set. If you configure the system to use an automatic sliding window, you could also keep an eye on the system to ensure that data will automatically be moved to the archive partition.
Again, if the 100MB is not a typo, I would think your database is too small to really benefit from archiving. If your goal is to speed things up, put the system on a server that is able to load the whole database into RAM, or use SSD drives.
If you need to establish a data archive for legal or administrative reasons, give horizontal table partitioning a look. It's a pretty straight-forward process that is mostly handled by SQL Server automatically.
Hope this helps you out!

How to store my data (C#.net)

I'm having a bit of a problem deciding how to store some data. To see it from a simple perspective, it will be a simple table of data but there will be many tables. There will be about 7 columns in each table, but again there will be a lot of tables (and they will be created at runtime, whenever the customer wants a clean grid)
The data has to be stored locally in a file (and there will not be multiple instances of the software running).
I'm using C# 4.0 and I have been looking at using XML files(one file per table, or storing multiple tables in a file), sqlite, sql server CE, access etc. I will be happy if someone here has some comments or suggestions on how to do/not to do. Stability and reliability(e.g. no trashed databases because of unstable third party software) is probably my biggest concern.

If you are looking to store the data locally in a file, I would recommend the sqlite option since it seems your data is created in the form of a database table already. Sqlite is already built to handle multiple tables and columns so it means less mental overhead for you, the developer.
http://web.archive.org/web/20100208133236/http://www.mikeduncan.com/sqlite-on-dotnet-in-3-mins/ is a decent tutorial to give a quick overview on how to set it up and get going.
As for what NOT to do: don't try to make your own scheme to save the data to a file, it's a well understood problem that has been solved many times over, why re-invent the wheel?

XML wont be a good choice if you are planning to make several queries, since loading text files may be painful when they grow (talking about files over 1mb). If you plan to mantain the data low, the xml would be good to keep it simple. I still won't use it, but if you have a background, then the benefits will be heavier than the learning curve.
If you have no expertise in any of them, and the data is light my suggestion is SQLite, I beleive is the best lightweight DB for .Net and the prvider is very good. you can find it easily on Google.
I would tell you that Access is not recommendable, but this is a personal oppinion. Many people use it and I think is for some reason. So you should check it out and try it.
Again, my final recommendation is SQLite, unless you know very well another one, in which case you'll have to think how much your data is going to grow. If you plan to have a DB around 100mb, any of them, except xml would do; If you think it'll grow bigger than that, consider SQLite heavily

Simple Object to Database Product

I've been taking a look at some different products for .NET which propose to speed up development time by providing a way for business objects to map seamlessly to an automatically generated database. I've never had a problem writing a data access layer, but I'm wondering if this type of product will really save the time it claims. I also worry that I will be giving up too much control over the database and make it harder to track down any data level problems. Do these type of products make it better or worse in the already tough case that the database and business object structure must change?
For example:
Object Relation Mapping from Dev Express
In essence, is it worth it? Will I save "THAT" much time, effort, and future bugs?

I have used SubSonic and EntitySpaces. Once you get the hang of them, I beleive they can save you time, but as complexity of your app and volume of data grow, you may outgrow these tools. You start to lose time trying to figure out if something like a performance issue is related to the ORM or to your code. So, to answer your question, I think it depends. I tend to agree with Eric on this, high volume enterprise apps are not a good place for general purpose ORMs, but in standard fare smaller CRUD type apps, you might see some saved time.

I've found iBatis from the Apache group to be an excellent solution to this problem. My team is currently using iBatis to map all of our calls from Java to our MySQL backend. It's been a huge benefit as it's easy to manage all of our SQL queries and procedures because they're all located in XML files, not in our code. Separating SQL from your code, no matter what the language, is a great help.
Additionally, iBatis allows you to write your own data mappers to map data to and from your objects to the DB. We wanted this flexibility, as opposed to a Hibernate type solution that does everything for you, but also (IMO) limits your ability to perform complex queries.
There is a .NET version of iBatis as well.

I've recently set up ActiveRecord from the Castle Project for an app. It was pretty easy to get going. After creating a new app with it, I even used MyGeneration to script out class files for a legacy app that ActiveRecord could use in a pretty short time. It uses NHibernate to interact with the database, but takes away all the xml mapping that comes with NHibernate. The nice thing is though, if necessary, you already have NHibernate in your project, you can use its full power if you have some special cases. I'd suggest taking a look at it.

There are lots of choices of ORMs. Linq to Sql, nHibernate. For pure object databases there is db4o.
It depends on the application, but for a high volume enterprise application, I would not go this route. You need more control of your data.

I was discussing this with a friend over the weekend and it seems like the gains you make on ease of storage are lost if you need to be able to query the database outside of the application. My understanding is that these databases work by storing your object data in a de-normalized fashion. This makes it fast to retrieve entire sets of objects, but if you need to select data from a perspective that doesn't match your object model, the odbms might have a hard time getting at the particular data you want.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.