Storing a large amount of analytical data - c#

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .
The data I will be storing is
datetime
ipAddress
linkId
possibly other string related data
I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)
I would then need to pull this data out based on linkId.
Also would I be able to do ordering within the query to the DB or would that be best done in the application?
EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.
Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.
Good luck :)
EDIT:
When I said (above) that the queries came back "quickly enough", I
meant anywhere from 1 to 90 seconds, depending on various factors.
Keep in mind that these were not simple queries, and in my opinion,
several improvements could be made to the data modeling and index
strategy.
I intentionally left out the Table Partitioning feature not only
because it is only in Enterprise Edition, but also because it is more
often misunderstood and hence misused than understood and used
properly. Table/Index partitioning in SQL Server is not a means of
"sharding".
I also did not mention Column Store indexes because they are only
available in Enterprise Edition. However, for projects large enough
to justify the cost, Column Store indexes are certainly worth
investigating. They were introduced in SQL Server 2012 and came with
the restriction that the table could not be updated once the Column
Store index was created. You can get around that, to a degree, using
Table Partitioning, but in SQL Server 2014 that restriction will be
removed.

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:
AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity,
and 10 Units of Read Capacity with Amazon DynamoDB.
Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.
BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

Related

Most effective way of storing and managing moderate number of users

In a current project of mine I need to manage and store a moderate number (from 10-100 to 5000+) of users (ID, username, and some other data).
This means I have to be able to find users quickly at runtime, and I have to be able to save and restore the database to continue statistics after a restart of the program. I will also need to register every connect/disconnect/login/logout of a user for the statistics. (And some other data as well, but you get the idea).
In the past, I saved settings and other stuff in encoded textfiles, or serialized the needed objects and wrote them down. But these methods require me to rewrite the whole database on each change, and that's increasingly slowing it down (especially with a growing number of users/entries), isn't it?
Now the question is: What is the best way to do this kind of thing in C#?
Unfortunately, I don't have any experience in SQL or other query languages (except for a bit of LINQ), but that's not posing any problem for me, as I have the time and motivation to learn one (or more if required) for this task.
Most effective is highly subjective based on who you ask even if narrowing down this question to specific needs. If you are storing non-relational data Mongo or some other NoSQL type of database such as Raven DB would be effective. If your data has a relational shape then an RDBMS such as MySQL, SQL Server, or Oracle would be effective. Relational databases are ideal if you are going to have heavy reporting requirements as this allows non-developers more ease of access in writing simple SQL queries against it. But also keeping in mind performance with disk cache persistence that databases provide. Commonly accessed data is stored in memory to save the round trips to the disk (with hybrid drives I suppose accessing some files directly accomplishes the same thing however SSD's are still not as fast as RAM access). So you really need to ask yourself some questions to identify the best solution for you; What is the shape of your data (flat, relational, etc), do you have reporting requirements where less technical team members need to be able to query the data repository, and what are your performance metrics?

using nosql database as a replacement for sql server

I'm developing website that (if successful) its going to have a rapidly growing database (maybe terabytes or more). up to now I have always used sql server and didn't know anything about nosql.
I just found out about nosql doing research about the database size, and now I'm not sure if it will fullfil my needs. will I have the same power that I had with sql-server?
my question may seem silly as I'm a newbie in nosql but I just wanted to know if it doesn't support sql queries. how can we do something like:
select *, (select name from cities where id = cityid) from users
how to join tables? use something like stored procedures, views or things like these?
Thats a big question. NoSQL is a broad term pretty much used to describe a bunch of non relational data stores. They can range from MongoDB, RavenDB (which are document stores) to things like Redis and other variants of key/value stores. They all operate very differently to SQL relational models (and the resulting T-SQL).
Document databases like Mongo or Raven typically have a C# driver that (in most cases) allows you to use LinQ queries across the datastore (Mongo example here on this thread and a RavenDB example on their documentation page). They are all specific to their engine and different.
All these engines are not specifically designed to address the 'space' issue you are describing but rather try and have a low friction way of interacting with a datastore, in a fast way. All these data stores will still grow in size in the same way SQL does when throwing massive amounts of data at it. SQL Server will handle massive databases, as will most of the document stores and other NoSQL variants. To be honest, I'd trust SQL Server more than the newer NoSQL stores simply because it has been field tested for longer however as already stated, these document stores (and other stores like Apache Cassandra) can all handle large volumes of data. My only suggestion is to look at how you want to query the data. Document stores typically dont have the concepts of relational integrity like foriegn keys and so normalisation rules do not apply. In addition, you need to assess your reporting needs as SQL typically has an advantage in this area with more tooling. You can also choose a hybrid approach using SQL for your relational data and document stores for other object blobs and the like.
I would suggest looking into how you want to access your data first and then assess which one best suits your needs. One thing to note too is that SQL has some great features but often only in the enterprise versions. This costs a lot. Document databases tend to cost a LOT less for licencing, some being free, with many companies offering hosting so removing the need for you to worry about it. Finally, if going with SQL, I would suggest looking into sharding approaches from the very beginning given the amount of data you will be processing as this will make it much more manageable and also allow better query performance.
I've used MongoDB quite a bit. Id suggest signing up for a sandbox account on Mongolabs and playing around with it. There is an excellent C# driver for it too. NoSql is not really relational although you can relate documents via Ids. In your example you'd store an array of cities (if I am reading your example clearly) against the User document and query that or vice versa. There's less of a concern on data repetition because storage concerns aren't as important as they used to be. I write my scripts (equilivent of stored procs) using JavaScript and run it directly against Mongo, its incredibly flexible and powerful. Of course if you have tons of related objects, perhaps a relational database is your best bet.

ADO and Microsoft SQL database backup and archival

I am working on re-engineering/upgrade of a tool. The database communication is in C++(unmanaged ADO) and connects to SQL server 2005.
I had a few queries regarding archiving and backup/restore techniques.
Generally archiving is different than backup/restore . can someone provide any link which explains me that .Presently the solution uses bcp tool for archival.I see lot of dependency on table names in the code. what are the things i have to consider in choosing the design(considering i have to take up the backup/archival on a button click, database size of 100mb at max)
Will moving the entire communication to .net will be of any help? considering lot of ORM tools. also all the bussiness logic and UI is in C#
What s the best method to verify the archival data ?
PS: the questionmight be too high level, but i did not get any proper link to understand this. It will be really helpful if someone can answer. I can provide more details!
Thanks in advance!
At 100 MB, I would say you should probably not spend too much time on archiving, and just use traditional backup strategies. The size of your database is so small that archiving would be quite an elaborate operation with very little gain, as the archiving process would typically only be relevant in the case of huge databases.
Generally speaking, a backup in database terms is a way to provide recoverability in case of a disaster (accidental data deletion, server crash, etc). Archiving mostly means you partition your data.
A possible goal with archiving is to keep specific data available for querying, but without the ability to alter it. When dealing with high volume databases, this is an excellent way to increase performance, as read-only data can be indexed much more densely than "hot" data. It also allows you to move the read-only data to an isolated RAID partition that is optimized for READ operations, and will not have to bother with the typical RDBMS IO. Also, by removing the non-active data from the regular database means the size of the data contained in your tables will decrease, which should boost performance of the overall system.
Archiving is typically done for legal reasons. The data in question might not be important for the business anymore, but the IRS or banking rules require it to be available for a certain amount of time.
Using SQL Server, you can archive your data using partitioning strategies. This normally involves figuring out the criteria based on which you will split the data. An example of this could be a date (i.e. data older than 3 years will be moved to the archive-part of the database). In case of huge systems, it might also make sense to split data based on geographical criteria (I.e. Americas on one server, Europe on another).
To answer your questions:
1) See the explanation written above
2) It really depends on what the goal of upgrading is. Moving it to .NET will get the code to be managed, but how important is that for the business?
3) If you do decide to partition, verifying it works could include issuing a query on the original database for data that contains both values before and after the threshold you will be using for partitioning, then splitting the data, and re-issuing the query afterwards to verify it still returns the same record-set. If you configure the system to use an automatic sliding window, you could also keep an eye on the system to ensure that data will automatically be moved to the archive partition.
Again, if the 100MB is not a typo, I would think your database is too small to really benefit from archiving. If your goal is to speed things up, put the system on a server that is able to load the whole database into RAM, or use SSD drives.
If you need to establish a data archive for legal or administrative reasons, give horizontal table partitioning a look. It's a pretty straight-forward process that is mostly handled by SQL Server automatically.
Hope this helps you out!

C# winforms as front-end for access

I'm in the early stages of building a winform C# app based on Access db (can't use other types of DB for different reasons).
My main issue is how to design the DB since the amount of data is vast (based on daily data) and it will fill up the its size limit within a month or so.
I thought of creating a new DB for every month, but how will I be able to compare data between the different DB, for example, between months? I want the C# app to execute the queries.
Are there any tutorials, books? I have no experience of using and linking front and back-end Access.
Any ideas?
Thanks!
You probably don't want to hear this, but starting a project with MS Access as the backend is not a good idea when you already know in advance that you will hit Access' size limit after only a month.
You say in a comment:
I'm stuck with Access because of these: 1. High cost of SQL server. 2. I'm not familiar with SQL server. 3. SQL server express (the free edition) also has a size limit, though larger (10gb). Are there other DB free and without size limitation? Are there other DB free
I agree with you that SQL Server is not a good solution in your situation - high price and size limitation are valid arguments against SQL Server (both full version and Express Edition).
So, in my opinion using a different database engine is the only real solution here.
Your third argument against SQL Server was "I'm not familiar with it", but I strongly advise you to become familiar with another database engine than Access, because using Access in your situation (size limit!!!) will be a pain in the long run.
(Note to all nitpickers: No, I'm not bashing Access in general - I'm making a living with it myself.
However, it has its limits and when you know in advance that you'll hit its size limit within a month, it's not a good idea to use it here.)
Yes, you could do some hack and use a different Access database for each month, but you will really feel the pain as soon as your users will need to load data from several months at once, or as soon as your boss asks you for a "quick report about our sales in the last three years" :-)
But you can use a different database engine. Yes, you will have to invest time to become familiar with it, learn how to set it up and so on.
But believe me, it will pay off in the long run because you don't have to deal with the hassle of one database file per month.
There are lots of free and capable database engines available, the most known are:
PostgreSQL
MySQL
Firebird
To connect to the MS Access database(s) you can use the code shown here and then you can go about 'joining' the data in your C# front-end.
You might end up writing a subset of a DB engine in C# though and I thoroughly support the comments provided by Bernard and Bryan.
Your DB design should be isolated from your front-end technology decisions, and the reverse is also true. See Multitier Architecture.
Using a multi-tier architecture will help you separate presentation from business logic from data access, allowing you to design and implement each of these components in a modular and robust fashion.
Searches for N-Tier architecture or Multitier architecture will find a wealth of information and help on how to implement multi-tier solutions and why you should go to the trouble.
Here are a few links that might help.
http://msdn.microsoft.com/en-us/library/aa288452(v=vs.71).aspx
http://bytes.com/topic/net/answers/516465-c-insert-statement-using-oledb

sqlite , berkeley db benchmarking

I want to create desktop application in c# for that i want to use embedded database like
(sqlite,berkeley db), so how can i start benchmarking for these databases ?
Recently, Oracle added the sqlite3 interface on top of BDB's btree storage, so you should be able to write your code against sqlite3 and then plug in BDB. The catch is licensing. BDB forces you to either pay or go open source; sqlite let's you do whatever you want.
Before thinking about benchmarking, you need to compare the features of the databases.
SQLite and BDB are completely different in the features they support, and if the data is complicated, I'd suggest SQLite for easier querying of relational data (if that's how your data is laid out)
I agree with Osama that you should compare the features your after first.
However, I disagree that "complicated" data should automatically drive you toward sqlite. While I haven't seen any benchmarks (nor have cared to write any), I have a gut reaction (whatever that's worth) that says BerkeleyDB is going to outperform nearly every time.
That said. I don't think that's what I'd use to make my own decision. It goes back to those features. If all I want is a simple data store, then I'd probably choose sqlite because its going to be easier. Likewise, if I want to be able to arbitrarily query my data on any field, or possibly one day store it in an "enterprise" SQL database, I'd likely go with sqlite because future migration will be easier. If, however, I intend to move beyond a simple data store, and am eyeing transactional safety, high concurrency, high availability, having many readers and writers, etc and I have a set of fairly well-defined "queries", then I probably want BDB.
Notice that "complexity" of my data doesn't really enter into these equations. The reason is simple. BDB can hold my object in it's native serialized format. Sql of any flavor comes with the famous impedence mismatch which, IMO, complicates my application.
If you are seriously considering BDB, I need to warn you that you should decide the type of storage your going to use up front as the different types of stores that BDB offers are not necessarily compatible.

Categories