We have a relative large scale application that uses relational DB (MSSQL).
After a lot of reading I've decided that I want to examine using MongoDB and not MSSQL, mainly because performance and scale issues.
I read and study about Mongo and couldn't figure out the answer for the following questions:
Should we do it? Bare in mind we have the time to invest, the only question is "is it good for us?"
How to model our data?
My problem with mongo is that we have a lot of one to many relations in our DB.
After reading this great post (and the second part as well), I've realized a good practice will be to divide the decision into 3 scenarios:
1 to few
1 to many
1 to squillions.
In our db, most of the times we use one-to-many, but the problem is that most of the times it's the same "one".
For example, we have users and transactions tables.
Each user can perform a transaction, so basically what I should do is to model the user as following:
{
"name": "John",
...,
"Transactions" : [ObjectId("..."), ObjectId("..."),...]
}
So far it's fine, the problem is that we have a lot more than just transactions, for example we could have: posts, requests and many more features like transactions, and then, my users collection becomes huge (more then 25 "columns"). And also when I want to retrieve a data set I have to do several queries unlike MSSQL in which I'm just using Join statement.
Another issue is that I'll have to save a lot of extra data, for example, for each transaction I have to save the terminal ID, and in the report I'll have to show the terminal name, in that case (as for my understanding) I have 2 choices, the one is to do 2 queries and the other is to save the terminal name as well. In relational DB this is a simple join.
So maybe for schemes like ours, Mongo(or any other document based DB) is not the best choice?
I know those are a newbie questions :)
We use c# for our server side (ASP.Net Web API)
Thanks in advance!
You can face with some serious issues while modeling your data with 2 and 3 approaches:
For One to many you may face with data inconsistency or/and eventual consistency. Here, you store inside document an index (array of references) to external documents. So, for your example to add a new transaction you need two requests: create a transaction and add its reference to a user (update document). Mongo DB has ACID transactions only on document level, so for your case application for some reason can create a transaction but doesn’t add its reference to user. It can be app failures, network problems, bugs and so on. Of course, you can simulate db transaction in app with try/catch block making data cleanup when an error occurs. It will help but not in fully because app can fall down between requests.
So, if your app is high loaded after some time you can have some number of “dad” transactions which are not linked to any user. It couldn’t be a big problem if your app doesn’t query transactions directly – only via users, you will have only useless data in db. Otherwise you will have data inconsistency.
To fix that you need to create background job which will make proper cleanup. So, some period of time your data can be inconsistent – eventual consistency. For some applications, it can be ok, for another – not.
The same problem you can face while deleting transactions.
I agree, that a document with 25 arrays of references (columns) looks not very good. Working with such objects manually will be harder (testing, manual data fixes and so on.
One to squillions doesn’t have this affect but you need indexes to query efficiently. For large and shared db you can have bad performance.
In general, I’d like to say document dbs are pretty good if your app works mostly with one document (aggregate) and don’t have a lot of references to another docs and you don’t need transactions between docs. Denormalization can also be a source of inconsistency.
Key-value data is very easy to scale. Document dbs – it’s one step closer to key-value data-store. Column-oriented dbs are even more closed to key-value and so they can be scaled even better.
Also, I recommend you to consider the next measures to improve your SQL Server db performance:
Caching – perhaps you can cache some your app aggregates instead of gathering (making joins) them in SQL db all the time. For instance, Stack Overflow uses SQL Server db and Redis for caching aggregates (questions with answers, comments and so on).
Tune query performance within indexes, db structure, demoralization and so on.
If your db is hosted in on premise SQL Server then additional memory, SSD disk, table partitioning, data compressions, replication can help. As a rule, SQL Server gives a good performance with these approaches for dbs up to 1 TB.
CQRS approach.
Consider storing your app data in different databases. Every type of dbs has its own strong and weak sides. Document DB is good for storing aggregates, SQL db – for relational data and so on. Complex apps as a rule use a few db types.
Related
I am making an application in C#, which is the first professional project I will be adding to my portfolio. The application connects to a remote SQL Server and queries specific tables to get data such as Student information, details on courses, and data of student enrollment in each course.
My question is, even though the query will return a single record most of the time, is there a need to cache the tables? There are only 3 tables (Students, Courses, Enrollment), and Courses is the only table to doesn't change all that often, at least in comparison to the other two.
In a nutshell, the app is a CLI that lets the user view school courses, registered students and the student's respective enrollment in those courses. The app has the functionality of entering student info such as their name, their mailing address and contact information, which is then persisted to the SQL Server. The same goes for the Course details like the CourseID, Name and description. As well as the enrollment, which is where the server joins the StudentID and CourseID in the record to show that the specified student is enrolled in that course.
I am currently running a local instance of MSSQL, but plan to create a lightweight Virtual Machine to hold the SQL server to replicate a remote access scenario.
I figure that if the application is ever deployed to a large scale environment, the tables will grow to a large size and a simple query may take some time to execute remotely.
Should I be implementing a cache system if I envision the tables growing to a large size? Or should I just do it out of good practice?
So far, the query executes very quickly. However, that could be due to the fact the MSSQL installation is local or the fact that the tables currently only have 2-3 records of sample data. I do in the future plan to create more sample data to see if the execution time is managable.
Caching is an optimisation tool. Try to avoid premature optimisation, especially in cases when you do not know (and can't even guess) what you are optimising for (CPU, Network, HD speed etc).
Keep in mind that databases are extremely efficient at searching and retrieving data. Provided adequate hardware is available, a database engine will always outperform C# cache structures.
Where caching is useful is in scenarios where network latency (between DB and the app) is an issue or chatty application designs (multiple simple DB calls into small tables in one interaction/page load).
Well for a C# app (Desktop/mobile), cache system is a good practice.But you can make a project for a school without cache system because it doesn't weaken your app performance a lot. Its up to you whether you want to use it or not.
Caching is a good option for the kind of data which is to be accessed frequently but does not change so frequently. In your case it would apply to 'Courses' for which you said that data won't change frequently.
However, for data that is going to grow in size in future and will have frequent inserts/updates to it, it is better to think about optimizing the way they are stored and retrieved from the data stores. In your case, the tables 'Student'and 'Enrollment' are such tables where it is expected to have lots of inserts/updates over time.
So it is better to write optimized procedures to perform CRUD operations on these tables and keep indexes of the right sort on the tables as well. It will not only provide better manageability of data but also give the performance that you are looking for, when compared to caching the results.
I'm creating a website content management system which stores a whole bunch of website articles and let user be able to modify these articles through the system. I'm a typical SQL Server developer however I'm thinking maybe this system can be done in DocumentDB.We are using C# plus WebAPI to do the read and write. I'm testing different data access technology to see which one performs better. I have been trying Ling, Linq Lambda, SQL and Stored Procedure. The thing is all these query methods seems all running around 600ms to 700ms when I test via Postman. For example, one of my test is a simple Get http://localhost:xxxxxx/multilanguage/resources/1, which would take 600ms+. That was only a 1 kb document and there are only have 5 documents stored in my collection so far. So I guess what I want to ask is: is there a quicker way to query DocumentDB than this. The reason I ask is because I did something similar in SQL Server before(not to query document, it was for relational tables). A much more complex query in a stored procedure on multiple joined tables only takes around 300ms. So I guess there should be a quicker way to do this. Thanks for any suggestions!
Most probably if you will change implementation to stab you will get same performance since actually you are testing connection time between yours server and client (postman).
There's a couple things you can do, but do keep in mind that DocumentDB, and other NoSQL solutions behave very differently than standard SQL Server. For example, the more nodes and RAM available to DocumentDB the better it will perform overall. The development instance of DocumentDB on Azure is understandably going to use fewer resources than a production instance. Since Azure takes care of scaling, one way to think about it is that the more data you have the better it will perform.
That said, something you are probably not used to is sharing your connection object for your whole application. That avoids the start up penalties every time you want to get your data. Summarizing Performance Tips:
Use TCP connection instead of HTTPS when you can
Use await client.OpenAsync() to avoid pausing on start up latency for the first request
Connect to the DocumentDB in the same region (keep in mind if you host across regions)
Use a singleton to access DocumentDB (it's threadsafe)
Cache your SelfLinks for quick access
Tune your page sizes so that you get only the data you intend to use
The more advanced performance tips cover index policies, etc. DocumentDB and other NoSQL databases behave differently than SQL databases. That also means your assumptions about how the APIs work are probably wrong. Make sure you are testing similar concepts. The SQL Server database connection object needs you to create/dispose of objects for each transaction so it can return those connections back to a connection pool. Treating DocumentDB the same way is going to cause the same kind of performance problems as if you didn't use a connection pool.
In a current project of mine I need to manage and store a moderate number (from 10-100 to 5000+) of users (ID, username, and some other data).
This means I have to be able to find users quickly at runtime, and I have to be able to save and restore the database to continue statistics after a restart of the program. I will also need to register every connect/disconnect/login/logout of a user for the statistics. (And some other data as well, but you get the idea).
In the past, I saved settings and other stuff in encoded textfiles, or serialized the needed objects and wrote them down. But these methods require me to rewrite the whole database on each change, and that's increasingly slowing it down (especially with a growing number of users/entries), isn't it?
Now the question is: What is the best way to do this kind of thing in C#?
Unfortunately, I don't have any experience in SQL or other query languages (except for a bit of LINQ), but that's not posing any problem for me, as I have the time and motivation to learn one (or more if required) for this task.
Most effective is highly subjective based on who you ask even if narrowing down this question to specific needs. If you are storing non-relational data Mongo or some other NoSQL type of database such as Raven DB would be effective. If your data has a relational shape then an RDBMS such as MySQL, SQL Server, or Oracle would be effective. Relational databases are ideal if you are going to have heavy reporting requirements as this allows non-developers more ease of access in writing simple SQL queries against it. But also keeping in mind performance with disk cache persistence that databases provide. Commonly accessed data is stored in memory to save the round trips to the disk (with hybrid drives I suppose accessing some files directly accomplishes the same thing however SSD's are still not as fast as RAM access). So you really need to ask yourself some questions to identify the best solution for you; What is the shape of your data (flat, relational, etc), do you have reporting requirements where less technical team members need to be able to query the data repository, and what are your performance metrics?
I'm developing website that (if successful) its going to have a rapidly growing database (maybe terabytes or more). up to now I have always used sql server and didn't know anything about nosql.
I just found out about nosql doing research about the database size, and now I'm not sure if it will fullfil my needs. will I have the same power that I had with sql-server?
my question may seem silly as I'm a newbie in nosql but I just wanted to know if it doesn't support sql queries. how can we do something like:
select *, (select name from cities where id = cityid) from users
how to join tables? use something like stored procedures, views or things like these?
Thats a big question. NoSQL is a broad term pretty much used to describe a bunch of non relational data stores. They can range from MongoDB, RavenDB (which are document stores) to things like Redis and other variants of key/value stores. They all operate very differently to SQL relational models (and the resulting T-SQL).
Document databases like Mongo or Raven typically have a C# driver that (in most cases) allows you to use LinQ queries across the datastore (Mongo example here on this thread and a RavenDB example on their documentation page). They are all specific to their engine and different.
All these engines are not specifically designed to address the 'space' issue you are describing but rather try and have a low friction way of interacting with a datastore, in a fast way. All these data stores will still grow in size in the same way SQL does when throwing massive amounts of data at it. SQL Server will handle massive databases, as will most of the document stores and other NoSQL variants. To be honest, I'd trust SQL Server more than the newer NoSQL stores simply because it has been field tested for longer however as already stated, these document stores (and other stores like Apache Cassandra) can all handle large volumes of data. My only suggestion is to look at how you want to query the data. Document stores typically dont have the concepts of relational integrity like foriegn keys and so normalisation rules do not apply. In addition, you need to assess your reporting needs as SQL typically has an advantage in this area with more tooling. You can also choose a hybrid approach using SQL for your relational data and document stores for other object blobs and the like.
I would suggest looking into how you want to access your data first and then assess which one best suits your needs. One thing to note too is that SQL has some great features but often only in the enterprise versions. This costs a lot. Document databases tend to cost a LOT less for licencing, some being free, with many companies offering hosting so removing the need for you to worry about it. Finally, if going with SQL, I would suggest looking into sharding approaches from the very beginning given the amount of data you will be processing as this will make it much more manageable and also allow better query performance.
I've used MongoDB quite a bit. Id suggest signing up for a sandbox account on Mongolabs and playing around with it. There is an excellent C# driver for it too. NoSql is not really relational although you can relate documents via Ids. In your example you'd store an array of cities (if I am reading your example clearly) against the User document and query that or vice versa. There's less of a concern on data repetition because storage concerns aren't as important as they used to be. I write my scripts (equilivent of stored procs) using JavaScript and run it directly against Mongo, its incredibly flexible and powerful. Of course if you have tons of related objects, perhaps a relational database is your best bet.
I have a software who does a heavy processing based on some files.
I have to query some tables in SQL Server in the process and this is killing the DB and the application performance. (other applications use the same tables).
After optimizing queries and code, getting better results but not enough. After research I reached the solution: Caching some query results. My idea is cache one specific table (identified as the overhead) rows that the file being process need.
I was think in using AppCache Fabric (I'm on MS stack), made some tests it have a large memory usage for small objects ( appcache service have ~350MB of ram usage without objects). But I need to make some queries in these result table (like search for lastname, ssn, birthdate etc.)
My second option is MongoDb as a cache store. I've research about this and most of people I read recommend using memcached or Redis, but I'm using Windows servers and they're not supported officialy.
Using mongo as cache store in this case it is a good approach? Or AppFabric Caching + tag search is better?
It is hard to tell what is better because we don't know enough about your bottlenecks. A lot is depending on quality of the data you're discussing. If the data is very static and is not called constantly but to compile the data set is time-consuming, the good solution might be to use the materialized view. If this data is frequently called than you better caching it on some server (e.g. app fabric).
There are many techniques and possibilities. But you really need to think of the network traffic, demand, size, etc, etc. And it is hard to answer this here without knowing all the details.
Looks like you are on the right way but may be all you need is just a parametrized query. Hard to tell. But I would add Materialized view into the roster that you just posted. May be all you need is to build this view from all the data you need and just access its contents.
My question to you would be that what are your long-term goals or estimates for your application? If this is the highest load you are going to expereince then tuning the DB or using MVL would be an answer. But the long term solution to this is distributed caching and you are already thinking along those lines. Your data requirements is what we'd called "reference data" or "lookup-data" and once you are excuting multiple lookups with limited DB resources there will be performance issue and your DB will become a performance bottleneck.
So the solution, that you are already thinking of, is caching this "reference" data in a cache without the need to go to the database, while, at the same time, keeping cache synchronized with the Database.
Appfabric I wouldn't be too sure about as it will have the same support issues that you mention. What is your budget like? Can you think about spending on a cachisng solution like NCache?