Practical performance comparision of Neo4j and MSSQL for C# developers - c#

Assume we have a web site with a small social graph that people (say ~1M users) can "like" stuff, follow each other, comment on each other posts and ... (the usual scenario).
In .NET for this we have two options:
Using EF (currently 6.1) and MSSQL (v2012 or above) to implement the social graph (the hard way)
Using Neo4j (currently 2.1.4) and Neo4jClient (which as far as I know is the best driver for .NET users)
Given the above scenario and the fact that Neo4j doesn't have a native driver for .NET and the current version of Neo4jClient (1.0.0.657) uses REST api to connect to the database engine, which one would be faster for questions like "Who likes stuff like I do" or "What a person would like (based on the people it follow)" and some other usual question regarding the social graphs?

You haven't specified that much information; your question may be likely to elicit a lot of opinion, but I'll try to give this a fair shake. (Disclaimer: I'm from the neo4j side of this, but I've worked with most of the other things you mention)
Your question has three elements I want to split apart:
Graph or Relational? (MySQL vs. Neo4J)
Driver/Engineering issues (Neo4jClient/REST vs EF/MySQL)
Modeling practicalities (implementing the social graph "the hard way" vs. in neo4j)
Graph or Relational?
You should read another answer I posted about general parameters of the performance of graph databases and graph database query. I won't recap all of that (since it's already on SO) but here's the executive summary: graph databases are very good and fast at path-associative queries where you need to traverse a bunch of edges. Those operations correspond to things in the relational world where you'd join a whole pile of tables together, or where the join depth is variable. In those situations, graph will be better than relational (performance wise). If you want to do bulk scans of users or single joins, you're probably better off with relational (again, see other answer for more detail here). So on this criteria, I am inferring that you only really want to traverse one edge at a time - e.g. "Show me all of the stuff that Bob likes" and that you don't need to do deeper queries like "Show me everyone who is separated by 3-4 degrees from Bob".
Driver/Engineering Issues
Speed wise, it's generally known that the java API is faster than the REST API for neo4j. Performance for the REST API would be variable, and depend on a lot of other factors like whether the DB is hosted on the same machine, or how "network far" away it is. You always have extra overhead with REST that comes with things like HTTP and serializing/deserializing JSON that you wouldn't have if you used the java API. So all other things being equal (disclaimer: they never are ;) the REST API will generally be slower than something like EF.
Modeling Practicalities
Here, neo4j is going to win by a lot. With MySQL, you'll have the ever-present object-relational impedance mismatch; neo4j lessens (but does not eliminate) those impedance mismatch problems. Modeling wise, neo4j is schemaless, which comes with lots of pros and cons. You can probably cobble together a working model faster with neo4j because your domain is fundamentally graphy-y.

Related

Storing a large amount of analytical data

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .
The data I will be storing is
datetime
ipAddress
linkId
possibly other string related data
I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)
I would then need to pull this data out based on linkId.
Also would I be able to do ordering within the query to the DB or would that be best done in the application?
EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.
Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.
Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.
Good luck :)
EDIT:
When I said (above) that the queries came back "quickly enough", I
meant anywhere from 1 to 90 seconds, depending on various factors.
Keep in mind that these were not simple queries, and in my opinion,
several improvements could be made to the data modeling and index
strategy.
I intentionally left out the Table Partitioning feature not only
because it is only in Enterprise Edition, but also because it is more
often misunderstood and hence misused than understood and used
properly. Table/Index partitioning in SQL Server is not a means of
"sharding".
I also did not mention Column Store indexes because they are only
available in Enterprise Edition. However, for projects large enough
to justify the cost, Column Store indexes are certainly worth
investigating. They were introduced in SQL Server 2012 and came with
the restriction that the table could not be updated once the Column
Store index was created. You can get around that, to a degree, using
Table Partitioning, but in SQL Server 2014 that restriction will be
removed.
Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:
AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity,
and 10 Units of Read Capacity with Amazon DynamoDB.
Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.
BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

Data layer design creation

I am planning to create a public facing website somethings on the lines of each user having a single profile page which they can maintain/update regularly. On this page the user can upload some pics and update their personal information.
I have 3 tier structure in mind.
I need inputs in creating my data layer. I have read many posts but I am not convinced on which particular approach to finalize. I have read about entity framework, Microsoft enterprise library, core ado.net etc. Many blogs say that its best to use plain ado.net for better performance.
Could you point out which could be the best approach for my case where I am looking for faster processing and performance. In terms of technology I am looking for asp.net, c#, data calls with WCF and No MVC.
Also in case of plain ado.net are there any ready to use Library available which I could use and get started with.
Thanks
I would not go with plane ADO.NET, if you're looking at whole picture, I would consider that as an micro optimization - by using caching, smart data structuring you would achieve much more than by using plain ado.net.
Entity Framework adds some cost, there's no doubt about it, it is shown here (although it may be outdated):
http://www.servicestack.net/benchmarks/
You could use some micro orm framework, that is mentioned in benchmark, but usually micro comes at is own cost, for example most of micro frameworks I've seen have problems with joins (they allow them in pure sql, but have no tools for writing in typed c#).
For example Stackoverflow has people profiles and is using micro ORM Dapper and their performance is great, because if I remember correctly ~95% of requests are served from Reddis cache, not database.
If your public profiles will be full text searchable and you'll have millions of them, may be relational database is not the right choice.

using nosql database as a replacement for sql server

I'm developing website that (if successful) its going to have a rapidly growing database (maybe terabytes or more). up to now I have always used sql server and didn't know anything about nosql.
I just found out about nosql doing research about the database size, and now I'm not sure if it will fullfil my needs. will I have the same power that I had with sql-server?
my question may seem silly as I'm a newbie in nosql but I just wanted to know if it doesn't support sql queries. how can we do something like:
select *, (select name from cities where id = cityid) from users
how to join tables? use something like stored procedures, views or things like these?
Thats a big question. NoSQL is a broad term pretty much used to describe a bunch of non relational data stores. They can range from MongoDB, RavenDB (which are document stores) to things like Redis and other variants of key/value stores. They all operate very differently to SQL relational models (and the resulting T-SQL).
Document databases like Mongo or Raven typically have a C# driver that (in most cases) allows you to use LinQ queries across the datastore (Mongo example here on this thread and a RavenDB example on their documentation page). They are all specific to their engine and different.
All these engines are not specifically designed to address the 'space' issue you are describing but rather try and have a low friction way of interacting with a datastore, in a fast way. All these data stores will still grow in size in the same way SQL does when throwing massive amounts of data at it. SQL Server will handle massive databases, as will most of the document stores and other NoSQL variants. To be honest, I'd trust SQL Server more than the newer NoSQL stores simply because it has been field tested for longer however as already stated, these document stores (and other stores like Apache Cassandra) can all handle large volumes of data. My only suggestion is to look at how you want to query the data. Document stores typically dont have the concepts of relational integrity like foriegn keys and so normalisation rules do not apply. In addition, you need to assess your reporting needs as SQL typically has an advantage in this area with more tooling. You can also choose a hybrid approach using SQL for your relational data and document stores for other object blobs and the like.
I would suggest looking into how you want to access your data first and then assess which one best suits your needs. One thing to note too is that SQL has some great features but often only in the enterprise versions. This costs a lot. Document databases tend to cost a LOT less for licencing, some being free, with many companies offering hosting so removing the need for you to worry about it. Finally, if going with SQL, I would suggest looking into sharding approaches from the very beginning given the amount of data you will be processing as this will make it much more manageable and also allow better query performance.
I've used MongoDB quite a bit. Id suggest signing up for a sandbox account on Mongolabs and playing around with it. There is an excellent C# driver for it too. NoSql is not really relational although you can relate documents via Ids. In your example you'd store an array of cities (if I am reading your example clearly) against the User document and query that or vice versa. There's less of a concern on data repetition because storage concerns aren't as important as they used to be. I write my scripts (equilivent of stored procs) using JavaScript and run it directly against Mongo, its incredibly flexible and powerful. Of course if you have tons of related objects, perhaps a relational database is your best bet.

What is best multi-user database C# app approach?

I would like to know what is the best method for developing a multi-user C# app using the SQL Server2005 as database. This is what I have in mind:
using nhibernate or telerik's openacces orm.
linq
using wrappers. all data from tables load into corresponding objects (at startup) and from that point only delete&update transactions affect the database.
...
I've looked at orm tools but in my opinion they generate a lot of code and i do not know if
it's necessary.
What is the best solution having in mind future changes in the application?
If i would choose the 3rd option how can i ensure that only one users modifies a row in a table(how can i lock a table row which is under modification) ?
Any suggestions or reading material will help!
Thanks!
There are hundreds of ways to solve this, but don't discount ORM. Microsoft's Entity Framework is getting better with every revision. The framework 4.0 bits are pretty good and play extremely well with LINQ.
As for generated code vs your own, try something like Entity Spaces... You have complete control over how the code gets generated and the data access layer is extremely powerful and flexible (not to mention very easy to use). It also plays nicely with LINQ.
I have written a lot of data access code over the years. In the beginning, the ORM tools were rough around the edges and left a lot to be desired. These tools have gone through many iterations since and have become indispensable in my opinion. I can't imagine writing routine after routine that does the same basic CRUD. I did that for years and spent lots of time correcting hardcoded SQL and vow to avoid it at all costs from here on out.
As for concurrency / locking issues, that's a question unto itself. There are many ways to provide locking (the major categories being optimistic and pessimistic). Each has its pros and cons.
If it's multiuser do NOT do #3. The purpose of an DBMS is to handle the multi-user aspects for you. Everything from transactions to access rights are built right in. Going down the path of mimicking that in your code will be difficult to get right. In the past some "engines" like Borland's BDE and MS Access did this. The end result is that you end up dealing with little things like data corruption and consistency errors.
Never mind that as your database grows the is going to take exponentially longer to start.
We typically stay away from ORM tools for a number of reasons, mostly feature / benefit / security concerns. Of course, we are extremely well versed in SQL and can take advantage of the specific features a given db server can offer, which most ORMs can't do. We also tend to tweak the queries based on performance metrics after product release, which would force a recompile of an app for most ORMs. By staying away from this, we can let production DBAs do their job. That may or may not be a concern of yours.
That said a lot of dev teams both like and successfully use the ones you spoke about. I would say to skip Linq-to-SQL in favor of Entity Framework if you're going that route. Linq-to-SQL has all but been replaced by EF.
Save yourself a load of effort and time and use an ORM. In terms of helping you decide which one, there is loads of information/opinion on the web (and StackOverflow!) about which one to use but that'll depend on what your application requirements are (which you haven't described).
I like Linq-to-SQL for small/mid sized apps. It's quick and easy and almost efficient. For bigger apps it'll depend on what types of data transformations and design you have in mind but Linq-to-Entities or nHibernate are probably the most appropriate.

Simple Object to Database Product

I've been taking a look at some different products for .NET which propose to speed up development time by providing a way for business objects to map seamlessly to an automatically generated database. I've never had a problem writing a data access layer, but I'm wondering if this type of product will really save the time it claims. I also worry that I will be giving up too much control over the database and make it harder to track down any data level problems. Do these type of products make it better or worse in the already tough case that the database and business object structure must change?
For example:
Object Relation Mapping from Dev Express
In essence, is it worth it? Will I save "THAT" much time, effort, and future bugs?
I have used SubSonic and EntitySpaces. Once you get the hang of them, I beleive they can save you time, but as complexity of your app and volume of data grow, you may outgrow these tools. You start to lose time trying to figure out if something like a performance issue is related to the ORM or to your code. So, to answer your question, I think it depends. I tend to agree with Eric on this, high volume enterprise apps are not a good place for general purpose ORMs, but in standard fare smaller CRUD type apps, you might see some saved time.
I've found iBatis from the Apache group to be an excellent solution to this problem. My team is currently using iBatis to map all of our calls from Java to our MySQL backend. It's been a huge benefit as it's easy to manage all of our SQL queries and procedures because they're all located in XML files, not in our code. Separating SQL from your code, no matter what the language, is a great help.
Additionally, iBatis allows you to write your own data mappers to map data to and from your objects to the DB. We wanted this flexibility, as opposed to a Hibernate type solution that does everything for you, but also (IMO) limits your ability to perform complex queries.
There is a .NET version of iBatis as well.
I've recently set up ActiveRecord from the Castle Project for an app. It was pretty easy to get going. After creating a new app with it, I even used MyGeneration to script out class files for a legacy app that ActiveRecord could use in a pretty short time. It uses NHibernate to interact with the database, but takes away all the xml mapping that comes with NHibernate. The nice thing is though, if necessary, you already have NHibernate in your project, you can use its full power if you have some special cases. I'd suggest taking a look at it.
There are lots of choices of ORMs. Linq to Sql, nHibernate. For pure object databases there is db4o.
It depends on the application, but for a high volume enterprise application, I would not go this route. You need more control of your data.
I was discussing this with a friend over the weekend and it seems like the gains you make on ease of storage are lost if you need to be able to query the database outside of the application. My understanding is that these databases work by storing your object data in a de-normalized fashion. This makes it fast to retrieve entire sets of objects, but if you need to select data from a perspective that doesn't match your object model, the odbms might have a hard time getting at the particular data you want.

Categories