I have a general question about large in-memory objects and distributed computing.
I have a large object like Class.Object that stores a lot of data like upwards of 200,000 objects and counting. As it is right now is a simple object created and running in memory and the clients call the data in it. Because speed is also important, I'm serializing this monster into the desk like C# BinaryFormatter and loading and running them. This is a WCF project so the object stays in the memory. My question is how should I scale this across multiple server kind of like distributed computing. Is there a tool in C# like "database sharding" or something like that. Is there a database that I can save this information to. This object isn't just like a simple database table. It has references up and down the classes. Everything is being referenced. There are hashtables etc. Google seems to do this kind of monster indexes using "shards" and splitting the data across different servers. Is there a tool and mechanism to do this in .NET. What is the approach here? I'm using Windows Server AppFabric to save it in memory and load it, but it seems like I need to split this monster object into pieces?
Any pointers and help is appreciated.
Never listened personally about some already ready to run db sharding solutions in .NET. Would be interesting to read from other posts on this question.
But for general knowledge and may be also pretty useful link can be this one:
CodeProject destributed computing with Silverlight
Excelent article, by me.
Good luck.
I guess I get no "upvotes" for this answer but the solution to your problem is not some technical sharding but better design. If you need to keep so much objects in memory all the time you need to have some really good incensitive for this. Isn't it possible to load only a portion of this at a time? If a client "call data in it" the client don't have to get all the monster back does he? If no try breaking the thing down to managable parts that a cliente really needs.
Related
Short description of our application: We analyze .NET assemblies and detect dependencies between them (e.g. method calls). We save those dependencies in a MSSQL server database. From a class/method in code we can now find all direct and indirect dependencies and are able to find out which code may break if we change the interface or implementation.
Although we make good use of indices (dropped our import performance, but that runs overnight anyways) we still have performance issues. As we import many many versions of the same assembly we have quite a heavy amount of data and queries take a few seconds, which is just not fast enough (< 1.5s is the target).
As dependencies are a graph-like structure we're wondering if switching from MSSQL to a NoSQL graph database may help. This would take some time so we're hoping for some external input first.
If yes, you can of course also post a recommended .NET graph database :-)
Call me an old fogey, but I would be quite careful making such a technology switch - as this SO question shows, the technology choice is fairly limited, and I think you run the risk of turning your project into a "Neo4J" project, rather than a "dependency management" project. If you've really hit the buffers, that's worth considering, but it doesn't sound like you should be there with the data volumes you're discussing.
The first thing I'd consider is looking at the "nested set" model - this specifically solves the performance problem when retrieving all children for a given node.
We have been having a lot of internal debate regarding staging tables. Some are view staging tables as archaic and will prevent the ability to build re-usable services, etc. It is also being communicated that these will prohibit the business to grow and handle expanded business channels.
I am not necessarily for or against either option, but I do know that having the staged data has been a life saver in many occasions and has made it really easy to re-import orders we have had issues with.
Just wanted to see what others thought about staging data and what other methods are being used to handle scenarios similar to ours (Taking orders from external partners, Amazon, etc and importing them into our ERP system).
Thanks,
S
Some places I've worked I've used staging tables, others I've used other techniques.
Each one has its own advantages and disadvantages.
That said, don't worry about it.
If some data feed comes along that requires some method other than what you are doing, then you'll come up with a new solution.
Change is driven by requirements.
(personally, when someone comes to me and says "We have to change to X because what we do now is inefficient and bad and witches will come and eat our children", they have this image in their minds that on tuesday, we will have an opportunity to triple our client base but only if we do this new thing, but if we don't get cracking on it now, then we'll miss the opportunity because none of those potential clients is willing to wait even a minute and they'll all demand the exact same thing and we can build exactly what they want right now even though we have no idea what they want HURRY HURRY HURRY AND DON'T BREAK ANYTHING. Which, of course, isn't how anything works. A single client (or whatever) comes along and says "Hey, we want your services, can you accept our XML?" to which the response is always "Sure thing", and then you get tasked with it and can make intelligent decisions, and plan things out. As opposed to the "chicken with its head cut-off" methodology preferred by people who like technical words but hate knowing anything tehcnical)
There is no reason for a debate - you have a working system. Anyone who thinks their "re-usable services" theory can do it better should put up or shut up.
Let them build a test implementation on your development servers for a common high volume scenario, and compare it to the current system - including criteria for recovery and re-import after a failure.
I hear this all the time where I work as well (usually from managers who just read an article about SOA and XML) and in situations dealing with large amounts of data - bulk imports into staging tables can handle a much higher data volume than any type of web service.
I am starting work on my first business (application + database) type application, using c# and sql I'm completely new to this,
What tips do you have for me?
What should I look for?
What concepts should I understand?
My tip is to get started and come back when you actually have a concrete question. If it makes you feel more prepared, go read some more C# and SQL books first.
While your question is broad, here is the number one thing I 'wish I knew' when I started out with business apps, kinda specific but it's pretty universal:
Writing business apps is made considerably easier by setting up a solid DAL (Data access layer) and abstracting your data access out and away from the rest of your application. That way all of your SQL is in one place and not strewn throughout your code.
Some golden ideas to read up on in this area are 'ORM' (object-relational mapping, as you're using C#, Linq to SQL could be a nice place to start) - this maps your database access to actual classes. If you have a good database design you might even find you have very little SQL work to do at all.
Another nice practice is using the Repository pattern, which effectively encapsulates all your data access into a single class (at least in the simple case - of course in bigger apps you might have multiple). Then in order to access any data, you always go via the repository. This is typically done via an interface which defines the repository, which then allows you to implement multiple concrete implementations. For example, you might want to fetch your data directly from an SQL server, but later on or in an alternative application you might want to use a web service to fetch data instead - no need to re-write everything, just drop in a new repository class! The interface stays the same, so the rest of your application doesn't know any different :D)
This is a pretty broad overview (and a bit of a mind dump sorry), and I'm certainly no expert, but believe me, good data access practices certainly make your life easier!
Just start writing code. You're going to have to throw is away later when you figure out what going on, but that's alright.
Well, I'd say you've come to the right site if you start asking specific questions.
Some of our highest rated questions, however, will give you tons and tons of reading material, books, link to other sites, etc. Here is the URL
https://stackoverflow.com/questions?sort=votes
General tips:
Get some help from somebody who has done this before. There is no way you're going to pull this off by yourself unless you allow for plenty of time to learn on the job.
Don't get distracted by the technical details -- make sure you understand the business. If you don't know why you're building the app (or your clients don't know why they need it) no good can come from it.
As far as what you should look for or how much you need to understand, I don't know the scope of the application you are trying to build -- thus I can't give any intelligent advice. A real-time financial system used by thousands of concurrent users is different from a small retail site that gets hit by hundreds. So my only look for/understand advice is this: don't overengineer your solution.
I'm currently learning C# and .NET (coming from a UNIX background), and have just started writing a media player. I was hoping for some suggestions on the best way to store the internal database of songs. SQL? Some kind of text file? I don't really have any experience in this area so all points will be really appreciated.
Cheers!
You should probably use SQLite, and you can use LINQ on that to take full advantage of C# 3.5.
http://www.codeproject.com/KB/linq/linqToSql_7.aspx
There is also SQL Server Compact. Linq to Sql works with this as well.
There is a whole spectrum of requirements involved here, to name a few:
multi user?
exepected size(s)
do you want to store the multi media binaries as well?
for complex structured data text files won't do very well.
for storing binaries I wouldn't use XML
So it's probably going to be: What Sql database to use? You can search for discussions about SQLite, Sql Express, SqlCE etc.
A more fundamental question should probably be asked before we move along toward recommending one technology over another...
That of architecture. From the brief description above, it seems like what you are building is a Windows Media Player Library-like piece of functionality. If that's the case, the suggestion of a SQL database might seem appropriate, but the complication of synchronization of the filesystem (you weren't planning on turning the media files to be played into a monolithic datastore, were you?)
If you are instead only worried about persisting playlists.... a text-based format seems appropriate.
Playlists might want to be text-based (which, to me, includes XML representations of an object graph), but library information would seem to want to be in a more robust, more queryable datastore.
An object database could also be appropriate, as it lets you work with a much more transparent view of persistence compared to other suggestions. Isolating the number of new topics you're dealing with while you learn can be an important way to manage your learning curve. db4o has a .Net variation that I haven't looked at recently.
I believe that the best way to save your application state is to a traditional relational database which most of the time its table structure is pretty much represent the data model of our system + meta data.
However other guys in my team think that today it's best to simply serialize the entire object graph to a binary or XML file.
No need to say (but I'll still say it) that World War 3 is going between us and I would like to hear your opinion about this issue.
Personally I hate serialization because:
The data saved is adhered only to your development platform (C# in my case). No other platforms like Java or C++ can use this data.
Entire object graph (including all the inheritance chain) is saved and not only the data we need.
Changing the data model might cause severe backward compatibility issues when trying to load old states.
Sharing parts of the data between applications is problematic.
I would like to hear your opinion about that.
You didn't say what kind of data it is -- much depends on your performance, simultaneity, installation, security, and availability/centralization requirements.
If this data is very large (e.g. many instances of the objects in question), a database can help performance via its indexing capabilities. Otherwise it probably hurts performance, or is indistinguishable.
If your app is being run by multiple users simultaneously, and they may want to write this data, a database helps because you can rely on transactions to ensure data integrity. With file-based persistence you have to handle that yourself. If the data is single-user or single-instance, a database is very likely overkill.
If your app has its own soup-to-nuts installation, using a database places an additional burden on the user, who must set up and maintain (apply patches etc.) the database server. If the database can be guaranteed to be available and is handled by someone else, this is less of an issue.
What are the security requirements for the data? If the data is centralized, with multiple users (either simultaneous or sequential), you may need to manage security and permissions on the data. Without seeing the data it's hard to say whether it would be easier to manage with file-based persistence or a database.
If the data is local-only, many of the above questions about the data have answers pointing toward file-based persistence. If you need centralized access, the answers generally point toward a database.
My guess is that you probably don't need a database, based solely on the fact that you're asking about it mainly from a programming-convenience perspective and not a data-requirements perspective. Serialization, especially in .NET, is highly customizable and can be easily tailored to persist only the essential pieces you need. There are well-known best practices for versioning this data as well, so I'm not sure there's an advantage on the database side from that perspective.
About cross-platform concerns: If you do not know for certain that cross-platform functionality will be required in the future, do not build for it now. It's almost certainly easier overall to solve that problem when the time comes (migration etc.) than to constrain your development now. More often than not, YAGNI.
About sharing data between parts of the application: That should be architected into the application itself, e.g. into the classes that access the data. Don't overload the persistence mechanism to also be a data conduit between parts of the application; if you overload it that way, you're turning the persisted state into a cross-object contract instead of properly treating it as an extension of the private state of the object.
It depends on what you want to serialize of course. In some cases serialization is ridicilously easy.
(I once wrote kind of a timeline program in Java,
where you could draw en drag around and resize objects. If you were ready you could save it in file (like myTimeline.til). On that momenet hundreds of objects where saved, their position on the canvas, their size, their colors, their innertexts, their special effects,...
You could than ofcourse open myTimeLine.til and work further.
All this only asked a few lines of code. (just made all classes and their dependencies
serializable) and my coding time took less than 5 minutes, I was astonished myself! (it was the first time I used serialization ever)
Working on a timeline you could also 'saveAs' for different versions and the 'til' files where very easy to backup and mail.
I think in my particular case it would be a bit idiot to use databases. But that's of course for document-like structures only, like Word to name one.)
My point thus first : there are certainly several scenarios in which databases wouldn't be the best solution. Serialization was not invented by developers just because they were bored.
Not true if you use XMLserialization or SOAP
Not quite relevant anymore
Only if you are not carefull, plenty of 'best practices' for that.
Only if you want it to be problematic, see 1
Of course serialization has besides the speed of implementation other important advantages like not needing a database at all in some cases!
See this Stackoverflow posting for a commentary on the applicability of XML vs. the applicability of a database management system. It discusses an issue that's quite similar to the subject of the debate in your team.
You have some good points. I pretty much agree with you, but I'll play the devil's advocate.
Well, you could always write a converter in C# to extract the data later if needed.
That's a weak point, because disk space is cheap and the amount of extra bytes we'll use costs far less than the time we'll waste trying to get this all to work your way.
That's the way of the world. Burn the bridges and require upgrades. Convert the data, or make a tool to do that, and then no longer support the old version's way of doing it.
Not if the C# program hands off the data to the other applications. Other applications should not be accessing the data that belongs to this application directly, should they?
For transfer and offline storage, serialization is fine; but for active use, some kind of database is far preferable.
Typically (as you say), without a database, you need to deserialize the entire stream to perform any query, which makes it hard to scale. Add the inherent issues with threading etc, and you're asking for pain.
Some of your other pain points about serialization aren't all true - as long as you pick wisely. Obviously, BinaryFormatter is a bad choice for portability and versioning, but "protocol buffers" (Google's serialization format) has versions for Java, C++, C#, and a lot of others, and is designed to be version tolerant.
Just make sure you have a component that handles saving/loading state with a clean interface to the rest of your application. Then whatever choice you make for persistence can easily be revisited later.
Serializing an object graph to a file might be a good quick and dirty initial solution that is very quick to implement.
But if you start to run into issues that make a database a better choice you can plug in a new version with little or no impact on the rest of the application.
Yes propably true. The downside is that you must retrieve the whole object which is like retrieving all rows from a table. And if it's big it will be a downside. But if it ain't so big and with my hobbyprojects they are not, so maybe they should be a perfect match?