Master Data Management - Data redundance - c#

I am currently developing a system that holds all relevant master data like for example a customer, or information about an operational system that exists within our system landscape.
The assigned IDs to these entities are unique within the enterprise. When some system stores e.g. customer related data, it has to hold the master data ID of the customer as well.
The master data system is based on .Net and a MSSQL 2005.
Now my question is, when developing another system with it's own assemblies, database, etc., that uses data of the MDM system, would you store that data redundantly in the other systems database, create its own business entities (like customer) and hard code the required master data from the MDM in the other database (or by ETL)? That way the other system is detached from the MDM but only stores the global Master Data IDs.
Or would you integrate the assemblies of the MDM into other systems (if .Net of course) and use the data layer of the MDM to load global entities (like a customer)?
Or would you let the other system create it's own entities but for retrieving master data you would use a SOAP interface provided by the MDM.
I tend to use the approach no. 1 because I think it's better to detach other systems from the MDM solution (separation of concerns). Since the MDM solution can hold much more data of a customer entity than I would need in some other system where a just the customer's name is required. Option 3 would be possible but web services might slow down an operational system alot. What do you think?

This carries a high risk of data getting out of sync followed by a major headache trying to unravel it.
This is a viable option, but you'd have to maintain a decent set of assemblies for people to use. This is a non-trivial task as they need to be robust, well documented, have a usable API plus some sensible release management, much like you'd expect from any 3rd party framework. This is usually a level of rigour above normal LOB development practices in my experience.
Got to be the way to go I'd say. A Service Oriented Architecture allowing other people to access the data, but giving them the flexibility for how they access/consume it.
As you say, performance may be the deciding factor - in which case 1 combined with 3 could be the best. i.e. the local copies are viewed and treated as cached data only rather than reliable up-to-date copies. The application could do a quick check with the master DB to see if the local cache is still valid (much like a HEAD request in HTTP land) and then either use the local data or refresh it from the master database.

The pros and cons of solution 1 are:
Pros:
- Faster response time (vs. having to consult the Master at each operation, you could cache something to alleviate this)
- Your satellite system can work even if the Master is temporarily unavailable.
Cons:
- You risk to work with obsolete data (if you got your ETL refresh at midnight, you won't get any new or updated record until the next midnight cycle)
- Of course you can't allow the satellite to ever touch any local copy of the MDM (two-way aligns, especially with multiple different satellites becomes a nightmare).
So depending on the specifics, Solution 1 may be ok. I'd prefer to go for querying the Master every time (again, possibly caching answers for a bit of time) but is more of a personal preference.

Related

Architecture Question - One Central Database and Many Different Programs Accessing It

I am designing a program that will build and maintain a database, and act as a central server. This is the 'first stage' of a grander plan. Coming later will be 3-5 remote programs built around the information put into this database.
The requirements are:
The remote programs must be able to access the information in the database.
The remote programs must be able to set alerts when information in the database changes.
The remote programs must be able to request the central server to go out and fetch new / different data.
So, the question is this: how do I expose this data and events to the outside world? My two choices are:
Have them communicate directly with my 'server' application. This seems easier to:
do event notifications (although I suppose I'm probably missing something in SQL).
It also seems like this is more 'upgradeable' - that is I don't need to worry about the database updating and crashing all my remote programs because something changed. I can account for this and transform it the data to a version the child program will understand.
Just go ahead and let them connect directly to the database.
This nice thing about this is that it's solved. I can use LINQ for SQL. The only thing the main server application needs to do is let the remote programs know where the database is.
I'm unsure how to trigger / relay 'events' for field changes in a database over different programs that may or may not be on the same computer.
Forgive my ignorance on this question. I feel woefully unprepared to ask it, but I'm having a hard time figuring out where to get started with this. It is my first real DB project :-/
Thanks!
If the other programs are going to need to know about updates to the database, then the best solution is to manage all db updates through your server application so it can alert clients of the changes. Otherwise it will be tough for the clients to be aware of changes to the db. This also has the advantage of hiding the implementation details of your storage solution from the clients, so you are free to change databases, etc...
My suggestion would be to go with option 1. Build out a web service that can provide the information they all need. This will be the most flexible and allow you to reduce duplicate backend code that would happen with direct communication with the database.
I would recommend looking at some Data Source design patterns first. This types of patterns will help you come up with solutions about how to manage the states of your data. Otherwise I think that I would require some more information about your requirements for the clients to make any further useful suggestions.
I recommend you learn about SQL Server and/or databases first. You don't appear to realize that most of what you want from your "central server" can all be done by SQL Server itself.
A central databse is the simplest option and the cheapest to both build and maintain.
There are however a few scenarios where a central database could cause problems:
High load on one of the systems: A high load on one of the systems could reduce performance on the other systems. For example someone running an internal report stops you being able to take orders on your eCommerce site.
With several systems writing to the same database there is a greater chance of locking.
With several systems dependent on the same database schema, how do you upgrade? All systems at the same time?
If you need to take down the database all systems stop.

How can I cache private data in a webfarm for ASP MVC

I am making a member based web app in ASP MVC3 and I am trying to plan ahead, at first our user base will not be huge, but as with any software the potential for a sudden volume spike is always a possibility.
Thinking ahead to this scenario, I know that the database is the bottleneck area on most web apps. We are using MSSQL 2008RS we will have dedicated servers with several client databases each client has there own database so if one server begins to bottle neck we can scale vertically or move some of the databases to a new server and begin filling it up.
To access the databases we use primarily LINQ 2 SQL and are currently re-factoring some of our code to make use of the IQueryable mechanisms to do a lazy load of content. but each page contains quite a bit of content from various parts of the database.
We also have a few large databases that are used for widgets in the program that rarely change but have millions of rows. The goal with those is to somehow sync them to the primary source and distribute them across several machines and then load balance those servers.
With this layout should I even worry about caching, or will the built-in caching mechanisms in MSSQL be sufficient?
If so where should I begin? I have looked briefly at app fabric but it looks as tho it is for Azure only?
Resources:
How to cache data in a MVC application
http://stephenwalther.com/blog/archive/2008/08/28/asp-net-mvc-tip-39-use-the-velocity-distributed-cache.aspx
http://stephenwalther.com/blog/archive/2008/08/29/asp-net-mvc-tip-40-don-t-cache-pages-that-require-authentication.aspx
Lazy loading is a performance killer. Its better to load the entire object graph with one join than to lazy load other properties. This is especially the case with a list of objects. If you iterate you'll end up lazy loading for each item in the list. Furthermore every call to the db has overhead. Less calls = better performance.
SO was a top 1000 website before it needed two database servers. I think you'll be ok.
If your revenue model says "each client will have its own database" than your scaling issues should be really easy to solve. Sounds like you already have a plan to scale up with more servers as your client base increases. Whats the problem?
Caching on the web tier is usually the first scaling fix you'll have to worry about. You probably don't need to do a fresh db call with each page request.
Overall this sounds like a lot of premature optimization. Your traffic hasn't reached a point where you need to be worried about scaling. Make these kinds of decisions at the last second possible.
The database cache is different to most caches - it can if course load used data into memory and re-use query plans, but that isn't really a cache as such.
AppFabric is definitely not just azure; after all, I it was you wouldnt be able to install it (and use it) locally :) but in truth there is little between AppFabroc, redis and memcached (the latter lacks persistance, of course).
But I think you should initially look at using the inbuilt asp.net caching; both data caching via HttpContext.Cache, and caching of entire responses (or, in MVC 3, partials). Obviously you should have a broad idea of what data is used heavily by lots of requests, and is safe to re-use : cache that!
Just make sure you treat all cached FAA as immutable (if you need to update the cache, re-add a modified value; don't modify the existing objects) - reason: it won't work the same if you start needing to use distributed caching, as that uses serialization, and any changes you make won't be seen by the next request.

Business layer design dilemma: memory or IO?

The project I am working on is facing a design dilemma on how to get objects and collections of objects from a database. Sometimes it is useful to buffer *all* objects from the database with its properties into memory, sometimes it is useful to just set an object id and query its properties on-demand (1 db call per object to get all properties). And in many cases, collections need to support both buffering objects into memory and being initialized with minimum information for on-demand access. After all, not everything can be buffered into memory and not everything can be read on-demand. It is a ubiquitous memory vs IO problem.
Did anyone have to face the same problem? How did affect your design? What are the tough lessons learned? Any other thoughts and recommendations?
EDIT: my project is a classic example of a business layer dll, consumed by a web application, web services and desktop application. When a list of products is requested for a desktop application and displayed only by product name, it is ok to have this sequence of steps to display all products (lets say there is a million of products in the database):
1. One db call to get all product names
2. One db call to get all product information if the user clicks on the product to see details (on-demand access)
However, if this same API is going to be consumed by a web service to display all products with details, the network traffic will become chatty. The better sequence in this case would be:
1. What the heck, buffer all products and product fields from just one db call (in this case buffering 1 million products also looks scary)
It depends how often the data changes. It is common to cache static and near static data (usually with a cache expiry window).
Databases are already designed to cache data, so provided network I/O is not a bottleneck, let the database do what it is good at.
Have you looked at some of the caching technologies available?
.NET Framework 4 ObjectCache Class
Cache Class: Using the ASP.NET Cache outside of ASP.NET
Velocity: Build Better Data-Driven Apps With Distributed Caching
Object cache for C#
This is not a popular position, but avoid all caching unless absolutely necessary or if you know for sure immediately that you're going to need to "Internet-scale." Tried to scale out a layered-cache atop the database? Are you going to write-through the cache and only read or wait for an LRU object to write changes? what happens when another app or web services tier sit atop the DB and get inconsistent reads?
Most modern databases already have cache and likely can implement them better than you, just determine if you want to hit the DB wire every time you need something. In a large majority of the cases, the DB'll perform just fine and you'll keep your consistency. BASE and CAP theory is nice and fun to talk about and imagine, but you sometimes just can't beat the cost-to-market of just hitting the good old database. Stress test and determine your hotspots, implement your cache conservatively if needed.

is a database intermediary good system design?

background: we've got a number of server processes and client apps that are used entirely internally, in a fairly controlled environment. we capture a significant amount of data every day that goes into a couple database machines. most everything is c#, with a few c++ apps.
just about every app has some basic (if not extensive) dependence on database data, whether it's for historical data, daily-calculated values, or assorted parameters. as the whole environment has gotten a bit more sprawling, I've been wondering about the sense in sticking an intermediary in between all client and server apps and the database, a sort of "database data broker". any app that needs values from the db makes a request to the data broker, instead of a dll wrapper function that calls a stored proc.
one immediate downside is that the data would make two trips across the network: from db to broker, and from broker to calling app. seems like poor form, but the amount of data would be small enough in each request that I'm ok with it as far as performance goes.
one (seeming) upside is that it would be trivial to set up a test environment, as it would entail just setting up a test data broker, and there's no maintaining of db connection strings locally anywhere else. also, I've been pondering creating a mini request language so you wouldn't have to enumerate functions for each dataset you might request (instead of GetX() and GetY(), there would be Get("name = X")
am I over-engineering this, or is it possibly a worthy architecture?
edit: thanks for all the great comments so far, great food for thought.
It depends on what you're trying to accomplish with it. According to Rocky Lhotka, you should only add a tier if you are forced to, kicking and screaming all the way.
I agree with him: don't tier unless you need to. I think there are valid reasons to add additional tiers, usually for purposes of security, scalability and maintainability. The question becomes: is yours a valid reason?
It looks like the major reason is maintainability. Does it outweigh the benefits you get by not having the tier?
only you can answer these:
what are the benefits of doing this?
what are the problems/risks of doing this?
do you need this to make testing easier or even possible?
if you make this change and when it goes live and crashes will you be fired?
if you make the changes and it goes live will you get a promotion?
etc...
As the former architect of a system that also used a database heavily as a "hub," I can say that there are several drawbacks that you should be aware of. Our system used databases:
As a transaction store (typical OLTP stuff)
As a staging queue (submitted but unprocessed transactions)
As a historical data store (results of processed transactions)
As an interoperation layer (untranslated commands or transactions issued from other systems)
One of the major drawbacks is ownership costs. When your databases become the single point of failure for so many types of operations, it becomes necessary to ensure that they are all hosted in high-availability environments. This not only expensive from a hardware perspective, but it is also expensive to support deployments to HA environments, since developers typically have very limited visibility to the internals.
A second drawback is that you have to seriously design integrity in to all of your tables. In a typical SOA environment, you have complete control over how data is modified. When you expose it through database tables, you must consider that any application with the right credentials will have the ability to modify data. Because of this, you must carefully consider utilitarian implementations of constraints. If you had a single service managing persistence, you could be much looser in constraints on the database and enforce them in code.
Third, if you ever want to expose any functionality that the database tables currently allow you to provide to outside parties, you must write service code anyway, so you might be better served doing it strategically as opposed to reacting to requests.
Fourth, UI interaction directly with the data layer creates security risks, especially if the client is a thick client.
Finally, writing code that responds to events (service calls) is much easier than polling code. Typically, organizations that rely heavily on database polling end up reinventing the wheel every time a new project requires a new "monitoring service." It can be avoided by creating a "framework," but those have their own pitfalls (primarily around prescription versus adoption).
This is just a laundry list of problems I have encountered. It's not necessarily meant to dissuade you from using databases for these functions, but it helps to know the dangers ahead of time so you can at least plan for them if they ever do become issues.
EDIT
Just thought of another scenario that caused us pains. Versioning your changes can be difficult. For example, if you need to change the shape of a table (normalize/denormalize), it has a cascading effect if multiple applications rely on it. In a SOA scenario, it is much easier, because you can keep your old API, change the internal interaction so that it works with the changed tables, and allow consumers to migrate to the new version on their own schedule.
A data broker sounds like a really good way to abstract out the multiple data sources for your apps. It would be easy to consolidate, change repositories, or otherwise move data around if needed in the future.
I may be misunderstanding something, but it seems to me like you should consider some entity framework. That is a framework you can use to "map" your interaction with the db to some domain objects. That way you work locally on domain objects that gets filled form your db, and when it is time to persist the state of your objects to the base, the framework handles all the connections back and forth. In this way you can also easily mock up these domain objects for unit testing without needing a db connection.
Check out NHibernate for a good entity framework alternative.
If you already have the database related know-how I think it's not a bad decission.
Good things that I can think of:
if the data model is consistent you can plug in new tools easily without making any changes in the other apps.
maybe you can have running the database more reliabily than your apps, so if one of them fails, the other one can still be working.
you can make backups and rollbacks using the database tools.
you can do emergency fixes manipulating the data directly with sql or some visual tool.
But if you have to learn new frameworks along the way, maybe the benefits are not worth the extra initial effort.
"any app that needs values from the db makes a request to the data broker"
When database technology was being invented over 40 years ago, the people doing that inventing had ideas along the lines of "any app that needs values from the db makes a request to the dbms".
Have you ever pondered the possibility that YOU ALREADY HAVE a "data broker", and that there might be very little added value in creating a second one of your own ?

Sometimes Connected CRUD application DAL

I am working on a Sometimes Connected CRUD application that will be primarily used by teams(2-4) of Social Workers and Nurses to track patient information in the form of a plan. The application is a revisualization of a ASP.Net app that was created before my time. There are approx 200 tables across 4 databases. The Web App version relied heavily on SP's but since this version is a winform app that will be pointing to a local db I see no reason to continue with SP's. Also of note, I had planned to use Merge Replication to handle the Sync'ing portion and there seems to be some issues with those two together.
I am trying to understand what approach to use for the DAL. I originally had planned to use LINQ to SQL but I have read tidbits that state it doesn't work in a Sometimes Connected setting. I have therefore been trying to read and experiment with numerous solutions; SubSonic, NHibernate, Entity Framework. This is a relatively simple application and due to a "looming" verion 3 redesign this effort can be borderline "throwaway." The emphasis here is on getting a desktop version up and running ASAP.
What i am asking here is for anyone with any experience using any of these technology's(or one I didn't list) to lend me your hard earned wisdom. What is my best approach, in your opinion, for me to pursue. Any other insights on creating this kind of App? I am really struggling with the DAL portion of this program.
Thank you!
If the stored procedures do what you want them to, I would have to say I'm dubious that you will get benefits by throwing them away and reimplementing them. Moreover, it shouldn't matter if you use stored procedures or LINQ to SQL style data access when it comes time to replicate your data back to the master database, so worrying about which DAL you use seems to be a red herring.
The tricky part about sometimes connected applications is coming up with a good conflict resolution system. My suggestions:
Always use RowGuids as your primary keys to tables. Merge replication works best if you always have new records uniquely keyed.
Realize that merge replication can only do so much: it is great for bringing new data in disparate systems together. It can even figure out one sided updates. It can't magically determine that your new record and my new record are actually the same nor can it really deal with changes on both sides without human intervention or priority rules.
Because of this, you will need "matching" rules to resolve records that are claiming to be new, but actually aren't. Note that this is a fuzzy step: rarely can you rely on a unique key to actually be entered exactly the same on both sides and without error. This means giving weighted matches where many of your indicators are the same or similar.
The user interface for resolving conflicts and matching up "new" records with the original needs to be easy to operate. I use something that looks similar to the classic three way merge that many source control systems use: Record A, Record B, Merged Record. They can default the Merged Record to A or B by clicking a header button, and can select each field by clicking against them as well. Finally, Merged Records fields are open for edit, because sometimes you need to take parts of the address (say) from A and B.
None of this should affect your data access layer in the slightest: this is all either lower level (merge replication, provided by the database itself) or higher level (conflict resolution, provided by your business rules for resolution) than your DAL.
If you can install a db system locally, go for something you feel familiar with. The greatest problem I think will be the syncing and merging part. You must think of several possibilities: Changed something that someone else deleted on the server. Who does decide?
Never used the Sync framework myself, just read an article. But this may give you a solid foundation to built on. But each way you go with data access, the solution to the businesslogic will probably have a much wider impact...
There is a sample app called issueVision Microsoft put out back in 2004.
http://windowsclient.net/downloads/folders/starterkits/entry1268.aspx
Found link on old thread in joelonsoftware.com. http://discuss.joelonsoftware.com/default.asp?joel.3.25830.10
Other ideas...
What about mobile broadband? A couple 3G cellular cards will work tomorrow and your app will need no changes sans large pages/graphics.
Excel spreadsheet used in the field. DTS or SSIS to import data into application. While a "better" solution is created.
Good luck!
If by SP's you mean stored procedures... I'm not sure I understand your reasoning from trying to move away from them. Considering that they're fast, proven, and already written for you (ie. tested).
Surely, if you're making an app that will mimic the original, there are definite merits to keeping as much of the original (working) codebase as possible - the least of which is speed.
I'd try installing a local copy of the db, and then pushing all affected records since the last connected period to the master db when it does get connected.

Categories