How should I go about warehousing data from different sources?

How should I go about warehousing data from different sources? - c#

I am starting on an analytics project that will be getting data from several different sources and comparing them to one another. Sources can be anything from an API such as google analytics API to a locally hosted database.
Should I build a single database to import this data into on a regular basis?
Can anyone suggest some best practices, patterns or articles? I really don't know where to start with this so any information would be great! Thanks!
I will be using SQL Server 2008 R2, C# 4.0.

That's a big question, Mike - plenty of people have entire careers doing nothing but Data Warehousing.
I would give a qualified "yes" to your first question - one of the main attractions of a DWH is that you can consolidate multiple data sources into a single source of information. (The qualification is that there may be circumstances where you don't want to do this - for example, for security or performance reasons.)
As ever, Wikipedia is a reasonable first stop for information on this subject. Since your question is already tagged with data-warehouse, StackOverflow is another possible source.
The canonical books on the subject are probably:
Building the Data Warehouse - WH Inmon
The Data Warehouse Toolkit - Ralph Kimball, Margy Ross
The Data Warehouse Lifecycle Toolkit - Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy, Bob Becker
Note that the Inmon and Kimball approaches are radically different - Inmon concentrates on a top-down, normalised relational approach to constructing an enterprise DWH, while Kimball's approach is more bottom-up, dimensional, functional datamart-based.
The DWH Toolkit concentrates on the technical aspects of building a DWH, while The DWH Lifecycle Toolkit is based as much on the organisational challenges as on the technical details.
Good luck!

I would start with SSIS which is a data integration technology that comes with SQL Server. It may handle a lot of the data sources you need. If you are using APIs such as Googles to get data you may need to put that in a staging table first.
Start with a single staging database which you will use as your primary source to load data into Analysis Services and see how that works out. Use SSIS to populate that staging database.

You need to take up the following steps:
1. First you need to pick up the ETL platform like SSIS, Informatica, or other ETL tools, etc.
2. Then, you need to pick up the appropriate database like Oracle or SQL server, etc.
3.  Thereafter, you need to make the logical data warehouse modeling (Star or Snowflake) and
4. Finally, you need to develop the whole data ware house.
I would advise making two databases, i.e.
1. ODS for storing the data from different sources and for cleansing and
2. Warehouse database for storing all the relevant data.

Related

Best approach to incremently update application data

I have been working on an application for a couple of years that I updated using a back-end database. The whole key is that everything is cached on the client, so that it never requires an network connection to operate, but when it does have a connection it will always pickup the latest updates. Every application updated is shipped with the latest version of the database and I wanted it to download only the minimum amount of data when the database has been updated.
I currently use a table with a timestamp to check for updates. It looks something like this.
ID - Name - Description- Severity - LastUpdated
0 - test.exe - KnownVirus - Critical - 2009-09-11 13:38
1 - test2.exe - Firewall - None - 2009-09-12 14:38
This approach was fine for what I previously needed, but I am looking to expand more function of the application to use this type of dynamic approach. All the data is currently stored as XML, but I do not want to store complete XML files in the database and only transmit changed data.
So how would you go about allowing a fairly simple approach to storing dynamic content (text/xml/json/xaml) in a database, and have the client only download new updates? I was thinking of having logic that can handle XML inserted directly
ID - Data - Revision
15 - XXX - 15
XXX would be something like <Content><File>Test.dll<File/><Description>New DLL to load.</Description></Content> and would be inserted into the cache, but this would obviously be complicated as I would need to load them in sequence.
Another approach that has been mentioned was to base it on something similar to Source Control, storing the version in the root of the file and calculating the delta to figure out the minimal amount of data that need to be sent to the client.
Anyone got any suggestions on how to approach this with no risk for data corruption? I would also to expand with features that allows me to revert possibly bad revisions, and replace them with new working ones.

It really depends on the tools you are using and the architecture you already have. Is there already a server with some logic and a data access layer?
Dynamic approaches might get complicated, slow and limit the number of solutions. Why do you need a dynamic structure? Would it be feasible to just add data by using a name-value pair approach in a relational database? Static and uniform data structures are much easier to handle.
Before going into detail, you should consider the different scenarios.
Items can be added
Items can be changed
Items can be removed (I assume)
Adding is not a big problem. The client needs to remember the last revision number it got from the server and you write a query which get everything since there.
Changing is basically the same. You should care about identification of the items. You need an unchangeable surrogate key, as it seems to be the ID you already have. (Guids may be useful here.)
Removing is tricky. You need to either flag items as deleted instead of actually removing them, or have a list of removed IDs with the revision number when they had been removed.
Storing the data in the client: Consider using a relational database like SQLite in the client. (It doesn't need installation, it is just storing in a file. Firefox for instance stores quite a lot in SQLite databases.) When using the same in the server, you can probably reuse some code. It is also transaction based, which helps to keep it consistent (rollback in case of error during synchronization).
XML - if you really need it - can be stored just as a string in the database.
When using an abstraction layer or ORM that supports SQLite (eg. NHibernate), you may also reuse some code even when there is another database used by the server. Note that the learning curve for such an ORM might be rather steep. If you don't know anything like this, it could be too much.
You don't need to force reuse of code in the client and server.
Synchronization itself shouldn't be very complicated. You have a revision number in the client and a last revision in the server. You get all new / changed and deleted items since then in the client and apply it to the local store. Update the local revision number. Commit. Done.
I would never update only a part of a revision, because then you can't really know what changed since the last synchronization. Because you do differential updates, it is essential to have a well defined state of the client.

I would go with a solution using Sync Framework.
Quote from Microsoft:
Microsoft Sync Framework is a comprehensive synchronization platform enabling collaboration and offline for applications, services and devices. Developers can build synchronization ecosystems that integrate any application, any data from any store using any protocol over any network. Sync Framework features technologies and tools that enable roaming, sharing, and taking data offline.
A key aspect of Sync Framework is the ability to create custom providers. Providers enable any data sources to participate in the Sync Framework synchronization process, allowing peer-to-peer synchronization to occur.

I have just built an application pretty much exactly as you described. I built it on top of the Microsoft Sync Framework that DjSol mentioned.
I use a C# front end application with a SqlCe database, and a SQL 2005 Server at the other end.
The following articles were extremely useful for me:
Tutorial: Synchronizing SQL Server and SQL Server Compact
Walkthrough: Creating a Sync service
Step by step N-tier configuration of Sync services for ADO.NET 2.0
How to Sync schema changed database using sync framework?

You don't say what your back-end database is, but if it's SQL Server you can use SqlCE (SQL Server Compact Edition) as the client DB and then use RDA merge replication to update the client DB as desired. This will handle all your requirements for sure; there is no need to reinvent the wheel for such a common requirement.

C# winforms as front-end for access

I'm in the early stages of building a winform C# app based on Access db (can't use other types of DB for different reasons).
My main issue is how to design the DB since the amount of data is vast (based on daily data) and it will fill up the its size limit within a month or so.
I thought of creating a new DB for every month, but how will I be able to compare data between the different DB, for example, between months? I want the C# app to execute the queries.
Are there any tutorials, books? I have no experience of using and linking front and back-end Access.
Any ideas?
Thanks!

You probably don't want to hear this, but starting a project with MS Access as the backend is not a good idea when you already know in advance that you will hit Access' size limit after only a month.
You say in a comment:
I'm stuck with Access because of these: 1. High cost of SQL server. 2. I'm not familiar with SQL server. 3. SQL server express (the free edition) also has a size limit, though larger (10gb). Are there other DB free and without size limitation? Are there other DB free
I agree with you that SQL Server is not a good solution in your situation - high price and size limitation are valid arguments against SQL Server (both full version and Express Edition).
So, in my opinion using a different database engine is the only real solution here.
Your third argument against SQL Server was "I'm not familiar with it", but I strongly advise you to become familiar with another database engine than Access, because using Access in your situation (size limit!!!) will be a pain in the long run.
(Note to all nitpickers: No, I'm not bashing Access in general - I'm making a living with it myself.
However, it has its limits and when you know in advance that you'll hit its size limit within a month, it's not a good idea to use it here.)
Yes, you could do some hack and use a different Access database for each month, but you will really feel the pain as soon as your users will need to load data from several months at once, or as soon as your boss asks you for a "quick report about our sales in the last three years" :-)
But you can use a different database engine. Yes, you will have to invest time to become familiar with it, learn how to set it up and so on.
But believe me, it will pay off in the long run because you don't have to deal with the hassle of one database file per month.
There are lots of free and capable database engines available, the most known are:
PostgreSQL
MySQL
Firebird

To connect to the MS Access database(s) you can use the code shown here and then you can go about 'joining' the data in your C# front-end.
You might end up writing a subset of a DB engine in C# though and I thoroughly support the comments provided by Bernard and Bryan.

Your DB design should be isolated from your front-end technology decisions, and the reverse is also true. See Multitier Architecture.
Using a multi-tier architecture will help you separate presentation from business logic from data access, allowing you to design and implement each of these components in a modular and robust fashion.
Searches for N-Tier architecture or Multitier architecture will find a wealth of information and help on how to implement multi-tier solutions and why you should go to the trouble.

Here are a few links that might help.
http://msdn.microsoft.com/en-us/library/aa288452(v=vs.71).aspx
http://bytes.com/topic/net/answers/516465-c-insert-statement-using-oledb

Database design and hosting solution

I'm trying to prepare to build a database driven .net application and I have hit a roadblock early on due to my lack of knowledge on this topic. Searching around didn't yield anything so here I am asking for help.
I'm receiving weekly data in xml format that will be added to a database and then reports generated using that data. I have a limited license on the xml files so only I can download them and I need to get the results to my end users as well. As far as I can see, I have 2 options:
Feed the data from the xml files into a web hosted database and then have each user connect to the database.
Upload the xml data to a server, have each user download it and keep a local copy of their own database. I'm thinking this will invalidate my license to the original data.
Things / questions of note:
The database holds weekly sports historical data for about the last 10 years.
I need to limit access to the database to only subscribed users.
I'll need to decide how the database will be built.
I need to decide what kind of hosting I'll need.
As you can see, quite an ambitious project for someone new to this. I haven't asked any specific questions so far:
What kind of hosting solutions shall I look for?
Should I use SQL? (Complete newbie on this subject)
Should I use clickonce and then host the application?
Do you have any book or tutorial recommendations that would cover a project like this?
Do I need a script to feed the xml into the database if I go that route? Will that script reside on the server and do it automatically even if I'm not there to instigate it?
I hope the general topic isn't too vague. I tried to actually ask specific questions on it and I'm aware I don't have any code to show as it's just in the early stages of thinking.

The question is a bit vague since you are early on in the decision-making process. However, I do believe that I can offer some help in directing your thinking as you proceed. I think in the situation you are describing, one key thing you should consider is to host your data via JSON/WCF/REST. If you look into these technologies, you will see that there are different ways you can offer your data based upon your developing requirements. For example, how are you going to do authentication? Are you going to allow third-party clients?
What you really don't want to do is allow direct database access, even for authenticated users. Instead, put something in front of it. If you are working in the .NET space, look into all of the different things WCF offers and pick one based upon what fits best. Once you pick that, then you will know what you need for hosting and deployment. Even if you are going to provide the clients as well as the server, this is still a good way to protect your data and provide a way to expand your offering in the future.

NOSQL database Selection for forum

Hi i am developing a FORUM i am using asp.net, c# language for code.
I have read a article about NoSql i inspired a lot from there advantage over RDBMS (sql)
so i was thinking that should i use NoSql concept for Forum DataBase or not. I am not a expert
in database. So can u suggest me should i use NoSql? Currently I am using sql(rdbms).

Depends on what you wanna do with your forum.
If you want to store and retrieve user-written messages, then SQL will do fine.
If you want to analyze user relationships (Graph problem), you will want to examine Neo4J.
If you want to store a lot of large documents, but not on the file system, you will want to use NoSQL.
If you want to be able to change the table structure 100 times all over, NoSQL is the way to go.
Else, stick with SQL.
Since a forum is remotely related to what twitter does, I would look what twitter uses.

There are a few questions to answer before you make a decision about your database type. Will scalability be an issue? Are you designing your software to be used by hundreds of users concurrently? Also the previous poster is right about NoSQL offering schema flexibility.
Two main NoSQL products for .Net are RavenDB and FatDB. I'm using the latter with great performance results.

Sometimes Connected CRUD application DAL

I am working on a Sometimes Connected CRUD application that will be primarily used by teams(2-4) of Social Workers and Nurses to track patient information in the form of a plan. The application is a revisualization of a ASP.Net app that was created before my time. There are approx 200 tables across 4 databases. The Web App version relied heavily on SP's but since this version is a winform app that will be pointing to a local db I see no reason to continue with SP's. Also of note, I had planned to use Merge Replication to handle the Sync'ing portion and there seems to be some issues with those two together.
I am trying to understand what approach to use for the DAL. I originally had planned to use LINQ to SQL but I have read tidbits that state it doesn't work in a Sometimes Connected setting. I have therefore been trying to read and experiment with numerous solutions; SubSonic, NHibernate, Entity Framework. This is a relatively simple application and due to a "looming" verion 3 redesign this effort can be borderline "throwaway." The emphasis here is on getting a desktop version up and running ASAP.
What i am asking here is for anyone with any experience using any of these technology's(or one I didn't list) to lend me your hard earned wisdom. What is my best approach, in your opinion, for me to pursue. Any other insights on creating this kind of App? I am really struggling with the DAL portion of this program.
Thank you!

If the stored procedures do what you want them to, I would have to say I'm dubious that you will get benefits by throwing them away and reimplementing them. Moreover, it shouldn't matter if you use stored procedures or LINQ to SQL style data access when it comes time to replicate your data back to the master database, so worrying about which DAL you use seems to be a red herring.
The tricky part about sometimes connected applications is coming up with a good conflict resolution system. My suggestions:
Always use RowGuids as your primary keys to tables. Merge replication works best if you always have new records uniquely keyed.
Realize that merge replication can only do so much: it is great for bringing new data in disparate systems together. It can even figure out one sided updates. It can't magically determine that your new record and my new record are actually the same nor can it really deal with changes on both sides without human intervention or priority rules.
Because of this, you will need "matching" rules to resolve records that are claiming to be new, but actually aren't. Note that this is a fuzzy step: rarely can you rely on a unique key to actually be entered exactly the same on both sides and without error. This means giving weighted matches where many of your indicators are the same or similar.
The user interface for resolving conflicts and matching up "new" records with the original needs to be easy to operate. I use something that looks similar to the classic three way merge that many source control systems use: Record A, Record B, Merged Record. They can default the Merged Record to A or B by clicking a header button, and can select each field by clicking against them as well. Finally, Merged Records fields are open for edit, because sometimes you need to take parts of the address (say) from A and B.
None of this should affect your data access layer in the slightest: this is all either lower level (merge replication, provided by the database itself) or higher level (conflict resolution, provided by your business rules for resolution) than your DAL.

If you can install a db system locally, go for something you feel familiar with. The greatest problem I think will be the syncing and merging part. You must think of several possibilities: Changed something that someone else deleted on the server. Who does decide?
Never used the Sync framework myself, just read an article. But this may give you a solid foundation to built on. But each way you go with data access, the solution to the businesslogic will probably have a much wider impact...

There is a sample app called issueVision Microsoft put out back in 2004.
http://windowsclient.net/downloads/folders/starterkits/entry1268.aspx
Found link on old thread in joelonsoftware.com. http://discuss.joelonsoftware.com/default.asp?joel.3.25830.10
Other ideas...
What about mobile broadband? A couple 3G cellular cards will work tomorrow and your app will need no changes sans large pages/graphics.
Excel spreadsheet used in the field. DTS or SSIS to import data into application. While a "better" solution is created.
Good luck!

If by SP's you mean stored procedures... I'm not sure I understand your reasoning from trying to move away from them. Considering that they're fast, proven, and already written for you (ie. tested).
Surely, if you're making an app that will mimic the original, there are definite merits to keeping as much of the original (working) codebase as possible - the least of which is speed.
I'd try installing a local copy of the db, and then pushing all affected records since the last connected period to the master db when it does get connected.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.