Manipulate values in a datatable? - c#

A generic data import module:
I am reading data from any of 6 data source types (CSV, Active Directory, SQL, Access, Oracle, Sharepoint) into a datatable.
This data is then possibly changed by the users by casting, and calculation per column and written to a SQL table (any table selected by the user).
Doing this seems easy except that the user must also be able to replace some fields in the datatable with values from fields in the target SQL database (lookups)
I would really like to do all of the above to the datatable before sending to the target databes, but cannot, repeat NOT use Linq since the table structures (both source and target are unknown and do not represent a specific business object.
tl;dr I need to do data transformations on any datatable. Which is a good way (No Linq!)
EDIT: The source and target tables are different in structure.

I ended up writing a class for each database type all part of one interface, and using a GenericConnection based on DbConnection for the differenct types of sources.
I broke the process up into:
Import
Transform
Write
stages that can be saved and reopened for re-use or editing.
The Transform part consists of:
Casting
Calculations (string,int,decimal,date,bool) as Add, Subtract, Divide, Multiply, AND, OR, Substring, Replace
Lookups against other tables
Direct copying
The transformations can be queued so that one column of data can go through any amount to match the target before being written to target.

Related

Copying data from one oracle database to another oracle database using C#

What is the standard way of copying data from one oracle database to another.
1) Read data from source table and copy to temp table on destination using configuration( i.e. there are more than 1 table and each table has separate temp table)
2) Right now there is no clob data, but in future clob data might be used.
3) Read everything to memory(if large data read in chunks)
Should not use Oracle links
Should not use files
Code should be only using C# but not any database procedures.
One way that I've used to do this is to use a DataReader on the source database and just perform inserts on the target database (using Bind Parameters for sure).
Note that the DataReader is excellent at not using much memory as it moves through a table (I believe that by default it uses a Fast Forward, Read Only cursor). This means that only a small amount of data is held in memory at a given time.
Here are the things to watch out for:
Relationships
If you're working with data that has relationships, you're going to need to deal with that. There are two ways that I've seen to deal with this:
Temporarily drop the relationships in the target database before doing the copy, then recreate them after.
Copy the data in the correct order for the relationships to work correctly (this is usually pretty difficult / inefficient)
Auto Generated Id Values
These columns are usually handled by disabling the auto increment functionality for the given table and allowing identity insert (I'm using some SQL Server terms, I can't remember how it works on Oracle).
Transactions
If you're moving a lot of data, transactions will be expensive.
Repeatability / Deleting Target Data
Unless you're way more awesome than the rest of us, you'll probably have to run this thing more than once (at least during development). That means you might want a way to delete the target data.
Platform Specific Methods
In SQL Server, there are ways to perform bulk inserts that are blazingly fast (by giving up little things like referential integrity checking). There might be a similar feature within the Oracle toolset.
Table / Column Metadata
I haven't had to do this in Oracle yet, but it looks like you can get metadata on tables and columns using the views mentioned here.

Customizeable database

What would be the best database/technique to use if I'd like to create a database that can "add", "remove" and "edit" tables and columns?
I'd like it to be scaleable and fast.
Should I use one table and four columns for this (Id, Table, Column, Type, Value) - Is there any good articles about this. Or is there any other solutions?
Maybe three tables: One that holds the tables, one that holds the columns and one for the values?
Maybe someone already has created a db for this purpose?
My requirements is that I'm using .NET (I guess the database don't have to be on windows, but I would prefer that)
Since (in comments on the question) you are aware of the pitfalls of the "inner platform effect", it is also true that this is a very common requirement - in particular to store custom user-defined columns. And indeed, most teams have needed this. Having tried various approaches, the one which I have found most successful is to keep the extra data in-line with the record - in particular, this makes it simple to obtain the data without requiring extra steps like a second complex query on an external table, and it means that all the values share things like timestamp/rowversion for concurrency.
In particular, I've found a CustomValues column (for example text or binary; typically json / xml, but could be more exotic) a very effective way to work, acting as a property-bag for the additional data. And you don't have to parse it (or indeed, SELECT it) until you know you need the extra data.
All you then need is a way to tie named keys to expected types, but you need that metadata anyway.
I will, however, stress the importance of making the data portable; don't (for example) store any specific platform-bespoke serialization (for example, BinaryFormatter for .NET) - things like xml / json are fine.
Finally, your RDBMS may also work with this column; for example, SQL Server has the xml data type that allows you to run specific queries and other operations on xml data. You must make your own decision whether that is a help or a hindrance ;p
If you also need to add tables, I wonder if you are truly using the RDBMS as an RDBMS; at that point I would consider switching from an RDBMS to a document-database such as CouchDB or Raven DB

How to generically read text files and save into databases without evaluating data type?

I need to write a C# program for reading any flat text files and, for every line, parses using a regex expression previously saved in a configuration file, which separates the line in several fields that are stored in a List object where each element is a field. Then, let each field be saved in the respective fields previously created within a database table and so on for each line.
At this point, everything is seen as that data is orchestrated without interference, as a pipe that runs from the text file to the database. However, I face the difficulty where some of the fields can represent dates/hours in different formats, bool, double etc. For these few cases, I will need to modify my program and make some sort of if-statements that would minimally generate an overhead in processing times.
The question is to know whether there is any methodology or technique within the C# that, despite the special cases mentioned, it allows me to record the data in the database without the overhead of previous evaluations to establish the data type?
Thank you.
Why not just write all of the data from the text file into a database whose fields are all strings (varchar, text, etc). At some point you must need to query the database to use the values, and that's when you can convert from the strings into their respective data types.

Converting project to SQL Server, design thoughts?

Currently, I'm sitting on an ugly business application written in Access that takes a spreadsheet on a bi-daily basis and imports it into a MDB. I am currently converting a major project that includes this into SQL Server and .net, specifically in c#.
To house this information there are two tables (alias names here) that I will call Master_Prod and Master_Sheet joined on an identity key parent to the Master_Prod table, ProdID. There are also two more tables to store history, History_Prod and History_Sheet. There are more tables that extend off of Master_Prod but keeping this limited to two tables for explanation purposes.
Since this was written in Access, the subroutine to handle this file is littered with manually coded triggers to deal with history that were and have been a constant pain to keep up with, one reason why I'm glad this is moving to a database server rather than a RAD tool. I am writing triggers to handle history tracking.
My plan is/was to create an object modeling the spreadsheet, parse the data into it and use LINQ to do some checks client side before sending the data to the server... Basically I need to compare the data in the sheet to a matching record (Unless none exist, then its new). If any of the fields have been altered I want to send the update.
Originally I was hoping to put this procedure into some sort of CLR assembly that accepts an IEnumerable list since I'll have the spreadsheet in this form already but I've recently learned this is going to be paired with a rather important database server that I am very concerned with bogging down.
Is this worth putting a CLR stored procedure in for? There are other points of entry where data enters and if I could build a procedure to handle them given the objects passed in then I could take a lot of business rule away from the application at the expense of potential database performance.
Basically I want to take the update checking away from the client and put it on the database so the data system manages whether or not the table should be updated so the history trigger can fire off.
Thoughts on a better way to implement this along the same direction?
Use SSIS. Use Excel Source to read the spreadsheets, perhaps use a Lookup Transformation to detect new items and finally use a SQL Server Destination to insert the stream of missing items into SQL.
SSIS is way better fit to these kind of jobs that writing something from scratch, no matter how much fun linq is. SSIS Packages are easier to debug, maintain and refactor than some dll with forgoten sources. Besides, you will not be able to match the refinements SSIS has in managing its buffers for high troughput Data Flows.
Originally I was hoping to put this
procedure into some sort of CLR
assembly that accepts an IEnumerable
list since I'll have the spreadsheet
in this form already but I've recently
learned this is going to be paired
with a rather important database
server that I am very concerned with
bogging down.
Does not work. Any input into a C# written CLR procedure STILL has to follow normal SQL semantics. All that can change is the internal setup. Any communication up with the client has to be done in SQL. Which means executions / method calls. No way to directly pass in an enumerable of objects.
My plan is/was to create an object
modeling the spreadsheet, parse the
data into it and use LINQ to do some
checks client side before sending the
data to the server... Basically I need
to compare the data in the sheet to a
matching record (Unless none exist,
then its new). If any of the fields
have been altered I want to send the
update.
You probably need to pick a "centricity" for your approach - i.e. data-centric or object-centric.
I would probably model the data appropriately first. This is because relational databases (or even non-normalized models represented in relational databases) will often outlive client tools/libraries applications. I would probably start trying to model in a normal form and think about the triggers to maintain audit/history as you mention during this time also.
I would typically then think of the data coming in (not an object model or an entity, really). So then I focus on the format and semantics of the inputs and see if there is misfit in my data model - perhaps there were assumptions in my data model which were incorrect. Yes, I'm not thinking of making an object model which validates the spreadsheet even though spreadsheets are notoriously fickle input sources. Like Remus, I would simply use SSIS to bring it in - perhaps to a staging table and then some more validation before applying it to production tables with some T-SQL.
Then I would think about a client tool which had an object model based on my good solid data model.
Alternatively, the object approach would mean modeling the spreadsheet, but also an object model which needs to be persisted to the database - and perhaps you now have two object models (spreadsheet and full business domain) and database model (storage persistence), if the spreadsheet object model is not as complete as the system's business domain object model.
I can think of an example where I had a throwaway external object model kind of like this. It read a "master file" which was a layout file describing an input file. This object model allowed the program to build SSIS packages (and BCP and SQL scripts) to import/export/do other operations on these files. Effectively it was a throwaway object model - it was not used as the actual model for the data in the rows or any kind of navigation between parent and child rows, etc., but simply an internal representation for internal purposes - it didn't necessarily correspond to a "domain" entity.

Performance issues with transpose and insert large, variable column data files into SQL Server

I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example - Data Files:
Example - DB tables:
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.
Any ideas on how to speed up this process?
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.

Categories