Rhino ETL - loading large pipe-delimited files

Rhino ETL - loading large pipe-delimited files - c#

We've got to load large pipe-delimited files. When loading these into a SQL Server DB by using Rhino ETL (relying upon FileHelpers), is it mandatory to provide a record class?
We have to load files to different tables which have dozens of columns - it might take us a whole day to generate them. I guess we can write a small tool to generate the record classes out of the SQL Server tables.
Another approach would be to write an IDataReader wrapper for a FileStream and the pass it on to a SqlBulkCopy.
SqlBulkCopy does require column mappings as well but it does allow column ordinals - that's easy.
Any ideas/suggestions?
Thanks.

I don't know much about Rhino ETL, but FileHelpers has a ClassBuilder which allows you to generate the record class at run time. See the documentation for some examples.
So it would be easy to generate a class with something like the following:
SqlCommand command = new SqlCommand("SELECT TOP 1 * FROM Customers;", connection);
connection.Open();
// get the schema for the customers table
SqlDataReader reader = command.ExecuteReader();
DataTable schemaTable = reader.GetSchemaTable();
// create the FileHelpers record class
// alternatively there is a 'FixedClassBuilder'
DelimitedClassBuilder cb = new DelimitedClassBuilder("Customers", ",");
cb.IgnoreFirstLines = 1;
cb.IgnoreEmptyLines = true;
// populate the fields based on the columns
foreach (DataRow row in schemaTable.Rows)
{
cb.AddField(row.Field<string>("ColumnName"), row.Field<Type>("DataType"));
cb.LastField.TrimMode = TrimMode.Both;
}
// load the dynamically created class into a FileHelpers engine
FileHelperEngine engine = new FileHelperEngine(cb.CreateRecordClass());
// import your records
DataTable dt = engine.ReadFileAsDT("testCustomers.txt");

Related

Need advice on database tables merging tool

I have created a tool that is using Excel (VBA) currently, however I would like to make it independent from Excel as there might be different kind of problems with Excel update packages etc. Basically to eliminate one more factor that can go wrong.
My current Excel application:
Imports two database tables (Customers) into two different worksheets sheet1 and sheet2
Merging these two tables to another sheet3 by comparing column 2 (Name) and 6 (Postanumber)
If there are duplicates (there are partly same Customers in both of them) VBA code inputs only one value to sheet3
After sheet3 is ready I am performing foreach loop for sheet3 for exporting all the Customers to another system
I have started with C# code that is able to connect to database and get all the Customers from both tables by merging them.
Started with DataGrid to get an idea:
private Task<DataView> GetDataAsync()
{
return Task.Run(() =>
{
string connectionStringDE = "Driver={Pervasive ODBC Client Interface};ServerName=DE875;dbq=#DEDBFS;Uid=DEUsername;Pwd=DEPassword;";
string queryStringDE = "select NRO,NAME,NAMEA,NAMEB,ADDRESS,POSTA,POSTN,POSTADR,COMPANYN,COUNTRY,ID,ACTIVE from COMPANY";
string connectionStringFR = "Driver={Pervasive ODBC Client Interface};ServerName=FR875;dbq=#FRDBFS;Uid=FRUsername;Pwd=FRPassword;";
string queryStringFR = "select NRO,NAME,NAMEA,NAMEB,ADDRESS,POSTA,POSTN,POSTADR,COMPANYN,COUNTRY,ID,ACTIVE from COMPANY";
DataTable dataTable = new DataTable("COMPANY");
// using-statement will cleanly close and dispose unmanaged resources i.e. IDisposable instances
using (OdbcConnection dbConnectionDE = new OdbcConnection(connectionStringDE))
{
dbConnectionDE.Open();
OdbcDataAdapter dadapterDE = new OdbcDataAdapter();
dadapterDE.SelectCommand = new OdbcCommand(queryStringDE, dbConnectionDE);
dadapterDE.Fill(dataTable);
}
using (OdbcConnection dbConnectionFR = new OdbcConnection(connectionStringFR))
{
dbConnectionFR.Open();
OdbcDataAdapter dadapterFR = new OdbcDataAdapter();
dadapterFR.SelectCommand = new OdbcCommand(queryStringFR, dbConnectionFR);
var newTable = new DataTable("COMPANY");
dadapterFR.Fill(newTable);
dataTable.Merge(newTable);
}
return dataTable.DefaultView;
});
}
However because of lack of knowledge I am not so confident what method should I use in C# to store all the data (sheet3, there are more than 1000 records) before exporting to another system with foreach? Should I create local database or can use list? I guess I shouldn't import all the values from two tables to two different lists/database tables and compare/merge them after to another one (like I am doing in my current Excel VBA setup), but can perform this action on connection, merging and assigning to list/database table?

However because of lack of knowledge I am not so confident what method should I use in C# to store all the data (sheet3, there are more than 1000 records) before exporting to another system with foreach?
"What should I use" is subjective/opinion based and can't really be answered here. It would be fair to say that anything you can find a reasonable tutorial on such as generic collections or datasets/datatables will be able to easily handle a dataset this small
Should I create local database or can use list?
You can do either, but I would only use a database if i was persisting the information. It sounds to me like your need for data storage is temporary so I would just use an in-memory collection of data
I guess I shouldn't import all the values from two tables to two different lists/database tables and compare/merge them after to another one
I can't see any reason why you shouldn't
but can perform this action on connection, merging and assigning to list/database table?
It is certainly possible to merge two sets of data by uploading one if them into the database where the other one is and having the database do the merge

C# - How to tell if DataColumn supports nulls?

I have a datatable that comes from an SQL request. While I am really working against a table using OLEDB, even if I get the table from my SQL server, I have the same problem.
If I fill the datatable and then query the DataColumns - they all say AllowDBNull== true and allowNull == true. But if I look at the table in SSMS, it states otherwise.
string selectStmt= "Select * from foobar; "
DataSet NewData = new DataSet();
using (SqlConnection DataConn = new SqlConnection(MyConnectionString))
{
SqlDataAdapter DataAdapter = new SqlDataAdapter(selectStmt, DataConn );
var Results = DataAdapter.Fill(NewData, tableName);
}
DataColumn Col = NewData.Tables[0].Columns[0];
// Col.AllowDBNull is always true as is Col.AllowNull
I also can't seem to figure out where to get the length of a string field.
This makes it a little difficult to implement some simple client side error checking before I try to upload data.
If I were only dealing with SQL server based tables, I could use Microsoft.SqlServer.Management.Sdk and Microsoft.SqlServer.Management.Smo. Since I am not, that's out.

Try
var Results = DataAdapter.FillSchema(NewData, SchemaType.Source, tableName);
See if that gives you the level of schema detail you need.

A ResultSet isn't going to know column schema data like that, it would be too intensive an operation to do that per command execution, instead the runtime will create schema information on the fly only using the data it gets back in the data/result-set. For full blown schema you'd have to use something like EF or code the schema yourself. The only thing you can rely on for runtime schema's is the data type (unless the data columns were specifically coded with their attributes).
To properly test for DbNull you do this:
if ( dataRow[colNameOrIndex].Value == DbNull.Value){
//null
}

Fastest way to update (populate) 1,000,000 records into a database using .NET

I am using this code to insert 1 million records into an empty table in the database. Ok so without much code I will start from the point I have already interacted with data, and read the schema into a DataTable:
So:
DataTable returnedDtViaLocalDbV11 = DtSqlLocalDb.GetDtViaConName(strConnName, queryStr, strReturnedDtName);
And now that we have returnedDtViaLocalDbV11 lets create a new DataTable to be a clone of the source database table:
DataTable NewDtForBlkInsert = returnedDtViaLocalDbV11.Clone();
Stopwatch SwSqlMdfLocalDb11 = Stopwatch.StartNew();
NewDtForBlkInsert.BeginLoadData();
for (int i = 0; i < 1000000; i++)
{
NewDtForBlkInsert.LoadDataRow(new object[] { null, "NewShipperCompanyName"+i.ToString(), "NewShipperPhone" }, false);
}
NewDtForBlkInsert.EndLoadData();
DBRCL_SET.UpdateDBWithNewDtUsingSQLBulkCopy(NewDtForBlkInsert, tblClients._TblName, strConnName);
SwSqlMdfLocalDb11.Stop();
var ResSqlMdfLocalDbv11_0 = SwSqlMdfLocalDb11.ElapsedMilliseconds;
This code is populating 1 million records to an embedded SQL database (localDb) in 5200ms. The rest of the code is just implementing the bulkCopy but I will post it anyway.
public string UpdateDBWithNewDtUsingSQLBulkCopy(DataTable TheLocalDtToPush, string TheOnlineSQLTableName, string WebConfigConName)
{
//Open a connection to the database.
using (SqlConnection connection = new SqlConnection(ConfigurationManager.ConnectionStrings[WebConfigConName].ConnectionString))
{
connection.Open();
// Perform an initial count on the destination table.
SqlCommand commandRowCount = new SqlCommand("SELECT COUNT(*) FROM "+TheOnlineSQLTableName +";", connection);
long countStart = System.Convert.ToInt32(commandRowCount.ExecuteScalar());
var nl = "\r\n";
string retStrReport = "";
retStrReport = string.Concat(string.Format("Starting row count = {0}", countStart), nl);
retStrReport += string.Concat("==================================================", nl);
// Create a table with some rows.
//DataTable newCustomers = TheLocalDtToPush;
// Create the SqlBulkCopy object.
// Note that the column positions in the source DataTable
// match the column positions in the destination table so
// there is no need to map columns.
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = TheOnlineSQLTableName;
try
{
// Write from the source to the destination.
for (int colIndex = 0; colIndex < TheLocalDtToPush.Columns.Count; colIndex++)
{
bulkCopy.ColumnMappings.Add(colIndex, colIndex);
}
bulkCopy.WriteToServer(TheLocalDtToPush);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
// Perform a final count on the destination
// table to see how many rows were added.
long countEnd = System.Convert.ToInt32(
commandRowCount.ExecuteScalar());
retStrReport += string.Concat("Ending row count = ", countEnd, nl);
retStrReport += string.Concat("==================================================", nl);
retStrReport += string.Concat((countEnd - countStart)," rows were added.", nl);
retStrReport += string.Concat("New Customers Was updated successfully", nl, "END OF PROCESS !");
//Console.ReadLine();
return retStrReport;
}
}
Trying it via a connection to SQL server was around 7000ms(at best) & ~7700ms average. Also via a random kv nosql database took around 40 sec (really I did not even keep records of it as it passed over the x2 of sql variants). So... is there a faster way than what I was testing in my code?
Edit
i am using win7 x64 8gb ram and most important i should think (as i5 3ghz) is not so great by now
the x3 500Gb Wd on Raid-0 does the job even better
but i am just saying if you will check on your pc
though just compare it to any other method in your configuration

Have you tried SSIS? I have never written an SSIS package with a loacldb connection, but this is the sort of activity SSIS should be well suited.
If your data source is a SQL Server, another idea would be setting up a linked server. Not sure if this would work with localdb. If you can set up a linked server, you could bypass the C# all together and load your data with an INSERT .. SELECT ... FROM ... SQL statement.

you can use Dapper.NET.
Dapper is a micro-ORM, executes a query and map the results to a strongly typed List.
Object-relational mapping (ORM, O/RM, and O/R mapping) in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language
For more info:
check out https://code.google.com/p/dapper-dot-net/
GitHub Repository: https://github.com/SamSaffron/dapper-dot-net
Hope It helps..

Remove looping... In SQL, try to make a table with 1 million rows... and left join it use this for insert/select data

Try sending it without storing it in a datatable.
See the example at the end of this post, that allows you to do it with an enumerator http://www.developerfusion.com/article/122498/using-sqlbulkcopy-for-high-performance-inserts/

If you are just creating nonsense data, create a stored procedure and just call that through .net
If you are passing real data, again passing it to a stored proc would be quicker but you would be best off dropping the table and recreating it with the data.
If you insert one row at a time, it will take longer than inserting it all at once. It will take even longer if you have indexes to write.

Create a single XML file for all rows you want to save into data base. Pass this XML to SQL stored procedure and save all record in one call only.
But your stored procedure must be written so that it can read all read then insert into table.

If this is a new project I recommend you to use Entity Framework. In this case you can create a List<> with an object with all the data you need and then simply add it entirely to the corresponding table.
This way you are quickly geting the needed data and then sending it to the database at once.

I agree with Mike on SSIS but it my not suit your environment, however for ETL processes that involve cross server calls and general data flow processes it is a great built in tool and highly integrated.
With 1 million rows you will likely have to do a bulk insert. Depending on the row size you would not really be able to use a stored procedure unless you did this in batches. A datatable will fill memory pretty quick, again depending on the row size. You could make a stored procedure and have that take a table type and call that every X number of rows but why would we do this when you already have a better, more scalable solution. That million rows could be 50 million next year.
I have used SSIS a bit and if that is an organizational fit I would suggest looking at it, but it wouldn't be a one time answer, wouldn't be worth the dependencies.

Import data from excel into multiple tables

I'm building an offline C# application that will import data off spread sheets and store them in a SQL Database that I have created (Inside the Project). Through some research I have been able to use some code that can import a static table, into a Database that is exactly the same layout as the columns in the worksheet
What I"m looking to do is have specific columns go to their correct tables based on name. This way I have the database designed correctly and not just have one giant table to store everything.
Below is the code I'm using to import a few static fields into one table, I want to be able to split the imported data into more than one.
What is the best way to do this?
public partial class Form1 : Form
{
string strConnection = ConfigurationManager.ConnectionStrings
["Test3.Properties.Settings.Test3ConnectionString"].ConnectionString;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
//Create connection string to Excel work book
string excelConnectionString =
#"Provider=Microsoft.Jet.OLEDB.4.0;
Data Source=C:\Test.xls;
Extended Properties=""Excel 8.0;HDR=YES;""";
//Create Connection to Excel work book
OleDbConnection excelConnection = new OleDbConnection(excelConnectionString);
//Create OleDbCommand to fetch data from Excel
OleDbCommand cmd = new OleDbCommand
("Select [Failure_ID], [Failure_Name], [Failure_Date], [File_Name], [Report_Name], [Report_Description], [Error] from [Failures$]", excelConnection);
excelConnection.Open();
OleDbDataReader dReader;
dReader = cmd.ExecuteReader();
SqlBulkCopy sqlBulk = new SqlBulkCopy(strConnection);
sqlBulk.DestinationTableName = "Failures";
sqlBulk.WriteToServer(dReader);
}

You can try an ETL (extract-transform-load) architecture:
Extract: One class will open the file and get all the data in chunks you know how to work with (usually you take a single row from the file and parse its data into a POCO object containing fields that hold pertinent data), and put those into a Queue that other work processes can take from. In this case, maybe the first thing you do is have Excel open the file and re-save it as a CSV, so you can reopen it as basic text in your process and chop it up efficiently. You can also read the column names and build a "mapping dictionary"; this column is named that, so it goes to this property of the data object. This process should happen as fast as possible, and the only reason it should fail is because the format of a row doesn't match what you're looking for given the structure of the file.
Transform: Once the file's contents have been extracted into an instance of a basic row, perform any validation, calculations or other business rules necessary to turn a row from the file into a set of domain objects that conform to your domain model. This process can be as complex as you need it to be, but again it should be as straightforward as you can make it while obeying all the business rules given in your requirements.
Load: Now you've got an object graph in your own domain objects, you can use the same persistence framework you'd call to handle domain objects created any other way. This could be basic ADO, an ORM like NHibernate or MSEF, or an Active Record pattern where objects know how to persist themselves. It's no bulk load, but it saves you having to implement a completely different persistence model just to get file-based data into the DB.
An ETL workflow can help you separate the repetitive tasks into simple units of work, and from there you can identify the tasks that take a lot of time and consider parallel processes.
Alternately, you can take the file and massage its format by detecting columns you want to work with, and arranging them into a format that matches your bulk input spec, before calling a bulk insert routine to process the data. This file processor routine can do anything you want it to, including separating data into several files. However, it's one big process that works on a whole file at a time and has limited opportunities for optimization or parallel processing. However, if your loading mechanism is slow, or you've got a LOT of data that is simple to digest, it may end up faster than even a well-designed ETL.
In any case, I would get away from an Office format and into a plain-text (or XML) format as soon as I possibly could, and I would DEFINITELY avoid having to install Office on a server. If there is ANY way you can require the files be in some easily-parseable format like CSV BEFORE they're loaded, so much the better. Having an Office installation on a server is a Really Bad Thing in general, and OLE operations in a server app is not much better. The app will be very brittle, and anything Office wants to tell you will cause the app to hang until you log onto the server and clear the dialog box.

If you were looking for a more code related answer, you could use the following to modify your code to work with difficult column names / different tables:
private void button1_Click(object sender, EventArgs e)
{
//Create connection string to Excel work book
string excelConnectionString =
#"Provider=Microsoft.Jet.OLEDB.4.0;
Data Source=C:\Test.xls;
Extended Properties=""Excel 8.0;HDR=YES;""";
//Create Connection to Excel work book
OleDbConnection excelConnection = new OleDbConnection(excelConnectionString);
//Create OleDbCommand to fetch data from Excel
OleDbCommand cmd = new OleDbCommand
("Select [Failure_ID], [Failure_Name], [Failure_Date], [File_Name], [Report_Name], [Report_Description], [Error] from [Failures$]", excelConnection);
excelConnection.Open();
DataTable dataTable = new DataTable();
dataTable.Columns.Add("Id", typeof(System.Int32));
dataTable.Columns.Add("Name", typeof(System.String));
// TODO: Complete other table columns
using(OleDbDataReader dReader = cmd.ExecuteReader())
{
DataRow dataRow = dataTable.NewRow();
dataRow["Id"] = dReader.GetInt32(0);
dataRow["Name"] = dReader.GetString(1);
// TODO: Complete other table columns
dataTable.Rows.Add(dataRow);
}
SqlBulkCopy sqlBulk = new SqlBulkCopy(strConnection);
sqlBulk.DestinationTableName = "Failures";
sqlBulk.WriteToServer(dataTable);
}
Now you can control the names of the columns and which tables the data gets imported into. SqlBulkCopy is good for insert large amounts of data. If you only have a small amount of rows, you might be better off creating a standard data access layer to insert your records.

If you are only interested in the text (not the formatting etc.), alternatively you can save the excel file as CSV file, and parse the CSV file instead, it's simple.

Depending on the lifetime of the program, I would recommend one of two options.
If the program is to be short lived in use, or generally a "throw away" project, I would recommend a series of routines which parse and input data into another set of tables using standard SQL with some string processing as needed.
If the program will stick around longer and/or find more use on a day-to-day basis, I would recommend implementing a solution similar to the one recommended by #KeithS. With a set of well defined steps for working with the data, much flexibility is gained. More specifically, the .NET Entity Framework would probably be a great fit.
As a bonus, if you're not already well versed in this area, you might find you learn a great deal about working with data between boundaries (xls -> sql -> etc.) during your first stint with an ORM such as EF.

What is the fastest way to write to a text file from a database table?

I have data analysis application and I need to be able to export database tables to a delimited text file using c#. Because of the application architecture, that data must be brought to the c# application. No database exporting functionality can be used. The tables size can range from a few columns and a few hundred rows to ~100 columns to over a million rows.
Further clarification based on comments --
I have a Windows Service acting as the data access layer that will be getting the request for the export from the presentation layer. Once the export is complete, the service will then need to pass the export back to the presentation layer, which would either be a WPF app or a Silverlight app, as a stream object. The user will then be given an option to save or open the export.
What is the fastest way to do this?
Thanks

hmm, first of all, if its not a must to use c#, the sql managment console is capable of such a task.
To achieve best perfrormance i would you a consumer-producer 2 thread concept,
One thread will be the reader,
responsible for reading items from
the DB - in which case i highly
recommand using the IReader to read
the values, and put them in a cuncurrent queue.
The other will be the writer who will simply use a fileStream to write the data from the queue.
you can also achieve much greater performance by reading the information via a paged manner, thats is, if you know you'll have 100000 records, devide it to chunks of 1000, have a reader reading those chunks from the DB and putting them in a queue.
Although the later solution is more complicated he'll allow you to utilize your CPU in the best way possibble and avoid latency.

for SQL Server: use BCP
http://www.simple-talk.com/sql/database-administration/creating-csv-files-using-bcp-and-stored-procedures/

If you are using SQL Server 2008 (or maybe 2005), you can right-click the database and choose "Tasks->Export Data". Choose your database as input, and choose the "Flat file destination" as output. Specify the file name, specify double-quote as the text qualifier, click "next" a few times and you're done. You can even save the task as an SSIS package that you can run again.
Doing it this way uses SSIS under the covers. It has very high performance, as it uses multiple threads in a pipeline.

I would look at using the SQLBulkCopy object.

If you really need to use C#, the fastest way would be to use ADO.NET's DataReader, it is read-only and forward-only, may suit you well. Just be careful with null fields, it doesn't handle very well, if you need to deal with them, maybe other ADO.NET resources will be more interesting for you.

If you need to just query the data quickly you can use 'Firehose' cursors in one or more threads and just read straight from the database.

var sqlConnection = new SqlConnection(ConfigurationManager.ConnectionStrings["connstr"].ToString());
var sqlDataAdapter = new SqlDataAdapter("select * from tnm_story_status", sqlConnection);
sqlConnection.Open();
var dataSet = new DataSet();
sqlDataAdapter.Fill(dataSet);
sqlConnection.Close();
var dataTable = dataSet.Tables[0];
var streamWriter = new StreamWriter(#"C:\db.txt", false);
var sb = new StringBuilder();
for (var col = 0; col < dataTable.Columns.Count; col++)
{
if (sb.ToString() != "") sb.Append(",");
sb.Append(dataTable.Columns[col].ColumnName);
}
streamWriter.WriteLine(sb.ToString());
sb.Remove(0, sb.ToString().Length);
for (var row = 0; row < dataTable.Rows.Count; row++ )
{
for (var col = 0; col < dataTable.Columns.Count; col++)
{
if (sb.ToString() != "") sb.Append(",");
sb.Append(dataTable.Rows[row][col].ToString());
}
streamWriter.WriteLine(sb.ToString());
sb.Remove(0, sb.ToString().Length);
}
streamWriter.Close();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.