Compare and get difference of 2 CSV files with C# - c#

I'm reading a CSV file a few times a day. It's about 300MB and each time I have to read it through, compare with existing data in the database, add new ones, hide old ones and update existing ones. There are also bunch of data that's not getting touch.
I have access to all files both old and new ones and I'd like to compare new one with the previous one and just update what's changed in the file. I have no idea what to do and I'm using C# to do all my work. The one thing that might be most problematic is that a row in the previous field might be in another location in the second feed even if it's not updated at all. I want to avoid that problem as well, if possible.
Any idea would help.

Use one of the existing CSV parsers
Parse each row to a mapped class object
Override Equals and GetHashCode for your object
Keep a List<T> or HashSet<T> in memory, At the first step initialize them with no contents.
On reading each line from the CSV file, check if the exist in your in-memory collection (List, HashSet)
If the object doesn't exists in your in-memory collection, add it to the collection and insert in database.
If the object exists in your in-memory collection then ignore it (Checking for it would be based on Equals and GetHashCode implementation and then it would be as simple as if(inMemoryCollection.Contains(currentRowObject))
I guess you have a windows service reading CSV files periodically from a file location. You can repeat the above process, every time you read a new CSV file. This way you will be able to maintain an in-memory collection of the previously inserted objects and ignore them, irrespective of their place in the CSV file.
If you have primary key, defined for your data then you can use Dictionary<T,T>, where you Key could be the unique field. This will help you in having more performance for comparison and you can ignore Equals and GetHashCode implementation.
As a backup to this process, your DB writing routine/stored procedure should be defined in a way that it would first check, if the record already exists in the table, in that case Update the table otherwise INSERT new record. This would be UPSERT.
Remember, if you end up maintaining an in-memory collection, then keep clearing it periodically, otherwise you could end up with out of memory exception.

Just curious, why do you have to compare the old file with the new file? Isn't the data from the old file in SQL server already? (When yous say database, you mean SQL server right? I'm assuming SQL server because you use C# .net)
My approach is simple:
Load new CSV file into a staging table
Use stored procs to insert, update, and set inactive files
public static void ProcessCSV(FileInfo file)
{
foreach (string line in ReturnLines(file))
{
//break the lines up and parse the values into parameters
using (SqlConnection conn = new SqlConnection(connectionString))
using (SqlCommand command = conn.CreateCommand())
{
command.CommandType = CommandType.StoredProcedure;
command.CommandText = "[dbo].sp_InsertToStaging";
//some value from the string Line, you need to parse this from the string
command.Parameters.Add("#id", SqlDbType.BigInt).Value = line["id"];
command.Parameters.Add("#SomethingElse", SqlDbType.VarChar).Value = line["something_else"];
//execute
if (conn.State != ConnectionState.Open)
conn.Open();
try
{
command.ExecuteNonQuery();
}
catch (SqlException exc)
{
//throw or do something
}
}
}
}
public static IEnumerable<string> ReturnLines(FileInfo file)
{
using (FileStream stream = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
using (StreamReader reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Now you write stored procs to insert, update, set inactive fields based on Ids. You'll know if a row is updated if Field_x(main_table) != Field_x(staging_table) for a particular Id, and so on.
Here's how you detect changes and updates between your main table and staging table.
/* SECTION: SET INACTIVE */
UPDATE main_table
SET IsActiveTag = 0
WHERE unique_identifier IN
(
SELECT a.unique_identifier
FROM main_table AS a INNER JOIN staging_table AS b
--inner join because you only want existing records
ON a.unique_identifier = b.unique_identifier
--detect any updates
WHERE a.field1 <> b.field2
OR a.field2 <> b.field2
OR a.field3 <> b.field3
--etc
)
/* SECTION: INSERT UPDATED AND NEW */
INSERT INTO main_table
SELECT *
FROM staging_table AS b
LEFT JOIN
(SELECT *
FROM main_table
--only get active records
WHERE IsActiveTag = 1) AS a
ON b.unique_identifier = a.unique_identifier
--select only records available in staging table
WHERE a.unique_identifier IS NULL

How big is the csv file?? if its small try the following
string [] File1Lines = File.ReadAllLines(pathOfFileA);
string [] File2Lines = File.ReadAllLines(pathOfFileB);
List<string> NewLines = new List<string>();
for (int lineNum = 0; lineNo < File1Lines.Length; lineNo++)
{
if(!String.IsNullOrEmpty(File1Lines[lineNum])
String.IsNullOrEmpty(File2Lines[lineNo]))
{
if(String.Compare(File1Lines[lineNo], File2Lines[lineNo]) != 0)
NewLines.Add(File2Lines[lineNo]) ;
}
else if (!String.IsNullOrEmpty(File1Lines[lineNo]))
{
}
else
{
NewLines.Add(File2Lines[lineNo]);
}
}
if (NewLines.Count > 0)
{
File.WriteAllLines(newfilepath, NewLines);
}

Related

How do I batch 1000 inserts in the given loop scenario?

ORIGINAL QUESTION:
I have some code which looks like this:
for (int i = start_i; i <= i_s; i++)
{
var json2 = JObject.Parse(RequestServer("query_2", new List<JToken>(){json1["result"]}));
foreach (var data_1 in json2["result"]["data_1"])
{
var json3 = JObject.Parse(RequestServer("query_3", new List<JToken>(){data_1, 1}));
foreach (var data_2 in json3["result"]["data_2"])
{
var data_1 = data_2["id"];
var index = data_2["other"];
}
foreach (var other in json3["result"]["other"])
{
var data_3_1 = other["data_3"]["data_3_1"];
var data_4 = other["data_4"];
var data_5 = other["data_5"];
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
insert_data((string)data_3_1); <- very slow
}
}
}
}
This code was able to generate about 5000 WriteLines in less than a minute. However, I now want to insert that data into a database. When I try to do that, the code now takes much much longer to get through the 5000 sets of data.
My question is, how do I batch the database inserts into about 1000 inserts at a time, instead of doing one at a time. I have tried creating the insert statement using a stringbuilder which is fine, what I can't figure out is how to generate 1000 at a time. I have tried using for loops upto 1000, and then trying to break out of the foreach loop, before starting with the next 1000, but it just makes a big mess.
I have looked at questions like this example, but they are no good for my loop scenario. I know how to do bulk inserts at the sql level, I just can't seem to figure out how to generate the bulk sql inserts using the unique loop situation I have above using the those very specific loops in the example code.
The 5000 records was just a test run. The end code will have to deal with millions, if not billions of inserts. Based on rough calculations, the end result will use about 500GB of drive space when inserted into a database, so I will need to batch an optimum amount into RAM before inserting into the database.
UPDATE 1:
This is what happens in insert_data:
public static string insert_data(string data_3_1)
{
string str_conn = #"server=localhost;port=3306;uid=username;password=password;database=database";
MySqlConnection conn = null;
conn = new MySqlConnection(str_conn);
conn.Open();
MySqlCommand cmd = new MySqlCommand();
cmd.Connection = conn;
cmd.CommandText = "INSERT INTO database_table (data_3_1) VALUES (#data_3_1)";
cmd.Prepare();
cmd.Parameters.AddWithValue("#data_3_1", data_3_1);
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
return null;
}
You're correct that doing bulk inserts in batches can be a big throughput win. Here's why it's a win: When you do INSERT operations one at a time, the database server does an implicit COMMIT operation after every insert, and that can be slow. So, if you can wrap every hundred or so INSERTs in a single transaction, you'll reduce that overhead.
Here's an outline of how to do that. I'll try to put it in the context of your code, but you didn't show your MySQLConnection object or query objects, so this solution of mine necessarily will be incomplete.
var batchSize = 100;
var batchCounter = batchSize;
var beginBatch = new MySqlCommand("START TRANSACTION;", conn);
var endBatch = new MySqlCommand("COMMIT;", conn);
beginBatch.ExecuteNonQuery();
for (int i = start_i; i <= i_s; i++)
{
....
foreach (var data_1 in json2["result"]["data_1"])
{
...
foreach (var other in json3["result"]["other"])
{
...
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
/****************** batch handling **********************/
if ( --batchCounter <= 0) {
/* commit one batch, start the next */
endBatch.ExecuteNonQuery();
beginBatch.ExecuteNonQuery();
batchCounter = batchSize;
}
insert_data((string)data_3_1); <- very slow
}
}
}
}
/* commit the last batch. It's OK if it contains no records */
endBatch.ExecuteNonQuery();
If you want, you can try different values of batchSize to find a good value. But generally something like the 100 I suggest works well.
Batch sizes of 1000 are also OK. But the larger each transaction gets, the more server RAM it uses before it's committed, and the longer it might block other programs using the same MySQL server.
There's a nice and popular extension called MoreLinq that offers an extension method called Batch(int batchSize). To get an IEnumerable containing up to 1000 elements:
foreach (var upTo1000 in other["data_3"]["data_3_1"].Batch(1000))
{
// Build a query using the (up to) 1000 elements in upTo1000
}
The best approach for me was using LOAD DATA LOCAL INFILE statement. To make it work first you have to turn ON MySQL server parameter local_infile.
I used mysql2 package for NodeJS and query function:
db.query({
sql: "LOAD DATA LOCAL INFILE .......",
infileStreamFactory: <readable stream which provides your data in flat file format>
}, function(err, results) {....});
The trick is to provide a readable stream properly. By default, LOAD DATA expects tab delimited text file. Also LOAD DATA expects some file name and in you case if you provide a stream then file name can be arbitrary string.

Backup SQL Server Schema With Data

I've been tasked with creating a backup of the data in our "default schema" database dbo to the same database using a new schema called dbobackup.
I honestly do not understand what this means as far as a database goes. Apparently, it is like having a database backup inside the existing database. I guess there is some advantage to doing that.
Anyway, I can't seem to find anywhere online that will allow me to do this.
I have found a few posts on here about copying the schema without data, but I need the data too.
Backup SQL Schema Only?
How do I check to see if a schema exists, delete it if it does, and then create a schema that accepts data in the current database?
Once I have the new schema created, can I dump data in there with a simple command like this?
SELECT * INTO [dbobackup].Table1 FROM [dbo].Table1;
That line only backs up one table, though. If I need to do this to 245 tables for this particular customer, I'd need a script.
We have several customers, too, and their databases are not structured identically.
Could I do something along these lines?
I was thinking about creating a small console program to walk through the tables.
How would I modify something like the code below to do what I want?
public static void Backup(string sqlConnection)
{
using (var conn = new SqlConnection(sqlConnection))
{
conn.Open();
var tables = new List<String>();
var sqlSelectTables = "SELECT TableName FROM [dbo];";
using (var cmd = new SqlCommand(sqlSelectTables, conn))
{
using (var r = cmd.ExecuteReader())
{
while (r.Read())
{
var item = String.Format("{0}", r["TableName"]).Trim();
tables.Add(item);
}
}
}
var fmtSelectInto = "SELECT * INTO [dbobackup].{0} FROM [dbo].{0}; ";
using (var cmd = new SqlCommand(null, conn))
{
foreach (var item in tables)
{
cmd.CommandText = String.Format(fmtSelectInto, item);
cmd.ExecuteNonQuery();
}
}
}
}
SQL Server already has this built in. If you open SQL Server Management Studio and right click on the database you want to back up, then select all tasks then backup, you will get an option to back up your database into an existing database.
This is the important part and why you should use the built in functionality: You must copy the data from one DB to the other DB in the correct order or you'll get foreign key errors all over the place. If you have a lot of data tables with a lot of relationships, this will really be hard to nail down on your own. You could write code to make a complete graph of all of the dependencies and then figure out what order to copy the table data (which is essentially what SQL Server already does).
Additionally, there are third-party programs available to do this type of backup as well (see: Google).
This is sort of a "work in progress" approach I got started with that looks promising:
public static void CopyTable(
string databaseName, // i.e. Northwind
string tableName, // i.e. Employees
string schema1, // i.e. dbo
string schema2, // i.e. dboarchive
SqlConnection sqlConn)
{
var conn = new Microsoft.SqlServer.Management.Common.ServerConnection(sqlConn);
var server = new Microsoft.SqlServer.Management.Smo.Server(conn);
var db = new Microsoft.SqlServer.Management.Smo.Database(server, databaseName);
db.Tables.Refresh();
for (var itemId = 0; itemId < db.Tables.Count; itemId++)
{
var table = db.Tables.ItemById(itemId);
if (table.Name == tableName)
{
table.Schema = String.Format("{0}", DatabaseSchema.dboarchive);
table.Create();
}
}
}
The only issue I am currently running into is that my db variable always comes back with Tables.Count == 0.
If I get a chance to fix this, I will update.
For now, I've been told to remove this piece of code and check my code in.

Inserting to SQL performance issues

I wrote a program some time ago that delimits and reads in pretty big text files. The program works but the problem is it basically freezes the computer and takes long time to finish. On average each text file has around 10K to 15K lines, and each line represents a new row in a SQL table.
Way my program works is I first read all of the lines (this is where delimiting happens) and store them in array, after that I go through each array element and insert them into SQL table. This is all done at once and I suspect is eating up to much memory which is causing the program to freeze the computer.
Here is my code for reading file:
private void readFile()
{
//String that will hold each line read from the file
String line;
//Instantiate new stream reader
System.IO.StreamReader file = new System.IO.StreamReader(txtFilePath.Text);
try
{
while (!file.EndOfStream)
{
line = file.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
{
if (this.meetsCondition(line))
{
badLines++;
continue;
} // end if
else
{
collection.readIn(line);
counter++;
} // end else
} // end if
} // end while
file.Close();
} // end try
catch (Exception exceptionError)
{
//Placeholder
}
Code for inserting:
for (int i = 0; i < counter; i++)
{
//Iterates through the collection array starting at first index and going through until the end
//and inserting each element into our SQL Table
//if (!idS.Contains(collection.getIdItems(i)))
//{
da.InsertCommand.Parameters["#Id"].Value = collection.getIdItems(i);
da.InsertCommand.Parameters["#Date"].Value = collection.getDateItems(i);
da.InsertCommand.Parameters["#Time"].Value = collection.getTimeItems(i);
da.InsertCommand.Parameters["#Question"].Value = collection.getQuestionItems(i);
da.InsertCommand.Parameters["#Details"].Value = collection.getDetailsItems(i);
da.InsertCommand.Parameters["#Answer"].Value = collection.getAnswerItems(i);
da.InsertCommand.Parameters["#Notes"].Value = collection.getNotesItems(i);
da.InsertCommand.Parameters["#EnteredBy"].Value = collection.getEnteredByItems(i);
da.InsertCommand.Parameters["#WhereReceived"].Value = collection.getWhereItems(i);
da.InsertCommand.Parameters["#QuestionType"].Value = collection.getQuestionTypeItems(i);
da.InsertCommand.Parameters["#AnswerMethod"].Value = collection.getAnswerMethodItems(i);
da.InsertCommand.Parameters["#TransactionDuration"].Value = collection.getTransactionItems(i);
da.InsertCommand.ExecuteNonQuery();
//}
//Updates the progress bar using the i in addition to 1
_worker.ReportProgress(i + 1);
} // end for
If you can map your collection to a DataTable then you could use an SqlBulkCopy to import your data. SqlBulkCopy is the fastest way to import data from .Net into SqlServer.
Use SqlBulkCopy class for bulk inserts.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
You will cut down the time to mere seconds.
+1 for SqlBulkCopy as others have stated, but be aware that it requires INSERT permission. If you work in a strictly controlled environment, as I do, where you aren't allowed to use dynamic SQL an alternative approach is to have your stored proc use Table-Valued parameters. That way you can still pass in chunks of records and have the proc do the actual inserting.
As an example how to use the functionaloty of the SqlBulkCopy class, (It is just pseudocode to render the idea)
First change your collection class to host an internal DataTable, and in the constructor define the schema used by your readIn method
public class MyCollection
{
private DataTable loadedData = null;
public MyCollection()
{
loadedData = new DataTable();
loadedData.Columns.Add("Column1", typeof(string));
.... and so on for every field expected
}
// A property to return the collected data
public DataTable GetData
{
get{return loadedData;}
}
public void readIn(string line)
{
// split the line in fields
DataRow r = loadedData.NewRow();
r["Column1"] = splittedLine[0];
.... and so on
loadedData.Rows.Add(r);
}
}
Finally the code that upload the data to your server
using (SqlConnection connection = new SqlConnection(connectionString))
{
connection.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "destinationTable";
try
{
bulkCopy.WriteToServer(collection.GetData());
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
As mentioned, using SqlBulkCopy will be faster than inserting one-by-one, but there are other things that you could look at:
Is there a clustered index on the table? If so will you be inserting rows with values in the middle of that index? It's much more efficient to add values at the end of a clustered index since otherwise it will have to rearrange data to insert in in the middle (this is only for CLUSTERED indexes). On example I've seen us using SSN as a clustered primary key. Since SSNs will be distributed randomly, you are rearranging the physical structure on virtually every insert. Having a date as part of the clustered key may be OK if you are MOSTLY inserting data at the end (e.g. adding daily records)
Are there a lot of indexes on that table? it may be more efficient to drop the indexes, add the data, and re-add the indexes after the inserts. (or just drop indexes you don't need)

Continuously loop thru a database table until record found, then delete record.

I have this method that has to scan a database table Announce continuously until a new record appears it compares it to a record from another table and if it matches it Deletes it from table announce and it continues to search until anothe record appears. Is there a better way of doing this instead of using a while(true) statement. Note: I am using Sqlserver
//Begin method
public void Begin()
{
string announce;
double announceID;
try
{
using (SqlConnection connStr = new SqlConnection(ConfigurationManager.ConnectionStrings["AnnounceConnString"].ConnectionString))
{
while (true)
{
//Selects Last record written to tblAnnounce
SqlCommand sqlcommandStart = new SqlCommand("AnnounceSelect", connStr);
sqlcommandStart.CommandType = CommandType.StoredProcedure;
connStr.Open();
SqlDataReader dr = sqlcommandStart.ExecuteReader();
if (dr.HasRows)
{
while (dr.Read())
{
announce = dr["AnnounceID"].ToString();
announceID = Convert.ToDouble(announce);
//Compares Values
//if it matches then DELETE record from TblAnnounce
}
connStr.Close();
}
else
{
connStr.Close();
}
dr.Close();
}
}
}
catch(Exception ex)
{
string exception = ex.Message;
MessageBox.Show(exception
}
}
Rather than check continuously for insert record you can easily handle this on insert. check before insert in other table and if exist you can ignore the insert.
Or you can use insert triger for this table in database level to handle the delete record if matching record found.
by C# code you can do this using CLR Triggers. check the sample at the end of MSDN page.
EDIT
AS per your new comments you are not inserting data, but you want to compare and delete records. you can do as below. change the sql query as you need
using (var sc = new SqlConnection(ConnectionString))
using (var cmd = sc.CreateCommand())
{
sc.Open();
cmd.CommandText = "delete from TableB where OtherID in (select distinct ID from tableA)";
cmd.ExecuteNonQuery();
}
What are you actually trying to do here? If you are accepting values into the announce and deleting them if they already exist then a trigger is fine. However, you can also write a query or view to just select the rows where there is or is not a matching row. Both these ways mean you don't actually have to constantly monitor the table at all.
Another way is to consider using a dependency
http://msdn.microsoft.com/en-us/library/62xk7953.aspx
alternatively, depending on your situation, a queue to get the message into your announce table might be a better model.
http://msdn.microsoft.com/en-us/library/ms345108(v=sql.90).aspx
instead of a constantly scanning process. You can do the logic as the new message arrives.
I think the dependency will be the most suited to you from what you've written so far.
// you can use inner join in delete statement in T-Sql :
DELETE FROM tbl1
FROM table1 AS tbl1
INNER JOIN table2 AS tbl2 ON tbl1.Id=tble2.Id

how to compare elements in a string with the database table values

In my project i have to give a string input through a text field, and i have to fill a database table with these values. I should first check the values of a specific table column, and add the input string only if it is not there in the table already.
I tried to convert the table values to a string array, but it wasn,t possible.
If anyone have an idea about this, your reply will be really valuable.
Thankx in advance.
Since you say your strings in the database table must be unique, just put a unique index on that field and let the database handle the problem.
CREATE UNIQUE INDEX UIX_YourTableName_YourFieldName
ON dbo.YourTableName(YourFieldName)
Whenever you will try to insert another row with the same string, SQL Server (or any other decent RDBMS) will throw an exception and not insert the value. Problem solved.
If you need to handle the error on the front-end GUI already, you'll need to load the existing entries from your database, using whatever technology you're familiar with, e.g. in ADO.NET (C#, SQL Server) you could do something like:
public List<string> FindExistingValues()
{
List<string> results = new List<string>();
string getStringsCmd = "SELECT (YourFieldName) FROM dbo.YourTableName";
using(SqlConnection _con = new SqlConnection("your connection string here"))
using(SqlCommand _cmd = new SqlCommand(getStringsCmd, _con)
{
_con.Open();
using(SqlDataReader rdr = _con.ExecuteReader())
{
while(rdr.Read())
{
results.Add(rdr.GetString(0));
}
rdr.Close();
}
_con.Close();
}
return results;
}
You would get back a List<string> from that method and then you could check in your UI whether a given string already exists in the list:
List<string> existing = FindExistingValues();
if(!existing.Contains(yournewstring))
{
// store the new value to the database
}
Or third option: you could write a stored procedure that will handle the storing of your new string. Inside it, first check to see whether the string already exists in the database
IF NOT EXISTS(SELECT * FROM dbo.YourTableName WHERE YourFieldName = '(your new string)')
INSERT INTO dbo.YourTableName(YourFieldName) VALUES(your-new-string-here)
and if not, insert it - you'll just need to find a strategy how to deal with the cases where the new string being passed in did indeed already exist (ignore it, or report back an error of some sorts).
Lots of options - up to you which one works best in your scenario!

Categories