Delete Duplicate records from large csv file C# .Net

Delete Duplicate records from large csv file C# .Net - c#

I have created a solution which read a large csv file currently 20-30 mb in size, I have tried to delete the duplicate rows based on certain column values that the user chooses at run time using the usual technique of finding duplicate rows but its so slow that it seems the program is not working at all.
What other technique can be applied to remove duplicate records from a csv file
Here's the code, definitely I am doing something wrong
DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);
private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
{
DataView dv = dtCSV.DefaultView;
string RowFilter=string.Empty;
if(dt==null)
dt = dv.ToTable().Clone();
DataRow row = dtCSV.Rows[0];
foreach (DataRow row in dtCSV.Rows)
{
try
{
RowFilter = string.Empty;
foreach (string column in columns)
{
string col = column;
RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
}
RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
dv.RowFilter = RowFilter;
DataRow dr = dt.NewRow();
bool result = RowExists(dt, RowFilter);
if (!result)
{
dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
dt.Rows.Add(dr);
}
}
catch (Exception ex)
{
}
}
return dt;
}

One way to do this would be to go through the table, building a HashSet<string> that contains the combined column values you're interested in. If you try to add a string that's already there, then you have a duplicate row. Something like:
HashSet<string> ScannedRecords = new HashSet<string>();
foreach (var row in dtCSV.Rows)
{
// Build a string that contains the combined column values
StringBuilder sb = new StringBuilder();
foreach (string col in columns)
{
sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
}
// Try to add the string to the HashSet.
// If Add returns false, then there is a prior record with the same values
if (!ScannedRecords.Add(sb.ToString())
{
// This record is a duplicate.
}
}
That should be very fast.

If you've implemented your sorting routine as a couple of nested for or foreach loops, you could optimise it by sorting the data by the columns you wish to de-duplicate against, and simply compare each row to the last row you looked at.
Posting some code is a sure-fire way to get better answers though, without an idea of how you've implemented it anything you get will just be conjecture.

Have you tried Wrapping the rows in a class and using Linq?
Linq will give you options to get distinct values etc.

You're currently creating a string-defined filter condition for each and every row and then running that against the entire table - that is going to be slow.
Much better to take a Linq2Objects approach where you read each row in turn into an instance of a class and then use the Linq Distinct operator to select only unique objects (non-uniques will be thrown away).
The code would look something like:
from row in inputCSV.rows
select row.Distinct()
If you don't know the fields you're CSV file is going to have then you may have to modify this slightly - possibly using an object which reads the CSV cells into a List or Dictionary for each row.
For reading objects from file using Linq, this article by someone-or-other might help - http://www.developerfusion.com/article/84468/linq-to-log-files/

Based on the new code you've included in your question, I'll provide this second answer - I still prefer the first answer, but if you have to use DataTable and DataRows, then this second answer might help:
class DataRowEqualityComparer : IEqualityComparer<DataRow>
{
public bool Equals(DataRow x, DataRow y)
{
// perform cell-by-cell comparison here
return result;
}
public int GetHashCode(DataRow obj)
{
return base.GetHashCode();
}
}
// ...
var comparer = new DataRowEqualityComparer();
var filteredRows = from row in dtCSV.Rows
select row.Distinct(comparer);

Related

C# - check if column in CSV exists before assigning it to datatable

I have this:
var productDetailsFromFile = (from row in dt.AsEnumerable()
select new ProductDetails
{
ItemNumber = row.Field<string>("Item Number"),
Cost = row.Field<string>("Cost").ToDecimal(),//custom method .ToDecimal
WHQtyList = new List<int>()
{
row.Field<string>("foo").ToInteger(),//custom method .ToInteger
row.Field<string>("bar").ToInteger(),
row.Field<string>("foo2").ToInteger(),
}
}).ToList();
It reads the info from a .csv file. What I am trying to achieve is an elegant way of checking if Fields "foo", "bar" or "foo2".
Right now the issue is that if from the CSV file I remove one of the columns, column not in datatable error pops up. I can't get this to work for 2 hours now.
What I am essentially seeking is - how to check if a column exists as I use it to initialize the list, or if the column doesn't exist the default value to be 0 for each row, where it doesn't exist.

I did it through a method. I was wondering if there was a way to do it faster without having to add additional lines of code or make the additional lines of code less than what they are now.
int ContainsColumn (string columnName, DataTable table, DataRow row)
{
DataColumnCollection columns = table.Columns;
if (columns.Contains(columnName))
{
return int.Parse(row.Field<string>(columnName));
}
else
{
return 0;
}
}

How to add rows and columns into datatable in single loop?

I have Json stored in DataBase which I deserialize into DataTable with the help of Newtonsoft.Json like this
string jsonString = "[myJsonfromDB....]";
//Deserialize to DataTable
DataTable dtSerialized = (DataTable)JsonConvert.DeserializeObject(jsonString, (typeof(DataTable)));
Which gives me result like this other columns in image are not shown
Here my label is Column and value is column value. Both of these columns will be moved to new DataTable which I'll process further for my operations. Now my problem is that I want to do it in one loop while I do it in multiple loops i.e add columns first (in first loop) and then add column values (in second loop). Currently I'm doing it like this
string colName = string.Empty;
// First Loop to add columns
foreach (DataRow dr in dtSerialized.Rows)
{
if (!string.IsNullOrEmpty(Utility.Instance.ToString(dr["label"])))
{
colName = prefix + "_" + Utility.Instance.ToString(dr["label"]).Replace(" ", string.Empty).Replace("/", "_").Replace("-", "_");
if (!dtResult.Columns.Contains(colName))
dtResult.Columns.Add(colName, typeof(string));
}
}
DataRow drSelect = dtResult.NewRow();
//Second loop to add column values
foreach (DataRow dr in dtSerialized.Rows)
{
if (!string.IsNullOrEmpty(Utility.Instance.ToString(dr["label"])))
{
colName = prefix + "_" + Utility.Instance.ToString(dr["label"]).Replace(" ", "").Replace("/", "_").Replace("-", "_");
drSelect[colName] = dr["value"];
}
}
dtResult.Rows.Add(drSelect);
dsResult.Tables.Add(dtResult);
After this I have
As much I know is that first DataRow schema is built from DataTable and then values can be added which is clear in above code. Now, How can i do it in one loop? Or should I search for alternate method which i don't know how to do this.
Thanks in advance

I am guessing I am missing something here. This looks like a transpose function and I cannot think of a way to accomplish this without two loops or transposing the data as you read it in. But going from what is posted it appears the column label holds the new DataTable’s column names. The first column is the first row of data to this new DataTable.
If this is the case then while you are looping through the rows to get the column names from column 1 (label), you can also get the “value’ from column 0 (value) and put this value in a List<string> named valuesList below.
Then after you have looped through all the rows and set the columns in the new DataTable dtResults you can add a single row from the valuesList by setting the list to a string array like below. This will produce the second picture you showed in one loop. Again I am guessing there is more to it than this simple transpose. Since a DataTable does not have a built in transpose function you will have to write your own. Not sure how you would do this in one loop though. Hope this helps.
private DataTable Transpose2ColDT(DataTable dtSource) {
string prefix = "DIAP_";
string colName = "";
DataTable dtResult = new DataTable();
List<string> valuesList = new List<String>();
if (dtSource.Rows.Count > 0) {
foreach (DataRow dr in dtSource.Rows) {
if (!dr.IsNull("Label")) {
if (dr.ItemArray[1].ToString() != "" ) {
colName = prefix + "_" + dr.ItemArray[1].ToString();
if (!dtResult.Columns.Contains(colName)) {
dtResult.Columns.Add(colName, typeof(string));
valuesList.Add(dr.ItemArray[0].ToString());
}
}
}
}
dtResult.Rows.Add(valuesList.ToArray<string>());
} // no rows in the original source
return dtResult;
}

c# datatable select last row on a speicfic condition

I have a datatable has data like this format
........ IVR........
.........IVR........
.........IVR........
.........City1......
.........City1......
.........City1......
.........City2......
.........City2......
.........City2......
I want to take the last row of each value. in order words, the rows that are bold now
The challenge is that i wan these three rows in a datatable. I tried to search on internet but i didn't know what is the name of this feature. could you help me please

You can GroupBy() and then select last row with the help of the Last() method.
var result = from b in myDataTable.AsEnumerable()
group b by b.Field<string>("Your_Column_Name") into g
select g.Last();
DataTable filtered = myDataTable.Clone();
foreach(DataRow row in result)
{
filtered.ImportRow(row);
}
Clone clones the structure of the DataTable, including all DataTable schemas and constraints.

This can be implemented in a simple loop using a Dictionary to hold found rows:
var cRows = new Dictionary<string, DataRow>(StringComparer.InvariantCultureIgnoreCase);
foreach (DataRow oRow in oTable.Rows)
{
var sKey = oRow["KeyValue"].ToString();
if (!cRows.ContainsKey(sKey))
{
cRows.Add(sKey, oRow);
}
else
{
cRows[sKey] = oRow;
}
}
This approach will store the last row for each unique value in the column that you nominate.
To move the selected rows into a new DataTable:
var oNewTable = oTable.Clone();
foreach (var oRow in cRows.Values)
{
oNewTable.Rows.Add(oRow);
}
Clone just clones the structure of the current table, not the rows.

how to convert the entire content of a DataTable column to a delimited string in C#?

I am retrieving data from an MSSQL server using the SqlDataAdapter and DataSet. From that DataSet I am creating a DataTable. My goal is to convert each column of the table into a string where the elements are comma delimited. I figured that I would try the string conversion first before making the delimiter work.
The code runs in the code-behind of an ASP.Net page. The ultimate goal is to pass the string to a jscript variable, it's a "functional requirement" that I create a delimited string from the columns and that it has to end up as a jscript variable.
Here's what I have thus far:
DataSet myDataSet = new DataSet();
mySqlDataAdapter.Fill(myDataSet);
DataTable temperature = myDataSet.Tables["Table"];
// LOOP1
foreach (DataRow row in temperature.Rows)
// this loop works fine and outputs all elements
// of the table to the web page, this is just to
// test things out
{
foreach (DataColumn col in temperature.Columns)
{
Response.Write(row[col] + " ### ");
}
Response.Write("<br>");
}
// LOOP2
foreach (DataColumn column in temperature.Columns)
// this loop was meant to take all elements for each
// column and create a string, then output that string
{
Response.Write(column.ToString() + "<br>");
}
In LOOP1 things work fine. My data has 4 columns, all are appropriately rendered with one record per row on the web page.
I saw the code for LOOP2 at http://msdn.microsoft.com/en-us/library/system.data.datacolumn.tostring.aspx which seems to do exactly what I need except it does not actually do what I want.
The only thing LOOP2 does is write 4 lines to the web page. Each line has the header of the respective table column but none of the additional data. Clearly there's either a logic flaw on my part or I misunderstand how DataColumn and .toString for it works. Please help me out on this one. Thanks in advance.
EDIT:
Here's an SQL query result example, this is what the Table looks like:
Table quesry result # ImageShack
What I want to end up are four strings, here's an example for the string that would be created from the second column: "-6.7, -7, -7.2, -7.3, -7.3".

This code will concatenate values from cells under each column with ", ":
foreach (var column in temperature.Columns)
{
DataColumn dc = column as DataColumn;
string s = string.Join(", ", temperature.Rows.OfType<DataRow>()
.Select(r => r[dc]));
// do whatever you need with s now
}
For example, for DataTable defined as:
DataTable table = new DataTable();
table.Columns.Add(new DataColumn("Column #1"));
table.Columns.Add(new DataColumn("Column #2"));
table.Rows.Add(1, 2);
table.Rows.Add(11, 22);
table.Rows.Add(111, 222);
... it will produce "1, 11, 111" and "2, 22, 222" strings.
Edit: I saw you chose to declare column as var as opposed to DataColumn, is that a matter of personal preference/style or is there an issue with coding?
Consider following scenario (on the same data table example as above):
// we decide we'll use results later, storing them temporarily here
List<IEnumerable<string>> columnsValues = new List<IEnumerable<string>>();
foreach (DataColumn column in temperature.Columns)
{
var values = temperature.Rows.OfType<DataRow>()
.Select(r => r[column].ToString())
columnsValues.Add(values);
}
We assume we now got list of list of column values. So, when we print them, like this:
foreach (var lisOfValues in columnsValues)
{
foreach (var value in listOfValues)
{
Debug.Write(value + " ");
}
Debug.WriteLine("");
}
We expect to see 1 11 111 followed by 2 22 222. Right?
Wrong.
This code will output 2 22 222 twice. Why? Our .Select(r => r[column].ToString()) captures column variable - not its value, but variable itself - and since we don't use it immediately, once we're out of loop all we know is last value of column.
To learn more about this concept search for closures and captured variables - for example, in posts like this.
Summary:
In this very case you can go with DataColumn in foreach statement. It doesn't matter here because we're enumerating through our .Select(r => r[dc]) either way inside the loop (precisely, string.Join does that for us), producing results before we get to next iteration - whatever we capture, is used immediately.

The link you have posted clearly states
The Expression value, if the property
is set; otherwise, the ColumnName
property.
and that is what is happening. You get column names.

This could help: How to convert a DataTable to a string in C#?

Remove row from generic datatable in C#

I ran into a problem trying to remove a row from a datatable in C#. The problem is that the datatable is built from SQL, so it can have any number of columns and may or may not have a primary key. So, I can't remove a row based on a value in a certain column or on a primary key.
Here's the basic outline of what I'm doing:
//Set up a new datatable that is an exact copy of the datatable from the SQL table.
newData = data.Copy();
//...(do other things)
foreach (DataRow dr in data.Rows)
{
//...(do other things)
// Check if the row is already in a data copy log. If so, we don't want it in the new datatable.
if (_DataCopyLogMaintenance.ContainedInDataCopyLog(dr))
{
newData.Rows.Remove(dr);
}
}
But, that gives me an error message, "The given DataRow is not in the current DataRowCollection". Which doesn't make any sense, given that newData is a direct copy of data. Does anyone else have any suggestions? The MSDN site wasn't much help.
Thanks!

Your foreach needs to be on the copy, not the original set. You cannot remove an object contained in collection1 from collection2.
foreach (DataRow dr in newData.Rows)
Otherwise you could use a counter to remove at an index. Something like this:
for(int i = 0; i < data.Rows.Count; i++)
{
if (_DataCopyLogMaintenance.ContainedInDataCopyLog(data.Rows[i]))
{
newData.Rows.RemoveAt(i);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Delete Duplicate records from large csv file C# .Net - c#

Have you tried Wrapping the rows in a class and using Linq? Linq will give you options to get distinct values etc.

Related

C# - check if column in CSV exists before assigning it to datatable

How to add rows and columns into datatable in single loop?

c# datatable select last row on a speicfic condition

how to convert the entire content of a DataTable column to a delimited string in C#?

Remove row from generic datatable in C#

Categories

Resources