More efficient way of assigning values in DataTable? - c#

I have a DataTable with two columns: JobDetailID and CalculatedID. JobDetailID is not always unique. I want one/the first instance of CalculatedID for a given JobDetailID to be JobDetailID + "A", and when there are multiple rows with the same JobDetailID, I want successive rows to be JobDetailID + "B", "C", etc. There aren't more than four or five rows with the same JobDetailID.
I currently have it implemented as follows, but it's unacceptably slow:
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
var enumerableData = data.AsEnumerable();
foreach (DataRow row in data.Rows)
{
var jobDetailID = row["JobDetailID"].ToString();
// Give calculated ID of JobDetailID + A, B, C, etc. for multiple rows with same JobDetailID
int x = 65; // ASCII value for A
string calculatedID = jobDetailID + (char)x;
while (string.IsNullOrEmpty(row["CalculatedID"].ToString()))
{
if ((enumerableData
.Any(r => r.Field<string>("CalculatedID") == calculatedID)))
{
calculatedID = jobDetailID + (char)x;
x++;
}
else
{
row["CalculatedID"] = calculatedID;
break;
}
}
}
}
Assuming I need to adhere to this format of output, how might I improve this performance?

It would be better to add the code for generation of CalculatedID in the place where you are getting the data, but, if that is unavailable, you might want to avoid scanning the entire table each time a duplicate is found. You could use a Dictionary for the used keys, like this:
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
Dictionary<string, string> UsedKeyIndex = new Dictionary<string, string>();
foreach (DataRow row in data.Rows)
{
string jobDetailID = row["JobDetailID"].ToString();
string calculatedID;
if (UsedKeyIndex.ContainsKey(jobDetailID))
{
calculatedID = jobDetailID + 'A';
UsedKeyIndex.Add(jobDetailID, 'A');
}
else
{
char nextKey = UsedKeyIndex[jobDetailID].Value+1;
calculatedID = jobDetailID + nextKey;
UsedKeyIndex[jobDetailID] = nextKey;
}
row["CalculatedID"] = calculatedID;
}
}
This will essentially trade memory for speed, as it will cache all used JobDetailID's along with the last char used for the generated key. If you have lots and lots of these JobDetailID, this might get a bit memory intensive, but I doubt that you'll have problems unless you have millions of rows to process.

If I understand your idea about setting CalculatedID for the rows, then following algorithm would do the trick and it's complexity is linear. Most important part is data.Select("","JobDetailID"), where I get a sorted list of rows.
I didn't compiled it myself, so there could be syntactical errors.
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
int jobDetailID = -1;
int letter = 65;
foreach (DataRow row in data.Select("","JobDetailID"))
{
if((int)row["JobDetailID"] == jobDetailID)
{
row["CalculatedID"] = row["JobDetailID"].ToString() + (char)letter;
letter++;
}
else
{
letter = 65;
jobDetailID = (int)row["JobDetailID"];
}
}
}

You tagged this as LINQ, but you are using iterative methods. Probably the best way to do this would be to use a combination of both, iterating over each "grouping" and assigning the calculated ID for each row in the grouping.
foreach (var groupRows in data.AsEnumerable().GroupBy(d => d["JobDetailID"].ToString()))
{
if(string.IsNullOrEmpty(groupRows.Key))
continue;
// We now have each "grouping" of duplicate JobDetailIDs.
int x = 65; // ASCII value for A
foreach (var duplicate in groupRows)
{
string calcID = groupRows.Key + ((char)x++);
duplicate["CalculatedID"] = calcID;
//Can also do this and achieve same results.
//duplicate["CalculatedID"] = groupRows.Key + ((char)x++);
}
}
First thing you do is group on the column that's going to have duplicates. You're going to iterate over each of these groupings, and reset the suffix value for every grouping. For every row in the grouping, you're going to get the calculated ID (incrementing the suffix value at the same time) and assign the ID back to the duplicate row. As a side note, we're altering the items we're enumerating here, which is normally a bad thing. However, we're changing data that isn't associated with our enumeration declaration (GroupBy), so it will not alter the behavior of our enumeration.

This method gets the job done in a single pass. You can optimize it further if, for example, "JobDetailID" is an integer instead of a string, or if the DataTable is always receiving the data sorted by "JobDetailID" (you could get rid of the dictionary), but here's a draft:
private static void AddCalculatedID(DataTable data)
{
data.BeginLoadData();
try
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
var jobDetails = new Dictionary<string, int>(data.Rows.Count);
foreach (DataRow row in data.Rows)
{
var jobDetailID = row["JobDetailID"].ToString();
int lastSuffix;
if (jobDetails.TryGetValue(jobDetailID, out lastSuffix))
{
lastSuffix++;
}
else
{
// ASCII value for A
lastSuffix = 65;
}
row["CalculatedID"] = jobDetailID + (char)lastSuffix;
jobDetails[jobDetailID] = lastSuffix;
}
}
finally
{
data.EndLoadData();
}
}

Related

C# Constructing a Dynamic Query From DataTable

Trying to Generate a Dynamic Linq Query, based on DataTable returned to me... The column names in the DataTable will change, but I will know which ones I want to total, and which ones I will want to be grouped.
I can get this to work with loops and writing the output to a variable, then recasting the parts back into a data table, but I'm hoping there is a more elegant way of doing this.
//C#
DataTable dt = new DataTable;
Dt.columns(DynamicData1)
Dt.columns(DynamicData1)
Dt.columns(DynamicCount)
In this case the columns are LastName, FirstName, Age. I want to total ages by LastName,FirstName columns (yes both in the group by). So one of my parameters would specify group by = LastName, FirstName and another TotalBy = Age. The next query may return different column names.
Datarow dr =..
dr[0] = {"Smith","John",10}
dr[1] = {"Smith","John",11}
dr[2] = {"Smith","Sarah",8}
Given these different potential columns names...I'm looking to generate a linq query that creates a generic group by and Total output.
Result:
LastName, FirstName, AgeTotal
Smith, John = 21
Smith, Sarah = 8
If you use a simple converter for Linq you can achieve that easily.
Here a quick data generation i did for the sample :
// create dummy table
var dt = new DataTable();
dt.Columns.Add("LastName", typeof(string));
dt.Columns.Add("FirstName", typeof(string));
dt.Columns.Add("Age", typeof(int));
// action to create easily the records
var addData = new Action<string, string, int>((ln, fn, age) =>
{
var dr = dt.NewRow();
dr["LastName"] = ln;
dr["FirstName"] = fn;
dr["Age"] = age;
dt.Rows.Add(dr);
});
// add 3 datarows records
addData("Smith", "John", 10);
addData("Smith", "John", 11);
addData("Smith", "Sarah", 8);
This is how to use my simple transformation class :
// create a linq version of the table
var lqTable = new LinqTable(dt);
// make the group by query
var groupByNames = lqTable.Rows.GroupBy(row => row["LastName"].ToString() + "-" + row["FirstName"].ToString()).ToList();
// for each group create a brand new linqRow
var linqRows = groupByNames.Select(grp =>
{
//get all items. so we can use first item for last and first name and sum the age easily at the same time
var items = grp.ToList();
// return a new linq row
return new LinqRow()
{
Fields = new List<LinqField>()
{
new LinqField("LastName",items[0]["LastName"].ToString()),
new LinqField("FirstName",items[0]["FirstName"].ToString()),
new LinqField("Age",items.Sum(item => Convert.ToInt32(item["Age"]))),
}
};
}).ToList();
// create new linq Table since it handle the datatable format ad transform it directly
var finalTable = new LinqTable() { Rows = linqRows }.AsDataTable();
And finally here are the custom class that are used
public class LinqTable
{
public LinqTable()
{
}
public LinqTable(DataTable sourceTable)
{
LoadFromTable(sourceTable);
}
public List<LinqRow> Rows = new List<LinqRow>();
public List<string> Columns
{
get
{
var columns = new List<string>();
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field => columns.Add(field.Name));
}
return columns;
}
}
public void LoadFromTable(DataTable sourceTable)
{
sourceTable.Rows.Cast<DataRow>().ToList().ForEach(row => Rows.Add(new LinqRow(row)));
}
public DataTable AsDataTable()
{
var dt = new DataTable("data");
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
Rows.ForEach(row =>
{
var dr = dt.NewRow();
row.Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
});
}
return dt;
}
}
public class LinqRow
{
public List<LinqField> Fields = new List<LinqField>();
public LinqRow()
{
}
public LinqRow(DataRow sourceRow)
{
sourceRow.Table.Columns.Cast<DataColumn>().ToList().ForEach(col => Fields.Add(new LinqField(col.ColumnName, sourceRow[col], col.DataType)));
}
public object this[int index]
{
get
{
return Fields[index].Value;
}
set
{
Fields[index].Value = value;
}
}
public object this[string name]
{
get
{
return Fields.Find(f => f.Name == name).Value;
}
set
{
var fieldIndex = Fields.FindIndex(f => f.Name == name);
if (fieldIndex >= 0)
{
Fields[fieldIndex].Value = value;
}
}
}
public DataTable AsSingleRowDataTable()
{
var dt = new DataTable("data");
if (Fields != null && Fields.Count > 0)
{
Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
var dr = dt.NewRow();
Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
}
return dt;
}
}
public class LinqField
{
public Type DataType;
public object Value;
public string Name;
public LinqField(string name, object value, Type dataType)
{
DataType = dataType;
Value = value;
Name = name;
}
public LinqField(string name, object value)
{
DataType = value.GetType();
Value = value;
Name = name;
}
public override string ToString()
{
return Value.ToString();
}
}
I think I'd just use a dictionary:
public Dictionary<string, int> GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n';
if(!d.ContainsKey(key))
d[key] = 0;
d[key]+= (int)ro[tot];
}
return d;
}
If you want the total on each row, we could get cute and create a column that is an array of one int instead of an int:
public void GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
var dc = dt.Columns.Add("Total_" + tot, typeof(int[]));
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n'; //build a grouping key from first and last name
if(!d.ContainsKey(key)) //have we seen this name pair before?
d[key] = new int[1]; //no we haven't, ensure we have a tracker for our total, for this first+last name
d[key][0] += (int)ro[tot]; //add the total
ro[dc] = d[key]; //link the row to the total tracker
}
}
At the end of the operation every row will have an array of int in the "Total_age" column that represents the total for that First+Last name. The reason I used int[] rather than int, is because int is a value type, whereas int[] is a reference. Because as the table is being iterated each row gets assigned a reference to an int[] some of them with the same First+Last name will end up with their int[] references pointing to the same object in memory, so incrementing a later one increments all the earlier ones too (all "John Smith" rows total column holds a refernece to the same int[]. If we'd made the column an int type, then every row would point to a different counter, because every time we say ro[dc] = d[key] it would copy the current value of d[key] int into ro[dc]'s int. Any reference type would do for this trick to work, but value types wouldn't. If you wanted your column to be value type you'd have to iterate the table again, or have two dictionaries, one that maps DataRow -> total and iterate the keys, assigning the totals back into the row

Casting a String as Integer in a .net DataTable

Disclaimer: This is my very first .net c# project
I am attempting to import a CSV into MSSQL but need to iterate through the CSV values first for sanitization purposes. Some of the columns in the CSV will be integer (will be used for calcuations later) and some are regular varchar.
My script above appears to force all values (that is row column values) in the DataTable as a string which throws an Exception later in my application when SQL cannot write a string as an integer.
Here is my method I am using for the getCSVImport which creates a datatable and populates it.
What I am thinking is to add another condition which checks if the value is an integer and then cast it as an integer (this kind of thing is new to me as PHP would does not handle types so strongly) but I fear that wont work as I am not sure if I can mix the values within a dataTable with various types.
So my question is, is there a way for me to have different values in a datatable as different types? My code below takes the line as a whole and writes it as a string, I need the values to be assigned either as string or as integer.
/*
* getCsvData()
* This method will create a datatable from the CSV file. We'll take the CSV file as is.
* and collect the data as needed:
*
* - Remove those original 4 lines (worthless info)
* - Line 5 starts with the headers, remove any of the brackets around the values
* - Iterate through the rest of the fields and sanitize them before we add it to the datatable
*
*/
private DataTable getCsvData(string csv_file_path)
{
// Create a new csvData tabletable object:
DataTable csvData = new DataTable();
try
{
using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
int row = 1;
while (!csvReader.EndOfData)
{
// Read the string and collect the row data
string[] rowData = csvReader.ReadFields();
if (row <= 4)
{
// We want to start on row 5 as first rows are nonsense :)
// Incriment the row so that we can do our magic above
row++;
continue;
} if(row == 5)
{
// Row 5 is the headers, we need to sanitize and continue:
foreach (string column in rowData)
{
// Remove the [ ] from the values:
var col = column.Substring(1, column.Length - 2);
DataColumn datecolumn = new DataColumn(col);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
// Incriment the row so that we can do our magic above
row++;
} else
{
// These are all of the actual rows, sanitize and add the rows:
//Making empty value as null
for (int i = 0; i < rowData.Length; i++)
{
// First remove the brackets:
if (rowData[i].Substring(0,1) == "[")
{
rowData[i] = rowData[i].Substring(1, rowData[i].Length - 2);
}
// Set blank to null:
if (rowData[i] == "" || rowData[i] == "-")
{
rowData[i] = null;
}
// Lastly, we need to do some calculations:
}
// Add the sanitized row to the DataTable:
csvData.Rows.Add(rowData);
}
}
}
}
catch (Exception ex)
{
throw new Exception("Could not parse the CSV file: "+ ex.Message);
}
return csvData;
}
You can cast the string to a int:
int j;
bool parsed=Int32.TryParse("-105", out j))
With TryParse you can check if it succeeded.
Then when you want to save it to the table again, cast it to string. You can simply do <variable>.ToString()
By default, data columns are initialized to a string data type.
There's an overload that allows you to specify the type, so I'd suggest you try that. Since your columns are known beforehand, you can easily handle this in your code.
private DataColumn AddColumn(string columnName, Type columnType)
{
// Remove the [ ] from the values:
var col = column.Substring(1, columnName.Length - 2);
DataColumn dataColumn = new DataColumn(col, columnType);
dataColumn.AllowDBNull = true;
return dataColumn;
}
if (row == 5)
{
csvData.Columns.Add(AddColumn(rowData[0], typeof(string)));
csvData.Columns.Add(AddColumn(rowData[1], typeof(int)));
csvData.Columns.Add(AddColumn(rowData[2], typeof(DateTime)));
csvData.Columns.Add(AddColumn(rowData[3], typeof(string)));
// etc
}
I'm not sure you'll even need to convert the other values before adding them to the DataTable, but if you do, many built-in types have TryParse methods, such as DateTime.TryParse and Int32.TryParse. You can call each of them in succession, and one of the "tries" succeeds, you'll know your type.
Alternatively, since you know the column types beforehand, you can just cast each value.
csvData.Rows.Add(Convert.ToString(rowData[0]),
Convert.ToInt32(rowData[1]),
Convert.ToDateTime(rowData[2]),
Convert.ToString(rowData[3]));
I would use *.TryParse(), ie: With this sample CSV:
*A sample csv file with
*some comment lines at top
-- with different comment
// comment strings.
[charField],[dateField],[intField],[decimalField]
"Sample char data 1",2016/1/2,123,123.45
"Sample char data 2",,2,1.5
"Sample char data 3",,3,
"Sample char data 4",,,
,,,
"Sample char data 6",2016/2/29 10:20,10,20.5
You might use TryParse on those datetime, int, decimal fields:
void Main()
{
var myData = ReadMyCSV(#"c:\MyPath\MyFile.csv");
// do whatever with myData
}
public IEnumerable<MyRow> ReadMyCSV(string fileName)
{
using (TextFieldParser tfp = new TextFieldParser(fileName))
{
tfp.HasFieldsEnclosedInQuotes = true;
tfp.SetDelimiters(new string[] { "," });
//tfp.CommentTokens = new string[] { "*","--","//" };
// instead of using comment tokens we are going to skip 4 lines
for (int j = 0; j < 4; j++)
{
tfp.ReadLine();
}
// header line.
tfp.ReadLine();
DateTime dt;
int i;
decimal d;
while (!tfp.EndOfData)
{
var data = tfp.ReadFields();
yield return new MyRow
{
MyCharData = data[0],
MyDateTime = DateTime.TryParse(data[1], out dt) ? dt : (DateTime?)null,
MyIntData = int.TryParse(data[2], out i) ? i : 0,
MyDecimal = decimal.TryParse(data[3], System.Globalization.NumberStyles.Any, null, out d) ? d : 0M
};
}
}
}
public class MyRow
{
public string MyCharData { get; set; }
public int MyIntData { get; set; }
public DateTime? MyDateTime { get; set; }
public decimal MyDecimal { get; set; }
}
I could further sanitize the data loaded, such as:
myData.Where( d => d.MyIntData != 0 );
Note: I didn't use a DataTable, which I could if I wanted to. For MSSQL loading, I would probably use an intermediate in-memory SQLite instance to save the sanitized data and then push to MSSQL using SqlBulkCopy class. A DataTable is of course an option (I just think it is less flexible).

Best performances to find a value in a DataTable ? For ? Linq ? Other?

I have a big txt file loaded in a DataTable on a c# programm.
I need to search severasl values in this dataTable.
For the moment i use a simple For loop, and it's veery long !
I really need to gain time.
Is there a better way to perform this ? Using Linq ? or another method ?
Here is a basic sample of my code :
foreach (DataRow row in DataTables[0].Rows)
{
for (int i = 0; i <= DataTables[1].Rows.Count - 1; i++)
{
if ((DataTables[1].Rows[i]["PRODUCT_CODE"].ToString().Trim() == row["PRODUCT_CODE"].ToString().Trim())
{
// Do Some Stuff
// When the value is found, don't break the for...continue because there is severals "PRODUCT_CODE", not once.
}
}
}
HashSet<string> dt0 = new HashSet<string>();
foreach (DataRow row in DataTables[0].Rows)
dt0.Add(row["PRODUCT_CODE"].ToString().Trim());
for (int i = 0; i <= DataTables[1].Rows.Count - 1; i++)
{
if ( dt0.Contains(DataTables[1].Rows[i]["PRODUCT_CODE"].ToString().Trim() == row["PRODUCT_CODE"].ToString().Trim())
{
// Do Some Stuff
// When the value is found, don't break the for...continue because there is severals "PRODUCT_CODE", not once.
}
}
Just went from O(n^m) to O(n+m)
If you need the whole row then Dictionary rather then HashSet
Dictionary<String, DataRow> dt0 = new Dictionary<String, DataRow>();
You should use the HashSet / Dictionary for the the larger.
I would give you more but you had the insolence to ask me if I thought this would be faster.
Why are you using DataTables in the first place?
A short example use more as one core
Parallel.ForEach(dt.AsEnumerable(), row =>
{
if (i["value1"].ToString() == "test")
{
Console.WriteLine(i["value1"]);
}
});
Other Solution
Compare Keys is very fast
Dictionary<string, Product> file1 = new Dictionary<string, Product>();
Dictionary<string, Product> file2 = new Dictionary<string, Product>();
//Add ProductCode in key
var product = new Product();
product.Code = "EAN1202";
product.Manufacturer = "Company";
product.Name = "Test";
product.Price = 12.05;
file1.Add(product.Code, product);
//One thread
foreach (var item in file1)
{
if (file2.ContainsKey(item.Key))
{
// Do Some Stuff
}
}
//Multi thread
Parallel.ForEach(file1, item =>
{
if (file2.ContainsKey(item.Key))
{
// Do Some Stuff
}
});
Product Class
public class Product
{
public string Code;
public string Manufacturer;
public string Name;
public double Price;
}
This could probably be a bit better if we knew what you were doing in the loop, but this should work:
var dt1=DataTables[0].Rows.AsEnumerable();
var dt2=DataTables[1].Rows.AsEnumerable();
var results=dt1.Join(
dt2,
d1=>d1.Field<string>("PRODUCT_CODE").Trim(),
d2=>d2.Field<string>("PRODUCT_CODE").Trim(),
(d1,d2)=>new {d1,d2});
foreach(var row in results)
{
// Do stuff with row.d1/row.d2
}
If for example, your data tables are created from a SQL source, it would be better to use join on that instead, which will allow the SQL server to do the joining instead of doing it client-side. Also, not using datatables, and using a POCO class would improve your performance some as well as you won't need to box/unbox the product code during the join.

C# - Looking for the list of duplicated rows (need optimization)

Please, I would like to optimize this code in C#, if possible.
When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time...
Here a little benchmark :
5000 lines => ~2s
15000 lines => ~20s
25000 lines => ~50s
Indeed, I'm looking for duplicated lines.
Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).
Here is the code :
private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values
List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated
foreach (DataRow row in table.Rows)
{
// Find keys fields values for the row
List<object> rowKeys = new List<object>();
keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));
// Check if those keys are already defined
bool alreadyDefined = false;
foreach (List<object> keyValue in keys.Keys)
{
if (rowKeys.SequenceEqual(keyValue))
{
alreadyDefined = true;
break;
}
}
if (alreadyDefined)
{
duplicatedRows.Add(row);
// If first duplicate for this key, add the first occurence of this key
if (!duplicatedKeys.Contains(rowKeys))
{
duplicatedKeys.Add(rowKeys);
int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
duplicatedRows.Add(table.Rows[i]);
}
}
else
{
keys.Add(rowKeys, table.Rows.IndexOf(row));
}
}
return duplicatedRows;
}
Any ideas ?
I think this is the fastest and shortest way to find duplicate rows:
For 100.000 rows it executes in about 250ms.
Main and test data:
static void Main(string[] args)
{
var dt = new DataTable();
dt.Columns.Add("Id");
dt.Columns.Add("Value1");
dt.Columns.Add("Value2");
var rnd = new Random(DateTime.Now.Millisecond);
for (int i = 0; i < 100000; i++)
{
var dr = dt.NewRow();
dr[0] = rnd.Next(1, 1000);
dr[1] = rnd.Next(1, 1000);
dr[2] = rnd.Next(1, 1000);
dt.Rows.Add(dr);
}
Stopwatch sw = new Stopwatch();
sw.Start();
var duplicates = GetDuplicateRows(dt, "Id", "Value1", "Value2");
sw.Stop();
Console.WriteLine(
"Found {0} duplicates in {1} miliseconds.",
duplicates.Count,
sw.ElapsedMilliseconds);
Console.ReadKey();
}
GetDuplicateRows with LINQ:
private static List<DataRow> GetDuplicateRows(DataTable table, params string[] keys)
{
var duplicates =
table
.AsEnumerable()
.GroupBy(dr => String.Join("-", keys.Select(k => dr[k])), (groupKey, groupRows) => new { Key = groupKey, Rows = groupRows })
.Where(g => g.Rows.Count() > 1)
.SelectMany(g => g.Rows)
.ToList();
return duplicates;
}
Explanation (for those who are new to LINQ):
The most tricky part is the GroupBy I guess. Here I take as the first parameter a DataRow and for each row I create a group key from the values for the specified keys that I join to create a string like 1-1-2. Then the second parameter just selects the group key and the group rows into a new anonymous object. Then I check if there is more then 1 row and flatten the groups back into a list with SelectMany.
Try this. Use more linq, that improve perfomance, also try with PLinq if posible.
Regards
private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values
List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated
foreach (DataRow row in table.Rows)
{
// Find keys fields values for the row
List<object> rowKeys = new List<object>();
keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));
// Check if those keys are already defined
bool alreadyDefined = false;
foreach (List<object> keyValue in keys.Keys)
{
if (rowKeys.Any(keyValue))
{
alreadyDefined = true;
break;
}
}
if (alreadyDefined)
{
duplicatedRows.Add(row);
// If first duplicate for this key, add the first occurence of this key
if (!duplicatedKeys.Contains(rowKeys))
{
duplicatedKeys.Add(rowKeys);
int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
duplicatedRows.Add(table.Rows[i]);
}
}
else
{
keys.Add(rowKeys, table.Rows.IndexOf(row));
}
}
return duplicatedRows;
}

Is there a code pattern for mapping a CSV with random column order to defined properties?

I have a CSV that is delivered to my application from various sources. The CSV will always have the same number columns and the header values for the columns will always be the same.
However, the columns may not always be in the same order.
Day 1 CSV may look like this
ID,FirstName,LastName,Email
1,Johh,Lennon,jlennon#applerecords.com
2,Paul,McCartney,macca#applerecords.com
Day 2 CSV may look like this
Email,FirstName,ID,LastName
resident1#friarpark.com,George,3,Harrison
ringo#allstarrband.com,Ringo,4,Starr
I want to read in the header row for each file and have a simple mechanism for associating each "column" of data with the associated property I have defined in my class.
I know I can use selection statements to figure it out, but that seems like a "bad" way to handle it.
Is there a simple way to map "columns" to properties using a dictionary or class at runtime?
Use a Dictionary to map column heading text to column position.
Hard-code mapping of column heading text to object property.
Example:
// Parse first line of text to add column heading strings and positions to your dictionary
...
// Parse data row into an array, indexed by column position
...
// Assign data to object properties
x.ID = row[myDictionary["ID"]];
x.FirstName = row[myDictionary["FirstName"]];
...
You dont need a design pattern for this purpose.
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I have used this Reader, while it is pretty good, it has a functionality as row["firstname"] or row["id"] which you can parse and create your objects.
I have parsed both CSV files using Microsoft.VisualBasic.FileIO.TextFieldParser. I have populated DataTable after parsing both csv files:
DataTable dt;
private void button1_Click(object sender, EventArgs e)
{
dt = new DataTable();
ParseCSVFile("day1.csv");
ParseCSVFile("day2.csv");
dataGridView1.DataSource = dt;
}
private void ParseCSVFile(string sFileName)
{
var dIndex = new Dictionary<string, int>();
using (TextFieldParser csvReader = new TextFieldParser(sFileName))
{
csvReader.Delimiters = new string[] { "," };
var colFields = csvReader.ReadFields();
for (int i = 0; i < colFields.Length; i++)
{
string sColField = colFields[i];
if (sColField != string.Empty)
{
dIndex.Add(sColField, i);
if (!dt.Columns.Contains(sColField))
dt.Columns.Add(sColField);
}
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
if (fieldData.Length > 0)
{
DataRow dr = dt.NewRow();
foreach (var kvp in dIndex)
{
int iVal = kvp.Value;
if (iVal < fieldData.Length)
dr[kvp.Key] = fieldData[iVal];
}
dt.Rows.Add(dr);
}
}
}
}
day1.csv and day2.csv as mentioned in the question.
Here is how output dataGridView1 look like:
Here is a simple generic method that will take a CSV file (broken into string[]) and create from it a list of objects. The assumption is that the object properties will have the same name as the headers. If this is not the case you might look into the DataMemberAttribute property and modify accordingly.
private static List<T> ProcessCSVFile<T>(string[] lines)
{
List<T> list = new List<T>();
Type type = typeof(T);
string[] headerArray = lines[0].Split(new char[] { ',' });
PropertyInfo[] properties = new PropertyInfo[headerArray.Length];
for (int prop = 0; prop < properties.Length; prop++)
{
properties[prop] = type.GetProperty(headerArray[prop]);
}
for (int count = 1; count < lines.Length; count++)
{
string[] valueArray = lines[count].Split(new char[] { ',' });
T t = Activator.CreateInstance<T>();
list.Add(t);
for (int value = 0; value < valueArray.Length; value++)
{
properties[value].SetValue(t, valueArray[value], null);
}
}
return list;
}
Now, in order to use it just pass your file formatted as an array of strings. Let's say the class you want to read into looks like this:
class Music
{
public string ID { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string Email { get; set; }
}
So you can call this:
List<Music> newlist = ProcessCSVFile<Music>(list.ToArray());
...and everything gets done with one call.

Categories