Disclaimer: This is my very first .net c# project
I am attempting to import a CSV into MSSQL but need to iterate through the CSV values first for sanitization purposes. Some of the columns in the CSV will be integer (will be used for calcuations later) and some are regular varchar.
My script above appears to force all values (that is row column values) in the DataTable as a string which throws an Exception later in my application when SQL cannot write a string as an integer.
Here is my method I am using for the getCSVImport which creates a datatable and populates it.
What I am thinking is to add another condition which checks if the value is an integer and then cast it as an integer (this kind of thing is new to me as PHP would does not handle types so strongly) but I fear that wont work as I am not sure if I can mix the values within a dataTable with various types.
So my question is, is there a way for me to have different values in a datatable as different types? My code below takes the line as a whole and writes it as a string, I need the values to be assigned either as string or as integer.
/*
* getCsvData()
* This method will create a datatable from the CSV file. We'll take the CSV file as is.
* and collect the data as needed:
*
* - Remove those original 4 lines (worthless info)
* - Line 5 starts with the headers, remove any of the brackets around the values
* - Iterate through the rest of the fields and sanitize them before we add it to the datatable
*
*/
private DataTable getCsvData(string csv_file_path)
{
// Create a new csvData tabletable object:
DataTable csvData = new DataTable();
try
{
using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
int row = 1;
while (!csvReader.EndOfData)
{
// Read the string and collect the row data
string[] rowData = csvReader.ReadFields();
if (row <= 4)
{
// We want to start on row 5 as first rows are nonsense :)
// Incriment the row so that we can do our magic above
row++;
continue;
} if(row == 5)
{
// Row 5 is the headers, we need to sanitize and continue:
foreach (string column in rowData)
{
// Remove the [ ] from the values:
var col = column.Substring(1, column.Length - 2);
DataColumn datecolumn = new DataColumn(col);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
// Incriment the row so that we can do our magic above
row++;
} else
{
// These are all of the actual rows, sanitize and add the rows:
//Making empty value as null
for (int i = 0; i < rowData.Length; i++)
{
// First remove the brackets:
if (rowData[i].Substring(0,1) == "[")
{
rowData[i] = rowData[i].Substring(1, rowData[i].Length - 2);
}
// Set blank to null:
if (rowData[i] == "" || rowData[i] == "-")
{
rowData[i] = null;
}
// Lastly, we need to do some calculations:
}
// Add the sanitized row to the DataTable:
csvData.Rows.Add(rowData);
}
}
}
}
catch (Exception ex)
{
throw new Exception("Could not parse the CSV file: "+ ex.Message);
}
return csvData;
}
You can cast the string to a int:
int j;
bool parsed=Int32.TryParse("-105", out j))
With TryParse you can check if it succeeded.
Then when you want to save it to the table again, cast it to string. You can simply do <variable>.ToString()
By default, data columns are initialized to a string data type.
There's an overload that allows you to specify the type, so I'd suggest you try that. Since your columns are known beforehand, you can easily handle this in your code.
private DataColumn AddColumn(string columnName, Type columnType)
{
// Remove the [ ] from the values:
var col = column.Substring(1, columnName.Length - 2);
DataColumn dataColumn = new DataColumn(col, columnType);
dataColumn.AllowDBNull = true;
return dataColumn;
}
if (row == 5)
{
csvData.Columns.Add(AddColumn(rowData[0], typeof(string)));
csvData.Columns.Add(AddColumn(rowData[1], typeof(int)));
csvData.Columns.Add(AddColumn(rowData[2], typeof(DateTime)));
csvData.Columns.Add(AddColumn(rowData[3], typeof(string)));
// etc
}
I'm not sure you'll even need to convert the other values before adding them to the DataTable, but if you do, many built-in types have TryParse methods, such as DateTime.TryParse and Int32.TryParse. You can call each of them in succession, and one of the "tries" succeeds, you'll know your type.
Alternatively, since you know the column types beforehand, you can just cast each value.
csvData.Rows.Add(Convert.ToString(rowData[0]),
Convert.ToInt32(rowData[1]),
Convert.ToDateTime(rowData[2]),
Convert.ToString(rowData[3]));
I would use *.TryParse(), ie: With this sample CSV:
*A sample csv file with
*some comment lines at top
-- with different comment
// comment strings.
[charField],[dateField],[intField],[decimalField]
"Sample char data 1",2016/1/2,123,123.45
"Sample char data 2",,2,1.5
"Sample char data 3",,3,
"Sample char data 4",,,
,,,
"Sample char data 6",2016/2/29 10:20,10,20.5
You might use TryParse on those datetime, int, decimal fields:
void Main()
{
var myData = ReadMyCSV(#"c:\MyPath\MyFile.csv");
// do whatever with myData
}
public IEnumerable<MyRow> ReadMyCSV(string fileName)
{
using (TextFieldParser tfp = new TextFieldParser(fileName))
{
tfp.HasFieldsEnclosedInQuotes = true;
tfp.SetDelimiters(new string[] { "," });
//tfp.CommentTokens = new string[] { "*","--","//" };
// instead of using comment tokens we are going to skip 4 lines
for (int j = 0; j < 4; j++)
{
tfp.ReadLine();
}
// header line.
tfp.ReadLine();
DateTime dt;
int i;
decimal d;
while (!tfp.EndOfData)
{
var data = tfp.ReadFields();
yield return new MyRow
{
MyCharData = data[0],
MyDateTime = DateTime.TryParse(data[1], out dt) ? dt : (DateTime?)null,
MyIntData = int.TryParse(data[2], out i) ? i : 0,
MyDecimal = decimal.TryParse(data[3], System.Globalization.NumberStyles.Any, null, out d) ? d : 0M
};
}
}
}
public class MyRow
{
public string MyCharData { get; set; }
public int MyIntData { get; set; }
public DateTime? MyDateTime { get; set; }
public decimal MyDecimal { get; set; }
}
I could further sanitize the data loaded, such as:
myData.Where( d => d.MyIntData != 0 );
Note: I didn't use a DataTable, which I could if I wanted to. For MSSQL loading, I would probably use an intermediate in-memory SQLite instance to save the sanitized data and then push to MSSQL using SqlBulkCopy class. A DataTable is of course an option (I just think it is less flexible).
Related
Trying to Generate a Dynamic Linq Query, based on DataTable returned to me... The column names in the DataTable will change, but I will know which ones I want to total, and which ones I will want to be grouped.
I can get this to work with loops and writing the output to a variable, then recasting the parts back into a data table, but I'm hoping there is a more elegant way of doing this.
//C#
DataTable dt = new DataTable;
Dt.columns(DynamicData1)
Dt.columns(DynamicData1)
Dt.columns(DynamicCount)
In this case the columns are LastName, FirstName, Age. I want to total ages by LastName,FirstName columns (yes both in the group by). So one of my parameters would specify group by = LastName, FirstName and another TotalBy = Age. The next query may return different column names.
Datarow dr =..
dr[0] = {"Smith","John",10}
dr[1] = {"Smith","John",11}
dr[2] = {"Smith","Sarah",8}
Given these different potential columns names...I'm looking to generate a linq query that creates a generic group by and Total output.
Result:
LastName, FirstName, AgeTotal
Smith, John = 21
Smith, Sarah = 8
If you use a simple converter for Linq you can achieve that easily.
Here a quick data generation i did for the sample :
// create dummy table
var dt = new DataTable();
dt.Columns.Add("LastName", typeof(string));
dt.Columns.Add("FirstName", typeof(string));
dt.Columns.Add("Age", typeof(int));
// action to create easily the records
var addData = new Action<string, string, int>((ln, fn, age) =>
{
var dr = dt.NewRow();
dr["LastName"] = ln;
dr["FirstName"] = fn;
dr["Age"] = age;
dt.Rows.Add(dr);
});
// add 3 datarows records
addData("Smith", "John", 10);
addData("Smith", "John", 11);
addData("Smith", "Sarah", 8);
This is how to use my simple transformation class :
// create a linq version of the table
var lqTable = new LinqTable(dt);
// make the group by query
var groupByNames = lqTable.Rows.GroupBy(row => row["LastName"].ToString() + "-" + row["FirstName"].ToString()).ToList();
// for each group create a brand new linqRow
var linqRows = groupByNames.Select(grp =>
{
//get all items. so we can use first item for last and first name and sum the age easily at the same time
var items = grp.ToList();
// return a new linq row
return new LinqRow()
{
Fields = new List<LinqField>()
{
new LinqField("LastName",items[0]["LastName"].ToString()),
new LinqField("FirstName",items[0]["FirstName"].ToString()),
new LinqField("Age",items.Sum(item => Convert.ToInt32(item["Age"]))),
}
};
}).ToList();
// create new linq Table since it handle the datatable format ad transform it directly
var finalTable = new LinqTable() { Rows = linqRows }.AsDataTable();
And finally here are the custom class that are used
public class LinqTable
{
public LinqTable()
{
}
public LinqTable(DataTable sourceTable)
{
LoadFromTable(sourceTable);
}
public List<LinqRow> Rows = new List<LinqRow>();
public List<string> Columns
{
get
{
var columns = new List<string>();
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field => columns.Add(field.Name));
}
return columns;
}
}
public void LoadFromTable(DataTable sourceTable)
{
sourceTable.Rows.Cast<DataRow>().ToList().ForEach(row => Rows.Add(new LinqRow(row)));
}
public DataTable AsDataTable()
{
var dt = new DataTable("data");
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
Rows.ForEach(row =>
{
var dr = dt.NewRow();
row.Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
});
}
return dt;
}
}
public class LinqRow
{
public List<LinqField> Fields = new List<LinqField>();
public LinqRow()
{
}
public LinqRow(DataRow sourceRow)
{
sourceRow.Table.Columns.Cast<DataColumn>().ToList().ForEach(col => Fields.Add(new LinqField(col.ColumnName, sourceRow[col], col.DataType)));
}
public object this[int index]
{
get
{
return Fields[index].Value;
}
set
{
Fields[index].Value = value;
}
}
public object this[string name]
{
get
{
return Fields.Find(f => f.Name == name).Value;
}
set
{
var fieldIndex = Fields.FindIndex(f => f.Name == name);
if (fieldIndex >= 0)
{
Fields[fieldIndex].Value = value;
}
}
}
public DataTable AsSingleRowDataTable()
{
var dt = new DataTable("data");
if (Fields != null && Fields.Count > 0)
{
Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
var dr = dt.NewRow();
Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
}
return dt;
}
}
public class LinqField
{
public Type DataType;
public object Value;
public string Name;
public LinqField(string name, object value, Type dataType)
{
DataType = dataType;
Value = value;
Name = name;
}
public LinqField(string name, object value)
{
DataType = value.GetType();
Value = value;
Name = name;
}
public override string ToString()
{
return Value.ToString();
}
}
I think I'd just use a dictionary:
public Dictionary<string, int> GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n';
if(!d.ContainsKey(key))
d[key] = 0;
d[key]+= (int)ro[tot];
}
return d;
}
If you want the total on each row, we could get cute and create a column that is an array of one int instead of an int:
public void GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
var dc = dt.Columns.Add("Total_" + tot, typeof(int[]));
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n'; //build a grouping key from first and last name
if(!d.ContainsKey(key)) //have we seen this name pair before?
d[key] = new int[1]; //no we haven't, ensure we have a tracker for our total, for this first+last name
d[key][0] += (int)ro[tot]; //add the total
ro[dc] = d[key]; //link the row to the total tracker
}
}
At the end of the operation every row will have an array of int in the "Total_age" column that represents the total for that First+Last name. The reason I used int[] rather than int, is because int is a value type, whereas int[] is a reference. Because as the table is being iterated each row gets assigned a reference to an int[] some of them with the same First+Last name will end up with their int[] references pointing to the same object in memory, so incrementing a later one increments all the earlier ones too (all "John Smith" rows total column holds a refernece to the same int[]. If we'd made the column an int type, then every row would point to a different counter, because every time we say ro[dc] = d[key] it would copy the current value of d[key] int into ro[dc]'s int. Any reference type would do for this trick to work, but value types wouldn't. If you wanted your column to be value type you'd have to iterate the table again, or have two dictionaries, one that maps DataRow -> total and iterate the keys, assigning the totals back into the row
I have a csv file and would like to count the 2. column how many times contains 111.
the csv file has 46 separated columns with separator ; .
"first col" "second col" "....."
abc 111 a
abc 112 b
abc 113 c
abc 111 d
abc 112 e
abc 113 f
i would like to count the 111.
Filled up first the datagridview fom datatable.
dgv.DataSource = dgv_table;
string[] raw_text = File.ReadAllLines("d:\\"+lb_csv.Text);
string[] data_col = null;
int x = 0;
foreach (string text_line in raw_text)
{
// MessageBox.Show(text_line);
data_col = text_line.Split(';');
if (x == 0)
{
for (int i = 0; i <= data_col.Count() - 1; i++)
{
dgv_table.Columns.Add(data_col[i]);
}
//header
x++;
}
else
{
//data
dgv_table.Rows.Add(data_col);
}
I find a lots of solution to count the 2nd columns specified data:111
but all time i had problems.
int xCount = dgv.Rows.Cast<DataGridViewRow>().Select(row => row.Cells["second col"].Value).Where(s => s !=null && Equals(111)).Count();
this.lb_qty.Text = xCount.ToString();
But it gives error for row.Cells["second col"].Value
An unhandled exception of type 'System.ArgumentException' occurred in System.Windows.Forms.dll
Additional information: Column named second col cannot be found.
Can someone help me how to solve this problem and get the needed result?
I would suggest you to skip using DataGridView and use counter variable in your loop, like Arkadiusz suggested.
If you still want to work with DataTable, count values like this:
int xCount = dgv_table.Rows.Cast<DataRow>().Count(r => r["second col"] != null && r["second col"].ToString() == "111");
I would try to read the file into a DataTable and use it as DataSource for the DataGridView.
DataTable d_Table = new DataTable();
//fill the DataTable
this.dgv_table.DataSource = d_Table;
To count the rows wich contains 111 in the second column, you can select the DataTable like this:
DataTable d_Table = new DataTable();
//fill the DataTable
DataRow[] rowCount = d_Table.Select("secondCol = '111'");
this.lb_qty.Text = rowCount.Length.ToString();
Or you can do it in a foreach-loop:
int count = 0;
foreach(DataGridViewRow dgr in this.dgv_table.Rows)
{
if(dgr.Cells["secondCol"].Value.ToString() == "111") count++;
}
this.lb_qty.Text = count.ToString();
you can use this method to save the CSV into List of arrays List
public static List<string[]> readCSV(String filename)
{
List<string[]> result = new List<string[]>();
try
{
string[] line = File.ReadAllLines(filename);
foreach (string l in line)
{
string[] value= vrstica.Split(',');
result.Add(value);
}
}
catch (Exception e)
{
Console.WriteLine("Error: '{0}'", e);
}
return result;
}
every array will represent a column, so you can simply find the frequency of any value using LINQ or even loop:
foreach (var item in tmp[1].GroupBy(c => c))
{
Console.WriteLine("{0} : {1}", item.Key, item.Count());
}
int CountValues(string input, string searchedValue, int ColumnNumber, bool skipFirstLine = false)
{
int numberOfSearchedValue= 0;
string line;
using (StreamReader reader = new StreamReader (input))
{
if(skipFirstLine)
reader.ReadLine();
while ((line = reader.ReadLine()) != null)
{
if(line.Split(';')[ColumnNumber] == searchedValue)
numberOfSearchedValue++;
}
}
return numberOfSearchedValue;
}
Edit:
StreamReader.ReadLine() reads the line but also, using this method we are jumping to second line. If there is no more lines it returns null, so that is our ending condition. Rest of the code is readable, I think
:)
Didn't test that so be careful :)
It might be necessary to use Trim() or ToUpperCase() in some places (as usually when you are searching).
Like in title I have problem with parsing data from CVS files. When i choose file with diffrent formating all i get is "Input string was not in a correct format".
My code works with files formatted like that:
16.990750 4.0
17.000250 5.0
17.009750 1.0
17.019250 6.0
But cant handle files formatted like this one:
Series1 - X;Series1 - Y;
285.75;798
285.79;764
285.84;578
285.88;690
This is code responsibile for reading data from file and creating chart from it:
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
string cos = File.ReadAllText(openFileDialog1.FileName);
string[] rows = cos.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
DataTable table = new DataTable();
table.Columns.Add("xValue", typeof(decimal));
table.Columns.Add("yValue", typeof(decimal));
foreach (string row in rows)
{
string[] values = row.Split(' ');
DataRow ch = table.NewRow();
ch[0] = Decimal.Parse(values[0], NumberStyles.AllowDecimalPoint, CultureInfo.InvariantCulture);
ch[1] = Decimal.Parse(values[1], NumberStyles.AllowDecimalPoint, CultureInfo.InvariantCulture);
table.Rows.Add(ch);
}
if (seria == false)
{
wykres.Series.Add("series");
wykres.Series["series"].ChartType = System.Windows.Forms.DataVisualization.Charting.SeriesChartType.Line;
wykres.Series["series"].XValueMember = "xValue";
wykres.Series["series"].YValueMembers = "yValue";
wykres.DataSource = table;
wykres.DataBind();
seria = true;
}
}
EDIT
I changed parsing method to this one:
foreach (string row in rows)
{
var values = row.Split(';');
var ch = table.NewRow();
decimal num = 0;
if (decimal.TryParse(values[0], out num))
ch[0] = num;
if (decimal.TryParse(values[1], out num))
ch[1] = num;
table.Rows.Add(ch);
}
It works okay but with one exception - It can't read decimals only integers from csv file(see the picture below).
View of table in locals
Why is this happening?
I suggest you don't re-invent the wheel, but use some well-tested library to parse the CSV (for example, your implementation does not handle quoted values well. It also doesn't allow the separator as part of a value).
And guess what: .NET includes something that could help you: the TextFieldParser class. Don't worry about the VisualBasicnamespace - it works in C#, too :-)
foreach (string row in rows)
{
var values = row.Split(';');
var ch = table.NewRow();
decimal num = 0;
if (decimal.TryParse(values[0], out num))
ch[0] = num;
if (decimal.TryParse(values[1], out num))
ch[1] = num;
table.Rows.Add(ch);
}
In the second text format the separator is(;) and first row of the text has two strings therefore to convert a string to decimal use decimal.TryParse() instead of decimal.Parse().the return type of method TryParse() is boolean
therefore if it returned true that means the string converted successful .
I have a CSV that is delivered to my application from various sources. The CSV will always have the same number columns and the header values for the columns will always be the same.
However, the columns may not always be in the same order.
Day 1 CSV may look like this
ID,FirstName,LastName,Email
1,Johh,Lennon,jlennon#applerecords.com
2,Paul,McCartney,macca#applerecords.com
Day 2 CSV may look like this
Email,FirstName,ID,LastName
resident1#friarpark.com,George,3,Harrison
ringo#allstarrband.com,Ringo,4,Starr
I want to read in the header row for each file and have a simple mechanism for associating each "column" of data with the associated property I have defined in my class.
I know I can use selection statements to figure it out, but that seems like a "bad" way to handle it.
Is there a simple way to map "columns" to properties using a dictionary or class at runtime?
Use a Dictionary to map column heading text to column position.
Hard-code mapping of column heading text to object property.
Example:
// Parse first line of text to add column heading strings and positions to your dictionary
...
// Parse data row into an array, indexed by column position
...
// Assign data to object properties
x.ID = row[myDictionary["ID"]];
x.FirstName = row[myDictionary["FirstName"]];
...
You dont need a design pattern for this purpose.
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I have used this Reader, while it is pretty good, it has a functionality as row["firstname"] or row["id"] which you can parse and create your objects.
I have parsed both CSV files using Microsoft.VisualBasic.FileIO.TextFieldParser. I have populated DataTable after parsing both csv files:
DataTable dt;
private void button1_Click(object sender, EventArgs e)
{
dt = new DataTable();
ParseCSVFile("day1.csv");
ParseCSVFile("day2.csv");
dataGridView1.DataSource = dt;
}
private void ParseCSVFile(string sFileName)
{
var dIndex = new Dictionary<string, int>();
using (TextFieldParser csvReader = new TextFieldParser(sFileName))
{
csvReader.Delimiters = new string[] { "," };
var colFields = csvReader.ReadFields();
for (int i = 0; i < colFields.Length; i++)
{
string sColField = colFields[i];
if (sColField != string.Empty)
{
dIndex.Add(sColField, i);
if (!dt.Columns.Contains(sColField))
dt.Columns.Add(sColField);
}
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
if (fieldData.Length > 0)
{
DataRow dr = dt.NewRow();
foreach (var kvp in dIndex)
{
int iVal = kvp.Value;
if (iVal < fieldData.Length)
dr[kvp.Key] = fieldData[iVal];
}
dt.Rows.Add(dr);
}
}
}
}
day1.csv and day2.csv as mentioned in the question.
Here is how output dataGridView1 look like:
Here is a simple generic method that will take a CSV file (broken into string[]) and create from it a list of objects. The assumption is that the object properties will have the same name as the headers. If this is not the case you might look into the DataMemberAttribute property and modify accordingly.
private static List<T> ProcessCSVFile<T>(string[] lines)
{
List<T> list = new List<T>();
Type type = typeof(T);
string[] headerArray = lines[0].Split(new char[] { ',' });
PropertyInfo[] properties = new PropertyInfo[headerArray.Length];
for (int prop = 0; prop < properties.Length; prop++)
{
properties[prop] = type.GetProperty(headerArray[prop]);
}
for (int count = 1; count < lines.Length; count++)
{
string[] valueArray = lines[count].Split(new char[] { ',' });
T t = Activator.CreateInstance<T>();
list.Add(t);
for (int value = 0; value < valueArray.Length; value++)
{
properties[value].SetValue(t, valueArray[value], null);
}
}
return list;
}
Now, in order to use it just pass your file formatted as an array of strings. Let's say the class you want to read into looks like this:
class Music
{
public string ID { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string Email { get; set; }
}
So you can call this:
List<Music> newlist = ProcessCSVFile<Music>(list.ToArray());
...and everything gets done with one call.
I have a DataTable with two columns: JobDetailID and CalculatedID. JobDetailID is not always unique. I want one/the first instance of CalculatedID for a given JobDetailID to be JobDetailID + "A", and when there are multiple rows with the same JobDetailID, I want successive rows to be JobDetailID + "B", "C", etc. There aren't more than four or five rows with the same JobDetailID.
I currently have it implemented as follows, but it's unacceptably slow:
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
var enumerableData = data.AsEnumerable();
foreach (DataRow row in data.Rows)
{
var jobDetailID = row["JobDetailID"].ToString();
// Give calculated ID of JobDetailID + A, B, C, etc. for multiple rows with same JobDetailID
int x = 65; // ASCII value for A
string calculatedID = jobDetailID + (char)x;
while (string.IsNullOrEmpty(row["CalculatedID"].ToString()))
{
if ((enumerableData
.Any(r => r.Field<string>("CalculatedID") == calculatedID)))
{
calculatedID = jobDetailID + (char)x;
x++;
}
else
{
row["CalculatedID"] = calculatedID;
break;
}
}
}
}
Assuming I need to adhere to this format of output, how might I improve this performance?
It would be better to add the code for generation of CalculatedID in the place where you are getting the data, but, if that is unavailable, you might want to avoid scanning the entire table each time a duplicate is found. You could use a Dictionary for the used keys, like this:
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
Dictionary<string, string> UsedKeyIndex = new Dictionary<string, string>();
foreach (DataRow row in data.Rows)
{
string jobDetailID = row["JobDetailID"].ToString();
string calculatedID;
if (UsedKeyIndex.ContainsKey(jobDetailID))
{
calculatedID = jobDetailID + 'A';
UsedKeyIndex.Add(jobDetailID, 'A');
}
else
{
char nextKey = UsedKeyIndex[jobDetailID].Value+1;
calculatedID = jobDetailID + nextKey;
UsedKeyIndex[jobDetailID] = nextKey;
}
row["CalculatedID"] = calculatedID;
}
}
This will essentially trade memory for speed, as it will cache all used JobDetailID's along with the last char used for the generated key. If you have lots and lots of these JobDetailID, this might get a bit memory intensive, but I doubt that you'll have problems unless you have millions of rows to process.
If I understand your idea about setting CalculatedID for the rows, then following algorithm would do the trick and it's complexity is linear. Most important part is data.Select("","JobDetailID"), where I get a sorted list of rows.
I didn't compiled it myself, so there could be syntactical errors.
private void AddCalculatedID(DataTable data)
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
int jobDetailID = -1;
int letter = 65;
foreach (DataRow row in data.Select("","JobDetailID"))
{
if((int)row["JobDetailID"] == jobDetailID)
{
row["CalculatedID"] = row["JobDetailID"].ToString() + (char)letter;
letter++;
}
else
{
letter = 65;
jobDetailID = (int)row["JobDetailID"];
}
}
}
You tagged this as LINQ, but you are using iterative methods. Probably the best way to do this would be to use a combination of both, iterating over each "grouping" and assigning the calculated ID for each row in the grouping.
foreach (var groupRows in data.AsEnumerable().GroupBy(d => d["JobDetailID"].ToString()))
{
if(string.IsNullOrEmpty(groupRows.Key))
continue;
// We now have each "grouping" of duplicate JobDetailIDs.
int x = 65; // ASCII value for A
foreach (var duplicate in groupRows)
{
string calcID = groupRows.Key + ((char)x++);
duplicate["CalculatedID"] = calcID;
//Can also do this and achieve same results.
//duplicate["CalculatedID"] = groupRows.Key + ((char)x++);
}
}
First thing you do is group on the column that's going to have duplicates. You're going to iterate over each of these groupings, and reset the suffix value for every grouping. For every row in the grouping, you're going to get the calculated ID (incrementing the suffix value at the same time) and assign the ID back to the duplicate row. As a side note, we're altering the items we're enumerating here, which is normally a bad thing. However, we're changing data that isn't associated with our enumeration declaration (GroupBy), so it will not alter the behavior of our enumeration.
This method gets the job done in a single pass. You can optimize it further if, for example, "JobDetailID" is an integer instead of a string, or if the DataTable is always receiving the data sorted by "JobDetailID" (you could get rid of the dictionary), but here's a draft:
private static void AddCalculatedID(DataTable data)
{
data.BeginLoadData();
try
{
var calculatedIDColumn = new DataColumn { DataType = typeof(string), ColumnName = "CalculatedID" };
data.Columns.Add(calculatedIDColumn);
data.Columns["CalculatedID"].SetOrdinal(0);
var jobDetails = new Dictionary<string, int>(data.Rows.Count);
foreach (DataRow row in data.Rows)
{
var jobDetailID = row["JobDetailID"].ToString();
int lastSuffix;
if (jobDetails.TryGetValue(jobDetailID, out lastSuffix))
{
lastSuffix++;
}
else
{
// ASCII value for A
lastSuffix = 65;
}
row["CalculatedID"] = jobDetailID + (char)lastSuffix;
jobDetails[jobDetailID] = lastSuffix;
}
}
finally
{
data.EndLoadData();
}
}