Count the number of occurrences of a string in a row - c#

I have a combobox which contains different values:
public static DataTable GetStates()
{
DataTable myStates = new DataTable();
myStates.Columns.Add("Name", typeof(string));
myStates.Columns.Add("Location", typeof(string));
myStates.Rows.Add("1", "USA");
myStates.Rows.Add("2", "USA");
myStates.Rows.Add("3", "Canada");
return myStates;
}
I want this to be in BindStates function which will count the number of occurences of "USA" and so on.

public void BindStates(DataTable states)
{
int numberUsa = 0;
foreach (DataRow row in states.Rows)
{
if (row[1].ToString() == "USA")
{
numberUsa++;
}
}
Console.WriteLine(numberUsa.ToString());
}

You can do this by using LINQ against the datatable. So what I'm doing is creating an IGrouping that is grouped by the second object in the DataRow ItemArray. The second step is to just count how many items are in the "USA" Group.
// Get datatable with data
var states = GetStates();
// Create grouping of rows
var grouping = states.AsEnumerable().GroupBy(row => row.ItemArray[1]).ToList();
// Count how many rows are in the "USA" group
var numOfUSA = grouping.First(group => group.Key == "USA").Count();
Obviously this code can be improved with null checking etc, this is just to get you started. Good Luck!

Related

Validation in DataTable C#

I have a DataTable with 20 columns and 25000 Rows. There is a column called URL and a column Language.
I need to ensure that all same URLs have the same Language.
Presently I have achieved this by following steps
Get all distinct (unique) URLs
Created a foreach loop on URLs and create a DataView (filtered on the URL)
Now in the dataview I can check if all values in the Language columns are the same.
List<string> all_Distinct_Urls = helperFunction.DataTableToList(master_table, "URL");
foreach (var url in all_Distinct_Urls)
{
if (!string.IsNullOrEmpty(url))
{
DataView dv = new DataView(master_table);
dv.RowFilter = "[URL] = '" + url + "'";
DataTable temp_MasterTable = dv.ToTable();
List<string> all_languages = helperFunction.DataTableToList(temp_MasterTable, "Language");
if (all_languages.Count > 1)
{
Assert.Fail();
}
}
public List<string> DataTableToList(DataTable masterDataTable, string columnName, bool isDistinct = true)
{
List<string> list = new List<string>();
foreach (DataRow dataRow in masterDataTable.Rows)
{
string ID = dataRow[columnName].ToString().Trim();
list.Add(ID);
}
if (isDistinct)
{
list = list.Distinct().ToList();
}
return list;
}
But the problem is that this is consuming a lot of time, given the number of rows and column. Is there any faster way to achieve this?
I would use LINQ. I'm sure this approach will be a lot faster:
var invalidUrlLanguageGroups = master_table.AsEnumerable()
.GroupBy(r => r.Field<string>("Url"))
.Where(g => g.Select(r => r.Field<string>("Language")).Distinct().Skip(1).Any())
.ToList();
I groups by the url and then selects all distinct languages and echecks if theres more than one.
Testcase:
var master_table = new DataTable();
master_table.Columns.Add("Url");
master_table.Columns.Add("Language");
master_table.Rows.Add("/en-us/sample-page1", "english");
master_table.Rows.Add("/en-us/sample-page1", "german"); // fail
master_table.Rows.Add("/de-de/sample-page2", "german");
master_table.Rows.Add("/en-de/sample-page2", "english");
Note that the query collects all invalid urls and their DataRows. If you want an even more efficient query that only determines if there's at least one(to make the test fail), use:
bool anyInvalidUrlLanguageGroups = master_table.AsEnumerable()
.GroupBy(r => r.Field<string>("Url"))
.Any(g => g.Select(r => r.Field<string>("Language")).Distinct().Skip(1).Any());
how about if I want to validate that all columns are the same, not
just the Language column? So if the URL is the same then all column values should be the same
Well, then this method would be helpful to check if all columns(for each url-group) are equal. You can use it in many other cases too, so would be a good candidate for an extension:
public static bool AllItemsEqual<T>(IEnumerable<IEnumerable<T>> allSequences, IEqualityComparer<T> comparer = null)
{
if (comparer == null) comparer = EqualityComparer<T>.Default;
IEnumerable<T> first = null;
foreach(IEnumerable<T> items in allSequences)
{
if (first == null)
first = items;
else
{
if (!items.SequenceEqual(first, comparer))
return false;
}
}
return true;
}
You will use it then in this way:
List<string> columnsExceptUrl = master_table.Columns.Cast<DataColumn>()
.Select(c => c.ColumnName)
.Where(n => n != "Url")
.ToList();
var urlRowsWithDifferentColumns = master_table.AsEnumerable()
.GroupBy(r => r.Field<string>("Url"))
.Where(g => !AllItemsEqual(g.Select(r => columnsExceptUrl.Select(c => r[c].ToString()))))
.ToList();
again, if you just want to know if it fails, you can make it more efficient:
bool anyUrlRowsWithDifferentColumns = master_table.AsEnumerable()
.GroupBy(r => r.Field<string>("Url"))
.Any(g => !AllItemsEqual(g.Select(r => columnsExceptUrl.Select(c => r[c].ToString()))));

C# Constructing a Dynamic Query From DataTable

Trying to Generate a Dynamic Linq Query, based on DataTable returned to me... The column names in the DataTable will change, but I will know which ones I want to total, and which ones I will want to be grouped.
I can get this to work with loops and writing the output to a variable, then recasting the parts back into a data table, but I'm hoping there is a more elegant way of doing this.
//C#
DataTable dt = new DataTable;
Dt.columns(DynamicData1)
Dt.columns(DynamicData1)
Dt.columns(DynamicCount)
In this case the columns are LastName, FirstName, Age. I want to total ages by LastName,FirstName columns (yes both in the group by). So one of my parameters would specify group by = LastName, FirstName and another TotalBy = Age. The next query may return different column names.
Datarow dr =..
dr[0] = {"Smith","John",10}
dr[1] = {"Smith","John",11}
dr[2] = {"Smith","Sarah",8}
Given these different potential columns names...I'm looking to generate a linq query that creates a generic group by and Total output.
Result:
LastName, FirstName, AgeTotal
Smith, John = 21
Smith, Sarah = 8
If you use a simple converter for Linq you can achieve that easily.
Here a quick data generation i did for the sample :
// create dummy table
var dt = new DataTable();
dt.Columns.Add("LastName", typeof(string));
dt.Columns.Add("FirstName", typeof(string));
dt.Columns.Add("Age", typeof(int));
// action to create easily the records
var addData = new Action<string, string, int>((ln, fn, age) =>
{
var dr = dt.NewRow();
dr["LastName"] = ln;
dr["FirstName"] = fn;
dr["Age"] = age;
dt.Rows.Add(dr);
});
// add 3 datarows records
addData("Smith", "John", 10);
addData("Smith", "John", 11);
addData("Smith", "Sarah", 8);
This is how to use my simple transformation class :
// create a linq version of the table
var lqTable = new LinqTable(dt);
// make the group by query
var groupByNames = lqTable.Rows.GroupBy(row => row["LastName"].ToString() + "-" + row["FirstName"].ToString()).ToList();
// for each group create a brand new linqRow
var linqRows = groupByNames.Select(grp =>
{
//get all items. so we can use first item for last and first name and sum the age easily at the same time
var items = grp.ToList();
// return a new linq row
return new LinqRow()
{
Fields = new List<LinqField>()
{
new LinqField("LastName",items[0]["LastName"].ToString()),
new LinqField("FirstName",items[0]["FirstName"].ToString()),
new LinqField("Age",items.Sum(item => Convert.ToInt32(item["Age"]))),
}
};
}).ToList();
// create new linq Table since it handle the datatable format ad transform it directly
var finalTable = new LinqTable() { Rows = linqRows }.AsDataTable();
And finally here are the custom class that are used
public class LinqTable
{
public LinqTable()
{
}
public LinqTable(DataTable sourceTable)
{
LoadFromTable(sourceTable);
}
public List<LinqRow> Rows = new List<LinqRow>();
public List<string> Columns
{
get
{
var columns = new List<string>();
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field => columns.Add(field.Name));
}
return columns;
}
}
public void LoadFromTable(DataTable sourceTable)
{
sourceTable.Rows.Cast<DataRow>().ToList().ForEach(row => Rows.Add(new LinqRow(row)));
}
public DataTable AsDataTable()
{
var dt = new DataTable("data");
if (Rows != null && Rows.Count > 0)
{
Rows[0].Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
Rows.ForEach(row =>
{
var dr = dt.NewRow();
row.Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
});
}
return dt;
}
}
public class LinqRow
{
public List<LinqField> Fields = new List<LinqField>();
public LinqRow()
{
}
public LinqRow(DataRow sourceRow)
{
sourceRow.Table.Columns.Cast<DataColumn>().ToList().ForEach(col => Fields.Add(new LinqField(col.ColumnName, sourceRow[col], col.DataType)));
}
public object this[int index]
{
get
{
return Fields[index].Value;
}
set
{
Fields[index].Value = value;
}
}
public object this[string name]
{
get
{
return Fields.Find(f => f.Name == name).Value;
}
set
{
var fieldIndex = Fields.FindIndex(f => f.Name == name);
if (fieldIndex >= 0)
{
Fields[fieldIndex].Value = value;
}
}
}
public DataTable AsSingleRowDataTable()
{
var dt = new DataTable("data");
if (Fields != null && Fields.Count > 0)
{
Fields.ForEach(field =>
{
dt.Columns.Add(field.Name, field.DataType);
});
var dr = dt.NewRow();
Fields.ForEach(field => dr[field.Name] = field.Value);
dt.Rows.Add(dr);
}
return dt;
}
}
public class LinqField
{
public Type DataType;
public object Value;
public string Name;
public LinqField(string name, object value, Type dataType)
{
DataType = dataType;
Value = value;
Name = name;
}
public LinqField(string name, object value)
{
DataType = value.GetType();
Value = value;
Name = name;
}
public override string ToString()
{
return Value.ToString();
}
}
I think I'd just use a dictionary:
public Dictionary<string, int> GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n';
if(!d.ContainsKey(key))
d[key] = 0;
d[key]+= (int)ro[tot];
}
return d;
}
If you want the total on each row, we could get cute and create a column that is an array of one int instead of an int:
public void GroupTot(DataTable dt, string[] groupBy, string tot){
var d = new Dictionary<string, int>();
var dc = dt.Columns.Add("Total_" + tot, typeof(int[]));
foreach(DataRow ro in dt.Rows){
string key = "";
foreach(string col in groupBy)
key += (string)ro[col] + '\n'; //build a grouping key from first and last name
if(!d.ContainsKey(key)) //have we seen this name pair before?
d[key] = new int[1]; //no we haven't, ensure we have a tracker for our total, for this first+last name
d[key][0] += (int)ro[tot]; //add the total
ro[dc] = d[key]; //link the row to the total tracker
}
}
At the end of the operation every row will have an array of int in the "Total_age" column that represents the total for that First+Last name. The reason I used int[] rather than int, is because int is a value type, whereas int[] is a reference. Because as the table is being iterated each row gets assigned a reference to an int[] some of them with the same First+Last name will end up with their int[] references pointing to the same object in memory, so incrementing a later one increments all the earlier ones too (all "John Smith" rows total column holds a refernece to the same int[]. If we'd made the column an int type, then every row would point to a different counter, because every time we say ro[dc] = d[key] it would copy the current value of d[key] int into ro[dc]'s int. Any reference type would do for this trick to work, but value types wouldn't. If you wanted your column to be value type you'd have to iterate the table again, or have two dictionaries, one that maps DataRow -> total and iterate the keys, assigning the totals back into the row

C# - Looking for the list of duplicated rows (need optimization)

Please, I would like to optimize this code in C#, if possible.
When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time...
Here a little benchmark :
5000 lines => ~2s
15000 lines => ~20s
25000 lines => ~50s
Indeed, I'm looking for duplicated lines.
Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).
Here is the code :
private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values
List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated
foreach (DataRow row in table.Rows)
{
// Find keys fields values for the row
List<object> rowKeys = new List<object>();
keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));
// Check if those keys are already defined
bool alreadyDefined = false;
foreach (List<object> keyValue in keys.Keys)
{
if (rowKeys.SequenceEqual(keyValue))
{
alreadyDefined = true;
break;
}
}
if (alreadyDefined)
{
duplicatedRows.Add(row);
// If first duplicate for this key, add the first occurence of this key
if (!duplicatedKeys.Contains(rowKeys))
{
duplicatedKeys.Add(rowKeys);
int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
duplicatedRows.Add(table.Rows[i]);
}
}
else
{
keys.Add(rowKeys, table.Rows.IndexOf(row));
}
}
return duplicatedRows;
}
Any ideas ?
I think this is the fastest and shortest way to find duplicate rows:
For 100.000 rows it executes in about 250ms.
Main and test data:
static void Main(string[] args)
{
var dt = new DataTable();
dt.Columns.Add("Id");
dt.Columns.Add("Value1");
dt.Columns.Add("Value2");
var rnd = new Random(DateTime.Now.Millisecond);
for (int i = 0; i < 100000; i++)
{
var dr = dt.NewRow();
dr[0] = rnd.Next(1, 1000);
dr[1] = rnd.Next(1, 1000);
dr[2] = rnd.Next(1, 1000);
dt.Rows.Add(dr);
}
Stopwatch sw = new Stopwatch();
sw.Start();
var duplicates = GetDuplicateRows(dt, "Id", "Value1", "Value2");
sw.Stop();
Console.WriteLine(
"Found {0} duplicates in {1} miliseconds.",
duplicates.Count,
sw.ElapsedMilliseconds);
Console.ReadKey();
}
GetDuplicateRows with LINQ:
private static List<DataRow> GetDuplicateRows(DataTable table, params string[] keys)
{
var duplicates =
table
.AsEnumerable()
.GroupBy(dr => String.Join("-", keys.Select(k => dr[k])), (groupKey, groupRows) => new { Key = groupKey, Rows = groupRows })
.Where(g => g.Rows.Count() > 1)
.SelectMany(g => g.Rows)
.ToList();
return duplicates;
}
Explanation (for those who are new to LINQ):
The most tricky part is the GroupBy I guess. Here I take as the first parameter a DataRow and for each row I create a group key from the values for the specified keys that I join to create a string like 1-1-2. Then the second parameter just selects the group key and the group rows into a new anonymous object. Then I check if there is more then 1 row and flatten the groups back into a list with SelectMany.
Try this. Use more linq, that improve perfomance, also try with PLinq if posible.
Regards
private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values
List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated
foreach (DataRow row in table.Rows)
{
// Find keys fields values for the row
List<object> rowKeys = new List<object>();
keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));
// Check if those keys are already defined
bool alreadyDefined = false;
foreach (List<object> keyValue in keys.Keys)
{
if (rowKeys.Any(keyValue))
{
alreadyDefined = true;
break;
}
}
if (alreadyDefined)
{
duplicatedRows.Add(row);
// If first duplicate for this key, add the first occurence of this key
if (!duplicatedKeys.Contains(rowKeys))
{
duplicatedKeys.Add(rowKeys);
int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
duplicatedRows.Add(table.Rows[i]);
}
}
else
{
keys.Add(rowKeys, table.Rows.IndexOf(row));
}
}
return duplicatedRows;
}

Re-arrange rows of Datatable in runtime

I have multiple rows in a datatable, see a sample below:
Existing Table
Name Date Value Type
ABC(I) 11/11/2013 12.36 I
DEF(I) 11/11/2013 1 I
GHI(I) -do- -do- I
JKL(P) P
MNO(P) P
PQR(D) D
STU(D) -d0- -do- D
Required Table
Name Date Value Type
JKL(P) P
MNO(P) P
PQR(D) D
STU(D) -d0- -do- D
ABC(I) 11/11/2013 12.36 I
DEF(I) 11/11/2013 1 I
GHI(I) -do- -do- I
COndition to use
Sorting should be as per the column Type. Now I need a small change in order of the rows to be shown in the gridview. That is rows of Payment will come first then all Dues and at last all Interests types will come.
What I tried:
Sorting of column but it was not what I need.
Custom Grouping suggested by Tim Schmelter here
Code was:
public DataTable GroupBy(string i_sGroupByColumn, string i_sAggregateColumn, DataTable i_dSourceTable)
{
DataView dv = new DataView(i_dSourceTable);
//getting distinct values for group column
DataTable dtGroup = dv.ToTable(true, new string[] { i_sGroupByColumn });
//adding column for the row count
dtGroup.Columns.Add("Count", typeof(int));
//looping thru distinct values for the group, counting
foreach (DataRow dr in dtGroup.Rows) {
dr["Count"] = i_dSourceTable.Compute("Count(" + i_sAggregateColumn + ")", i_sGroupByColumn + " = '" + dr[i_sGroupByColumn] + "'");
}
//returning grouped/counted result
return dtGroup;
}
I dont know where and what I am lacking/missing. Kindly help.
try linq to order your table:
var query = dtGroup.AsEnumerable()
.OrderBy(c=> c.Field<DateTime?>("Date"))
.ThenByDescending(c=> c.Field<string>("Name"));
DataView dv2 = query.AsDataView();
If I understand correctly you want first sorting on P, D, I and then on date
Dictionary<string, int> sortDictionary = new Dictionary<string, int>();
sortDictionary.Add("P", 1);
sortDictionary.Add("D", 2);
sortDictionary.Add("I", 3);
var q = from row in dtGroup.AsEnumerable()
let type = sortDictionary[row.Field<string>("Name").Substring(4, 1)]
orderby type, row.Field<string>("Name")
select row;
foreach (var r in q)
{
string x = r["Name"].ToString() + r["Date"].ToString();
}

How to count and sum total of DataTable with LINQ?

I have a DataTable which has a column "amount" for each rows and I'd like to have the total sum of all the rows. And also, I'd like to get total number of rows in the DataTable. Could anyone teach me how to have it done with LINQ instead of ordinary way?
Number of rows:
DataTable dt; // ... populate DataTable
var count = dt.Rows.Count;
Sum of the "amount" column:
DataTable dt; // ... populate DataTable
var sum = dt.AsEnumerable().Sum(dr => dr.Field<int>("amount"));
Aggregate allows you to avoid enumerating the rows twice (you could get the row count from the rows collection but this is more to show how to extract multiple aggregates in 1 pass):
var sumAndCount = table.AsEnumerable().Aggregate(new { Sum = 0d, Count = 0},
(data, row) => new { Sum = data.Sum + row.Field<double>("amount"), Count = data.Count + 1});
double sum = sumAndCount.Sum;
int count = sumAndCount.Count;
decimal[] Amount = {2,3,5 };
var sum = Amount.Sum();
var count = Amount.Count();
Based on Roy Goode's Answer you could also create an Extension
public static int Sum(this DataTable table, string Column)
{
return table.AsEnumerable().Sum(dr => dr.Field<int>(Column));
}
Unfortunately you can't be more generic her because there is no where T : numeric

Categories