I have a list of this model
public Class Contact
{
public string MobileNumber {get;set;}
public string PhoneNumber2 {get;set;}
}
I have a method that compares this list against of list of phone numbers and returns non matching values
private List<ContactDto> GetNewContactsNotFoundInCrm(ContactPostModel model)
{
var duplicates = GetAllNumbers(); // Returns a List<string> of 5 million numbers
var mobile = duplicates.Select(x => x.MobilePhone).ToList();
var telephone2 = duplicates.Select(x => x.Telephone2).ToList();
// I'm trying to compare Telephone2 and MobilePhone against the
// duplicates list of 5 million numbers. It works, but it's slow
// and can take over a minute searching for around 5000 numbers.
return model.Contacts
.Where(y =>
!mobile.Contains(y.Phonenumber.ToPhoneNumber()) &&
!telephone2.Contains(y.Phonenumber.ToPhoneNumber()) &&
!mobile.Contains(y.Phonenumber2.ToPhoneNumber()) &&
!telephone2.Contains(y.Phonenumber2.ToPhoneNumber()))
.ToList();
}
// Extension method used
public static string ToPhoneNumber(this string phoneNumber)
{
if (phoneNumber == null || phoneNumber == string.Empty)
return string.Empty;
return phoneNumber.Replace("(", "").Replace(")", "")
.Replace(" ", "").Replace("-", "");
}
What data structure can I use to compare the Mobile and Telephone2 to the list of 5 million numbers for better performance?
Creating a HashSet will probably solve your problems:
var mobile = new HashSet<string>(duplicates.Select(x => x.MobilePhone));
var telephone2 = new HashSet<string>(duplicates.Select(x => x.Telephone2));
There are other performance improvements you can make, but they'll be micro-optimizations compared to avoiding iterating over 5 million items for each number you check.
You can use Enumerable.Except.
Enumerable.Except uses HashSet internally to improve lookup performance. You want to use the overload which allows to pass in a custom IEqualityComparer as an argument.
Also note that ToList() is a LINQ query finalizer. Means that ToList() executes the LINQ expression immediately - which results in a complete iteration over the collection. LINQ's power is that queries are executed deferred, which improves the performance significantly. All sub-queries (whether chained or split up in separate statements) are merged into one single iteration using yield return internally:
Good performance:
// Instead of two iterations, LINQ will defer both
// iterations and merge them into a single iteration
var filtered = collection.Where(); // Deferred iteration #1
var projected = filtered.Select(); // Deferred iteration #2
var results = projected.ToList(); // Results in one single iteration
Bad performance:
// LINQ will execute each iteration immediately
// resulting in two complete iterations
var filtered = collection.Where().ToList(); // Executed iteration #1
var projected = filtered.Select().ToList(); // Executed iteration #2
You should avoid to call a finalizer before you actually want to execute the query to significantly improve the performance:
// Executes deferred
var mobile = duplicates.Select(x => x.MobilePhone);
Instead of:
// Executes immediately
var mobile = duplicates.Select(x => x.MobilePhone).ToList();
Also note that each Enumerable.Contains executes a separate iteration. Contains is a finalizer and will execute immediately:
return model.Contacts
.Where(y =>
!mobile.Contains(y.Phonenumber.ToPhoneNumber()) // Iteration #1
&& !telephone2.Contains(y.Phonenumber.ToPhoneNumber()) // Iteration #2
&& !mobile.Contains(y.Phonenumber2.ToPhoneNumber()) // Iteration #3
&& !telephone2.Contains(y.Phonenumber2.ToPhoneNumber())) // Iteration #4
.ToList(); // Iteraion #5
Worst case iterates over n elements * 4 Enumerable.Contains * 5*10^6 reference elements in mobile and telephone2 - only for comparison!
Enumerable.Except
ContactEqualityComparer.cs
class ContactEqualityComparer : IEqualityComparer<Contact>
{
public bool Equals(Contact contact1, Contact contact2)
{
if (ReferenceEquals(contact1, contact2))
return true;
else if (ReferenceEquals(contact1, null) || ReferenceEquals(contact2, null))
return false;
else if (contact1.MobileNumber.Equals(contact2.MobileNumber, StringComparoison.OrdinalIgnoreCase)
&& contact1.PhoneNumber2.Equals(contact2.PhoneNumber2, StringComparer.OrdinalIgnoreCase))
return true;
else
return false;
}
// Will be used by Enumerable.Except to generate item keys
// for the lookup table
public int GetHashCode(Contact contact)
{
unchecked
{
return ((contact.MobileNumber != null
? contact.MobileNumber.GetHashCode()
: 0) * 397) ^ (contact.PhoneNumber2 != null
? contact.PhoneNumber2.GetHashCode()
: 0);
}
}
}
Contact.cs
Consider to use two properties for each data: one property for computations and one for display e.g. MobileNumber and MobileNumberDisplay. The computation properties should be numeric.
public class Contact : IEqualityComparer<Contact>
{
private string mobileNumber;
public string MobileNumber
{
get => this.mobileNumber;
set => this.mobileNumber = value.ToPhoneNumber();
}
private string phoneNumber2;
public string PhoneNumber2
{
get => this.phoneNumber2;
set => this.phoneNumber2 = value.ToPhoneNumber();
}
public string ToPhoneNumber(string phoneNumber)
{
if (phoneNumber == null || phoneNumber == string.Empty)
return string.Empty;
return phoneNumber.Replace("(", "").Replace(")", "")
.Replace(" ", "").Replace("-", "");
}
}
Example
private List<Contact> GetNewContactsNotFoundInCrm(ContactPostModel model)
{
List<Contact> duplicates = GetAllNumbers();
return model.Contacts
.Except(duplicates, new ContactEqualityComparer())
.ToList();
}
One good option here is to try removing the need for the phone number conversions (call to ToPhoneNumber method) over each iteration step, by making both regular numbers (the one you convert ToPhoneNumber), telephone number and mobile numbers and telephone2 numbers to compared by the same format.
The other thing to improve over the query is to cach the calls for mobile and telephone2 numbers. You can move their calculation outside of the GetNewContactsNotFoundInCrm method and acquire only when there is a new change in data.
Finally, consider using HashSet for removing the need to have duplicates and make fast comparisons.
Side note:
If you are dealing with the database elements, consider moving this logic to SQL Stored Procedure.
Related
I have run a profiler on my .NET winforms app (compiled with .NET 4.7.1) and it is pointing at the following function as consuming 73% of my application's CPU time, which seems like far too much for a simple utility function:
public static bool DoesRecordExist(string keyColumn1, string keyColumn2, string keyColumn3,
string keyValue1, string keyValue2, string keyValue3, DataTable dt)
{
if (dt != null && dt.Rows.Count > 0) {
bool exists = dt.AsEnumerable()
.Where(r =>
string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
)
.Any();
return exists;
} else {
return false;
}
}
The purpose of this function is to pass in some key column names and matching key values, and checking whether any matching record exists in the in-memory c# DataTable.
My app is processing hundreds of thousands of records and for each record, this function must be called multiple times. The app is doing a lot of inserts, and before any insert, it must check whether that record already exists in the database. I figured that an in-memory check against the DataTable would be much faster than going back to the physical database each time, so that's why I'm doing this in-memory check. Each time I do a database insert, I do a corresponding insert into the DataTable, so that subsequent checks as to whether the record exists will be accurate.
So to my question: Is there a faster approach? (I don't think I can avoid checking for record existence each and every time, else I'll end up with duplicate inserts and key violations.)
EDIT #1
In addition to trying the suggestions that have been coming in, which I'm trying now, it occurred to me that I should also maybe do the .AsEnumerable() only once and pass in the EnumerableRowCollection<DataRow> instead of the DataTable. Do you think this will help?
EDIT #2
I just did a controlled test and found that querying the database directly to see if a record already exists is dramatically slower than doing an in-memory lookup.
You should try parallel execution, this should be a very good case for that as you mentioned you are working with a huge set, and no orderliness is needed if you just want to check if a record already exists.
bool exists = dt.AsEnumerable().AsParallel().Any((r =>
string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
)
Your solution find all occurences which evaluates true in the condition and then you ask if there is any. Instead use Any directly. Replace Where with Any. It will stop processing when hits first true evaulation of the condition.
bool exists = dt.AsEnumerable().Any(r => condition);
I suggest that you are keeping the key columns of the existing records in a HashSet. I'm using tuples here, but you could also create your own Key struct or class by overriding GetHashCode and Equals.
private HashSet<(string, string, string)> _existingKeys =
new HashSet<(string, string, string)>();
Then you can test the existence of a key very quickly with
if (_existingKeys.Contains((keyValue1, keyValue2, keyValue3))) {
...
}
Don't forget to keep this HashSet in sync with your additions and deletions. Note that tuples cannot be compared with CurrentCultureIgnoreCase. Therefore either convert all the keys to lower case, or use the custom struct approach where you can use the desired comparison method.
public readonly struct Key
{
public Key(string key1, string key2, string key3) : this()
{
Key1 = key1?.Trim() ?? "";
Key2 = key2?.Trim() ?? "";
Key3 = key3?.Trim() ?? "";
}
public string Key1 { get; }
public string Key2 { get; }
public string Key3 { get; }
public override bool Equals(object obj)
{
if (!(obj is Key)) {
return false;
}
var key = (Key)obj;
return
String.Equals(Key1, key.Key1, StringComparison.CurrentCultureIgnoreCase) &&
String.Equals(Key2, key.Key2, StringComparison.CurrentCultureIgnoreCase) &&
String.Equals(Key3, key.Key3, StringComparison.CurrentCultureIgnoreCase);
}
public override int GetHashCode()
{
int hashCode = -2131266610;
unchecked {
hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key1);
hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key2);
hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key3);
}
return hashCode;
}
}
Another question is whether it is a good idea to use the current culture when comparing db keys. Users with different cultures might get different results. Better explicitly specify the same culture used by the db.
It might be that you want to transpose your data structure. Instead of having a DataTable where each row has keyColumn1, keyColumn2 and keyColumn3, have 3 HashSet<string>, where the first contains all of the keyColumn1 values, etc.
Doing this should be a lot faster than iterating through each of the rows:
var hashSetColumn1 = new HashSet<string>(
dt.Rows.Select(x => x[keyColumn1]),
StringComparison.CurrentCultureIgnoreCase);
var hashSetColumn2 = new HashSet<string>(
dt.Rows.Select(x => x[keyColumn2]),
StringComparison.CurrentCultureIgnoreCase);
var hashSetColumn3 = new HashSet<string>(
dt.Rows.Select(x => x[keyColumn3]),
StringComparison.CurrentCultureIgnoreCase);
Obviously, create these once, and then maintain them (as you're currently maintaining your DataTable). They're expensive to create, but cheap to query.
Then:
bool exists = hashSetColumn1.Contains(keyValue1) &&
hashSetColumn2.Contains(keyValue2) &&
hashSetColumn3.Contains(keyValue3);
Alternatively (and more cleanly), you can define your own struct which contains values from the 3 columns, and use a single HashSet:
public struct Row : IEquatable<Row>
{
// Convenience
private static readonly IEqualityComparer<string> comparer = StringComparer.CurrentCultureIngoreCase;
public string Value1 { get; }
public string Value2 { get; }
public string Value3 { get; }
public Row(string value1, string value2, string value3)
{
Value1 = value1;
Value2 = value2;
Value3 = value3;
}
public override bool Equals(object obj) => obj is Row row && Equals(row);
public bool Equals(Row other)
{
return comparer.Equals(Value1, other.Value1) &&
comparer.Equals(Value2, other.Value2) &&
comparer.Equals(Value3, other.Value3);
}
public override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 23 + comparer.GetHashCode(Value1);
hash = hash * 23 + comparer.GetHashCode(Value2);
hash = hash * 23 + comparer.GetHashCode(Value3);
return hash;
}
}
public static bool operator ==(Row left, Row right) => left.Equals(right);
public static bool operator !=(Row left, Row right) => !(left == right);
}
Then you can make a:
var hashSet = new HashSet<Row>(dt.Select(x => new Row(x[keyColumn1], x[keyColumn2], x[keyColumn3]));
And cache that. Query it like:
hashSet.Contains(new Row(keyValue1, keyValue2, keyValue3));
In some cases using LINQ won't optimize as good as a sequential query, so you might be better of writing the query just the old-fashined way:
public static bool DoesRecordExist(string keyColumn1, string keyColumn2, string keyColumn3,
string keyValue1, string keyValue2, string keyValue3, DataTable dt)
{
if (dt != null)
{
foreach (var r in dt.Rows)
{
if (string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
{
return true;
}
}
}
return false;
}
But there might be more structural improvements, but this depends on the situation whether you can use it.
Option 1: Making the selection already in the database
You are using a DataTable, so there is a chance that you fetch the data from the database. If you have a lot of records, then it might make more sense to move this check to the database. When using the proper indexes it might be way faster then an in-memory tablescan.
Option 2: Replace string.Equals+SafeTrim with a custom method
You are using SafeTrim up to three times per row, which creates a lot of new strings. When you create your own method that compares both strings (string.Equals) with respect to leading/trailing whitespaces (SafeTrim), but without creating a new string then this could be way faster, reduce memory load and reduce garbage collection. If the implementation is good enough to inline, then you'll gain a lot of performance.
Option 3: Check the columns in the proper order
Make sure you use the proper order and specify the column that has the least probability to match as keyColumn1. This will make the if-statement result to false sooner. If keyColumn1 matches in 80% of the cases, then you need to perform a lot more comparisons.
I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).
In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>
In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age);
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.
Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any, Count, AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.
Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count, Any, ToList(or a plain foreach) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.f.e with ToList).
However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:
var lookup = existingRecords.Entities.ToLookup(r => new
{
fund = r["field_1"].ToString(),
bps = Convert.ToDecimal(r["field_2"]),
withdrawalPct = Convert.ToDecimal(r["field_3"]),
percentile = Convert.ToDecimal(r["field_4"]),
age = Convert.ToDecimal(r["field_5"])
});
Now you can access this lookup in the loop very efficiently.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
Note that this will work even if the key does not exist(an empty list is returned).
Add a ToList after your Convert.ToDecimal(r["field_5"]) == age);-line to force an immediate execution of the query.
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age)
.ToList();
The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count. In this case you actually execute that query a thrird time when calling matchingRows.First().
When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.
Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).
This is a (1) clean your data, (2) query/join your data, (3) process your data approach.
Do this:
(1)
var inputs =
inputDataLines
.Select(record =>
{
var fields = record.Split(',');
return new
{
fund = fields[0],
bps = Convert.ToDecimal(fields[1]),
withdrawalPct = Convert.ToDecimal(fields[2]),
percentile = Convert.ToInt32(fields[3]),
age = Convert.ToInt32(fields[4]),
bombOutTerm = Convert.ToDecimal(fields[5]),
record
};
})
.ToArray();
var entities =
existingRecords
.Entities
.Select(entity => new
{
fund = entity["field_1"].ToString(),
bps = Convert.ToDecimal(entity["field_2"]),
withdrawalPct = Convert.ToDecimal(entity["field_3"]),
percentile = Convert.ToInt32(entity["field_4"]),
age = Convert.ToInt32(entity["field_5"]),
bombOutTerm = Convert.ToDecimal(entity["field_6"]),
entity
})
.ToArray()
.GroupBy(x => new
{
x.fund,
x.bps,
x.withdrawalPct,
x.percentile,
x.age
}, x => new
{
x.bombOutTerm,
x.entity,
});
(2)
var query =
from i in inputs
join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
select new { input = i, matchingRows = e };
(3)
foreach (var x in query)
{
entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));
if (x.matchingRows.Count() == 0)
{
rowsToAdd.Add(x.input.record);
}
else if (x.matchingRows.Count() == 1)
{
if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
{
rowsToUpdate.Add(x.input.record);
entitiesToUpdate.Add(x.matchingRows.First().entity);
}
}
else
{
entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
rowsToAdd.Add(x.input.record);
}
}
I would suspect that this will be the among the fastest approaches presented.
I am very much new to the Linq queries. I have the set of records in the csv which is like below
ProdID,Name,Color,Availability
P01,Product1,Red,Yes
P02,Product2,Blue,Yes
P03,Product1,Yellow,No
P01,Product1,Red,Yes
P04,Product1,Black,Yes
I need to check for the Names of the each product and if its is not the same in all the records then I need to send an error message.I know the below query is used to find the duplicates in the records but not sure how can I modify it check if it all has the same values.
ProductsList.GroupBy(p => p.Name).Where(p => p.Count() > 1).SelectMany(x => x);
var first = myObjects.First();
bool allSame = myObjects.All(x=>x.Name == first.Name);
Enumerable.All() will return true if the lambda returns true for all elements of the collection. In this case we're checking that every object's Name property is equal to the first (and thus that they're all equal to each other; the transitive property is great, innit?). You can one-line this by inlining myObjects.First() but this will slow performance as First() will execute once for each object in the collection. You can also theoretically Skip() the first element as we know it's equal to itself.
if I understand correctly you want to check if product exists in the list
using System.Linq;
private bool ItemExists(string nameOfProduct) {
return ProductsList.Any(p=> p.Name== nameOfProduct);
}
UPD after author comment:
To know all the records that are not having the same name as the first record:
var firstName = ProductsList[0].Name;
var differentNames = ProductsList.Where(p => p.Name != firstName);
Another option (just to have all other names): ProductsList.Select(p => p.Name).Where(n => n != firstName).Distinct()
Old version
So, if there are at least two different names then you should return an error?
LINQ way: return ProductsList.Select(p => p.Name).Distinct().Count() <= 1
More optimizied way:
if (ProductsList.Count == 0)
return true;
var name = ProductsList[0].Name;
for (var i = 1; i < ProductsList.Count; i++)
{
if (ProductsList[i].Name != name)
return false;
}
return true;
I am trying to use Linq2Sql to return all rows that contain values from a list of strings. The linq2sql class object has a string property that contains words separated by spaces.
public class MyObject
{
public string MyProperty { get; set; }
}
Example MyProperty values are:
MyObject1.MyProperty = "text1 text2 text3 text4"
MyObject2.MyProperty = "text2"
For example, using a string collection, I pass the below list
var list = new List<>() { "text2", "text4" }
This would return both items in my example above as they both contain "text2" value.
I attempted the following using the below code however, because of my extension method the Linq2Sql cannot be evaluated.
public static IQueryable<MyObject> WithProperty(this IQueryable<MyProperty> qry,
IList<string> p)
{
return from t in qry
where t.MyProperty.Contains(p, ' ')
select t;
}
I also wrote an extension method
public static bool Contains(this string str, IList<string> list, char seperator)
{
if (str == null) return false;
if (list == null) return true;
var splitStr = str.Split(new char[] { seperator },
StringSplitOptions.RemoveEmptyEntries);
bool retval = false;
int matches = 0;
foreach (string s in splitStr)
{
foreach (string l in list)
{
if (String.Compare(s, l, true) == 0)
{
retval = true;
matches++;
}
}
}
return retval && (splitStr.Length > 0) && (list.Count == matches);
}
Any help or ideas on how I could achieve this?
Youre on the right track. The first parameter of your extension method WithProperty has to be of the type IQueryable<MyObject>, not IQueryable<MyProperty>.
Anyways you dont need an extension method for the IQueryable. Just use your Contains method in a lambda for filtering. This should work:
List<string> searchStrs = new List<string>() { "text2", "text4" }
IEnumerable<MyObject> myFilteredObjects = dataContext.MyObjects
.Where(myObj => myObj.MyProperty.Contains(searchStrs, ' '));
Update:
The above code snippet does not work. This is because the Contains method can not be converted into a SQL statement. I thought a while about the problem, and came to a solution by thinking about 'how would I do that in SQL?': You could do it by querying for each single keyword, and unioning all results together. Sadly the deferred execution of Linq-to-SQL prevents from doing that all in one query. So I came up with this compromise of a compromise. It queries for every single keyword. That can be one of the following:
equal to the string
in between two seperators
at the start of the string and followed by a seperator
or at the end of the string and headed by a seperator
This spans a valid expression tree and is translatable into SQL via Linq-to-SQL. After the query I dont defer the execution by immediatelly fetch the data and store it in a list. All lists are unioned afterwards.
public static IEnumerable<MyObject> ContainsOneOfTheseKeywords(
this IQueryable<MyObject> qry, List<string> keywords, char sep)
{
List<List<MyObject>> parts = new List<List<MyObject>>();
foreach (string keyw in keywords)
parts.Add((
from obj in qry
where obj.MyProperty == keyw ||
obj.MyProperty.IndexOf(sep + keyw + sep) != -1 ||
obj.MyProperty.IndexOf(keyw + sep) >= 0 ||
obj.MyProperty.IndexOf(sep + keyw) ==
obj.MyProperty.Length - keyw.Length - 1
select obj).ToList());
IEnumerable<MyObject> union = null;
bool first = true;
foreach (List<MyObject> part in parts)
{
if (first)
{
union = part;
first = false;
}
else
union = union.Union(part);
}
return union.ToList();
}
And use it:
List<string> searchStrs = new List<string>() { "text2", "text4" };
IEnumerable<MyObject> myFilteredObjects = dataContext.MyObjects
.ContainsOneOfTheseKeywords(searchStrs, ' ');
That solution is really everything else than elegant. For 10 keywords, I have to query the db 10 times and every time catch the data and store it in memory. This is wasting memory and has a bad performance. I just wanted to demonstrate that it is possible in Linq (maybe it can be optimized here or there, but I think it wont get perfect).
I would strongly recommend to swap the logic of that function into a stored procedure of your database server. One single query, optimized by the database server, and no waste of memory.
Another alternative would be to rethink your database design. If you want to query contents of one field (you are treating this field like an array of keywords, seperated by spaces), you may simply have chosen an inappropriate database design. You would rather want to create a new table with a foreign key to your table. The new table has then exactly one keyword. The queries would be much simpler, faster and more understandable.
I haven't tried, but if I remember correctly, this should work:
from t in ctx.Table
where list.Any(x => t.MyProperty.Contains(x))
select t
you can replace Any() with All() if you want all strings in list to match
EDIT:
To clarify what I was trying to do with this, here is a similar query written without linq, to explain the use of All and Any
where list.Any(x => t.MyProperty.Contains(x))
Translates to:
where t.MyProperty.Contains(list[0]) || t.MyProperty.Contains(list[1]) ||
t.MyProperty.Contains(list[n])
And
where list.Any(x => t.MyProperty.Contains(x))
Translates to:
where t.MyProperty.Contains(list[0]) && t.MyProperty.Contains(list[1]) &&
t.MyProperty.Contains(list[n])
I've got a lot of ugly code that looks like this:
if (!string.IsNullOrEmpty(ddlFileName.SelectedItem.Text))
results = results.Where(x => x.FileName.Contains(ddlFileName.SelectedValue));
if (chkFileName.Checked)
results = results.Where(x => x.FileName == null);
if (!string.IsNullOrEmpty(ddlIPAddress.SelectedItem.Text))
results = results.Where(x => x.IpAddress.Contains(ddlIPAddress.SelectedValue));
if (chkIPAddress.Checked)
results = results.Where(x => x.IpAddress == null);
...etc.
results is an IQueryable<MyObject>.
The idea is that for each of these innumerable dropdowns and checkboxes, if the dropdown has something selected, the user wants to match that item. If the checkbox is checked, the user wants specifically those records where that field is null or an empty string. (The UI doesn't let both be selected at the same time.) This all adds to the LINQ Expression which gets executed at the end, after we've added all the conditions.
It seems like there ought to be some way to pull out an Expression<Func<MyObject, bool>> or two so that I can put the repeated parts in a method and just pass in what changes. I've done this in other places, but this set of code has me stymied. (Also, I'd like to avoid "Dynamic LINQ", because I want to keep things type-safe if possible.) Any ideas?
I'd convert it into a single Linq statement:
var results =
//get your inital results
from x in GetInitialResults()
//either we don't need to check, or the check passes
where string.IsNullOrEmpty(ddlFileName.SelectedItem.Text) ||
x.FileName.Contains(ddlFileName.SelectedValue)
where !chkFileName.Checked ||
string.IsNullOrEmpty(x.FileName)
where string.IsNullOrEmpty(ddlIPAddress.SelectedItem.Text) ||
x.FileName.Contains(ddlIPAddress.SelectedValue)
where !chkIPAddress.Checked ||
string.IsNullOrEmpty(x. IpAddress)
select x;
It's no shorter, but I find this logic clearer.
In that case:
//list of predicate functions to check
var conditions = new List<Predicate<MyClass>>
{
x => string.IsNullOrEmpty(ddlFileName.SelectedItem.Text) ||
x.FileName.Contains(ddlFileName.SelectedValue),
x => !chkFileName.Checked ||
string.IsNullOrEmpty(x.FileName),
x => string.IsNullOrEmpty(ddlIPAddress.SelectedItem.Text) ||
x.IpAddress.Contains(ddlIPAddress.SelectedValue),
x => !chkIPAddress.Checked ||
string.IsNullOrEmpty(x.IpAddress)
}
//now get results
var results =
from x in GetInitialResults()
//all the condition functions need checking against x
where conditions.All( cond => cond(x) )
select x;
I've just explicitly declared the predicate list, but these could be generated, something like:
ListBoxControl lbc;
CheckBoxControl cbc;
foreach( Control c in this.Controls)
if( (lbc = c as ListBoxControl ) != null )
conditions.Add( ... );
else if ( (cbc = c as CheckBoxControl ) != null )
conditions.Add( ... );
You would need some way to check the property of MyClass that you needed to check, and for that you'd have to use reflection.
Have you seen the LINQKit? The AsExpandable sounds like what you're after (though you may want to read the post Calling functions in LINQ queries at TomasP.NET for more depth).
Don't use LINQ if it's impacting readability. Factor out the individual tests into boolean methods which can be used as your where expression.
IQueryable<MyObject> results = ...;
results = results
.Where(TestFileNameText)
.Where(TestFileNameChecked)
.Where(TestIPAddressText)
.Where(TestIPAddressChecked);
So the the individual tests are simple methods on the class. They're even individually unit testable.
bool TestFileNameText(MyObject x)
{
return string.IsNullOrEmpty(ddlFileName.SelectedItem.Text) ||
x.FileName.Contains(ddlFileName.SelectedValue);
}
bool TestIPAddressChecked(MyObject x)
{
return !chkIPAddress.Checked ||
x.IpAddress == null;
}
results = results.Where(x =>
(string.IsNullOrEmpty(ddlFileName.SelectedItem.Text) || x.FileName.Contains(ddlFileName.SelectedValue))
&& (!chkFileName.Checked || string.IsNullOrEmpty(x.FileName))
&& ...);
Neither of these answers so far is quite what I'm looking for. To give an example of what I'm aiming at (I don't regard this as a complete answer either), I took the above code and created a couple of extension methods:
static public IQueryable<Activity> AddCondition(
this IQueryable<Activity> results,
DropDownList ddl,
Expression<Func<Activity, bool>> containsCondition)
{
if (!string.IsNullOrEmpty(ddl.SelectedItem.Text))
results = results.Where(containsCondition);
return results;
}
static public IQueryable<Activity> AddCondition(
this IQueryable<Activity> results,
CheckBox chk,
Expression<Func<Activity, bool>> emptyCondition)
{
if (chk.Checked)
results = results.Where(emptyCondition);
return results;
}
This allowed me to refactor the code above into this:
results = results.AddCondition(ddlFileName, x => x.FileName.Contains(ddlFileName.SelectedValue));
results = results.AddCondition(chkFileName, x => x.FileName == null || x.FileName.Equals(string.Empty));
results = results.AddCondition(ddlIPAddress, x => x.IpAddress.Contains(ddlIPAddress.SelectedValue));
results = results.AddCondition(chkIPAddress, x => x.IpAddress == null || x.IpAddress.Equals(string.Empty));
This isn't quite as ugly, but it's still longer than I'd prefer. The pairs of lambda expressions in each set are obviously very similar, but I can't figure out a way to condense them further...at least not without resorting to dynamic LINQ, which makes me sacrifice type safety.
Any other ideas?
#Kyralessa,
You can create extension method AddCondition for predicates that accepts parameter of type Control plus lambda expression and returns combined expression. Then you can combine conditions using fluent interface and reuse your predicates. To see example of how it can be implemented see my answer on this question:
How do I compose existing Linq Expressions
I'd be wary of the solutions of the form:
// from Keith
from x in GetInitialResults()
//either we don't need to check, or the check passes
where string.IsNullOrEmpty(ddlFileName.SelectedItem.Text) ||
x.FileName.Contains(ddlFileName.SelectedValue)
My reasoning is variable capture. If you're immediately execute just the once you probably won't notice a difference. However, in linq, evaluation isn't immediate but happens each time iterated occurs. Delegates can capture variables and use them outside the scope you intended.
It feels like you're querying too close to the UI. Querying is a layer down, and linq isn't the way for the UI to communicate down.
You may be better off doing the following. Decouple the searching logic from the presentation - it's more flexible and reusable - fundamentals of OO.
// my search parameters encapsulate all valid ways of searching.
public class MySearchParameter
{
public string FileName { get; private set; }
public bool FindNullFileNames { get; private set; }
public void ConditionallySearchFileName(bool getNullFileNames, string fileName)
{
FindNullFileNames = getNullFileNames;
FileName = null;
// enforce either/or and disallow empty string
if(!getNullFileNames && !string.IsNullOrEmpty(fileName) )
{
FileName = fileName;
}
}
// ...
}
// search method in a business logic layer.
public IQueryable<MyClass> Search(MySearchParameter searchParameter)
{
IQueryable<MyClass> result = ...; // something to get the initial list.
// search on Filename.
if (searchParameter.FindNullFileNames)
{
result = result.Where(o => o.FileName == null);
}
else if( searchParameter.FileName != null )
{ // intermixing a different style, just to show an alternative.
result = from o in result
where o.FileName.Contains(searchParameter.FileName)
select o;
}
// search on other stuff...
return result;
}
// code in the UI ...
MySearchParameter searchParameter = new MySearchParameter();
searchParameter.ConditionallySearchFileName(chkFileNames.Checked, drpFileNames.SelectedItem.Text);
searchParameter.ConditionallySearchIPAddress(chkIPAddress.Checked, drpIPAddress.SelectedItem.Text);
IQueryable<MyClass> result = Search(searchParameter);
// inform control to display results.
searchResults.Display( result );
Yes it's more typing, but you read code around 10x more than you write it. Your UI is clearer, the search parameters class takes care of itself and ensures mutually exclusive options don't collide, and the search code is abstracted away from any UI and doesn't even care if you use Linq at all.
Since you are wanting to repeatedly reduce the original results query with innumerable filters, you can use Aggregate(), (which corresponds to reduce() in functional languages).
The filters are of predictable form, consisting of two values for every member of MyObject - according to the information I gleaned from your post. If every member to be compared is a string, which may be null, then I recommend using an extension method, which allows for null references to be associated to an extension method of its intended type.
public static class MyObjectExtensions
{
public static bool IsMatchFor(this string property, string ddlText, bool chkValue)
{
if(ddlText!=null && ddlText!="")
{
return property!=null && property.Contains(ddlText);
}
else if(chkValue==true)
{
return property==null || property=="";
}
// no filtering selected
return true;
}
}
We now need to arrange the property filters in a collection, to allow for iterating over many. They are represented as Expressions for compatibility with IQueryable.
var filters = new List<Expression<Func<MyObject,bool>>>
{
x=>x.Filename.IsMatchFor(ddlFileName.SelectedItem.Text,chkFileName.Checked),
x=>x.IPAddress.IsMatchFor(ddlIPAddress.SelectedItem.Text,chkIPAddress.Checked),
x=>x.Other.IsMatchFor(ddlOther.SelectedItem.Text,chkOther.Checked),
// ... innumerable associations
};
Now we aggregate the innumerable filters onto the initial results query:
var filteredResults = filters.Aggregate(results, (r,f) => r.Where(f));
I ran this in a console app with simulated test values, and it worked as expected. I think this at least demonstrates the principle.
One thing you might consider is simplifying your UI by eliminating the checkboxes and using an "<empty>" or "<null>" item in your drop down list instead. This would reduce the number of controls taking up space on your window, remove the need for complex "enable X only if Y is not checked" logic, and would enable a nice one-control-per-query-field.
Moving on to your result query logic, I would start by creating a simple object to represent a filter on your domain object:
interface IDomainObjectFilter {
bool ShouldInclude( DomainObject o, string target );
}
You can associate an appropriate instance of the filter with each of your UI controls, and then retrieve that when the user initiates a query:
sealed class FileNameFilter : IDomainObjectFilter {
public bool ShouldInclude( DomainObject o, string target ) {
return string.IsNullOrEmpty( target )
|| o.FileName.Contains( target );
}
}
...
ddlFileName.Tag = new FileNameFilter( );
You can then generalize your result filtering by simply enumerating your controls and executing the associated filter (thanks to hurst for the Aggregate idea):
var finalResults = ddlControls.Aggregate( initialResults, ( c, r ) => {
var filter = c.Tag as IDomainObjectFilter;
var target = c.SelectedValue;
return r.Where( o => filter.ShouldInclude( o, target ) );
} );
Since your queries are so regular, you might be able to simplify the implementation even further by using a single filter class taking a member selector:
sealed class DomainObjectFilter {
private readonly Func<DomainObject,string> memberSelector_;
public DomainObjectFilter( Func<DomainObject,string> memberSelector ) {
this.memberSelector_ = memberSelector;
}
public bool ShouldInclude( DomainObject o, string target ) {
string member = this.memberSelector_( o );
return string.IsNullOrEmpty( target )
|| member.Contains( target );
}
}
...
ddlFileName.Tag = new DomainObjectFilter( o => o.FileName );