Performance mongodb query with starts-with

Performance mongodb query with starts-with - c#

For a proof of concept I have loaded ~54 million records into mongodb. The goal is to investigate the query speed of mongodb.
I use the following class to store the data:
[BsonDiscriminator("Part", Required = true)]
public class Part
{
[BsonId]
public ObjectId Id { get; set; }
[BsonElement("pgc")]
public int PartGroupCode { get; set; }
[BsonElement("sc")]
public int SupplierCode { get; set; }
[BsonElement("ref")]
public string ReferenceNumber { get; set; }
[BsonElement("oem"), BsonIgnoreIfNull]
public List<OemReference> OemReferences { get; set; }
[BsonElement("alt"), BsonIgnoreIfNull]
public List<AltReference> AltReferences { get; set; }
[BsonElement("crs"), BsonIgnoreIfNull]
public List<CrossReference> CrossReferences { get; set; }
[BsonElement("old"), BsonIgnoreIfNull]
public List<FormerReference> FormerReferences { get; set; }
[BsonElement("sub"), BsonIgnoreIfNull]
public List<SubPartReference> SubPartReferences { get; set; }
}
And I created the following indexes:
Compound Index on ref, sc, pgc
Ascending Index on oem.refoem
Ascending Index on alt.refalt
Ascending Index on crs.refcrs
Ascending Index on old.refold
Ascending Index on sub.refsub
I perform the following queries to test the performance:
var searchValue = "345";
var start = DateTime.Now;
var result1 = collection.AsQueryable<Part>().OfType<Part>().Where(part => part.ReferenceNumber == searchValue);
long count = result1.Count();
var finish = DateTime.Now;
start = DateTime.Now;
var result2 = collection.AsQueryable<Part>().OfType<Part>().Where(part =>
part.ReferenceNumber.Equals(searchValue) ||
part.OemReferences.Any(oem => oem.ReferenceNumber.Equals(searchValue)) ||
part.AltReferences.Any(alt => alt.ReferenceNumber.Equals(searchValue)) ||
part.CrossReferences.Any(crs => crs.ReferenceNumber.Equals(searchValue)) ||
part.FormerReferences.Any(old => old.ReferenceNumber.Equals(searchValue))
);
count = result2.Count();
finish = DateTime.Now;
start = DateTime.Now;
var result3 = collection.AsQueryable<Part>().OfType<Part>().Where(part =>
part.ReferenceNumber.StartsWith(searchValue) ||
part.OemReferences.Any(oem => oem.ReferenceNumber.StartsWith(searchValue)) ||
part.AltReferences.Any(alt => alt.ReferenceNumber.StartsWith(searchValue)) ||
part.CrossReferences.Any(crs => crs.ReferenceNumber.StartsWith(searchValue)) ||
part.FormerReferences.Any(old => old.ReferenceNumber.StartsWith(searchValue))
);
count = result3.Count();
finish = DateTime.Now;
var regex = new Regex("^345"); //StartsWith regex
start = DateTime.Now;
var result4 = collection.AsQueryable<Part>().OfType<Part>().Where(part =>
regex.IsMatch(part.ReferenceNumber) ||
part.OemReferences.Any(oem => regex.IsMatch(oem.ReferenceNumber)) ||
part.AltReferences.Any(alt => regex.IsMatch(alt.ReferenceNumber)) ||
part.CrossReferences.Any(crs => regex.IsMatch(crs.ReferenceNumber)) ||
part.FormerReferences.Any(old => regex.IsMatch(old.ReferenceNumber))
);
count = result4.Count();
finish = DateTime.Now;
The results are not what I would have expected:
Search 1 on 345 results in: 3 records (00:00:00.3635937)
Search 2 on 345 results in: 58 records (00:00:00.0671566)
Search 3 on 345 results in: 6189 records (00:01:17.6638459)
Search 4 on 345 results in: 6189 records (00:01:17.0727802)
Why is the StartsWith query (3 and 4) so much slower?
The StartsWith query performance is the make or break decision.
Did I create the wrong indexes? Any help is appreciated.
Using mongodb with the 10gen C# driver
UPDATE:
The way the query is translated from Linq to a MongoDB query is very important for the performance. I build the same query (like 3 and 4) again but with the Query object:
var query5 = Query.And(
Query.EQ("_t", "Part"),
Query.Or(
Query.Matches("ref", "^345"),
Query.Matches("oem.refoem", "^345"),
Query.Matches("alt.refalt", "^345"),
Query.Matches("crs.refcrs", "^345"),
Query.Matches("old.refold", "^345")));
start = DateTime.Now;
var result5 = collection.FindAs<Part>(query5);
count = result5.Count();
finish = DateTime.Now;
The result of this query is returned in 00:00:00.4522972
The query translated as
command: { count: "PSG", query: { _t: "Part", $or: [ { ref: /^345/ }, { oem.refoem: /^345/ }, { alt.refalt: /^345/ }, { crs.refcrs: /^345/ }, { old.refold: /^345/ } ] } }
Compared with Query 3 and 4 the difference is big:
command: { count: "PSG", query: { _t: "Part", $or: [ { ref: /^345/ }, { oem: { $elemMatch: { refoem: /^345/ } } }, { alt: { $elemMatch: { refalt: /^345/ } } }, { crs: { $elemMatch: { refcrs: /^345/ } } }, { old: { $elemMatch: { refold: /^345/ } } } ] } }
So why is query 3 and 4 not using the indexes?

From the index documentation:
Every query, including update operations, uses one and only one index.
In other words, MongoDB doesn't support index intersection. Thus, creating a huge number of indexes is pointless unless there are queries that use this index and this index only. Also, make sure you're calling the correct Count() method here. If you call the linq-to-object extensions (IEnumerable's Count() extension rather than MongoCursor's Count, it will actually have to fetch and hydrate all objects).
It is probably easier to throw these in a single mutli-key index like this:
{
"References" : [ { id: new ObjectId("..."), "_t" : "OemReference", ... },
{ id: new ObjectId("..."), "_t" : "CrossReferences", ...} ],
...
}
where References.id is indexed. Now, a query db.foo.find({"References.id" : new ObjectId("...")}) will automatically search for any match in the array of references. Since I assume the different types of references must be distinguished, it makes sense to use a discriminator so the driver can support polymorphic deserialization. In C#, you'd declare this like
[BsonDiscriminator(Required=true)]
[BsonKnownTypes(typeof(OemReference), typeof(...), ...)]
class Reference { ... }
class OemReference : Reference { ... }
The driver will automatically serialize the type name in a field called _t. That behaviour can be adjusted to your needs, if required.
Also note that shortening the property names will decrease storage requirements, but won't affect index size.

Related

rearrange a list of objects by type field in C#

I have an incoming list of alerts and I use a MapFunction as:
private static BPAlerts MapToAlerts(List<IntakeAlert> intakeAlerts)
{
// Make sure that there are alerts
if (intakeAlerts.IsNullOrEmpty()) return new BPAlerts { AllAlerts = new List<BPAlert>(), OverviewAlerts = new List<BPAlert>() };
// All Alerts
var alerts = new BPAlerts
{
AllAlerts = intakeAlerts.Select(
alert => new BPAlert
{
AlertTypeId = alert.AlertTypeId ?? 8100,
IsOverview = alert.IsOverviewAlert.GetValueOrDefault(),
Text = alert.AlertText,
Title = alert.AlertTitle,
Type = alert.AlertTypeId == 8106 ? "warning" : "report",
Severity = alert.AlertSeverity.GetValueOrDefault(),
Position = alert.Position.GetValueOrDefault()
}).OrderBy(a => a.Position).ToList()
};
// Alerts displayed on the overview page
alerts.OverviewAlerts =
alerts.AllAlerts
.ToList()
.Where(a => a.IsOverview && !string.IsNullOrEmpty(a.Title))
.Take(3)
.ToList();
return alerts;
}
the BPAlerts type contains list of two type:
public class BPAlerts
{
public List<BPAlert> AllAlerts { get; set; }
public List<BPAlert> OverviewAlerts { get; set; }
}
And the BPAlert type is defined as:
public class BPAlert
{
public short AlertTypeId { get; set; }
public string Type { get; set; }
public int Severity { get; set; }
public bool IsOverview { get; set; }
public string Title { get; set; }
public string Text { get; set; }
public int Position { get; set; }
public string Id { get; internal set; } = Guid.NewGuid().ToString();
}
I want to achieve a task in which the MaptoAlerts function returns a alerts object with overviewalerts which are sorted based on the type of BPAlert. To be more clear in the following order if present:
Confirmed Out of Business - 8106 \n
Bankruptcy - 8105 \n
Lack of Licensing - 8111 \n
Investigations - 8109 \n
Government Actions - 8103 \n
Pattern of Complaints - 8104 \n
Customer Reviews - 8112 \n
Accreditation - 8110 \n
Misuse of BBB Name - 8101 \n
Advisory - 8107 \n
Advertising Review – 8102 \n

Solution #1 Order values array
I would just define the order of those ids in some kind of collection, can be an array:
var orderArray = new int[]
{
8106, // Confirmed Out of Busine
8105, // Bankruptcy
8111, // Lack of Licensing
8109, // Investigations
8103, // Government Actions
8104, // Pattern of Complaints
8112, // Customer Reviews
8110, // Accreditation
8101, // Misuse of BBB Name
8107, // Advisory
8102, // Advertising Review
};
Then iterate through array while incrementing order value. While looping check if order array contains actual type id which order value I'm trying to evaluate:
for (int orderValue = 0; orderValue < orderArray.Length; orderValue++)
{
if (alertTypeId == orderArray[orderValue])
{
return orderValue;
}
}
If not found in the array, return highest value possible:
return int.MaxValue
Whole method would look like this and it would evaluate the order for alert type id:
public int GetAlertTypeIdOrder(short alertTypeId)
{
var orderArray = new int[]
{
8106, // Confirmed Out of Busine
8105, // Bankruptcy
8111, // Lack of Licensing
8109, // Investigations
8103, // Government Actions
8104, // Pattern of Complaints
8112, // Customer Reviews
8110, // Accreditation
8101, // Misuse of BBB Name
8107, // Advisory
8102, // Advertising Review
};
for (int orderValue = 0; orderValue < orderArray.Length; orderValue++)
{
if (alertTypeId == orderArray[orderValue])
{
return orderValue;
}
}
return int.MaxValue;
}
Usage:
var sortedAlerts = alerts
.AllAlerts
.OrderByDescending(a => GetAlertTypeIdOrder(a.AlertTypeId))
.ToList();
It also works in a descending way.
Solution #2 Order values dictionary
You could achieve better performance by reducing the redundancy - repeated creation of array storing order values. Better idea would be to store the order rules in a dictionary. I know that code below creates an array too, but the concept is that it would be called once to get the dictionary which would be then passed over.
public Dictionary<int, int> GetOrderRules()
{
var alertTypeIds = new int[]
{
8106, // Confirmed Out of Busine
8105, // Bankruptcy
8111, // Lack of Licensing
8109, // Investigations
8103, // Government Actions
8104, // Pattern of Complaints
8112, // Customer Reviews
8110, // Accreditation
8101, // Misuse of BBB Name
8107, // Advisory
8102, // Advertising Review
};
var orderRules = new Dictionary<int, int>();
for (int orderValue = 0; orderValue < alertTypeIds.Length; orderValue++)
{
orderRules.Add(alertTypeIds[orderValue], orderValue);
}
return orderRules;
}
So the GetAlertIdOrder() method would look different, but still keeping the idea from previous solution:
public int GetAlertIdOrder(short alertTypeId, IDictionary<int, int> orderRules)
{
if (orderRules.TryGetValue(alertTypeId, out int orderValue))
{
return orderValue;
}
else
{
return int.MaxValue;
}
}
Usage:
var orderRules = GetOrderRules();
var sortedAlerts = alerts
.AllAlerts
.OrderBy(a => GetAlertIdOrder(a.AlertTypeId, orderRules))
.ToList();

(a) I wouldn't mix sorting with the mapper. let the mapper just do its thing. (this is separation of concerns ) .. aka, no ordering/sorting. IMHO, you'll always end up with way too much voodoo in the mapper that is hard to understand. You're already on this path with the above code.
(b) if "OverviewAlerts" is a subset of AllAlerts (aka, AllAlerts is the superset), then hydrate AllAlerts, and create a read-only "get" property where you filter AllAlerts to your subset by its rules. optionally, consider a AllAlertsSorted get property. this way, you allow your consumers to choose if they want raw or sorted...since there is a cost with sorting.
public class BPAlerts
{
public List<BPAlert> AllAlerts { get; set; }
public List<BPAlert> OverviewAlerts {
get
{
return null == this.AllAlerts ? null : this.AllAlerts.Where (do you filtering and maybe sorting here ) ;
}
}
}
public List<BPAlert> AllAlertsSorted{
get
{
return null == this.AllAlerts ? null : this.AllAlerts.Sort(do you filtering and maybe sorting here ) ;
}
}
}
if you do the read-only properties, then you have more simple linq operations like
OrderBy(x => x.PropertyAbc).ThenByDescending(x => x.PropertyDef);
99% of my mapping code looks like this. I don't even throw an error if you give null input, i just return null.
public static class MyObjectMapper {
public static ICollection < MyOtherObject > ConvertToMyOtherObject(ICollection <MyObjectMapper> inputItems) {
ICollection <MyOtherObject> returnItems = null;
if (null != inputItems) {
returnItems = new List <MyOtherObject> ();
foreach(MyObjectMapper inputItem in inputItems) {
MyOtherObject moo = new MyOtherObject();
/* map here */
returnItems.Add(moo);
}
}
return returnItems;
}
}

Filtering nested lists with nullable property

Say I have the following class structures
public class EmailActivity {
public IEnumerable<MemberActivity> Activity { get; set; }
public string EmailAddress { get; set; }
}
public class MemberActivity {
public EmailAction? Action { get; set; }
public string Type { get; set; }
}
public enum EmailAction {
None = 0,
Open = 1,
Click = 2,
Bounce = 3
}
I wish to filter a list of EmailActivity objects based on the presence of a MemberActivity with a non-null EmailAction matching a provided list of EmailAction matches. I want to return just the EmailAddress property as a List<string>.
This is as far as I've got
List<EmailAction> activityTypes; // [ EmailAction.Open, EmailAction.Bounce ]
List<string> activityEmailAddresses =
emailActivity.Where(
member => member.Activity.Where(
activity => activityTypes.Contains(activity.Action)
)
)
.Select(member => member.EmailAddress)
.ToList();
However I get an error message "CS1503 Argument 1: cannot convert from 'EmailAction?' to 'EmailAction'"
If then modify activityTypes to allow null values List<EmailAction?> I get the following "CS1662 Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type".
The issue is the nested .Where it's returning a list, but the parent .Where requires a bool result. How would I tackle this problem?
I realise I could do with with nested loops however I'm trying to brush up my C# skills!

Using List.Contains is not ideal in terms of performance, HashSet is a better option, also if you want to select the email address as soon as it contains one of the searched actions, you can use Any:
var activityTypes = new HashSet<EmailAction>() { EmailAction.Open, EmailAction.Bounce };
List<string> activityEmailAddresses =
emailActivity.Where(
member => member.Activity.Any(
activity => activity.Action.HasValue &&
activityTypes.Contains(activity.Action.Value)
)
)
.Select(activity => activity.EmailAddress)
.ToList();

You want to use All or Any depends if you want each or at least one match...
HashSet<EmailAction> activityTypes = new HashSet<EmailAction> { EmailAction.None };
var emailActivity = new List<EmailActivity>
{
new EmailActivity { Activity = new List<MemberActivity>{ new MemberActivity { Action = EmailAction.None } }, EmailAddress = "a" },
new EmailActivity { Activity = new List<MemberActivity>{ new MemberActivity { Action = EmailAction.Click } }, EmailAddress = "b" }
};
// Example with Any but All can be used as well
var activityEmailAddresses = emailActivity
.Where(x => x.Activity.Any(_ => _.Action.HasValue && activityTypes.Contains(_.Action.Value)))
.Select(x => x.EmailAddress)
.ToArray();
// Result is [ "a" ]

Querying a RavenDB index against an external List<T>

I have the following RavenDB Index:
public class RidesByPostcode : AbstractIndexCreationTask<Ride, RidesByPostcode.IndexEntry>
{
public class IndexEntry
{
public string PostcodeFrom { get; set; }
public string PostcodeTo { get; set; }
}
public RidesByPostcode()
{
Map = rides => from doc in rides
select new
{
doc.DistanceResult.PostcodeFrom,
doc.DistanceResult.PostcodeTo
};
StoreAllFields(FieldStorage.Yes);
}
}
I also have a list of strings representing postcodes, and I want to get all the Rides for which the PostcodeFrom is in the list of postcodes:
var postcodes = new List<string> { "postcode 1", "postcode 2" };
var rides = _database.Query<RidesByPostcode.IndexEntry, RidesByPostcode>()
.Where(x => postcodes.Contains(x.PostcodeFrom))
.OfType<Ride>()
.ToList();
But of course RavenDb says it cannot understand the .Contains expression.
How can I achieve such a query in RavenDb without having to call .ToList() before the where clause?

Ok, I found the answer: RavenDb's .In() extension method (see the "Where + In" section of the docs).
Apparently I was thinking from the outside in, instead of from the inside out :)
This is the final query:
var rides = _database.Query<RidesByPostcode.IndexEntry, RidesByPostcode>()
.Where(x => !x.IsAccepted && x.PostcodeFrom.In(postcodes))
.OfType<Ride>()
.ToList();

How do I search nested criteria using MongoDB c# driver (version 2)?

I have a collection of documents that can contain criteria grouped into categories. The structure could look like this:
{
"Name": "MyDoc",
"Criteria" : [
{
"Category" : "Areas",
"Values" : ["Front", "Left"]
},
{
"Category" : "Severity",
"Values" : ["High"]
}
]
}
The class I'm using to create the embedded documents for the criteria looks like this:
public class CriteriaEntity
{
public string Category { get; set; }
public IEnumerable<string> Values { get; set; }
}
The user can choose criteria from each category to search (which comes into the function as IEnumerable<CriteriaEntity>) and the document must contain all the selected criteria in order to be returned. This was my first attempt:
var filterBuilder = Builders<T>.Filter;
var filters = new List<FilterDefinition<T>>();
filters.Add(filterBuilder.Exists(entity =>
userCriterias.All(userCriteria =>
entity.Criteria.Any(entityCriteria =>
entityCriteria.Category == userCriteria.Category
&& userCriteria.Values.All(userValue =>
entityCriteria.Values.Any(entityValue =>
entityValue == userValue))))));
However I get the error: "Unable to determine the serialization information for entity...". How can I get this to work?

MongoDB.Driver 2.0 doesn't support Linq.All. Anyway you task can be resolve next way:
var filterDefinitions = new List<FilterDefinition<DocumentEntity>>();
foreach (var criteria in searchCriterias)
{
filterDefinitions
.AddRange(criteria.Values
.Select(value => new ExpressionFilterDefinition<DocumentEntity>(doc => doc.Criterias
.Any(x => x.Category == criteria.Category && x.Values.Contains(value)))));
}
var filter = Builders<DocumentEntity>.Filter.And(filterDefinitions);
return await GetCollection<DocumentEntity>().Find(filter).ToListAsync();

Sort enum column based on language

Say I have the following simple setup:
public enum Status
{
Pending = 0,
Accepted = 1,
Rejected = 2
}
public class MyEntity
{
public Status Status { get; set; }
}
...and I desire the following ascending sorting for Status for two different languages:
Language 1: 1, 0, 2
Language 2: 0, 1, 2
Is there a simple way to do this in entity framework? I feel like this is probably a common scenario.
I'm thinking the only way is to maintain a separate table with all the translations for Status... then sort by the name column in that table based on the users current language. I wanted to see if anyone had any other ideas though.

You could write a query without having a separate table, but it's not pretty. Nor is it very flexible. Since EF relies on well known methods to map to the equivalent SQL function, you can't interject custom functions. However, you can manually set the order using a let:
var query = from e in MyEntity
let sortOrder = UserLanguage == Language1 // Handle Language 1 Sort
? e.Status == Pending
? 1 : e.Status == Accepted ? 0 : 2
: e.Status == Pending // Handle Language 2 sort
? 0 : e.Status == Accepted ? 1 : 2
orderby sortOrder
select e
And this is only for two language. Another way I could think of is to write the expression tree yourself. This would have the advantage that you can split the logic for each language. We can extract the ternary condition logic and place them in an extension static class. Here's a sample of what it could look like:
public static class QuerySortExtension
{
private static readonly Dictionary<string, Expression<Func<MyEntity, int>>> _orderingMap;
private static readonly Expression<Func<MyEntity, int>> _defaultSort;
public static IOrderedQueryable<MyEntity> LanguageSort(this IQueryable<MyEntity> query, string language)
{
Expression<Func<MyEntity, int>> sortExpression;
if (!_orderingMap.TryGetValue(language, out sortExpression))
sortExpression = _defaultSort;
return query.OrderBy(sortExpression);
}
static QuerySortExtension()
{
_orderingMap = new Dictionary<string, Expression<Func<MyEntity, int>>>(StringComparer.OrdinalIgnoreCase) {
{ "EN", e => e.Status == Status.Pending ? 1 : e.Status == Status.Accepted ? 0 : 2 },
{ "JP", e => e.Status == Status.Pending ? 2 : e.Status == Status.Accepted ? 1 : 0 }
};
// Default ordering
_defaultSort = e => (int)e.Status;
}
}
And you can use the following method this way:
var entities = new[] {
new MyEntity { Status = Status.Accepted },
new MyEntity { Status = Status.Pending },
new MyEntity { Status = Status.Rejected }
}.AsQueryable();
var query = from e in entities.LanguageSort("EN")
select e;
var queryJp = from e in entities.LanguageSort("JP")
select e;
Console.WriteLine("Sorting with EN");
foreach (var e in query)
Console.WriteLine(e.Status);
Console.WriteLine("Sorting with JP");
foreach (var e in queryJp)
Console.WriteLine(e.Status);
Which output the following:
Sorting with EN
Accepted
Pending
Rejected
Sorting with JP
Rejected
Accepted
Pending

EF will sort based on the type of DB column you have mapped the enum to (presumably an int), so it will sort the same way it sorts ints.
IF you want some kind of custom sorter you will have to write your own and sort in memory.
Not really sure i understand your separate table concept, could you give more details.

I accepted Simon's answer since it was an alternative to what I mentioned in the question. Here's a way I figured out how to do it with a separate table that works. Obvious downside is that you have to maintain translations in the database...
public enum Status
{
Pending = 0,
Accepted = 1,
Rejected = 2
}
public class MyEntity
{
public int MyEntityID { get; set; }
public Status Status { get; set; }
}
public enum Language
{
Language1 = 0,
Language2 = 1
}
public class StatusTranslation
{
public int StatusTranslationID { get; set; }
public Language Language { get; set; }
public Status Status { get; set; }
public string Name { get; set; }
}
Insert all the translations into the SQL table in whatever way:
INSERT INTO StatusTranslations (Language, Status, Name) VALUES (0, 0, "Pending")
// etc...
INSERT INTO StatusTranslations (Language, Status, Name) VALUES (1, 0, "En attendant")
// etc...
Now join and sort:
var userLanguage = Language.Language1;
var results = db.MyEntities.Join(
db.StatusTranslations,
e => e.Status,
s => s.Status,
(e, s) => new { MyEntity = e, Status = s }
)
.Where(e => e.Status.Language == userLanguage)
.Select(e => new {
MyEntityID = e.MyEntity.MyEntityID
StatusName = e.Status.Name
}).OrderBy(e => e.StatusName).ToList();
I never compiled this example code (only my own code) so I apologise for any mistakes, but it should give the idea well enough.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Performance mongodb query with starts-with - c#

Related

rearrange a list of objects by type field in C#

Filtering nested lists with nullable property

Querying a RavenDB index against an external List<T>

How do I search nested criteria using MongoDB c# driver (version 2)?

Sort enum column based on language

Categories

Resources