I have the follow example:
public class Commands
{
public int ID { get; set; }
public List<string> Alias { get; set; }
}
public class UserAccess
{
public int AccessID { get; set; }
// other stuff not needed for the question
public List<Commands> AllowedCommands { get; set; }
}
Now I wanted to implement on the UserAccess a way to return the command ID or NULL if no Alias were found on the list, see a dirty example of what I am saying below HasCommand:
public class UserAccess
{
public ID { get; set; }
// other stuff not needed for the question
public List<Commands> AllowedCommands { get; set; }
public Commands HasCommand(string cmd)
{
foreach (Commands item in this.AllowedCommands)
{
if (item.Alias.Find(x => string.Equals(x, cmd, StringComparison.OrdinalIgnoreCase)) != null)
return item;
}
return null;
}
}
My question is what would be the most efficient way to run or implement the HasCommand method ?
Or is there a better way to implement it into the UserAccess ?
Can be shortened a little bit
public Commands HasCommand(string cmd)
{
return AllowedCommands.FirstOrDefault(c => c.Alias.Contains(cmd, StringComparer.OrdinalIgnoreCase));
}
but it's pretty much the same thing.
public Commands HasCommand(string cmd)
{
return this.AllowedCommands.FirstOrDefault(item => item.Alias.Find(x => string.Equals(x, cmd, StringComparison.OrdinalIgnoreCase)) != null);
}
You do not need to use Where + FirstOrDefault. The FirstOfDefault can have condition.
Also, 3 suggestions for further improvement:
(1) I would encourage the use of IEnumerable instead of List, if possible.
(2) I would call "Commands" just "Command".
(3) I would make all commands be able to be easily referenced via a class like this:
public class Command {
public Command(int id, IEnumerable<string> aliases) {
Id = id;
Aliases = alias;
}
public int Id { get; set; }
public IEnumerable<string> Aliases { get; set; }
}
public class Commands {
public static readonly Command CommandNameHere1(yourIdHere1, yourAliasesHere1);
public static readonly Command CommandNameHere2(yourIdHere2, yourAliasesHere2);
//etc.
}
Assuming that by "efficient", you mean fast, anytime you are looking up a string in a collection of strings, and that collection is likely to contain more than a few entries, you should always use a hash lookup. Doing a simple scan of the list takes exponential time as the count of items goes up, while the count has little effect on a hash lookup. In .NET, this has traditionally been handled by the Dictionary class, which is commonly used to index a collection of objects with a key (which is often a string). However, the value can't be null, and this led to passing the same string in as both the key and value - rather ugly. Finally, .NET 4 provided HashSet, which you should use for such a case of only having a key and no value.
In your case, you have the (not uncommon) situation of needing a case-insensitive compare. The common solution for this is to lower-case the string keys when adding them to the dictionary (or HashSet). This tiny overhead on add is vastly outweighed by the savings on lookups, since all programmers should know and understand that case-insensitive compares are vastly slower than case-sensitive, especially with Unicode - the CPU can't just do a block compare of data, but must check each pair of characters specially (even using a table look-up, this is vastly slower).
If your Alias names can be in lower case, change them from List to HashSet. If not, use Dictionary where the key is added as lower case, and the value is the (mixed-case) Alias string. Assuming the use of Dictionary, your code would become:
public Commands HasCommand(string cmd)
{
foreach (Commands item in AllowedCommands)
{
if (item.Alias.ContainsKey(cmd))
return item;
}
return null;
}
Finally, also on the subject of performance, using LINQ is almost always going to result in slower performance - somewhere between a little slower and a lot slower, depending upon the situation. It does make nice, compact source for simple things, and I use it quite a bit myself, but if you're certain that performance is an issue for a piece of a code, you probably shouldn't use it (unless it's PLINQ, of course).
So, if you want as few lines of code as possible, use the other answer posted here. If you want speed, use mine.
It almost goes without saying, but when you're worried about the performance of some small chunk of code like this, just wrap it in a for loop and repeat it until it takes 5-10 seconds to execute - just add orders of magnitude as needed, whether it's 1,000 or 1,000,000 reps, and time it with System.Diagnostics.Stopwatch. Try alternative logic, and repeat the test. The 5-10 seconds is a minimum designed to mask the fluctuations caused by a managed environment and other stuff executing on the same machine (you should obviously also avoid running other apps during the test). Of course, for overall performance testing of a complicated application, a performance analyzer tool would be recommended.
Related
I need to be able to search over a collection of approx 2 million items in C#. Search should be possible over multiple fields. Simple string-matching is good enough.
Using an external dependency like a database is not an option, but using an in-memory database would be OK.
The main goal is to do this memory-efficient.
The type in the collection is quite simple and has no long strings:
public class Item
{
public string Name { get; set; } // Around 50 chars
public string Category { get; set; } // Around 20 chars
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; } // 2-3 items
}
Focus and requirements
Clarification of focus and requirements:
No external dependencies (like a database)
Memory-efficient (below 2 GB for 2 million items)
Searchable items in collection (must be performant)
Today's non-optimal solution
Using a simple List<T> over above type, either as a class or a struct, still requires about 2 GB of memory.
Is there a better way?
The most significant memory hog in your class is the use of a read-only list. Get rid of it and you will reduce memory footprint by some 60% (tested with three tags):
public class Item
{
public string Name { get; set; }
public string Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public string Tags { get; set; } // Semi-colon separated
}
Also, consider using DateTime instead of DateTimeOffset. That will further reduce memory footprint with around 10%.
There are many things you can do in order to reduce the memory footprint of your data, but probably the easiest thing to do with the greatest impact would be to intern all strings. Or at least these that you expect to be repeated a lot.
// Rough example (no checks for null values)
public class Item
{
private string _name;
public string Name
{
get { return _name; }
set { _name = String.Intern(value); }
}
private string _category;
public string Category
{
get { return _category; }
set { _category = String.Intern(value); }
}
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
private IReadOnlyList<string> _tags;
public IReadOnlyList<string> Tags
{
get { return _tags; }
set { _tags = Array.AsReadOnly(value.Select(s => String.Intern(s)).ToArray()); }
}
}
Another thing you could do, more difficult and with smaller impact, would be to assign the same IReadOnlyList<string> object to items with identical tags (assuming that many items with identical tags exist in your data).
Update: Also don't forget to call TrimExcess to the list after you fill it with items, in order to get rid of the unused capacity.
This method can be used to minimize a collection's memory overhead if no new elements will be added to the collection.
With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.
If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.
Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.
You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?
With that out of the way, let's get into an answer.
Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.
Find or implement a Suffix Tree.
The suffix tree allows you to find substrings (with time complexity O(log N) O(m), where m is the word length) and get back the original objects they occur in.
Construct the suffix tree once, with the strings of each object pointing back to that object.
Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.
Edit: On memory usage
As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.
For example, the words fire and firm will be stored as a parent node fir with two leaf nodes, e and m, thus forming the words. Should the word fish be introduced, the node fir will be split: a parent node fi, with child nodes sh and r, and the r having child nodes e and m. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.
With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a through z, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.
The only overhead comes from storing separate substrings and the nodes that represent and connect them.
You can do theses dots, then you will see if there is trouble:
you can enable gcAllowVeryLargeObjects to enables arrays that are greater than 2 gigabytes.
Let the class implementation. When you choose between class and struct, the performance is not the main factor. I think there is no reason to use struct here. See Choosing Between Class and Struct.
Depending your search filter, you must override GetHashCode and Equal.
Do you need to mutate properties, or just search object in the collection?
If you just want research, and if your properties repeat themselves a lot, you can have one property used by many objects.
In this way, the value is stored only one time, and the object only store the reference.
You can do this only if you dont want to mutate the property.
As exemple, if two objects have the same category:
public class Category
{
public string Value { get; }
public Category(string category)
{
Value = category;
}
}
public class Item
{
public string Name { get; set; }
public Category Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; }
}
class Program
{
public void Init()
{
Category category = new Category("categoryX");
var obj1 = new Item
{
Category = category
};
var obj2 = new Item
{
Category = category
};
}
}
I would not expect any major memory issues with 2M objects if you are running 64-bits. There is a max size limit of lists of 2Gb, but a reference is only 8 bytes, so the list should be well under this limit. The total memory usage will depend mostly on how large the strings are. There will also be some object overhead, but this is difficult to avoid if you need to store multiple strings.
Also, how do you measure memory? The .Net runtime might over allocate memory, so the actual memory usage of your object might be significantly lower than the memory reported by windows. Use a memory profiler to get an exact count.
If strings are duplicated between many objects there might be a major win if you can deduplicate them, making use of the same instance.
using a struct instead of a class could avoid some overhead, so I made some tests:
list of objects using LINQ - 46ms
list of objects using for loop - 16ms
list of structs using for loop - 250ms
list of readonly structs with ref-return using for loop: 180ms
The exact times will depend on what query you are doing, these numbers are mostly for comparison.
Conclusion is that a regular List of objects with a regular for loop is probably the fastest. Also, iterating over all objects is quite fast, so in most cases it should not cause a major performance issue.
If you need better performance you will need to create some kind of index so you can avoid iterating over all items. Exact strategies for this is difficult to know without knowing what kinds of queries you are doing.
One option could be to use some variant of in memory database, this could provide most of the indexing functionality. SQLite would be one example
If the categories could be defined as an Enum, you can map it to bits, that would help in reducing the size pretty much. From 20bytes to say 2bytes(short int), this could approximately save around 36M bytes for 2M objects.
I have a part of my application which querys a database for records. One of the fields in this query is a string type status of known values:
Open, Closed, Cancel
The user has 3 check boxes and can select any combination to determine which types of records they get back. So in my view model I have a status filter property with 3 bools:
public class SalesOrderStatusFilter
{
public bool Open { get; set; }
public bool Closed { get; set; }
public bool Canceled { get; set; }
}
Now when a query is run, I'd like to filter the results based on the chosen status types. Right now I've got a linq query like this:
public IEnumerable<SalesOrders> GetSalesOrders(SalesOrderParams parameters)
{
return _dbContext.SalesOrderLookup()
.Where(x => (x.Status.EqualsTrim("Open") && parameters.SalesOrderStatusFilter.Open)
|| (x.Status.EqualsTrim("Closed") && parameters.SalesOrderStatusFilter.Closed)
|| (x.Status.EqualsTrim("Cancel") && parameters.SalesOrderStatusFilter.Canceled)).ToList();
}
This is a common pattern across my application and I'd like to find a better solution that I can reuse without having to keep typing out the query every time. I've already tested out converting my db string statuses to enums using some custom attributes, reflection, etc, but I'm worried it's a bit overkill when I'm doing mostly view-only type querys for these various reports, so I'm not sure I'm going to stick with it. It's also added a bit of a performance hit to do the enum conversion (The enum values didn't always match the database values, so thats why I was using reflection and custom attributes).
Can anybody recommend a good approach to dealing with this problem?
Edit:
for clarity, the SalesOrderStatusFilter is a property of another
public class SalesOrderParams
{
public string SalesOrderNumber { get; set; }
public SalesOrderStatusFilter SalesOrderStatusFilter { get; set; }
}
I think the main challenge I'm trying to solve is mapping the bools to their string equivalents, which may not always match by name (sometimes theres a space, for example), and then making a more concise and reusable call.
Please try like this
First compare with status again compare with parameters status... (&& replace for ==)
_dbContext.SalesOrderLookup()
.Where(x => (x.Status.EqualsTrim("Open") == parameters.Open)
|| (x.Status.EqualsTrim("Closed") == parameters.Closed)
|| (x.Status.EqualsTrim("Cancel") == parameters.Canceled)).ToList();
Why you don't use bool type? Even if you use enums and define it as byte type, it takes 1-byte length for each record while using 3 bool type, takes 3-bit length. Also the performance of checking one bit is higher than comparing a string. So I think better choice for you is defining three bool type for status variables as you define check boxes. Also in this way you will be able to read from and writ to db directly without any data conversion.
How to I check if a nested model object, has any items.
Ie. if I have an object/viewmodel:
public class CarViewModel
{
public string Type { get; set; }
public long ID { get; set; }
public virtual IQueryable<Feature> Features { get; set; }
}
public class Feature
{
public string Offer { get; set; }
public decimal Rate { get; set; }
public virtual CarViewModel CarViewModel { get; set; }
}
...and it is populated as follows - so that 1 car object has 2 additional features, and the other car object, has no additional features:
[
{"Type":"SoftTop","ID":1,
"Features":
[{"Offer":"Alloys","Rate":"500"},{"Offer":"Standard","Rate":"100"}]},
{"Type":"Estate","ID":2,
"Features":[]}
]
So in my code, I had "Cars" populated with the data above:
foreach (var car in Cars)
{
if (!car.Features.Any())
{
car.Type = "Remove";
}
}
However, I get the message: This method is not supported against a materialized query result. at the if (!car.Features.Any()) line.
I got the same error when trying if (car.Features.Count()==0)
Is there a way of checking if the number of Features is 0?
Or is there a linq way of removing any items from the object, where the number of features is 0?
Thank you,
Mark
UPDATE
I changed the viewModel to use IEnumerable and then the following:
cars=cars.Where(x => x.Feature.Count()>0).ToList();
That seems to work - although I'm not 100% sure. If anyone can say whether this is a "bad" fix or not, I'd appreciate it.
Thanks, Mark
Try fetching the results first then checking the count
car.Features.ToList().Count
I don't think there any anything wrong with the fix - when you're using IQueryable<T> that came from a Linq to DB (L2S, Entity Framework, etc) you pretty much have to materialise it before you can use things like Any() or Count() when you ask for these things inside foreach.
As to why this is - I actually am not 100% certain and I believe that the error is a bit misleading in this respect, but I think that what it's complaining about is that neither Cars not car.Features() has actually been fully evaluated and run yet (i.e you are only starting to hit the database at the point when you go foreach ... in your code because it's IQueryable<T>).
However on a broader note I'd recommend you not use IQueryable<T> in your Viewmodels, much safer to use IEnumerable<T> - no chance of accidentally setting off a database access when rendering your view, for example.
And also when you are returning data from your DataLayer or wherever, a good rule of thumb is to materialise it as quickly as possible so as to be able to move on with an actual list of actual things as opposed to a "promise to go and look" for certain things in the database at some unspecificed point in the future :) So your DataLayers should only ever return IEnumerable<T>'s
You can always cast an IEnumerable to IQueryable if for some reason you need to...
I need to know if there are some performance problem/consideration if I do something like this:
public Hastable Properties=...
public double ItemNumber
{
get { return (double)Properties["ItemNumber"]; }
set
{
ItemNumber = value;
Properties["ItemNumber"] = value;
}
}
Public string Property2....
Public ... Property 3....
Instead of accessing the property directly:
public string ItemNumber { get; set; }
public string prop2 { get; set; }
public string 3...{ get; set; }
It depends on your performance requirements... Accessing a Hashtable and casting the result is obviously slower than just accessing a field (auto-properties create a field implicitly), but depending on what you're trying to do, it might or might not make a significant difference. Complexity is O(1) in both cases, but accessing a hashtable obviously takes more cycles...
Well, compared to the direct property access it will surely be slower because much more code needs to be executed for the get and set operations. But since you are using a Hashtable the access should be pretty fast. You are also getting an additional overhead because of the casting since you are using weakly typed collection. Things like boxing and unboxing come with a cost. The question is whether all this will affect noticeably the performance of your application. It would really depend on your requirements. I would recommend you performing some load tests to see if this could be a bottleneck.
Yes, I know, yet another question about mutable objects. See this for general background and this for the closest analogue to my question. (though it has some C++ specific overtones that don't apply here)
Let's assume that the following pseudo code represents the best interface design. That is, it's the clearest expression of the business semantics (as they stand today) into OO type. Naturally, the UglyData and the things we're tasked to do with it are subject to incremental change.
public class FriendlyWrapper
{
public FriendlyWrapper(UglyDatum u)
{
Foo = u.asdf[0].f[0].o.o;
Bar = u.barbarbar.ToDooDad();
Baz = u.uglyNameForBaz;
// etc
}
public Widget Foo { get; private set; }
public DooDad Bar { get; private set; }
public DooDad Baz { get; private set; }
// etc
public WhizBang Expensive1 { get; private set; }
public WhizBang Expensive2 { get; private set; }
public void Calculate()
{
Expensive1 = Calc(Foo, Bar);
Expensive2 = Calc(Foo, Baz);
}
private WhizBang Calc(Widget a, DooDad b) { /* stuff */ }
public override void ToString()
{
return string.Format("{0}{1}{2}{3}{4}", Foo, Bar, Baz, Expensive1 ?? "", Expensive2 ?? "");
}
}
// Consumer 1 is happy to work with just the basic wrapped properties
public string Summarize()
{
var myStuff = from u in data
where IsWhatIWant(u)
select new FriendlyWrapper(u);
var sb = new StringBuilder();
foreach (var s in myStuff)
{
sb.AppendLine(s.ToString());
}
return sb.ToString();
}
// Consumer 2's job is to take the performance hit up front. His callers might do things
// with expensive properties (eg bind one to a UI element) that should not take noticeable time.
public IEnumerable<FriendlyWrapper> FetchAllData(Predicate<UglyDatum> pred)
{
var myStuff = from u in data
where pred(u)
select new FriendlyWrapper(u);
foreach (var s in myStuff)
{
s.Calculate(); // as written, this doesn't do what you intend...
}
return myStuff;
}
What's the best route here? Options I can see:
Mutable object with an explicit Calculate() method, as above
Mutable object where expensive calculations are done in the getters (and probably cached)
Split into two objects where one inherits (or perhaps composes?) from the other
Some sort of static + locking mechanism, as in the C++ question linked above
I'm leaning toward #2 myself. But every route has potential pitfalls.
If you choose #1 or #2, then how would you implement Consumer2's loop over mutables in a clear, correct manner?
If you choose #1 or #3, how would you handle future situations where you only want to calculate some properties but not others? Willing to create N helper methods / derived classes?
If you choose #4, I think you're crazy, but feel free to explain
In your case, since you're using LINQ, you're only going to constructing these objects in cases where you want the calculation.
If that is your standard usage pattern, I would just put the expensive calculation directly in the constructor. Using lazy initialization is always slower unless you plan to have some cases where you do not calculate. Doing the calculation in the getters will not save anything (at least in this specific case).
As for mutability - mutable objects with reference syntax and identity (ie: classes in C#) are really okay - it's more a problem when you're dealing with mutable value types (ie: structs). There are many, many mutable classes in the .NET BCL - and they don't cause issues. The problem is typically more of one when you start dealing with value types. Mutable value types lead to very unexpected behavior.
In general, I'd turn this question upside down - How and where are you going to use this object? How can you make this object the most performant (if it's been determined to be problematic) without affecting usability? Your 1), 3) and 4) options would all make usability suffer, so I'd avoid them. In this case, doing 2) won't help. I'd just put it in the constructor, so your object's always in a valid state (which is very good for usability and maintainability).