I need to be able to search over a collection of approx 2 million items in C#. Search should be possible over multiple fields. Simple string-matching is good enough.
Using an external dependency like a database is not an option, but using an in-memory database would be OK.
The main goal is to do this memory-efficient.
The type in the collection is quite simple and has no long strings:
public class Item
{
public string Name { get; set; } // Around 50 chars
public string Category { get; set; } // Around 20 chars
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; } // 2-3 items
}
Focus and requirements
Clarification of focus and requirements:
No external dependencies (like a database)
Memory-efficient (below 2 GB for 2 million items)
Searchable items in collection (must be performant)
Today's non-optimal solution
Using a simple List<T> over above type, either as a class or a struct, still requires about 2 GB of memory.
Is there a better way?
The most significant memory hog in your class is the use of a read-only list. Get rid of it and you will reduce memory footprint by some 60% (tested with three tags):
public class Item
{
public string Name { get; set; }
public string Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public string Tags { get; set; } // Semi-colon separated
}
Also, consider using DateTime instead of DateTimeOffset. That will further reduce memory footprint with around 10%.
There are many things you can do in order to reduce the memory footprint of your data, but probably the easiest thing to do with the greatest impact would be to intern all strings. Or at least these that you expect to be repeated a lot.
// Rough example (no checks for null values)
public class Item
{
private string _name;
public string Name
{
get { return _name; }
set { _name = String.Intern(value); }
}
private string _category;
public string Category
{
get { return _category; }
set { _category = String.Intern(value); }
}
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
private IReadOnlyList<string> _tags;
public IReadOnlyList<string> Tags
{
get { return _tags; }
set { _tags = Array.AsReadOnly(value.Select(s => String.Intern(s)).ToArray()); }
}
}
Another thing you could do, more difficult and with smaller impact, would be to assign the same IReadOnlyList<string> object to items with identical tags (assuming that many items with identical tags exist in your data).
Update: Also don't forget to call TrimExcess to the list after you fill it with items, in order to get rid of the unused capacity.
This method can be used to minimize a collection's memory overhead if no new elements will be added to the collection.
With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.
If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.
Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.
You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?
With that out of the way, let's get into an answer.
Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.
Find or implement a Suffix Tree.
The suffix tree allows you to find substrings (with time complexity O(log N) O(m), where m is the word length) and get back the original objects they occur in.
Construct the suffix tree once, with the strings of each object pointing back to that object.
Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.
Edit: On memory usage
As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.
For example, the words fire and firm will be stored as a parent node fir with two leaf nodes, e and m, thus forming the words. Should the word fish be introduced, the node fir will be split: a parent node fi, with child nodes sh and r, and the r having child nodes e and m. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.
With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a through z, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.
The only overhead comes from storing separate substrings and the nodes that represent and connect them.
You can do theses dots, then you will see if there is trouble:
you can enable gcAllowVeryLargeObjects to enables arrays that are greater than 2 gigabytes.
Let the class implementation. When you choose between class and struct, the performance is not the main factor. I think there is no reason to use struct here. See Choosing Between Class and Struct.
Depending your search filter, you must override GetHashCode and Equal.
Do you need to mutate properties, or just search object in the collection?
If you just want research, and if your properties repeat themselves a lot, you can have one property used by many objects.
In this way, the value is stored only one time, and the object only store the reference.
You can do this only if you dont want to mutate the property.
As exemple, if two objects have the same category:
public class Category
{
public string Value { get; }
public Category(string category)
{
Value = category;
}
}
public class Item
{
public string Name { get; set; }
public Category Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; }
}
class Program
{
public void Init()
{
Category category = new Category("categoryX");
var obj1 = new Item
{
Category = category
};
var obj2 = new Item
{
Category = category
};
}
}
I would not expect any major memory issues with 2M objects if you are running 64-bits. There is a max size limit of lists of 2Gb, but a reference is only 8 bytes, so the list should be well under this limit. The total memory usage will depend mostly on how large the strings are. There will also be some object overhead, but this is difficult to avoid if you need to store multiple strings.
Also, how do you measure memory? The .Net runtime might over allocate memory, so the actual memory usage of your object might be significantly lower than the memory reported by windows. Use a memory profiler to get an exact count.
If strings are duplicated between many objects there might be a major win if you can deduplicate them, making use of the same instance.
using a struct instead of a class could avoid some overhead, so I made some tests:
list of objects using LINQ - 46ms
list of objects using for loop - 16ms
list of structs using for loop - 250ms
list of readonly structs with ref-return using for loop: 180ms
The exact times will depend on what query you are doing, these numbers are mostly for comparison.
Conclusion is that a regular List of objects with a regular for loop is probably the fastest. Also, iterating over all objects is quite fast, so in most cases it should not cause a major performance issue.
If you need better performance you will need to create some kind of index so you can avoid iterating over all items. Exact strategies for this is difficult to know without knowing what kinds of queries you are doing.
One option could be to use some variant of in memory database, this could provide most of the indexing functionality. SQLite would be one example
If the categories could be defined as an Enum, you can map it to bits, that would help in reducing the size pretty much. From 20bytes to say 2bytes(short int), this could approximately save around 36M bytes for 2M objects.
Related
Introduction to the goal:
I am currently trying to optimize performance and memory usage of my code. (mainly Ram bottleneck)
The program will have many instances of the following element at the same time. Especially when historic prices should be processed at the fastest possible rate.
The struct looks like this in it's simplest way:
public struct PriceElement
{
public DateTime SpotTime { get; set; }
public decimal BuyPrice { get; set; }
public decimal SellPrice { get; set; }
}
I realized the performance benefits of using the struct just like an empty bottle and refill it after consumption. This way, I do not have to reallocate memory for each single element in the line.
However, it also made my code a little more dangerous for human errors in the program code. Namely I wanted to make sure that I always update the whole struct at once rather than maybe ending up with just an updated sellprice and buyprice because I forgot to update an element.
The element is very neat like this but I have to offload methods into functions in another classes in order to have the functionality I require - This in turn would be less intuitive and thus less preferable in code.
So I added some basic methods which make my life a lot easier:
public struct PriceElement
{
public PriceElement(DateTime spotTime = default(DateTime), decimal buyPrice = 0, decimal sellPrice = 0)
{
// assign datetime min value if not happened already
spotTime = spotTime == default(DateTime) ? DateTime.MinValue : spotTime;
this.SpotTime = spotTime;
this.BuyPrice = buyPrice;
this.SellPrice = sellPrice;
}
// Data
public DateTime SpotTime { get; private set; }
public decimal BuyPrice { get; private set; }
public decimal SellPrice { get; private set; }
// Methods
public decimal SpotPrice { get { return ((this.BuyPrice + this.SellPrice) / (decimal)2); } }
// refills/overwrites this price element
public void UpdatePrice(DateTime spotTime, decimal buyPrice, decimal sellPrice)
{
this.SpotTime = spotTime;
this.BuyPrice = buyPrice;
this.SellPrice = sellPrice;
}
public string ToString()
{
System.Text.StringBuilder output = new System.Text.StringBuilder();
output.Append(this.SpotTime.ToString("dd/MM/yyyy HH:mm:ss"));
output.Append(',');
output.Append(this.BuyPrice);
output.Append(',');
output.Append(this.SellPrice);
return output.ToString();
}
}
Question:
Let's say I have PriceElement[1000000] - will those additional methods put additional strain on the system memory or are they "shared" between all structs of type PriceElement?
Will those additional methods increase the time to create a new PriceElement(DateTime, buy, sell) instance, respectively the load on the garbage collector?
Will there be any negative impacts, I have not mentioned here?
will those additional methods put additional strain on the system memory or are they "shared" between all structs of type PriceElement?
Code is shared between all instances. So no additional memory will be used.
Code is stored separately from any data, and the memory for the code is only dependent on the amount of code, not how many instance of objects there are. This is true for both classes and structs. The main exception is generics, this will create a copy of the code for each type combination that is used. It is a bit more complicated since the code is Jitted, cached etc, but that is irrelevant in most cases since you cannot control it anyway.
I would recommend making your struct immutable. I.e. change UpdatePrice so it returns a new struct instead of changing the existing one. See why is mutable structs evil for details. Making the struct immutable allow you to mark the struct as readonly and that can help avoid copies when passing the struct with an in parameter. In modern c# you can take references to structs in an array, and that also helps avoiding copies (as you seem to be aware of).
I have a method in an ASP.NET application which gets hit a lot and needs to be runtime cached. It accepts the following:
public List<ModelTwo> SomeMethod(List<ModelOne> models, List<Guid> guids)
I can loop through each list selecting unique values and concatenating into a large string. But I'm wondering if there is a faster and more efficient way of doing this?
If these two items are somehow related, you could just create a class to group them together with a unique id. You might have to edit some of this code (I haven't compiled this), but you can get a basic idea of how you can relate the two collections with a unique id(the key) and use that key for a cache. This would be more efficient than looping through the lists and concatenating it into a large string.
Strings are immutable. Meaning they can't be changed once they are created. So every time you're concatenating, you are in fact creating a new string. Which might slow down performance if you have a lot of values.
public class MyValue
{
public guid key {get;set;}
public List<ModelOne> models {get;set;}
public List<Guid> guids {get;set;}
public void MyValue(List<ModelOne> modelsIn, List<Guid> guidsIn)
{
key = Guid.NewGuid();
models = modelsIn;
guids = guidsIn;
}
}
public Dictionary<Guid,MyValue> Cache = new Dictionary<Guid,MyValue>();
public List<ModelTwo> SomeMethod(MyValue valueIn)
{
MyValue val;
If(Cache.TryGetValue(valueIn.Key, value)
return Cache[valueIn.Key].models;
else
{
Do Stuf...
put in cache...
return value;
}
}
I have an entity that needed a list of type int. Due to this being an internal tool that only I would use, I didn't want to spend a lot of time making a UI/view to edit the list and I sort of cheated.
So, I have the following class:
myitem.cs
public int ID { get; set; }
public string Name { get; set; }
public string Description { get; set; }
virtual public ICollection<Size> Sizes {get;set;}
size.cs
public int ID { get; set; }
public int Size { get; set; }
Between the controller and view controller, I did some funky bits. I had a single text field in the view controller called "Sizes" and I then split the input on a comma to an array, and assign the list to Sizes.
This works perfectly and as expected.
string[] sizes = model.sizes.Split(',');
myitem.size = new List<sizes>();
foreach (string item in sizes)
{
myitem.size.Add( new sizes {Size=int.Parse(item)});
}
In the edit one, I find the object, and create a new text string that basically gets all of them and this also works.
In the edit saving controller, no matter what I try, it seems to append. So, I basically did the following:
MyItem myitem = db.myitems.find(id);
...auto mapper stuff from viewmodel to model...
myitem.sizes=null;
...call same bits as create to split and add to sizes...
db.savechanges();
However, I am now finding that whatever I try to do in edit, it simply adds to the list in addition to what is already there - I can't seem to find a way to remove it.
I have tried many different things (instead of = null, foreach and remove(), and a few others) without much luck.
In the end, I don't think this is the best approach at all as I am going to end up dropping the items and recreating them by the thousands for the sake of saving a few minutes, so, I am going to create a DBSet for sizes and do an ajax interface to list/delete/add them separate to the main model. (If there is an easy way, please let me know?!)
However, the fact that this didn't work has annoyed me and I was wondering if anyone knows why?
I have this class :
public class Item
{
public int Id { get; set; }
public string Name { get; set; }
public decimal Price { get; set; }
}
I want to store instances of Item in a list, and keep it ordered like the user has ordered them (Likely to be in a GUI with up-down arrows while selecting an Item)...
Should I be adding an order member to my Item class, or is there a specific datastructure that can keep an arbitrary user-specified order.
Note: I'm going to use this to keep a list of items, in the order a person has seen them, walking in a store.
If you intend to persist the list to a database then you may want to include an Order property in your Item class; databases such as SQL Server do not guarantee the order of the result set.
List/Array/Collection are names for ordered sequence of items.
List<Item> is enough to keep items in particular order. Note that re-ordering items will be "slow" ( O(n) ) operation in this case to move single item in new place. If you just need Add regular List<T> is probably the easiest choice that does not require any additional fields.
I have the follow example:
public class Commands
{
public int ID { get; set; }
public List<string> Alias { get; set; }
}
public class UserAccess
{
public int AccessID { get; set; }
// other stuff not needed for the question
public List<Commands> AllowedCommands { get; set; }
}
Now I wanted to implement on the UserAccess a way to return the command ID or NULL if no Alias were found on the list, see a dirty example of what I am saying below HasCommand:
public class UserAccess
{
public ID { get; set; }
// other stuff not needed for the question
public List<Commands> AllowedCommands { get; set; }
public Commands HasCommand(string cmd)
{
foreach (Commands item in this.AllowedCommands)
{
if (item.Alias.Find(x => string.Equals(x, cmd, StringComparison.OrdinalIgnoreCase)) != null)
return item;
}
return null;
}
}
My question is what would be the most efficient way to run or implement the HasCommand method ?
Or is there a better way to implement it into the UserAccess ?
Can be shortened a little bit
public Commands HasCommand(string cmd)
{
return AllowedCommands.FirstOrDefault(c => c.Alias.Contains(cmd, StringComparer.OrdinalIgnoreCase));
}
but it's pretty much the same thing.
public Commands HasCommand(string cmd)
{
return this.AllowedCommands.FirstOrDefault(item => item.Alias.Find(x => string.Equals(x, cmd, StringComparison.OrdinalIgnoreCase)) != null);
}
You do not need to use Where + FirstOrDefault. The FirstOfDefault can have condition.
Also, 3 suggestions for further improvement:
(1) I would encourage the use of IEnumerable instead of List, if possible.
(2) I would call "Commands" just "Command".
(3) I would make all commands be able to be easily referenced via a class like this:
public class Command {
public Command(int id, IEnumerable<string> aliases) {
Id = id;
Aliases = alias;
}
public int Id { get; set; }
public IEnumerable<string> Aliases { get; set; }
}
public class Commands {
public static readonly Command CommandNameHere1(yourIdHere1, yourAliasesHere1);
public static readonly Command CommandNameHere2(yourIdHere2, yourAliasesHere2);
//etc.
}
Assuming that by "efficient", you mean fast, anytime you are looking up a string in a collection of strings, and that collection is likely to contain more than a few entries, you should always use a hash lookup. Doing a simple scan of the list takes exponential time as the count of items goes up, while the count has little effect on a hash lookup. In .NET, this has traditionally been handled by the Dictionary class, which is commonly used to index a collection of objects with a key (which is often a string). However, the value can't be null, and this led to passing the same string in as both the key and value - rather ugly. Finally, .NET 4 provided HashSet, which you should use for such a case of only having a key and no value.
In your case, you have the (not uncommon) situation of needing a case-insensitive compare. The common solution for this is to lower-case the string keys when adding them to the dictionary (or HashSet). This tiny overhead on add is vastly outweighed by the savings on lookups, since all programmers should know and understand that case-insensitive compares are vastly slower than case-sensitive, especially with Unicode - the CPU can't just do a block compare of data, but must check each pair of characters specially (even using a table look-up, this is vastly slower).
If your Alias names can be in lower case, change them from List to HashSet. If not, use Dictionary where the key is added as lower case, and the value is the (mixed-case) Alias string. Assuming the use of Dictionary, your code would become:
public Commands HasCommand(string cmd)
{
foreach (Commands item in AllowedCommands)
{
if (item.Alias.ContainsKey(cmd))
return item;
}
return null;
}
Finally, also on the subject of performance, using LINQ is almost always going to result in slower performance - somewhere between a little slower and a lot slower, depending upon the situation. It does make nice, compact source for simple things, and I use it quite a bit myself, but if you're certain that performance is an issue for a piece of a code, you probably shouldn't use it (unless it's PLINQ, of course).
So, if you want as few lines of code as possible, use the other answer posted here. If you want speed, use mine.
It almost goes without saying, but when you're worried about the performance of some small chunk of code like this, just wrap it in a for loop and repeat it until it takes 5-10 seconds to execute - just add orders of magnitude as needed, whether it's 1,000 or 1,000,000 reps, and time it with System.Diagnostics.Stopwatch. Try alternative logic, and repeat the test. The 5-10 seconds is a minimum designed to mask the fluctuations caused by a managed environment and other stuff executing on the same machine (you should obviously also avoid running other apps during the test). Of course, for overall performance testing of a complicated application, a performance analyzer tool would be recommended.