I need to be able to search over a collection of approx 2 million items in C#. Search should be possible over multiple fields. Simple string-matching is good enough.
Using an external dependency like a database is not an option, but using an in-memory database would be OK.
The main goal is to do this memory-efficient.
The type in the collection is quite simple and has no long strings:
public class Item
{
public string Name { get; set; } // Around 50 chars
public string Category { get; set; } // Around 20 chars
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; } // 2-3 items
}
Focus and requirements
Clarification of focus and requirements:
No external dependencies (like a database)
Memory-efficient (below 2 GB for 2 million items)
Searchable items in collection (must be performant)
Today's non-optimal solution
Using a simple List<T> over above type, either as a class or a struct, still requires about 2 GB of memory.
Is there a better way?
The most significant memory hog in your class is the use of a read-only list. Get rid of it and you will reduce memory footprint by some 60% (tested with three tags):
public class Item
{
public string Name { get; set; }
public string Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public string Tags { get; set; } // Semi-colon separated
}
Also, consider using DateTime instead of DateTimeOffset. That will further reduce memory footprint with around 10%.
There are many things you can do in order to reduce the memory footprint of your data, but probably the easiest thing to do with the greatest impact would be to intern all strings. Or at least these that you expect to be repeated a lot.
// Rough example (no checks for null values)
public class Item
{
private string _name;
public string Name
{
get { return _name; }
set { _name = String.Intern(value); }
}
private string _category;
public string Category
{
get { return _category; }
set { _category = String.Intern(value); }
}
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
private IReadOnlyList<string> _tags;
public IReadOnlyList<string> Tags
{
get { return _tags; }
set { _tags = Array.AsReadOnly(value.Select(s => String.Intern(s)).ToArray()); }
}
}
Another thing you could do, more difficult and with smaller impact, would be to assign the same IReadOnlyList<string> object to items with identical tags (assuming that many items with identical tags exist in your data).
Update: Also don't forget to call TrimExcess to the list after you fill it with items, in order to get rid of the unused capacity.
This method can be used to minimize a collection's memory overhead if no new elements will be added to the collection.
With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.
If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.
Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.
You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?
With that out of the way, let's get into an answer.
Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.
Find or implement a Suffix Tree.
The suffix tree allows you to find substrings (with time complexity O(log N) O(m), where m is the word length) and get back the original objects they occur in.
Construct the suffix tree once, with the strings of each object pointing back to that object.
Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.
Edit: On memory usage
As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.
For example, the words fire and firm will be stored as a parent node fir with two leaf nodes, e and m, thus forming the words. Should the word fish be introduced, the node fir will be split: a parent node fi, with child nodes sh and r, and the r having child nodes e and m. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.
With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a through z, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.
The only overhead comes from storing separate substrings and the nodes that represent and connect them.
You can do theses dots, then you will see if there is trouble:
you can enable gcAllowVeryLargeObjects to enables arrays that are greater than 2 gigabytes.
Let the class implementation. When you choose between class and struct, the performance is not the main factor. I think there is no reason to use struct here. See Choosing Between Class and Struct.
Depending your search filter, you must override GetHashCode and Equal.
Do you need to mutate properties, or just search object in the collection?
If you just want research, and if your properties repeat themselves a lot, you can have one property used by many objects.
In this way, the value is stored only one time, and the object only store the reference.
You can do this only if you dont want to mutate the property.
As exemple, if two objects have the same category:
public class Category
{
public string Value { get; }
public Category(string category)
{
Value = category;
}
}
public class Item
{
public string Name { get; set; }
public Category Category { get; set; }
public bool IsActive { get; set; }
public DateTimeOffset CreatedAt { get; set; }
public IReadOnlyList<string> Tags { get; set; }
}
class Program
{
public void Init()
{
Category category = new Category("categoryX");
var obj1 = new Item
{
Category = category
};
var obj2 = new Item
{
Category = category
};
}
}
I would not expect any major memory issues with 2M objects if you are running 64-bits. There is a max size limit of lists of 2Gb, but a reference is only 8 bytes, so the list should be well under this limit. The total memory usage will depend mostly on how large the strings are. There will also be some object overhead, but this is difficult to avoid if you need to store multiple strings.
Also, how do you measure memory? The .Net runtime might over allocate memory, so the actual memory usage of your object might be significantly lower than the memory reported by windows. Use a memory profiler to get an exact count.
If strings are duplicated between many objects there might be a major win if you can deduplicate them, making use of the same instance.
using a struct instead of a class could avoid some overhead, so I made some tests:
list of objects using LINQ - 46ms
list of objects using for loop - 16ms
list of structs using for loop - 250ms
list of readonly structs with ref-return using for loop: 180ms
The exact times will depend on what query you are doing, these numbers are mostly for comparison.
Conclusion is that a regular List of objects with a regular for loop is probably the fastest. Also, iterating over all objects is quite fast, so in most cases it should not cause a major performance issue.
If you need better performance you will need to create some kind of index so you can avoid iterating over all items. Exact strategies for this is difficult to know without knowing what kinds of queries you are doing.
One option could be to use some variant of in memory database, this could provide most of the indexing functionality. SQLite would be one example
If the categories could be defined as an Enum, you can map it to bits, that would help in reducing the size pretty much. From 20bytes to say 2bytes(short int), this could approximately save around 36M bytes for 2M objects.
I am struggling to find a good solution for this. It's fairly straight forward to find the orphaned elements, but the trouble is storing them in such a way that they can easily be merged back into the hierarchy at a later point.
I the following abstract class that has multiple implementations:
public abstract class FilterElement
{
public abstract string ID { get; }
public abstract IEnumerable<FilterElement> Children { get; set; }
public FilterElement Parent { get; set; }
}
I have two hierarchies of FilterElement - the "master" (i.e. the main structure), and the "filters". The filters point at elements in the master - however, if these master elements do not exist, I wish to create a third structure, the "orphans".
I'm struggling to do this. While it's easy to identify the orphaned elements, I don't know how to store them effectively. This is the current solution:
Note: "GetFlatKey" returns a unique key for the element based on it's parents & children, and "RecursiveSelect" effectively flattens the hierarchy:
private IEnumerable<FilterElement> GetOrphanedFilterElements
(IEnumerable<FilterElement> filters,
IEnumerable<IFilterFileViewModel> visibleList)
{
var flattenedMasterList = visibleList.Cast<IFilterViewModel>()
.RecursiveSelect(f => f.Children)
.Select(x => x.GetFlatKey).ToList();
var orphanedFilterFiles = new List<FilterElement>();
foreach (var f in filters.RecursiveSelect(f => f.Children))
{
// Remove non orphaned files.
if (!flattenedMasterList.Contains(f.GetFlatKey))
{
orphanedFilterFiles.Add((f));
}
}
return orphanedFilterFiles;
}
The problem with this is that the elements in the orphanedFilterFiles list contain references to other elements - e.g. An orphan will have a parent, which may have non-orphaned Children. This makes it difficult to merge back into the final hierarchy, which is the main issue.
Can anyone help me find a better solution, or just tell me what I'm doing wrong?
I am using System.Linq.Dynamic, version on github's repository.
I am NOT interested in NON System.Linq.Dynamic solution.
I am trying to perform select on nested collection's property. Let us imagine we have following situation:
public class Region
{
public int Id { get; set; }
public List<Town> Towns { get; set; }
}
public class Town
{
public int Id { get; set; }
public string Name { get; set; }
}
Would it be possible to 'Select' region's id and it's town's names?
Something of a kind:
someListofRegions.Select("new(Id, Towns.Name)")
where "new(Id, Towns.Name)" is the dynamic Linq expression.
of course example above fails.
You can't perform a select on a nested collection's property as this would require flattening of the resulting collection.
Usually you would use SelectMany() to do this flattening, but since you want to use System.Linq.Dynamic and I don't believe this library has a dynamic SelectMany() this probably isn't possible. You could write your own SelectMany() though using expression trees which shouldn't be too difficult.
Alternatively you may find GroupBy is more suited to your needs here anyway, I can't personally see the benefit in wanting a collection of Region IDs and Town Names - there would be loads of duplicate region IDs.
I have a List<Leaf> named items in C#. A Leaf has the following properties:
public class Leaf
{
public int ID { get; set; }
public int ParentID { get; set; }
public bool IsFlagged { get; set; }
}
If a Leaf has the IsFlagged property set then I need to remove it from the collection of items. In addition, I need to remove all of that Leaf entity's children. I'm trying to figure out the most elegant way to write this code. Currently, I have a loop within a loop, but it seems sloppy.
Does anyone know of an elegant way to do this?
Perhaps:
void RemoveItAndChildren(Leaf leaf)
{
foreach (Leaf item in items)
if (item.ParentID == leaf.ID)
RemoveItAndChildren(item);
items.Remove(leaf);
}
And use so:
foreach (Leaf leaf in items)
if (leaf.IsFlagged)
RemoveItAndChildren(leaf);
Note that, as in a comment above, something like the following might be more appropriate:
public class Leaf2
{
List<Leaf2> Children;
bool IsFlagged { get; set; }
}
Most reasonable (and probably "the most elegant") way of dealing with tree is to store it as a tree, not an array/list. In this case you'll not need to deal with walking elements to try to find all children.
Note that depending on your actual requirements tree may not be best data structure, but for removing node with all children nodes it would be hard to beat regular tree.
I'm originally a C# developer (as a hobby), but as of late I have been digging into Ruby on Rails and really enjoying it. Right now I am building an application in C#, and I was wondering if there is any collection implementation for C# that could match (or "semi-match") the find_by method of ActiveRecord.
What I am essentially looking for is a list that would hold Rectangles:
class Rectangle
{
public int Width { get; set; }
public int Height { get; set; }
public string Name { get; set; }
}
Where I could query this list and find all entries with Height = 10, Width = 20, or name = "Block". This was done with ActiveRecord by doing a call similar to Rectangle.find_by_name('Block'). The only way I can think of doing this in C# is to create my own list implementation and iterate through each item manually checking each item against the criteria. I fear I would be reinventing the wheel (and one of poorer quality).
I am not necessarily trying to match the naming convention find_by_..., but rather to have the functionality of the method.
Any input or suggestions is much appreciated.
The "Linq methods", namely Where, that were added in .NET 3.5 are pretty close to what you're looking for.
myCollection.Where(r => r.Name == 'Block')