I have a dataset. This dataset will serve a lookup table. Given a number, I should be able to lookup a corresponding value for that number.
The dataset (let's say its CSV) has a few caveats though. Instead of:
1,ABC
2,XYZ
3,LMN
The numbers are ranges (- being "through", not minus):
1-3,ABC // 1, 2, and 3 = ABC
4-8,XYZ // 4, 5, 6, 7, 8 = XYZ
11-11,LMN // 11 = LMN
All the numbers are signed ints. No ranges overlap with another ranges. There are some gaps; there are ranges that aren't defined in the dataset (like 9 and 10 in the last snippet above).
`
How might I model this dataset in C# so that I have the most-performant lookup while keeping my in-memory footprint low?
The only option I've come up with suffers from overconsumption of memory. Let's say my dataset is:
1-2,ABC
4-6,XYZ
Then I create a Dictionary<int,string>() whose key/values are:
1/ABC
2/ABC
4/XYZ
5/XYZ
6/XYZ
Now I have hash performance-lookup, but tons of wasted space in the hash table.
Any ideas? Maybe just use PLINQ instead and hope for good performance? ;)
If your dictionary is going to truly store a wide range of key values, an approach that expands all possible ranges into explicit keys will rapidly consume more memory than you likely have available.
You're best option is to use a data structure that supports some variation of binary search (or other O(log N) lookup technique). Here's a link to a generic RangeDictionary for .NET that uses an OrderedList internally, and has O(log N) performance.
Achieving constant-time O(1) lookup requires that you expand all ranges into explicit keys. This requires both a lot of memory, and can actually degrade performance when you need to split or insert a new range. This probably isn't what you want.
You can create a doubly-indirected lookup:
Dictionary<int, int> keys;
Dictionary<int, string> values;
Then store the data like this:
keys.Add(1, 1);
keys.Add(2, 1);
keys.Add(3, 1);
//...
keys.Add(11, 3);
values.Add(1, "ABC");
//...
values.Add(3, "LMN");
And then look the data up:
return values[keys[3]]; //returns "ABC"
I'm not sure how much memory footprint this will save with trivial strings, but once you get beyond "ABC" it should help.
EDIT
After Dan Tao's comment below, I went back and checked on what he was asking about. The following code:
var abc = "ABC";
var def = "ABC";
Console.WriteLine(ReferenceEquals(abc, def));
will write "True" to the console. Which means that the either the compiler or the runtime (clarification?) is maintaining the reference to "ABC", and assigns it as the value of both variables.
After reading up some more on Interned strings, if you're using string literals to populate the dictionary, or Interning computed strings, it will in fact take more space to implement my suggestion than the original dictionary would have taken. If you're not using Interned strings, then my solution should take less space.
FINAL EDIT
If you're treating your strings correctly, there should be no excess memory usage from the original Dictionary<int, string> because you can assign them to a variable and then assign that reference as the value (or, if you need to, because you can Intern them)
Just make sure your assignment code includes an intermediate variable assignment:
while (thereAreStringsLeftToAssign)
{
var theString = theStringToAssign;
foreach (var i in range)
{
strings.Add(i, theString);
}
}
As arootbeer has mentioned in his answer, the following code does not create multiple instances of the string "ABC"; rather, it interns a single instance and assigns a reference to that instance to each KeyValuePair<int, string> in dictionary:
var dictionary = new Dictionary<int, string>();
dictionary[0] = "ABC";
dictionary[1] = "ABC";
dictionary[2] = "ABC";
// etc.
OK, so in the case of string literals, you're only using one string instance per range of keys. Is there a scenario where this wouldn't be the case--that is, where you would be using a separate string instance for each key within the range (this is what I assume you're concerned about when you speak of "overconsumption of memory")?
Honestly, I don't think so. There are scenarios where multiple equivalent string instances may be created without the benefit of interning, yes. But I can't imagine these scenarios would affect what you're trying to do here.
My reasoning is this: you want to assign certain values to different ranges of keys, right? So any time you are defining a key-range-value pairing of this sort, you have a single value and several keys. The single part is what leads me to doubt that you'll ever have multiple instances of the same string, unless it is defined as the value for more than one range.
To illustrate: yes, the following code will instantiate two identical strings:
string x = "ABC";
Console.Write("Type 'ABC' and press Enter: ");
string y = Console.ReadLine();
Console.WriteLine(Equals(x, y));
Console.WriteLine(ReferenceEquals(x, y));
The above program, assuming the user follows instructions and types "ABC," outputs True, then False. So you might think, "Ah, so when a string is only provided at run-time, it isn't interned! So this could be where my values could be duplicated!"
But... again: I don't think so. It all comes back to the fact that you are going to be assigning a single value to a range of keys. So let's say your values come from user input; then your code would look something like this:
var dictionary = new Dictionary<int, string>();
int start, count;
GetRange(out start, out count);
string value = GetValue();
foreach (int key in Enumerable.Range(start, count))
{
// Look, you're using the same string instance to assign
// to each key... how could it be otherwise?
dictionary[key] = value;
}
Now, if you were actually thinking more along the lines of what LBushkin mentions in his answer--that you may potentially have huge ranges, making it impractical to define a KeyValuePair<int, string> for each key within that range (e.g., if you have a range of 1-1000000)--then I would agree that you're best off with some sort of data structure that bases its lookup on a binary search. If that's more your scenario, say so and I will be happy to offer more ideas on that front. (Or you could just take a look at the link LBushkin already posted.)
Use a balanced ordered tree (or something similar) mapping start-of-range to end-of-range and data. This will be easy to implement for non-overlapping ranges.
arootbeer has a good solution, but one you may find confusing to work with.
Another choice is to use a reference type instead of a string, so that you point to the same reference
class StringContainer {
public string Value { get; set; }
}
Dictionary<int, StringContainer> values;
var value1 = new StringContainer { Value = "ABC" };
values.Add(1, value1);
values.Add(2, value1);
They will both point to the same instance of StringContainer
EDIT: Thanks for the comments everyone. This method handles value types other than string, so it might be useful for more than the given example. Also, it is my understanding that strings don't always behave in the manner you would expect from reference values, but I could be wrong.
Related
I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718
In some string[] array I have a list of web-sites
amazon.com
google.com
ieee.org
...
I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?
For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.
Can someone help me with faster approach? Thank you in advance
Good solution would leverage hashing. My approach would be following
Hash all your known hosts (the string[] collection that you mention)
Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
Sort the list (hashes.Sort())
When looking up a url:
Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
If the hash exists, then you have this host in your list
Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.
Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.
Create a Dictionary of domain to counter.
For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.
I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.
You can read this question, the answers will be help you:
High performance "contains" search in list of strings in C#
Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop
as in something like
int l = url.length;
int position = 0;
while (position < l)
{
if (url[i] == website[0])
{
//test rest of web site from position in an other loop
if (exactMatch(url,position, website))
}
}
Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.
Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:
var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
var domain = naiveDomainExtractor(line);
if(tally.ContainsKey(domain)) tally[domain]++;
}
...which takes a third of a second on my not particularly speedy machine, including generation of test data.
Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.
You'd have to test the performance but you might try converting the urls to the actual System.Uri object.
Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:
IEnumerable<Uri> inputUrls = File.ReadAllLines(#"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
I have a textfile and on each line is a single word followed by specific values
For example:
texture_menu_label_1 = 0 0 512 512
What I want to do is read that text in and basically convert it to the following commmand:
texture_menu_label_1 = new int[]{0, 0, 512, 512};
Parsing the line and extracting the integer values for the constructor is trivial, but im wondering if there is anyway to use the "texture_menu_label_1" String from the file to reference a pre-existing variable by the same name...
Is there anyway to do this without manually constructing a lookup table?
You really don't want to do this. I know you think you do, I remember when I was learning how to program and I thought the same thing, but really, you don't.
There are better ways to store a collection values, in your case, this would be a multi-dimensional array (or a List<List<int>>). If not that, then perhaps a hash table (Dictionary<string,int[]>).
Better yet, if this data is 'regular' and logically connected, create your own custom type and maintain a collection of those. You really don't want to go down the road of tying your logic to the names of your variables... very messy.
That data looks like a rectangle. Why not just maintain a Dictionary<string,Rectangle>?
var dict = new Dictionary<string, Rectangle>();
dict.Add("some_name", new Rectangle(0, 0, 512, 512));
// ... later
var rect = dict["some_name"]; // get the rectangle that maps to "some_name"
Before you try to implement an answer please consider why you are doing this, and whether there may be a better solution.
I recommend using a Dictionary to store the data by name as strings.
dataDictionary["texture_menu_label_1"] = new int[] { ... };
Another approach is to use a separate class with fields, since fields can be accessed by name. You may experience performance issues though, and it's definitely not an optimal solution.
class Data
{
public int[] texture_menu_label_1;
...
}
You can use reflection to set the field value. Something like this:
typeof(Data).GetField("texture_menu_label_1").SetValue(data, new int [] { ... });
Use a HashTable (Dictionary for generics) or similar. The key would be the string (texture_menu_label_1) and the value would be the array.
What if you wrap it up in a struct?
struct TextureMenu
{
string MenuString;
int[] Values;
}
Then, instead of dealing directly with either type, you just deal with the struct.
I started reading C# in depth. Now I'm in the journey of Generics. I came across the first example of Generics in this book as:
static Dictionary<string,int> CountWords(string text)
{
Dictionary<string,int> frequencies;
frequencies = new Dictionary<string,int>();
... //other code goes here..
And after this code, author says that:
The CountWords method first creates an empty map from string to int
This looks vague to me, as a novice in C#, what the author is trying to mean string to int(in the above statement)? I'm bit confused with this line.
Thanks in advance.
Lets say we want to count the words in a paragraph:
I started reading C# in depth. Now I am in the journey of Generics.
I came across the first example of Generics in this book as
In order to count the words, you'll need some data structure that will be able to store a number of occurrences for each of the words, that will basically attach a number to a string, like
I - 3 times
in - 3 times
Generics - 2 times
etc...
that structure maps a string to an integer, and in C# Generics, that structure is a Dictionary<string,int>
BTW, if you are a C# beginner, i would recommend against C# in depth, which, while being a great book, assumes a quite advanced reader.
He means that string is your key and int is the value paired with the key.
Dictionary<string,int> maps a string key (or lookup) to an int value.
Consider Dictionary<string,int> frequencies.
When you try to add an item you use (for example)
frequencies.Add("key3", 3)
When you add another item you cannot repeat "key3", because in Dictionary that's a unique key; so you create a "map" because you are sure you have unique keys and you can recall values using their key: frequencies["key3"]...
Dictionary<string, int> frequencies = new Dictionary<string, int>();
frequencies.Add("key3", 3);
frequencies.Add("key4", 4);
frequencies.Add("key3", 5); // This raises an error
int value = frequencies["key3"];
This function counts all words in a given string. In the returned dictionary exist for every found word one entry with the word as key. In the int value is stored, how many times this word was found in the string.
It means from the Key to the Value
I have a known-good Dictionary, and at run time I need to create a new Dictionary and run a check to see if it has the same key-value pairs as the known-good Dictionary (potentially inserted in different orders), and take one path if it does and another if it doesn't. I don't necessarily need to serialize the entire known-good Dictionary (I could use a hash, for example), but I need some on-disk data that has enough information about the known-good Dictionary to allow for comparison, if not for recreation. What is the quickest way to do this? I can use a SortedDictionary, but the amount of time required to initialize and add values counts in the speed of this task.
Concrete example:
Consider a Dictionary<String,List<String>> that looks something like this (in no particular order, obviously):
{ {"key1", {"value1", "value2"} }, {"key2", {"value3", "value4"} } }
I create that Dictionary once and save some form of information about it on disk (a full serialization, a hash, whatever). Then, at runtime, I do the following:
Dictionary<String,List<String>> d1 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d2 = new Dictionary<String,List<String>> ();
Dictionary<String,List<String>> d3 = new Dictionary<String,List<String>> ();
String key11 = "key1";
String key12 = "key1";
String key13 = "key1";
String key21 = "key2";
String key22 = "key2";
String key23 = "key2";
List<String> value11 = new List<String> {"value1", "value2"};
List<String> value12 = new List<String> {"value1", "value2"};
List<String> value13 = new List<String> {"value1", "value2"};
List<String> value21 = new List<String> {"value3", "value4"};
List<String> value22 = new List<String> {"value3", "value4"};
List<String> value23 = new List<String> {"value3", "value5"};
dict1.add(key11, value11);
dict1.add(key21, value21);
dict2.add(key22, value22);
dict2.add(key12, value12);
dict3.add(key13, value13);
dict3.add(key23, value23);
dict1.compare(fileName); //Should return true
dict2.compare(fileName); //Should return true
dict3.compare(fileName); //Should return false
Again, if the overall time from startup to the return from compare() is quicker, I can change this code to use a SortedDictionary (or anything else) instead, but I can't guarantee ordering and I need some consistent comparison. compare() could load a serialization and iterate through the dictionaries, it could serialize the in-memory dictionary and compare the serialization to the file name, or it could do any number of other things.
Solution one: use set equality.
If the dictionaries are of different sizes, you know they are unequal.
If they are of the same size then build a mutable hash set of keys from one dictionary. Remove from it all the keys from the other dictionary. If you attempted to remove a key that wasn't there, then the key sets are unequal and you know which key was the problem.
Alternatively, build two hash sets and take their intersection; the resulting intersection should be the size of the original sets.
This takes O(n) time and O(n) space.
Once you know that the key sets are equal then go through all the keys one at a time, fetch the values, and do comparison of the values. Since the values are sequences, use SequenceEquals. This takes O(n) time and O(1) space.
Solution two: sort the keys
Again, if the dictionaries are of different size, you know they are unequal.
If they are of the same size, sort both sets of keys and do a SequenceEquals on them; if the sequences of keys are unequal then the dictionaries are unequal.
This takes O(n lg n) time and O(n) space.
If that succeeds, then again, go through the keys one at a time and compare the values.
Solution three:
Again, check the dictionaries to see if they are the same size.
If they are, then iterate over the keys of one dictionary and check to see if the key exists in the other dictionary. If it does not, then they are not equal. If it does, then check the corresponding values for equality.
This is O(n) in time and O(1) in space.
How to choose amongst these possible solutions? It depends on what the likely failure mode is, and whether you need to know what the missing or extra key is. If the likely failure mode is a bad key then it might be more performant to choose a solution that concentrates on finding the bad key first, and only checking for bad values if all the keys turn out to be OK. If the likely failure mode is a bad value, then the third solution is probably best, since it prioritizes checking values early.
Due to my comments on the accepted answer, here's a stricter check.
goodDictionary.Keys.All(k=>
{
List<string> otherVal;
if(!testDictionary.TryGetValue(k,out otherVal))
{
return false;
}
return goodDictionary[k].SequenceEquals(otherVal);
})
If you already have serialisation, then take the hash (I recommend SHA-1) of each serialised dictionary and then compare them.
I don't think there is a magic bullet here; you just need to do a lookup for each key pair:
public bool IsDictionaryAMatch(Dictionary<string, List<string>> dictionaryToCheck)
{
foreach(var kvp in dictionaryToCheck)
{
// Do the Keys Match
if(!goodDictionary.Exists(x => x.Key == kvp.Key))
return false;
foreach(var valueElement in kvp.Value)
{
// Do the Values in each list match
if(!goodDictionary[kvp.Key].Exists(x => x == valueElement))
return false;
}
}
return true;
}
Well, at some point you need to compare that each key has the same value, but before that you can do quick things, like checking to see how many keys each dictionary has, then checking that the list of keys match. Those should be fairly quick, and if either of those tests fail you can abort the more expensive testing.
After that, you might be able to build separate lists of keys and then fire off a Paraells query to compare the actual values.
I know Dictionaries don't store their key-value pairs in the order that they are added, but if I add the same key-value pairs (potentially in different orders) to two different Dictionaries and serialize the results, will the data on file be the same?
Edit: To clarify, I'm asking specifically about the output of GetObjectData(), not any particular serializer. Consider the following code:
Dictionary<string,List<string>> dict1 = new Dictionary<string,List<string>>();
Dictionary<string,List<string>> dict2 = new Dictionary<string,List<string>>();
string key11 = "key1";
string key12 = "key1";
string key21 = "key2";
string key22 = "key2";
List<string> values11 = new List(1);
List<string> values12 = new List(1);
List<string> values21 = new List(1);
List<string> values22 = new List(1);
values11.add("value1");
values12.add("value1");
values21.add("value2");
values22.add("value2");
dict1.add(key11, values11);
dict2.add(key22, values22);
dict1.add(key21, values21);
dict2.add(key12, values12);
Will dict1 and dict2 return the same thing for GetObjectData()? If not, why not?
Whether or not it is would end up being an implementation detail that likely would not be guaranteed in future versions and/or alternate implementations; As such, I would recommend at the very least having a test written that verifies it and can run as part of your standard tests. But if you are implementing a solution that absolutely depends on it, then it may be worth writing your own serializer...
Almost certainly not! The Dictionary works by hashing, and there has to be some method of hash collision resolution. So let's say that the first time you go through the dictionary you add key1 and key2 in that order. key1 ends up in the "normal" spot for keys that hash to that particular value. key2 is stored "somewhere else" (dependent on the implementation).
Now change the order that you add keys. key2 goes in the normal spot and key1 goes "somewhere else."
You cannot make any assumptions about the order of the items in your dictionary.
Even if you could guarantee the order, that guarantee could be invalidated with the next change to the .NET Framework because the implementation of string.GetHashCode might change (it has in the past). That would completely change the order in which keys are stored in the dictionary's underlying data structures, so any saved data created by a previous version of the Framework would likely not agree with data you create when running with the new version.
That would depend on internal implementation details of the particular (version of the) Dictionary.
So in general, No.
There have been topics here about using Serialization to determine Equality, it fails on several corner cases.