I have a redis db that has thousands of keys and I'm currently running the following line to get all the keys:
string[] keysArr = keys.Select(key => (string)key).ToArray();
But because I have a lot of keys this takes a long time. I want to limit the number of keys being read. So I'm trying to run an execute command where I get 100 keys at a time:
var keys = Redis.Connection.GetDatabase(dbNum).Execute("scan", 0, "count", 100);
This command successfully runs the command, however unable to access the the value as it is private, and unable to cast it even though RedisResult classs provides a explicit cast to it:
public static explicit operator string[] (RedisResult result);
Any ideas to get x amount of keys at a time from redis?
Thanks
SE.Redis has a .Keys() method on IServer API which fully encapsulates the semantics of SCAN. If possible, just use this method, and consume the data 100 at a time. It is usually pretty easy to write a batching function, i.e.
ExecuteInBatches(server.Keys(), 100, batch => DoSomething(batch));
with:
public void ExecuteInBatches<T>(IEnumerable<T> source, int batchSize,
Action<List<T>> action)
{
List<T> batch = new List<T>();
foreach(var item in source) {
batch.Add(item);
if(batch.Count == batchSize) {
action(batch);
batch = new List<T>(); // in case the callback stores it
}
}
if (batch.Count != 0) {
action(batch); // any leftovers
}
}
The enumerator will worry about advancing the cursor.
You can use Execute, but: that is a lot of work! Also, SCAN makes no gaurantees about how many will be returned per page; it can be zero - it can be 3 times what you asked for. It is ... guidance only.
Incidentally, the reason that the cast fails is because SCAN doesn't return a string[] - it returns an array of two items, the first of which is the "next" cursor, the second is the keys. So maybe:
var arr = (RedisResult[])server.Execute("scan", 0);
var nextCursor = (int)arr[0];
var keys = (RedisKey[])arr[1];
But all this is doing is re-implementing IServer.Keys, the hard way (and significantly less efficiently - ServerResult is not the ideal way to store data, it is simply necessary in the case of Execute and ScriptEvaluate).
I would use the .Take() method, outlined by Microsoft here.
Returns a specified number of contiguous elements from the start of a
sequence.
It would look something like this:
//limit to 100
var keysArr = keys.Select(key => (string)key).Take(100).ToArray();
Related
I have this class:
public class SimHasher {
int count = 0;
//take each string and make an int[] out of it
//should call Hash method lines.Count() times
public IEnumerable<int[]> HashAll(IEnumerable<string> lines) {
//return lines.Select(il => Hash(il));
var linesCount = lines.Count();
var hashes = new int[linesCount][];
for (var i = 0; i < linesCount; ++i) {
hashes[i] = Hash(lines.ElementAt(i));
}
return hashes;
}
public int[] Hash(string line) {
Debug.WriteLine(++count);
//stuff
}
}
When I run a program that calls HashAll and passes it an IEnumerable<string> with 1000 elements, it acts as expected: loops 1000 times, writing numbers from 1 to 1000 in the debug console with the program finishing in under 1 second. However if I replace the code of the HashAll method with the LINQ statement, like so:
public IEnumerable<int[]> HashAll(IEnumerable<string> lines) {
return lines.Select(il => Hash(il));
}
the behavior seems to depend on where HashAll gets called from.
If I call it from this test method
[Fact]
public void SprutSequentialIntegrationTest() {
var inputContainer = new InputContainer(new string[] {
#"D:\Solutions\SimHash\SimHashTests\R.in"
});
var simHasher = new SimHasher();
var documentSimHashes = simHasher.HashAll(inputContainer.InputLines); //right here
var queryRunner = new QueryRunner(documentSimHashes);
var queryResults = queryRunner.RunAllQueries
(inputContainer.Queries);
var expectedQueryResults = System.IO.File.ReadAllLines(
#"D:\Solutions\SimHash\SimHashTests\R.out")
.Select(eqr => int.Parse(eqr));
Assert.Equal(expectedQueryResults, queryResults);
}
the counter in the debug console reaches around 13,000, even though there are only 1000 input lines. It also takes around 6 seconds to finish, but still manages to produce the same results as the loop version.
If I run it from the Main method like so
static void Main(string[] args) {
var inputContainer = new InputContainer(args);
var simHasher = new SimHasher();
var documentSimHashes = simHasher.HashAll(inputContainer.InputLines);
var queryRunner = new QueryRunner(documentSimHashes);
var queryResults = queryRunner.RunAllQueries
(inputContainer.Queries);
foreach (var queryResult in queryResults) {
Console.WriteLine(queryResult);
}
}
it starts writing out to the output console right away, altough very slowly, while the counter in the debug console goes into tens of thousands. When I try to debug it line by line, it goes straight to the foreach loop and writes out the results one by one. After some Googling, I've found out that this is due to LINQ queries being lazily evaluated. However, each time it lazily evaluates a result, the counter in the debug console increase by more than 1000, which is even more than the number of input lines.
What is causing so many calls to the Hash method? Can it be deduced from these snippets?
The reason why you get more iterations than you would expect is that there are LINQ calls that iterate the IEnumerable<T> multiple times.
When you call Count() on an IEnumerable<T>, LINQ tries to see if there is a Count or Length to avoid iterating, but when there is no shortcut, it iterates IEnumerable<T> all the way to the end.
Similarly, when you call ElementAt(i), LINQ tries to see if there is an indexer, but generally it iterates the collection up to point i. This renders your loop an O(n2).
You can easily fix your problem by storing your IEnumerable<T> in a list or an array by calling ToList() or ToArray(). This would iterate through IEnumerable<T> once, and then use Count and indexes to avoid further iterations.
IEnumerable<T> does not allow random access.
The ElementAt() method will actually loop through the entire sequence until it reaches the N'th element.
I have such a scenario at hand (using C#): I need to use a parallel "foreach" on a list of objects: Each object in this list is working like a data source, which is generating series of binary vector patterns (like "0010100110"). As each vector pattern is generated, I need to update the occurrence count of the current vector pattern on a shared ConcurrentDictionary. This ConcurrentDictionary acts like a histogram of specific binary patterns among ALL data sources. In a pseudo-code it should work like this:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(var dataSource in listOfDataSources)
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
//Add the pattern to concDict if it does not exist,
//or increment the current value of it, in a thread-safe fashion among all
//dataSource objects in parallel steps.
}
}
I have read about TryAdd() and TryUpdate() methods of ConcurrentDictionary class in the documentation but I am not sure that I have clearly understood them. TryAdd() obtains an access to the Dictionary for the current thread and looks for the existence of a specific key, a binary pattern in this case, and then if it does not exist, it creates its entry, sets its value to 1 as it is the first occurence of this pattern. TryUpdate() gains acces to the dictionary for the current thread, looks whether the entry with the specified key has its current value equal to a "known" value, if it is so, updates it. By the way, TryGetValue() checks whether a key exits in the dictionary and returns the current value, if it does.
Now I think of the following usage and wonder if it is a correct implementation of a thread-safe population of the ConcurrentDictionary:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(var dataSource in listOfDataSources)
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
while(true)
{
//Look whether the pattern is in dictionary currently,
//if it is, get its current value.
int currOccurenceOfPattern;
bool isPatternInDict = concDict.TryGetValue(pattern,out currOccurenceOfPattern);
//Not in dict, try to add.
if(!isPatternInDict)
{
//If the pattern is not added in the meanwhile, add it to the dict.
//If added, then exit from the while loop.
//If not added, then skip this step and try updating again.
if(TryAdd(pattern,1))
break;
}
//The pattern is already in the dictionary.
//Try to increment its current occurrence value instead.
else
{
//If the pattern's occurence value is not incremented by another thread
//in the meanwhile, update it. If this succeeds, then exit from the loop.
//If TryUpdate fails, then we see that the value has been updated
//by another thread in the meanwhile, we need to try our chances in the next
//step of the while loop.
int newValue = currOccurenceOfPattern + 1;
if(TryUpdate(pattern,newValue,currOccurenceOfPattern))
break;
}
}
}
}
I tried to firmly summarize my logic in the above code snippet in the comments. From what I gather from the documentation, a thread-safe updating scheme can be coded in this fashion, given the atomic "TryXXX()" methods of the ConcurrentDictionary. Is this a correct approach to the problem? How can this be improved or corrected if it is not?
You can use AddOrUpdate method that encapsulates either add or update logic as single thread-safe operation:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(listOfDataSources, dataSource =>
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
concDict.AddOrUpdate(
pattern,
_ => 1, // if pattern doesn't exist - add with value "1"
(_, previous) => previous + 1 // if pattern exists - increment existing value
);
}
});
Please note that AddOrUpdateoperation is not atomic, not sure if it's your requirement but if you need to know the exact iteration when a value was added to the dictionary you can keep your code (or extract it to kind of extension method)
You might also want to go through this article
I don't know what BinaryPattern is here, but I would probably address this in a different way. Instead of copying value types around, inserting things into dictionaries, etc.. like this, I would probably be more inclined if performance was critical to simply place your instance counter in BinaryPattern. Then use InterlockedIncrement() to increment the counter whenever the pattern was found.
Unless there is a reason to separate the count from the pattern, in which case the ConccurentDictionary is probably a good choice.
First, the question is a little confusing because it's not clear what you mean by Parallel.Foreach. I would naively expect this to be System.Threading.Tasks.Parallel.ForEach(), but that's not usable with the syntax you show here.
That said, assuming you actually mean something like Parallel.ForEach(listOfDataSources, dataSource => { ... } )…
Personally, unless you have some specific need to show intermediate results, I would not bother with ConcurrentDictionary here. Instead, I would let each concurrent operation generate its own dictionary of counts, and then merge the results at the end. Something like this:
var results = listOfDataSources.Select(dataSource =>
Tuple.Create(dataSource, new Dictionary<BinaryPattern, int>())).ToList();
Parallel.ForEach(results, result =>
{
for(int i = 0; i < result.Item1.OperationCount; i++)
{
BinaryPattern pattern = result.Item1.GeneratePattern(i);
int count;
result.Item2.TryGetValue(pattern, out count);
result.Item2[pattern] = count + 1;
}
});
var finalResult = new Dictionary<BinaryPattern, int>();
foreach (result in results)
{
foreach (var kvp in result.Item2)
{
int count;
finalResult.TryGetValue(kvp.Key, out count);
finalResult[kvp.Key] = count + kvp.Value;
}
}
This approach would avoid contention between the worker threads (at least where the counts are concerned), potentially improving efficiency. The final aggregation operation should be very fast and can easily be handled in the single, original thread.
If I had a statement such as:
var item = Core.Collections.Items.FirstOrDefault(itm => itm.UserID == bytereader.readInt());
Does this code read an integer from my stream each iteration, or does it read the integer once, store it, then use its value throughout the lookup?
Consider this code:
static void Main(string[] args)
{
new[] { 1, 2, 3, 4 }.FirstOrDefault(j => j == Get());
Console.ReadLine();
}
static int i = 5;
static int Get()
{
Console.WriteLine("GET:" + i);
return i--;
}
It shows, that it will call the method the number of times it needs to meet the first element matching the condition. The output will be:
GET:5
GET:4
GET:3
I don't know without checking but would expect it to read it each time.
But this is very easily remedied with the following version of your code.
byte val = bytereader.readInt();
var item = Core.Collections.Items.FirstOrDefault(itm => itm.UserID == val);
Myself, I would automatically take this approach anyway just to remove any doubt. Might be a good habit to form as there is no reason to read it for each item.
It's actually quite obvious that the call is performed for each item - FirstOrDefault() takes an delegate as argument. This fact is a bit obscured by using a lambda method but in the end the method only sees a delegate that it can call for each item to check the predicate. In order to evaluate the right hand side only once some magic mechanism would have to understand and rewrite the method and (sometimes sadly) there is no real magic inside compilers and runtimes.
I have a simple method to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.
The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.
The inefficiency is obviously the Contains method call on the processedFiles collection.
First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFiles.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
A List<T>'s Contains method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string> or similar instead. A HashSet<T>'s Containsmethod is designed to run in constant O(1) time, i.e it shouldn't depend on the number of items in the set.
This small change should make the entire method run in linear time:
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles,
List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
HashSet<string> processedFileSet = new HashSet<string>(processedFiles);
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFileSet.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
I would suggest 3 improvements, if possible:
For extra efficiency, store the processed files in a set at the source, so that this method takes an ISet<T> as a parameter. This way, you won't have to reconstruct the set every time.
Try not to mix and match different representations of the same entity (string and FileInfo) in this fashion. Pick one and go with it.
You might also want to consider the HashSet<T>.ExceptWith method instead of doing the looping yourself. Bear in mind that this will mutate the collection.
If you can use LINQ, and you can afford to build up a set on every call, here's another way:
public static IEnumerable<string> GetUnprocessedFiles
(IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
// null-checks here
return allFiles.Except(processedFiles);
}
I would try to convert the processedFiles List to a HashSet. With a list, it needs to iterate the list everytime you call contains. A HashSet is an O(1) operation.
You could use a dictionary/hastable like class to speed up the lookup process significantly. Even translation the incoming List into an hashtable once, then using that one will be much quicker than what you're using.
Sort the searched array by file name
employ Array.BinarySearch<T>() to search the array. This should come out at about O(logN) efficiency.
to check if a list contains an element is faster with a sorted list
Just to be excessively pedantic ...
If you know that both lists are sorted (FileInfo lists often come pre-sorted, so this approach might be applicable to you), then you can achieve true linear performance without the time and memory overhead of a hashset. Hashset construction still requires linear time to build, so complexity is closer to O(n + m); the hashset has to internally allocate additional object references for at most 250k strings in your case and that's going to cost in GC terms.
Something like this half-baked generalisation might help:
public static IEnumerable<string> GetMismatches(IList<string> fileNames, IList<string> processedFileNames, StringComparer comparer)
{
var filesIndex = 0;
var procFilesIndex = 0;
while (filesIndex < fileNames.Count)
{
if (procFilesIndex >= processedFileNames.Count)
{
yield return files[filesIndex++];
}
else
{
var rc = comparer.Compare(fileNames[filesIndex], processedFileNames[procFilesIndex]);
if (rc != 0)
{
if (rc < 0)
{
yield return files[filesIndex++];
}
else
{
procFilesIndex++;
}
}
else
{
filesIndex++;
procFilesIndex++;
}
}
}
yield break;
}
I would strongly agree with Ani that sticking to a generic or canonical type is A Very Good Thing Indeed.
But I'll give mine -1 for unfinished generalisation and -1 for elegance...
I have a very simple function which takes in a matching bitfield, a grid, and a square. It used to use a delegate but I did a lot of recoding and ended up with a bitfield & operation to avoid the delegate while still being able to perform matching within reason. Basically, the challenge is to find all contiguous elements within a grid which match the match bitfield, starting from a specific "leader" square.
Square is somewhat small (but not tiny) class. Any tips on how to push this to be even faster? Note that the grid itself is pretty small (500 elements in this test).
Edit: It's worth noting that this function is called over 200,000 times per second. In truth, in the long run my goal will be to call it less often, but that's really tough, considering that my end goal is to make the grouping system be handled with scripts rather than being hardcoded. That said, this function is always going to be called more than any other function.
Edit: To clarify, the function does not check if leader matches the bitfield, by design. The intention is that the leader is not required to match the bitfield (though in some cases it will).
Things tried unsuccessfully:
Initializing the dictionary and stack with a capacity.
Casting the int to an enum to avoid a cast.
Moving the dictionary and stack outside the function and clearing them each time they are needed. This makes things slower!
Things tried successfully:
Writing a hashcode function instead of using the default: Hashcodes are precomputed and are equal to x + y * parent.Width. Thanks for the reminder, Jim Mischel.
mquander's Technique: See GetGroupMquander below.
Further Optimization: Once I switched to HashSets, I got rid of the Contains test and replaced it with an Add test. Both Contains and Add are forced to seek a key, so just checking if an add succeeds is more efficient than adding if a Contains fails check fails. That is, if (RetVal.Add(s)) curStack.Push(s);
public static List<Square> GetGroup(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
curStack.Push(leader);
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
if (Retval.ContainsKey(curItem)) continue;
Retval.Add(curItem, true);
foreach (Square s in curItem.Neighbors)
{
if (0 != ((int)(s.RoomType) & match))
{
curStack.Push(s);
}
}
}
return new List<Square>(Retval.Keys);
}
=====
public static List<Square> GetGroupMquander(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
Retval.Add(leader, true);
curStack.Push(leader);
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
foreach (Square s in curItem.Neighbors)
{
if (0 != ((int)(s.RoomType) & match))
{
if (!Retval.ContainsKey(s))
{
curStack.Push(s);
Retval.Add(curItem, true);
}
}
}
}
return new List<Square>(Retval.Keys);
}
The code you posted assumes that the leader square matches the bitfield. Is that by design?
I assume your Square class has implemented a GetHashCode method that's quick and provides a good distribution.
You did say micro-optimization . . .
If you have a good idea how many items you're expecting, you'll save a little bit of time by pre-allocating the dictionary. That is, if you know you won't have more than 100 items that match, you can write:
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>(100);
That will avoid having to grow the dictionary and re-hash everything. You can also do the same thing with your stack: pre-allocate it to some reasonable maximum size to avoid resizing later.
Since you say that the grid is pretty small it seems reasonable to just allocate the stack and the dictionary to the grid size, if that's easy to determine. You're only talking grid_size references each, so memory isn't a concern unless your grid becomes very large.
Adding a check to see if an item is in the dictionary before you do the push might speed it up a little. It depends on the relative speed of a dictionary lookup as opposed to the overhead of having a duplicate item in the stack. Might be worth it to give this a try, although I'd be surprised if it made a big difference.
if (0 != ((int)(s.RoomType) & match))
{
if (!Retval.ContainsKey(curItem))
curStack.Push(s);
}
I'm really stretching on this last one. You have that cast in your inner loop. I know that the C# compiler sometimes generates a surprising amount of code for a seemingly simple cast, and I don't know if that gets optimized away by the JIT compiler. You could remove that cast from your inner loop by creating a local variable of the enum type and assigning it the value of match:
RoomEnumType matchType = (RoomEnumType)match;
Then your inner loop comparison becomes:
if (0 != (s.RoomType & matchType))
No cast, which might shave some cycles.
Edit: Micro-optimization aside, you'll probably get better performance by modifying your algorithm slightly to avoid processing any item more than once. As it stands, items that do match can end up in the stack multiple times, and items that don't match can be processed multiple times. Since you're already using a dictionary to keep track of items that do match, you can keep track of the non-matching items by giving them a value of false. Then at the end you simply create a List of those items that have a true value.
public static List<Square> GetGroup(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
curStack.Push(leader);
Retval.Add(leader, true);
int numMatch = 1;
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
foreach (Square s in curItem.Neighbors)
{
if (Retval.ContainsKey(curItem))
continue;
if (0 != ((int)(s.RoomType) & match))
{
curStack.Push(s);
Retval.Add(s, true);
++numMatch;
}
else
{
Retval.Add(s, false);
}
}
}
// LINQ makes this easier, but since you're using .NET 2.0...
List<Square> matches = new List<Square>(numMatch);
foreach (KeyValuePair<Square, bool> kvp in Retval)
{
if (kvp.Value == true)
{
matches.Add(kvp.Key);
}
}
return matches;
}
Here are a couple of suggestions -
If you're using .NET 3.5, you could change RetVal to a HashSet<Square> instead of a Dictionary<Square,bool>, since you're never using the values (only the keys) in the Dictionary. This would be a small improvement.
Also, if you changed the return to IEnumerable, you could just return the HashSet's enumerator directly. Depending on the usage of the results, it could potentially be faster in certain areas (and you can always use ToList() on the results if you really need a list).
However, there is a BIG optimization that could be added here -
Right now, you're always adding in every neighbor, even if that neighbor has already been processed. For example, when leader is processed, it adds in leader+1y, then when leader+1y is processed, it puts BACK in leader (even though you've already handled that Square), and next time leader is popped off the stack, you continue. This is a lot of extra processing.
Try adding:
foreach (Square s in curItem.Neighbors)
{
if ((0 != ((int)(s.RoomType) & match)) && (!Retval.ContainsKey(s)))
{
curStack.Push(s);
}
}
This way, if you've already processed the square of your neighbor, it doesn't get re-added to the stack, just to be skipped when it's popped later.