I have a list of objects which can have one or more relationships to one another. I'd like to run through this list and compare each object to all other objects in the list, setting the relationships as I compare the objects. Because this comparison in real life is fairly complex and time consuming I'm trying to do this asynchronously.
I've quickly put together some sample code which illustrates the issue at hand in a fairly simple fashion.
class Program
{
private static readonly Word[] _words =
{
new Word("Beef"),
new Word("Bull"),
new Word("Space")
};
static void Main()
{
var tasks = new List<Task>();
foreach (var word in _words)
{
tasks.Add(CheckRelationShipsAsnc(word));
}
Task.WhenAll(tasks);
}
static async Task CheckRelationShipsAsnc(Word leftWord)
{
await Task.Run(() =>
{
foreach (var rightWord in _words)
{
if(leftWord.Text.First() == rightWord.Text.First())
{
leftWord.RelationShips.Add(rightWord);
}
}
});
}
}
class Word
{
public string Text { get; }
public List<Word> RelationShips { get; } = new List<Word>();
public Word(string text)
{
if(string.IsNullOrEmpty(text)) throw new ArgumentException();
Text = text;
}
public override string ToString()
{
return $"{Text} ({RelationShips.Count} relationships)";
}
}
The expected result would be that "Space" has no relationships whereas the words "Bull" and "Beef" have one relationship to one another. What I get is all words have no relationsships at all. I'm having trouble understanding what exactly the problem is.
You should make Main method async as well and await Task.WhenAll. Otherwise the result task for WhenAll will not run its execution. You can also simplify tasks creation using Linq
static async Task Main()
{
var tasks = _words.Select(CheckRelationShipsAsync);
await Task.WhenAll(tasks);
}
You can also use Wait() or WaitAll method, which runs synchronously and blocks the current thread (so, it isn't a recommend approach). But it doesn't require to make Main method async
var tasks = _words.Select(CheckRelationShipsAsync);
var task = Task.WhenAll(tasks);
task.Wait();
or
static void Main()
{
var tasks = _words.Select(CheckRelationShipsAsync);
Task.WaitAll(tasks.ToArray());
}
Second point is that when you check the relationships you haven't skip the word itself, and every word at the end has relationship with itself. You should add leftWord != rightWord condition inside foreach loop to get an expected result
The expected result would be that "Space" has no relationships whereas
the words "Bull" and "Beef" have one relationship to one another.
Your algorithm has an O(n^2) time complexity. This is a problem if you have a great number of items to compare against each other. E.g., if you have 1000 items, this gives you 1000 * 1000 = 1000000 (one million) comparisons.
Consider using another approach. I don't know if this is applicable to your real problem, but for this example, assuming that each word starts with a capital letter A..Z, you could store the related words by first letter in an array of length 26 of word lists.
var a = new List<Word>[26];
// Initialize array with empty lists
for (int i = 0; i < a.Length; i++) {
a[i] = new List<Word>();
}
// Fill array with related words
foreach (var word in _words) {
a[word.Text[0] - 'A'].Add(word); // Subtracting 'A' yields a zero-based index.
}
Note that your original solution has two nested loops (where one is hidden inside the call of CheckRelationShipsAsnc). This solution has only one level of loops and has a time complexity of O(n) up to here.
Now you find all related words in one list at the corresponding array positions. Taking this information, you can now wire up the words being in the same list. This part is still O(n^2); however, here n is much smaller, because is refers only to words being in the lists that are related, but not the length of the initial _words array.
Depending on how your real problem is formulated, it may be better to use a Dictionary<char, List<Word>> in place of the my array a. The array solution requires an index. In a real world problem, the relation condition may not be formulateable as an index. The dictionary requires a key and any kind of object can be used as key. See: Remarks section of Dictionary<TKey,TValue> Class.
An algorithm optimized in this way may be even faster than a multitaskting solution.
Related
I have a redis db that has thousands of keys and I'm currently running the following line to get all the keys:
string[] keysArr = keys.Select(key => (string)key).ToArray();
But because I have a lot of keys this takes a long time. I want to limit the number of keys being read. So I'm trying to run an execute command where I get 100 keys at a time:
var keys = Redis.Connection.GetDatabase(dbNum).Execute("scan", 0, "count", 100);
This command successfully runs the command, however unable to access the the value as it is private, and unable to cast it even though RedisResult classs provides a explicit cast to it:
public static explicit operator string[] (RedisResult result);
Any ideas to get x amount of keys at a time from redis?
Thanks
SE.Redis has a .Keys() method on IServer API which fully encapsulates the semantics of SCAN. If possible, just use this method, and consume the data 100 at a time. It is usually pretty easy to write a batching function, i.e.
ExecuteInBatches(server.Keys(), 100, batch => DoSomething(batch));
with:
public void ExecuteInBatches<T>(IEnumerable<T> source, int batchSize,
Action<List<T>> action)
{
List<T> batch = new List<T>();
foreach(var item in source) {
batch.Add(item);
if(batch.Count == batchSize) {
action(batch);
batch = new List<T>(); // in case the callback stores it
}
}
if (batch.Count != 0) {
action(batch); // any leftovers
}
}
The enumerator will worry about advancing the cursor.
You can use Execute, but: that is a lot of work! Also, SCAN makes no gaurantees about how many will be returned per page; it can be zero - it can be 3 times what you asked for. It is ... guidance only.
Incidentally, the reason that the cast fails is because SCAN doesn't return a string[] - it returns an array of two items, the first of which is the "next" cursor, the second is the keys. So maybe:
var arr = (RedisResult[])server.Execute("scan", 0);
var nextCursor = (int)arr[0];
var keys = (RedisKey[])arr[1];
But all this is doing is re-implementing IServer.Keys, the hard way (and significantly less efficiently - ServerResult is not the ideal way to store data, it is simply necessary in the case of Execute and ScriptEvaluate).
I would use the .Take() method, outlined by Microsoft here.
Returns a specified number of contiguous elements from the start of a
sequence.
It would look something like this:
//limit to 100
var keysArr = keys.Select(key => (string)key).Take(100).ToArray();
I have this class:
public class SimHasher {
int count = 0;
//take each string and make an int[] out of it
//should call Hash method lines.Count() times
public IEnumerable<int[]> HashAll(IEnumerable<string> lines) {
//return lines.Select(il => Hash(il));
var linesCount = lines.Count();
var hashes = new int[linesCount][];
for (var i = 0; i < linesCount; ++i) {
hashes[i] = Hash(lines.ElementAt(i));
}
return hashes;
}
public int[] Hash(string line) {
Debug.WriteLine(++count);
//stuff
}
}
When I run a program that calls HashAll and passes it an IEnumerable<string> with 1000 elements, it acts as expected: loops 1000 times, writing numbers from 1 to 1000 in the debug console with the program finishing in under 1 second. However if I replace the code of the HashAll method with the LINQ statement, like so:
public IEnumerable<int[]> HashAll(IEnumerable<string> lines) {
return lines.Select(il => Hash(il));
}
the behavior seems to depend on where HashAll gets called from.
If I call it from this test method
[Fact]
public void SprutSequentialIntegrationTest() {
var inputContainer = new InputContainer(new string[] {
#"D:\Solutions\SimHash\SimHashTests\R.in"
});
var simHasher = new SimHasher();
var documentSimHashes = simHasher.HashAll(inputContainer.InputLines); //right here
var queryRunner = new QueryRunner(documentSimHashes);
var queryResults = queryRunner.RunAllQueries
(inputContainer.Queries);
var expectedQueryResults = System.IO.File.ReadAllLines(
#"D:\Solutions\SimHash\SimHashTests\R.out")
.Select(eqr => int.Parse(eqr));
Assert.Equal(expectedQueryResults, queryResults);
}
the counter in the debug console reaches around 13,000, even though there are only 1000 input lines. It also takes around 6 seconds to finish, but still manages to produce the same results as the loop version.
If I run it from the Main method like so
static void Main(string[] args) {
var inputContainer = new InputContainer(args);
var simHasher = new SimHasher();
var documentSimHashes = simHasher.HashAll(inputContainer.InputLines);
var queryRunner = new QueryRunner(documentSimHashes);
var queryResults = queryRunner.RunAllQueries
(inputContainer.Queries);
foreach (var queryResult in queryResults) {
Console.WriteLine(queryResult);
}
}
it starts writing out to the output console right away, altough very slowly, while the counter in the debug console goes into tens of thousands. When I try to debug it line by line, it goes straight to the foreach loop and writes out the results one by one. After some Googling, I've found out that this is due to LINQ queries being lazily evaluated. However, each time it lazily evaluates a result, the counter in the debug console increase by more than 1000, which is even more than the number of input lines.
What is causing so many calls to the Hash method? Can it be deduced from these snippets?
The reason why you get more iterations than you would expect is that there are LINQ calls that iterate the IEnumerable<T> multiple times.
When you call Count() on an IEnumerable<T>, LINQ tries to see if there is a Count or Length to avoid iterating, but when there is no shortcut, it iterates IEnumerable<T> all the way to the end.
Similarly, when you call ElementAt(i), LINQ tries to see if there is an indexer, but generally it iterates the collection up to point i. This renders your loop an O(n2).
You can easily fix your problem by storing your IEnumerable<T> in a list or an array by calling ToList() or ToArray(). This would iterate through IEnumerable<T> once, and then use Count and indexes to avoid further iterations.
IEnumerable<T> does not allow random access.
The ElementAt() method will actually loop through the entire sequence until it reaches the N'th element.
This example is for a method called "WriteLines", which takes an array of strings and adds them to an asynchronous file writer. It works, but I am curious if there is an interesting way to support -any- collection of strings, rather than relying on the programmer to convert to an array.
I came up with something like:
public void AddLines(IEnumerable<string> lines)
{
// grab the queue
lock (_queue)
{
// loop through the collection and enqueue each line
for (int i = 0, count = lines.Count(); i < count; i++)
{
_queue.Enqueue(lines.ElementAt(i));
}
}
// notify the thread it has work to do.
_hasNewItems.Set();
}
This appears to work but I have no idea of any performance implications it has, or any logic implications either (What happens to the order? I assume this will allow even unordered collections to work, e.g. HashSet).
Is there a more accepted way to achieve this?
You've been passed an IEnumerable<string> - that means you can iterate over it. Heck, there's even a language feature specifically for it - foreach:
foreach (string line in lines)
{
_queue.Enqueue(line);
}
Unlike your existing approach, this will only iterate over the sequence once. Your current code will behave differently based on the underlying implementation - in some cases Count() and ElementAt are optimized, but in some cases they aren't. You can see this really easily if you use an iterator block and log:
public IEnumerable<string> GetItems()
{
Console.WriteLine("yielding a");
yield return "a";
Console.WriteLine("yielding b");
yield return "b";
Console.WriteLine("yielding c");
yield return "c";
}
Try calling AddLines(GetItems()) with your current implementation, and look at the console...
Adding this answer as well since you are using threads, use a ConcurrentQueue instead, like so:
// the provider method
// _queue = new BlockingCollection<string>()
public void AddLines(IEnumerable<string> lines)
{
foreach (var line in lines)
{
_queue.Add(line);
}
}
No locks required, and allows for multiple consumers and providers since we flag for each element added.
The consumer basically only has to do var workitem = _queue.Take();
I have such a scenario at hand (using C#): I need to use a parallel "foreach" on a list of objects: Each object in this list is working like a data source, which is generating series of binary vector patterns (like "0010100110"). As each vector pattern is generated, I need to update the occurrence count of the current vector pattern on a shared ConcurrentDictionary. This ConcurrentDictionary acts like a histogram of specific binary patterns among ALL data sources. In a pseudo-code it should work like this:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(var dataSource in listOfDataSources)
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
//Add the pattern to concDict if it does not exist,
//or increment the current value of it, in a thread-safe fashion among all
//dataSource objects in parallel steps.
}
}
I have read about TryAdd() and TryUpdate() methods of ConcurrentDictionary class in the documentation but I am not sure that I have clearly understood them. TryAdd() obtains an access to the Dictionary for the current thread and looks for the existence of a specific key, a binary pattern in this case, and then if it does not exist, it creates its entry, sets its value to 1 as it is the first occurence of this pattern. TryUpdate() gains acces to the dictionary for the current thread, looks whether the entry with the specified key has its current value equal to a "known" value, if it is so, updates it. By the way, TryGetValue() checks whether a key exits in the dictionary and returns the current value, if it does.
Now I think of the following usage and wonder if it is a correct implementation of a thread-safe population of the ConcurrentDictionary:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(var dataSource in listOfDataSources)
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
while(true)
{
//Look whether the pattern is in dictionary currently,
//if it is, get its current value.
int currOccurenceOfPattern;
bool isPatternInDict = concDict.TryGetValue(pattern,out currOccurenceOfPattern);
//Not in dict, try to add.
if(!isPatternInDict)
{
//If the pattern is not added in the meanwhile, add it to the dict.
//If added, then exit from the while loop.
//If not added, then skip this step and try updating again.
if(TryAdd(pattern,1))
break;
}
//The pattern is already in the dictionary.
//Try to increment its current occurrence value instead.
else
{
//If the pattern's occurence value is not incremented by another thread
//in the meanwhile, update it. If this succeeds, then exit from the loop.
//If TryUpdate fails, then we see that the value has been updated
//by another thread in the meanwhile, we need to try our chances in the next
//step of the while loop.
int newValue = currOccurenceOfPattern + 1;
if(TryUpdate(pattern,newValue,currOccurenceOfPattern))
break;
}
}
}
}
I tried to firmly summarize my logic in the above code snippet in the comments. From what I gather from the documentation, a thread-safe updating scheme can be coded in this fashion, given the atomic "TryXXX()" methods of the ConcurrentDictionary. Is this a correct approach to the problem? How can this be improved or corrected if it is not?
You can use AddOrUpdate method that encapsulates either add or update logic as single thread-safe operation:
ConcurrentDictionary<BinaryPattern,int> concDict = new ConcurrentDictionary<BinaryPattern,int>();
Parallel.Foreach(listOfDataSources, dataSource =>
{
for(int i=0;i<dataSource.OperationCount;i++)
{
BinaryPattern pattern = dataSource.GeneratePattern(i);
concDict.AddOrUpdate(
pattern,
_ => 1, // if pattern doesn't exist - add with value "1"
(_, previous) => previous + 1 // if pattern exists - increment existing value
);
}
});
Please note that AddOrUpdateoperation is not atomic, not sure if it's your requirement but if you need to know the exact iteration when a value was added to the dictionary you can keep your code (or extract it to kind of extension method)
You might also want to go through this article
I don't know what BinaryPattern is here, but I would probably address this in a different way. Instead of copying value types around, inserting things into dictionaries, etc.. like this, I would probably be more inclined if performance was critical to simply place your instance counter in BinaryPattern. Then use InterlockedIncrement() to increment the counter whenever the pattern was found.
Unless there is a reason to separate the count from the pattern, in which case the ConccurentDictionary is probably a good choice.
First, the question is a little confusing because it's not clear what you mean by Parallel.Foreach. I would naively expect this to be System.Threading.Tasks.Parallel.ForEach(), but that's not usable with the syntax you show here.
That said, assuming you actually mean something like Parallel.ForEach(listOfDataSources, dataSource => { ... } )…
Personally, unless you have some specific need to show intermediate results, I would not bother with ConcurrentDictionary here. Instead, I would let each concurrent operation generate its own dictionary of counts, and then merge the results at the end. Something like this:
var results = listOfDataSources.Select(dataSource =>
Tuple.Create(dataSource, new Dictionary<BinaryPattern, int>())).ToList();
Parallel.ForEach(results, result =>
{
for(int i = 0; i < result.Item1.OperationCount; i++)
{
BinaryPattern pattern = result.Item1.GeneratePattern(i);
int count;
result.Item2.TryGetValue(pattern, out count);
result.Item2[pattern] = count + 1;
}
});
var finalResult = new Dictionary<BinaryPattern, int>();
foreach (result in results)
{
foreach (var kvp in result.Item2)
{
int count;
finalResult.TryGetValue(kvp.Key, out count);
finalResult[kvp.Key] = count + kvp.Value;
}
}
This approach would avoid contention between the worker threads (at least where the counts are concerned), potentially improving efficiency. The final aggregation operation should be very fast and can easily be handled in the single, original thread.
Following code is simplified version of the code that I am trying to optimize.
void Main()
{
var words = new List<string> {"abcd", "wxyz", "1234"};
foreach (var character in SplitItOut(words))
{
Console.WriteLine (character);
}
}
public IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
foreach (string word in words)
{
var characters = GetCharacters(word);
foreach (char c in characters)
{
yield return c;
}
}
}
char[] GetCharacters(string word)
{
Thread.Sleep(5000);
return word.ToCharArray();
}
I cannot change the signature of method SplitItOut.The GetCharacters method is expensive to call but is thread safe. The input to SplitItOut method can contain 100,000+ entries and a single call to GetCharacters() method can take around 200ms. It can also throw exceptions which I can ignore. Order of the results do not matter.
In my first attempt I came up with following implementation using TPL which speeds up the things quite a bit, but is blocking till I am done processing all the words.
public IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
Task<char[][]> tasks = Task<char[][]>.Factory.StartNew(() =>
{
ConcurrentBag<char[]> taskResults = new ConcurrentBag<char[]>();
Parallel.ForEach(words,
word =>
{
taskResults.Add(GetCharacters(word));
});
return taskResults.ToArray();
});
foreach (var wordResult in tasks.Result)
{
foreach (var c in wordResult)
{
yield return c;
}
}
}
I am looking for any better implementation for method SplitItOut() than this. Lower processing time is my priority here.
If I'm reading your question correctly, you're not looking to just speed up the parallel processing that creates the chars from the words - you would like your enumerable to produce each one as soon as it's ready. With the implementation you currently have (and the other answers I currently see), the SplitItOut will wait until all of the words have been sent to GetCharacters, and all results returned before producing the first one.
In cases like this, I like to think of things as splitting my process into producers and a consumer. Your producer thread(s) will take the available words and call GetCharacters, then dump the results somewhere. The consumer will yield up characters to the caller of SplitItOut as soon as they are ready. Really, the consumer is the caller of SplitItOut.
We can make use of the BlockingCollection as both a way to yield up the characters, and as the "somewhere" to put the results. We can use the ConcurrentBag as a place to put the words that have yet to be split:
static void Main()
{
var words = new List<string> { "abcd", "wxyz", "1234"};
foreach (var character in SplitItOut(words))
{
Console.WriteLine(character);
}
}
static char[] GetCharacters(string word)
{
Thread.Sleep(5000);
return word.ToCharArray();
}
No changes to your main or GetCharacters - since these represent your constraints (can't change caller, can't change expensive operation)
public static IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
var source = new ConcurrentBag<string>(words);
var chars = new BlockingCollection<char>();
var tasks = new[]
{
Task.Factory.StartNew(() => CharProducer(source, chars)),
Task.Factory.StartNew(() => CharProducer(source, chars)),
//add more, tweak away, or use a factory to create tasks.
//measure before you simply add more!
};
Task.Factory.ContinueWhenAll(tasks, t => chars.CompleteAdding());
return chars.GetConsumingEnumerable();
}
Here, we change the SplitItOut method to do four things:
Initialize a concurrentbag with all of the words we wish to split. (side note: If you want to enumerate over words on demand, you can start a new task to push them in rather than doing it in the constructor)
Start up our char "producer" Tasks. You can start a set number, use a factory, whatever. I suggest not going task-crazy before you measure.
Signal the BlockingCollection that we are done when all tasks have completed.
"Consume" all of the produced chars (we make it easy on ourselves and just return an IEnumerable<char> rather than foreach and yield, but you could do it the long way if you wish)
All that's missing is our producer implementation. I've expanded out all the linq shortcuts to make it clear, but it's super simple:
private static void CharProducer(ConcurrentBag<string> words, BlockingCollection<char> output)
{
while(!words.IsEmpty)
{
string word;
if(words.TryTake(out word))
{
foreach (var c in GetCharacters(word))
{
output.Add(c);
}
}
}
}
This simply
Takes a word out of the ConcurrentBag (unless it's empty - if it is, task is done!)
Calls the expensive method
Puts the output in the BlockingCollection
I put your code through the profiler built into Visual Studio, and it looks like the overhead of the Task was hurting you. I refactored it slightly to remove the Task, and it improved the performance a bit. Without your actual algorithm and dataset, it's hard to tell exactly what the issue is or where the performance can be improved. If you have VS Premium or Ultimate, there are built-in profiling tools that will help you out a lot. You can also grab the trial of ANTS.
One thing to bear in mind: Don't try to prematurely optimize. If your code is performing acceptably, don't add stuff to possibly make it faster at the expense of readability and maintainability. If it's not performing to an acceptable level, profile it before you start messing with it.
In any case, here's my refactoring of your algorithm:
public static IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
var taskResults = new ConcurrentBag<char[]>();
Parallel.ForEach(words, word => taskResults.Add(GetCharacters(word)));
return taskResults.SelectMany(wordResult => wordResult);
}