Using task parallel library in IEnumerable implementation to achieve speed improvement - c#

Following code is simplified version of the code that I am trying to optimize.
void Main()
{
var words = new List<string> {"abcd", "wxyz", "1234"};
foreach (var character in SplitItOut(words))
{
Console.WriteLine (character);
}
}
public IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
foreach (string word in words)
{
var characters = GetCharacters(word);
foreach (char c in characters)
{
yield return c;
}
}
}
char[] GetCharacters(string word)
{
Thread.Sleep(5000);
return word.ToCharArray();
}
I cannot change the signature of method SplitItOut.The GetCharacters method is expensive to call but is thread safe. The input to SplitItOut method can contain 100,000+ entries and a single call to GetCharacters() method can take around 200ms. It can also throw exceptions which I can ignore. Order of the results do not matter.
In my first attempt I came up with following implementation using TPL which speeds up the things quite a bit, but is blocking till I am done processing all the words.
public IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
Task<char[][]> tasks = Task<char[][]>.Factory.StartNew(() =>
{
ConcurrentBag<char[]> taskResults = new ConcurrentBag<char[]>();
Parallel.ForEach(words,
word =>
{
taskResults.Add(GetCharacters(word));
});
return taskResults.ToArray();
});
foreach (var wordResult in tasks.Result)
{
foreach (var c in wordResult)
{
yield return c;
}
}
}
I am looking for any better implementation for method SplitItOut() than this. Lower processing time is my priority here.

If I'm reading your question correctly, you're not looking to just speed up the parallel processing that creates the chars from the words - you would like your enumerable to produce each one as soon as it's ready. With the implementation you currently have (and the other answers I currently see), the SplitItOut will wait until all of the words have been sent to GetCharacters, and all results returned before producing the first one.
In cases like this, I like to think of things as splitting my process into producers and a consumer. Your producer thread(s) will take the available words and call GetCharacters, then dump the results somewhere. The consumer will yield up characters to the caller of SplitItOut as soon as they are ready. Really, the consumer is the caller of SplitItOut.
We can make use of the BlockingCollection as both a way to yield up the characters, and as the "somewhere" to put the results. We can use the ConcurrentBag as a place to put the words that have yet to be split:
static void Main()
{
var words = new List<string> { "abcd", "wxyz", "1234"};
foreach (var character in SplitItOut(words))
{
Console.WriteLine(character);
}
}
static char[] GetCharacters(string word)
{
Thread.Sleep(5000);
return word.ToCharArray();
}
No changes to your main or GetCharacters - since these represent your constraints (can't change caller, can't change expensive operation)
public static IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
var source = new ConcurrentBag<string>(words);
var chars = new BlockingCollection<char>();
var tasks = new[]
{
Task.Factory.StartNew(() => CharProducer(source, chars)),
Task.Factory.StartNew(() => CharProducer(source, chars)),
//add more, tweak away, or use a factory to create tasks.
//measure before you simply add more!
};
Task.Factory.ContinueWhenAll(tasks, t => chars.CompleteAdding());
return chars.GetConsumingEnumerable();
}
Here, we change the SplitItOut method to do four things:
Initialize a concurrentbag with all of the words we wish to split. (side note: If you want to enumerate over words on demand, you can start a new task to push them in rather than doing it in the constructor)
Start up our char "producer" Tasks. You can start a set number, use a factory, whatever. I suggest not going task-crazy before you measure.
Signal the BlockingCollection that we are done when all tasks have completed.
"Consume" all of the produced chars (we make it easy on ourselves and just return an IEnumerable<char> rather than foreach and yield, but you could do it the long way if you wish)
All that's missing is our producer implementation. I've expanded out all the linq shortcuts to make it clear, but it's super simple:
private static void CharProducer(ConcurrentBag<string> words, BlockingCollection<char> output)
{
while(!words.IsEmpty)
{
string word;
if(words.TryTake(out word))
{
foreach (var c in GetCharacters(word))
{
output.Add(c);
}
}
}
}
This simply
Takes a word out of the ConcurrentBag (unless it's empty - if it is, task is done!)
Calls the expensive method
Puts the output in the BlockingCollection

I put your code through the profiler built into Visual Studio, and it looks like the overhead of the Task was hurting you. I refactored it slightly to remove the Task, and it improved the performance a bit. Without your actual algorithm and dataset, it's hard to tell exactly what the issue is or where the performance can be improved. If you have VS Premium or Ultimate, there are built-in profiling tools that will help you out a lot. You can also grab the trial of ANTS.
One thing to bear in mind: Don't try to prematurely optimize. If your code is performing acceptably, don't add stuff to possibly make it faster at the expense of readability and maintainability. If it's not performing to an acceptable level, profile it before you start messing with it.
In any case, here's my refactoring of your algorithm:
public static IEnumerable<char> SplitItOut(IEnumerable<string> words)
{
var taskResults = new ConcurrentBag<char[]>();
Parallel.ForEach(words, word => taskResults.Add(GetCharacters(word)));
return taskResults.SelectMany(wordResult => wordResult);
}

Related

C# Parallel.ForEach and Task.WhenAll sometimes returning less values then supposed

I have this:
Parallel.ForEach(numbers, (number) =>
{
var value = Regex.Replace(number, #"\s+", "%20");
tasks.Add(client.GetAsync(url + value));
});
await Task.WhenAll(tasks).ConfigureAwait(false);
foreach (var task in tasks)
{
...
}
Sometimes returns less tasks when reaching the foreach(var task in tasks), but after a few requests, starts returning all the tasks.
Ive changed the ConfigureAwait to true and still sometimes returns less tasks.
BTW Im using Parallel.ForEach beacuse each client.GetAsync(url + value) its a request to an external api with the particularity that its latency SLA is lower than 1s for 99% of its requests
Can you guys explain me why it returns less tasks sometimes?
And is there a way to guarantee returning always all tasks?
Thanks
And is there a way to guarantee returning always all tasks?
Several people in the comments are pointing out you should just do this, on the assumption that numbers is a non-threadsafe List:
foreach(var number in numbers)
{
var value = Regex.Replace(number, #"\s+", "%20");
tasks.Add(client.GetAsync(url + value));
}
await Task.WhenAll(tasks).ConfigureAwait(false);
foreach (var task in tasks)
{
...
}
There doesn't seem to be any considerable benefit in parallelizing the creation of the tasks that do the download; this happens very quickly. The waiting for the downloads to complete is done in the WhenAll
ps; there are a variety of more involved ways to escaping data for a URL, but if you're specifically looking to convert any kind of whitespace to %20, I guess it makes sense to do it with regex..
Edit; you asked when to use a Parallel ForEach, and I'm going to say "don't, generally, because you have to be more careful about th contexts within which you use it", but if you made the Parallel.ForEach do more syncronous work, it might make sense:
Parallel.ForEach(numbers, number =>
{
var value = Regex.Replace(number, #"\s+", "%20");
var r = client.Get(url + value));
//do something meaningful with r here, i.e. whatever ... is in your foreach (var task in tasks)
});
but be mindful if you're performing updates to some shared thing, for coordination purposes, from within the body then it'll need to be threadsafe
You haven't shown it, so we can only guess but I assume that tasks is a List<>. This collection type is not thread-safe; your parallel loop is likely "overwriting" values. Either perform manual locking of your list or switch to a thread-safe collection such as a ConcurrentQueue<>
var tasks = new ConcurrentQueue<Task<string>>();
Parallel.ForEach(numbers, number =>
{
var value = Regex.Replace(number, #"\s+", "%20");
tasks.Enqueue(client.GetAsync(url + value));
});
await Task.WhenAll(tasks.ToArray()).ConfigureAwait(false);
foreach (var task in tasks)
{
// whatever
}
That said, your use of Parallel.ForEach is quite suspect. You aren't performing anything of real significance inside the loop. Use of Parallel, especially with proper locking, likely has higher overhead negating any potential gains you claim to observe or are realized by paralellizing the Regex calls. I would convert this to a normal foreach loop and precompile the Regex to offset (some of) its overhead:
// in class
private static readonly Regex SpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
// in method
var tasks = new List<Task<string>>();
foreach (var number in numbers)
{
var value = SpaceRegex.Replace(number, "%20");
tasks.Add(client.GetAsync(url + value));
}
await Task.WhenAll(tasks).ConfigureAwait(false);
foreach (var task in tasks)
{
// whatever
}
Alternatively, don't use a regex at all. Use a proper Uri escaping mechanism which will have the added benefit of fixing more than just spaces:
var value = Uri.EscapeDataString(number);
// or
var fullUri = Uri.EscapeUriString(url + number);
Note there are two different methods there. The proper one to use depends on the values of url and number. There's also other mechanisms such as the HttpUtility.UrlEncode method... but I think these are the preferred ones.

Modifying list of related objects asynchronously yields unexpected results

I have a list of objects which can have one or more relationships to one another. I'd like to run through this list and compare each object to all other objects in the list, setting the relationships as I compare the objects. Because this comparison in real life is fairly complex and time consuming I'm trying to do this asynchronously.
I've quickly put together some sample code which illustrates the issue at hand in a fairly simple fashion.
class Program
{
private static readonly Word[] _words =
{
new Word("Beef"),
new Word("Bull"),
new Word("Space")
};
static void Main()
{
var tasks = new List<Task>();
foreach (var word in _words)
{
tasks.Add(CheckRelationShipsAsnc(word));
}
Task.WhenAll(tasks);
}
static async Task CheckRelationShipsAsnc(Word leftWord)
{
await Task.Run(() =>
{
foreach (var rightWord in _words)
{
if(leftWord.Text.First() == rightWord.Text.First())
{
leftWord.RelationShips.Add(rightWord);
}
}
});
}
}
class Word
{
public string Text { get; }
public List<Word> RelationShips { get; } = new List<Word>();
public Word(string text)
{
if(string.IsNullOrEmpty(text)) throw new ArgumentException();
Text = text;
}
public override string ToString()
{
return $"{Text} ({RelationShips.Count} relationships)";
}
}
The expected result would be that "Space" has no relationships whereas the words "Bull" and "Beef" have one relationship to one another. What I get is all words have no relationsships at all. I'm having trouble understanding what exactly the problem is.
You should make Main method async as well and await Task.WhenAll. Otherwise the result task for WhenAll will not run its execution. You can also simplify tasks creation using Linq
static async Task Main()
{
var tasks = _words.Select(CheckRelationShipsAsync);
await Task.WhenAll(tasks);
}
You can also use Wait() or WaitAll method, which runs synchronously and blocks the current thread (so, it isn't a recommend approach). But it doesn't require to make Main method async
var tasks = _words.Select(CheckRelationShipsAsync);
var task = Task.WhenAll(tasks);
task.Wait();
or
static void Main()
{
var tasks = _words.Select(CheckRelationShipsAsync);
Task.WaitAll(tasks.ToArray());
}
Second point is that when you check the relationships you haven't skip the word itself, and every word at the end has relationship with itself. You should add leftWord != rightWord condition inside foreach loop to get an expected result
The expected result would be that "Space" has no relationships whereas
the words "Bull" and "Beef" have one relationship to one another.
Your algorithm has an O(n^2) time complexity. This is a problem if you have a great number of items to compare against each other. E.g., if you have 1000 items, this gives you 1000 * 1000 = 1000000 (one million) comparisons.
Consider using another approach. I don't know if this is applicable to your real problem, but for this example, assuming that each word starts with a capital letter A..Z, you could store the related words by first letter in an array of length 26 of word lists.
var a = new List<Word>[26];
// Initialize array with empty lists
for (int i = 0; i < a.Length; i++) {
a[i] = new List<Word>();
}
// Fill array with related words
foreach (var word in _words) {
a[word.Text[0] - 'A'].Add(word); // Subtracting 'A' yields a zero-based index.
}
Note that your original solution has two nested loops (where one is hidden inside the call of CheckRelationShipsAsnc). This solution has only one level of loops and has a time complexity of O(n) up to here.
Now you find all related words in one list at the corresponding array positions. Taking this information, you can now wire up the words being in the same list. This part is still O(n^2); however, here n is much smaller, because is refers only to words being in the lists that are related, but not the length of the initial _words array.
Depending on how your real problem is formulated, it may be better to use a Dictionary<char, List<Word>> in place of the my array a. The array solution requires an index. In a real world problem, the relation condition may not be formulateable as an index. The dictionary requires a key and any kind of object can be used as key. See: Remarks section of Dictionary<TKey,TValue> Class.
An algorithm optimized in this way may be even faster than a multitaskting solution.

Skip first and last in IEnumerable, deferring execution

I have this huge json file neatly formated starting with the characters "[\r\n" and ending with "]". I have this piece of code:
foreach (var line in File.ReadLines(#"d:\wikipedia\wikipedia.json").Skip(1))
{
if (line[0] == ']') break;
// Do stuff
}
I'm wondering, what would be best performance-wise, what machine code would be the most optimal in regards to how many clock cycles and memory is consumed if I were to compare the above code to one where I have replaced "break" with "continue", or would both of those pieces of code compile to the same MSIL and machine code? If you know the answer, please explain exactly how you reached your conclusion? I'd really like to know.
EDIT: Before you close this as nonsensical, consider that this code is equivalent to the above code and consider that the c# compiler optimizes when the code path is flat and does not fork in a lot of ways, would all of the following examples generate the same amount of work for the CPU?
IEnumerable<char> text = new[] {'[', 'a', 'b', 'c', ']'};
foreach (var c in text.Skip(1))
{
if (c == ']') break;
// Do stuff
}
foreach (var c in text.Skip(1))
{
if (c == ']') continue;
// Do stuff
}
foreach (var c in text.Skip(1))
{
if (c != ']')
{
// Do stuff
}
}
foreach (var c in text.Skip(1))
{
if (c != ']')
{
// Do stuff
}
}
foreach (var c in text.Skip(1))
{
if (c != ']')
{
// Do stuff
}
else
{
break;
}
}
EDIT2: Here's another way of putting it: what's the prettiest way to skip the first and last item in an IEnumerable while still deferring the executing until //Do stuff?
Q: Different MSIL for break or continue in loop?
Yes, that's because it works like this:
foreach (var item in foo)
{
// more code...
if (...) { continue; } // jump to #1
if (...) { break; } // jump to #2
// more code...
// #1 -- just before the '}'
}
// #2 -- after the exit of the loop.
Q: What will give you the most performance?
Branches are branches for the compiler. If you have a goto, a continue or a break, it will eventually be compiled as a branch (opcode br), which will be analyzes as such. In other words: it doesn't make a difference.
What does make a difference is having predictable patterns of both data and code flow in the code. Branching breaks code flow, so if you want performance, you should avoid irregular branches.
In other words, prefer:
for (int i=0; i<10 && someCondition; ++i)
to:
for (int i=0; i<10; ++i)
{
// some code
if (someCondition) { ... }
// some code
}
As always with performance, the best thing to do is to run benchmarks. There's no surrogate.
Q: What will give you the most performance? (#2)
You're doing a lot with IEnumerable's. If you want raw performance and have the option, it's best to use an array or a string. There's no better alternative in terms of raw performance for sequential access of elements.
If an array isn't an option (for example because it doesn't match the access pattern), it's best to use a data structure that best suits the access pattern. Learn about the characteristics of hash tables (Dictionary), red black trees (SortedDictionary) and how List works. Knowledge about how stuff really works is the thing you need. If unsure, test, test and test again.
Q: What will give you the most performance? (#3)
I'd also try JSON libraries if your intent is to parse that. These people probably already invented the wheel for you - if not, it'll give you a baseline "to beat".
Q: [...] what's the prettiest way to skip the first and last item [...]
If the underlying data structure is a string, List or array, I'd simply do this:
for (int i=1; i<str.Length-1; ++i)
{ ... }
To be frank, other data structures don't really make sense here IMO. That said, people somethings like to put Linq code everywhere, so...
Using an enumerator
You can easily make a method that returns all but the first and last element. In my book, enumerators always are accessed in code through things like foreach to ensure that IDisposable is called correctly.
public static IEnumerable<T> GetAllButFirstAndLast<T>(IEnumerable<T> myEnum)
{
T jtem = default(T);
bool first = true;
foreach (T item in myEnum.Skip(1))
{
if (first) { first = false; } else { yield return jtem; }
jtem = item;
}
}
Note that this has little to do with "getting the best performance out of your code". One look at the IL tells you all you need to know.

Is there a way to handle any type of collection, instead of solely relying on Array, List, etc?

This example is for a method called "WriteLines", which takes an array of strings and adds them to an asynchronous file writer. It works, but I am curious if there is an interesting way to support -any- collection of strings, rather than relying on the programmer to convert to an array.
I came up with something like:
public void AddLines(IEnumerable<string> lines)
{
// grab the queue
lock (_queue)
{
// loop through the collection and enqueue each line
for (int i = 0, count = lines.Count(); i < count; i++)
{
_queue.Enqueue(lines.ElementAt(i));
}
}
// notify the thread it has work to do.
_hasNewItems.Set();
}
This appears to work but I have no idea of any performance implications it has, or any logic implications either (What happens to the order? I assume this will allow even unordered collections to work, e.g. HashSet).
Is there a more accepted way to achieve this?
You've been passed an IEnumerable<string> - that means you can iterate over it. Heck, there's even a language feature specifically for it - foreach:
foreach (string line in lines)
{
_queue.Enqueue(line);
}
Unlike your existing approach, this will only iterate over the sequence once. Your current code will behave differently based on the underlying implementation - in some cases Count() and ElementAt are optimized, but in some cases they aren't. You can see this really easily if you use an iterator block and log:
public IEnumerable<string> GetItems()
{
Console.WriteLine("yielding a");
yield return "a";
Console.WriteLine("yielding b");
yield return "b";
Console.WriteLine("yielding c");
yield return "c";
}
Try calling AddLines(GetItems()) with your current implementation, and look at the console...
Adding this answer as well since you are using threads, use a ConcurrentQueue instead, like so:
// the provider method
// _queue = new BlockingCollection<string>()
public void AddLines(IEnumerable<string> lines)
{
foreach (var line in lines)
{
_queue.Add(line);
}
}
No locks required, and allows for multiple consumers and providers since we flag for each element added.
The consumer basically only has to do var workitem = _queue.Take();

Can the compiler optimize away ToString() on a string?

I'm sure everyone's encountered their share of developers who love the ToString() method. We've all likely seen code similar to the following:
public static bool CompareIfAnyMatchesOrHasEmpty(List<string> list1, List<string> list2)
{
bool result = false;
foreach (string item1 in list1)
{
foreach (string item2 in list2)
{
if (item1.ToString() == item2.ToString())
{
result = true;
}
if (item1.ToString() == "")
{
result = true;
}
}
}
return result;
}
What I'm wondering is if the ToString() method (the empty, no formatting one) can be optimized away by the compiler? My assumption is that it does not, since it's originally defined on object. Thus I provide this second question, on if any effort to cleanup such instances would be worthwhile?
The C# compiler will not optimize this away. However, at runtime, I believe this will likely get inlined by the JIT compiler in the CLR, as string.ToString() just returns itself.
String.ToString is even declared with TargetedPatchingOptOutAttribute, which allows it to be inlined by NGEN as well when it's called from other assemblies, so it's obviously an inline target.
It certainly could be optimized away by the compiler, but they probably don't because it's trivial. Before deciding whether any optimization is worthwhile, try some tests first. Let's try it!
List<string> strings = Enumerable.Range(1, 10000000).Select(x => Guid.NewGuid().ToString()).ToList();
var sw= Stopwatch.StartNew();
foreach (var str in strings) {
if (!str.ToString().Equals(str.ToString())) {
throw new ApplicationException("The world is ending");
}
}
sw.Stop();
Console.WriteLine("Took: " + sw.Elapsed.TotalMilliseconds);
sw = Stopwatch.StartNew();
foreach (var str in strings) {
if (!str.Equals(str)) {
throw new ApplicationException("The world is ending");
}
}
sw.Stop();
Console.WriteLine("Took: " + sw.Elapsed.TotalMilliseconds);
Ok, so we're in a loop with 10 million items. How long does the tostring (called twice) version take compared to the non tostring version?
Here's what I get on my machine:
Took: 261.6189
Took: 231.2615
So, yeah. I saved 30 whole milliseconds over 10 million iterations. So...yeah, I'm going to say no, not worth it. At all.
Now, should the code be changed because it's stupid? Yes. I would make the argument as, "This is unnecessary and makes me think at a glance that this is NOT a string. It takes me brain cycles to process, and serves literally no purpose. Don't do it." Don't argue from an optimization point of view.

Categories