Fastest Way to Parse Large Strings (multi threaded)

Fastest Way to Parse Large Strings (multi threaded) - c#

I am about to start a project which will be taking blocks of text, parsing a lot of data into them into some sort of object which can then be serialized, stored, and statistics / data gleaned from. This needs to be as fast as possible as I have > 10,000,000 blocks of text that I need to start on and will be getting 100,000's of thousands a day.
I am running this on a system with 12 xeon cores + hyper threading. I also have access / know a bit about CUDA programming but for string stuff think that its not appropriate. From each string I need to parse a lot of data and some of it I know the exact positions of, some I don't and need to use regex's / something smart.
So consider something like this:
object[] parseAll (string [] stringsToParse)
{
parallel foreach
parse( string[n] )
}
object parse(string s)
{
try to use exact positions / substring etc here instead of regex's
}
So my questions are:
How much slower is using regex's to substr.
Is .NET going to be significantly slower than other languages.
What sort of optimizations (if any) can I do to maximize parallelism.
Anything else I haven't considered?
Thanks for any help! Sorry if this is long winded.

How much slower is using regex's to substr.
If you are looking for an exact string, substr will be faster. Regular expressions however are highly optimized. They (or at least parts) are compiled to IL and you can even store these compiled versions in a separate assembly using Regex.CompileToAssembly. See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more information.
What you really need to do is do perform measurements. Using something like Stopwatch is by far the easiest way to verify whether one or the other code construct works faster.
What sort of optimizations (if any) can I do to maximize parallelism.
With Task.Factory.StartNew, you can schedule tasks to run on the thread pool. You may also have a look at the TPL (Task Parallel Library, of which Task is a part). This has lots of constructs that help you parallelize work and allows constructs like Parallel.ForEach() to execute an iteration on multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more information.
Anything else I haven't considered?
One of the things that will hurt you with this volume of data is memory management. A few things to take into account:
Limit memory allocation: try to re-use the same buffers for a single document instead of copying them when you only need a part. Say you need to work on a range starting at char 1000 to 2000, don't copy that range into a new buffer, but construct your code to work only in that range. This will make your code complexer, but it saves you memory allocations;
StringBuilder is an important class. If you don't know of it yet, have a look.

I don't know what kind of processing you're doing here, but if you're talking hundreds of thousands of strings per day, it seems like a pretty small number. Let's assume that you get 1 million new strings to process every day, and you can fully task 10 of those 12 Xeon cores. That's 100,000 strings per core per day. There are 86,400 seconds in a day, so we're talking 0.864 seconds per string. That's a lot of parsing.
I'll echo the recommendations made by #Pieter, especially where he suggests making measurements to see how long it takes to do your processing. Your best bet is to get something up and working, then figure out how to make it faster if you need to. I think you'll be surprised at how often you don't need to do any optimization. (I know that's heresy to the optimization wizards, but processor time is cheap and programmer time is expensive.)
How much slower is using regex's to substr?
That depends entirely on how complex your regexes are. As #Pieter said, if you're looking for a single string, String.Contains will probably be faster. You might also consider using String.IndexOfAny if you're looking for constant strings. Regular expressions aren't necessary unless you're looking for patterns that can't be represented as constant strings.
Is .NET going to be significantly slower than other languages?
In processor-intensive applications, .NET can be slower than native apps. Sometimes. If so, it's typically in the range of 5 to 20 percent, and most often between 7 and 12 percent. That's just the code executing in isolation. You have to take into account other factors like how long it takes you to build the program in that other language and how difficult it is to share data between the native app and the rest of your system.

Google had recently announced it's internal text processing language (which seems like a Python/Perl subset made for heavily parallel processing).
http://code.google.com/p/szl/ - Sawzall.

If you want to do fast string parsing in C#, you might want to consider having a look at the new NLib project. It contains string extensions to facilitate searching strings in various ways rapidly. Such as, IndexOfAny(string[]) and IndexOfNotAny. They contain overloads with a StringComparison argument too.

Related

C# How Parallel.ForEach / Parallel.For partitioning works

I have some basic questions about Parallel.ForEach with partition approach and I'm facing some problems with them so I'd like to understand how this code works and what is the flow of it.
Code sample
var result = new StringBuilder();
Parallel.ForEach(Enumerable.Range(1, 5), () => new StringBuilder(), (x, option, sb) =>
{
sb.Append(x);
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});
Questions related to the code above:
Are they doing some partition work inside parallel foreach ?
When I debug the code, I can see that the iteration (execution) of the code happens more then 5 times, but as I understand it is supposed to fire only 5 times - Enumerable.Range(1, 5) .
When will be this code fired ? In both Parallel.Foreach and Parallel.For there are two blocks separated by {}. How these two blocks are executing and interact with each other?
lock (result)
{
result.Append(sb.ToString());
}
Bonus Q:
See this this block of code where 5 iteration not occurring rather more iteration is taking place. when i use Parallel For instead of foreach. see the code and tell me where i made the mistake.
var result = new StringBuilder();
Parallel.For(1, 5, () => new StringBuilder(), (x, option, sb) =>
{
sb.Append("line " + x + System.Environment.NewLine);
MessageBox.Show("aaa"+x.ToString());
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});

There are several misunderstands regarding how Parallel.XYZ works.
Couple of great points and suggestions have been mentioned in the comments, so I won't repeat them. Rather I would like to share some thoughts about Parallel programming.
The Parallel Class
Whenever we are talking about parallel programming we are usually distinguishing two kinds: Data parallelism and Task parallelism. The former is executing the same function(s) over a chunk of data in parallel. The latter is executing several independent functions in parallel.
(There is also a 3rd model called pipeline which is kind a mixture of these two. I won't spend time on it if you are interested about that one I would suggest to search for Task Parallel Library's Dataflow or System.Threading.Channels.)
The Parallel class supports both of the models. The For and ForEach are designed for data parallelism, while the Invoke for task parallelism.
Partitioning
In case data parallelism the tricky part is how do you slice your data to get the best throughput / performance. You have to put into the account the size of the data collection, the structure of the data, the processing logic and the available cores (and many more other aspects as well). So there is no one-rule-for-all suggestion.
The main concern about partitioning is to not under-use the resources (some cores are idle, while others are working hard) and neither over-use (there are way more waiting jobs than available cores, so the synchronization overhead can be significant).
Let's suppose your processing logic is firmly stable (in other words various input data will not change significantly the processing time). In this case you can load balance the data between the executors. If an executor finishes then it can grab the new piece of data to be processed.
The way how you choose which data should go to which executor can be defined by the Partitioner(1). By default .NET support Range, Chunk, Hash and Striped partitioning. Some are static (the partitioning is done before any processing) and some of them are dynamic (depending on the processing speed some executor might receive more than other ones).
The following two excellent articles can give you better insight how each of the partitioning works:
Dixin's blog
Nima's blog
Thread Safety
If each of the executor can execute its processing task without the need to interact with others than they are considered independent. If you can design your algorithm to have independent processing units then you minimize the synchronization.
In case of For and ForEach each partition can have its own partition-local-storage. That means the computations are independent because the intermediate results are stored in a partition aware storage. But as usual you want to merge these into a single collection or even into value.
That's the reason why these Parallel methods have body and localFinally parameters. The former is used to define the individual processing, while the latter is the aggregate and merge function. (It is kinda similar to the Map-Reduce approach) In the latter you have aware of thread safety by yourself.
PLINQ
I don't want to explore this topic, which outside of the scope of the question. But I would like to give you a notch where to get started:
MS Whitepaper about when to use Parallel and when to use PLINQ
Common pitfalls of PLINQ
Useful resources
Joe Albahari's Parallel Programming
BlackWasp's Parallel Programming
EDIT: How to decide that it's worth to run in parallel?
There is no single formula (at least to my knowledge) which will tell you when does it make sense to use parallel execution. As I tried to highlight in the Partitioning section is a quite complex topic, so several experiments and fine-tuning are needed to find the optimal solution.
I highly encourage you to measure and try several different settings.
Here is my guideline how you should tackle this:
Try to understand the current characteristics of your application
Perform several different measurements to spot the execution bottleneck
Capture the current solution's performance metrics as your baseline
If it possible try to extract that piece of code from the code base to ease the fine-tuning
Try to tackle the same problem with several different aspects and with various inputs
Measure them and compare them to your baseline
If you are satisfied with the result then put that piece of code into your code base and measure again under different workloads
Try to capture as many relevant metrics as you can
If it is possible consider to execute both (sequential and parallel) solutions and compare their results.
If you are satisfied then get rid of the sequential code
Details
There several really good tools that can help you to get insight about your application. For .NET Profiling I would encourage you to give it try to CodeTrack. Concurrency Visualizer is also good tool if no need custom metrics.
By several measurements I meant that you should measure several times with several different tools to exclude special circumstances. If you measure only once then you can get false positive result. So, measure twice, cut once.
Your sequential processing should serve as a baseline. Base over-parallelization can cause certain overhead that's why it make sense to be able to compare your new shine solution with current one. Under utilization can also cause significant performance degradation.
If you can extract your problematic code than you can perform micro-benchmarks. I encourage you to take a look at the awesome Benckmark.NET tool to create benchmarks.
The same problem can be solved many in ways. So try to find several different approaches (like Parallel, PLINQ can be used more or less for the same problems)
As I said earlier measure, measure and measure. You should also keep in mind .NET try to be smart. What I mean by that for example AsParallel does not give you a guarantee that it will run in parallel. .NET analysis your solution and data structure and decide how to run it. On the other hand you can enforce parallel execution if you are certain that it will help.
There are libraries like Scientist.NET which can help you to perform this short of parallel run and compare process.
Enjoy :D

Fastest way to check over 1500 regex pattern match on the same string

I have over 1500 given regular expression patterns, that need to be run on the same 100 - 200 kb text files and return list of success patterns. Files come from outside, so I can't do any assumption about that file.
The question is, can I somehow make processing faster than running all this regexes to the same text?
Logically the input file is the same, and later regexes can use some information that already have been processed. If we take that each regex is finite automate, than running 1500 finite automates to the same text, is definitely slower than runinng one joined automate. So the question is, can I somehow create that joined regex?

This is a perfect opportunity to take advantage of threading. Read in your to be processed file into a string, then spin up a series of consumer threads. Have your main thread put each regular expression into a queue, then have the consumers break off the next piece of the queue, compile the regex, and run it on the string. The shared memory means you can have several expressions running on the same string, and even on a weak computer (2 cores, not hyperthreaded) you'll notice a significant speed boost if you keep your consumer pool to a reasonable size. On a really big server - say 32 cores with hyperthreading? You can have a nice fat pool and blast through those regular expressions in no time.

I think it's possible in theory but seems like a non-trivial task. A possible approach could be:
Convert all regexes to finite state machines.
Combine these into a single fsm.
Optimize the generated states.
Optimization would be a key step since the inputs are lengthy (100-200kb). Memory could be a concern and performance could go for worse otherwise. I don't know if a library exists for this purpose but here's a theoretical answer.

Inefficient Parallel.For?

I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.

Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!

Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.

List vs. Dictionary (Maximum Size, Number of Elements)

I am attempting to ascertain the maximum sizes (in RAM) of a List and a Dictionary. I am also curious as to the maximum number of elements / entries each can hold, and their memory footprint per entry.
My reasons are simple: I, like most programmers, am somewhat lazy (this is a virtue). When I write a program, I like to write it once, and try to future-proof it as much as possible. I am currently writing a program that uses Lists, but noticed that the iterator wants an integer. Since the capabilities of my program are only limited by available memory / coding style, I'd like to write it so I can use a List with Int64s or possibly BigInts (as the iterators). I've seen IEnumerable as a possibility here, but would like to find out if I can just stuff a Int64 into a Dictionary object as the key, instead of rewriting everything. If I can, I'd like to know what the cost of that might be compared to rewriting it.
My hope is that should my program prove useful, I need only hit recompile in 5 years time to take advantage of the increase in memory.

Is it specified in the documentation for the class? No, then it's unspecified.
In terms of current implementations, there's no maximum size in RAM in the classes themselves, if you create a value type that's 2MB in size, push a few thousand into a list, and receive an out of memory exception, that's nothing to do with List<T>.
Internally, List<T>s workings would prevent it from ever having more than 2billion items. It's harder to come to a quick answer with Dictionary<TKey, TValue>, since the way things are positioned within it is more complicated, but really, if I was looking at dealing with a billion items (if a 32-bit value, for example, then 4GB), I'd be looking to store them in a database and retrieve them using data-access code.
At the very least, once you're dealing with a single data structure that's 4GB in size, rolling your own custom collection class no longer counts as reinventing the wheel.

I am using a concurrentdictionary to rank 3x3 patterns in half a million games of go. Obviously there are a lot of possible patterns. With C# 4.0 the concurrentdictionary goes out of memory at around 120 million objects. It is using 8GB at that time (on a 32GB machine) but wants to grow way too much I think (tablegrowths happen in large chunks with concurrentdictionary). Using a database would slow me down at least a hundredfold I think. And the process is taking 10 hours already.
My solution was to use a multiphase solution, actually doing multiple passes, one for each subset of patterns. Like one pass for odd patterns and one for even patterns. When using more objects no longer fails I can reduce the amount of passes.
C# 4.5 adds support for larger arraysin 64bit by using unsigned 32bit pointers for arrays
(the mentioned limit goes from 2 billion to 4 billion). See also
http://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx. Not sure which objects will benefit from this, List<> might.

I think you have bigger issues to solve before even wondering if a Dictionary with an int64 key will be useful in 5 or 10 years.
Having a List or Dictionary of 2e+10 elements in memory (int32) doesn't seem to be a good idea, never mind 9e+18 elements (int64). Anyhow the framework will never allow you to create a monster that size (not even close) and probably never will. (Keep in mind that a simple int[int.MaxValue] array already far exceeds the framework's limit for memory allocation of any given object).
And the question remains: Why would you ever want your application to hold in memory a list of so many items? You are better of using a specialized data storage backend (database) if you have to manage that amount of information.

Is there any scenario where the Rope data structure is more efficient than a string builder

Related to this question, based
on a comment of user Eric
Lippert.
Is there any scenario where the Rope data structure is more efficient than a string builder? It is some people's opinion that rope data structures are almost never better in terms of speed than the native string or string builder operations in typical cases, so I am curious to see realistic scenarios where indeed ropes are better.

The documentation for the SGI C++ implementation goes into some detail on the big O behaviours verses the constant factors which is instructive.
Their documentation assumes very long strings being involved, the examples posited for reference talk about 10 MB strings. Very few programs will be written which deal with such things and, for many classes of problems with such requirements reworking them to be stream based rather than requiring the full string to be available where possible will lead to significantly superior results. As such ropes are for non streaming manipulation of multi megabyte character sequences when you are able to appropriately treat the rope as sections (themselves ropes) rather than just a sequence of characters.
Significant Pros:
Concatenation/Insertion become nearly constant time operations
Certain operations may reuse the previous rope sections to allow sharing in memory.
Note that .Net strings, unlike java strings do not share the character buffer on substrings - a choice with pros and cons in terms of memory footprint. Ropes tend to avoid this sort of issue.
Ropes allow deferred loading of substrings until required
Note that this is hard to get right, very easy to render pointless due to excessive eagerness of access and requires consuming code to treat it as a rope, not as a sequence of characters.
Significant Cons:
Random read access becomes O(log n)
The constant factors on sequential read access seem to be between 5 and 10
efficient use of the API requires treating it as a rope, not just dropping in a rope as a backing implementation on the 'normal' string api.
This leads to a few 'obvious' uses (the first mentioned explicitly by SGI).
Edit buffers on large files allowing easy undo/redo
Note that, at some point you may need to write the changes to disk, involving streaming through the entire string, so this is only useful if most edits will primarily reside in memory rather than requiring frequent persistence (say through an autosave function)
Manipulation of DNA segments where significant manipulation occurs, but very little output actually happens
Multi threaded Algorithms which mutate local subsections of string. In theory such cases can be parcelled off to separate threads and cores without needing to take local copies of the subsections and then recombine them, saving considerable memory as well as avoiding a costly serial combining operation at the end.
There are cases where domain specific behaviour in the string can be coupled with relatively simple augmentations to the Rope implementation to allow:
Read only strings with significant numbers of common substrings are amenable to simple interning for significant memory savings.
Strings with sparse structures, or significant local repetition are amenable to run length encoding while still allowing reasonable levels of random access.
Where the sub string boundaries are themselves 'nodes' where information may be stored, though such structures are quite possible better done as a Radix Trie if they are rarely modified but often read.
As you can see from the examples listed, all fall well into the 'niche' category. Further, several may well have superior alternatives if you are willing/able to rewrite the algorithm as a stream processing operation instead.

the short answer to this question is yes, and that requires little explanation. Of course there's situations where the Rope data structure is more efficient than a string builder. they work differently, so they are more suited for different purposes.
(From a C# perspective)
The rope data structure as a binary tree is better in certain situations. When you're looking at extremely large string values (think 100+ MB of xml coming in from SQL), the rope data structure could keep the entire process off the large object heap, where the string object hits it when it passes 85000 bytes.
If you're looking at strings of 5-1000 characters, it probably doesn't improve the performance enough to be worth it. this is another case of a data structure that is designed for the 5% of people that have an extreme situation.

The 10th ICFP Programming Contest relied, basically, on people using the rope data structure for efficient solving. That was the big trick to get a VM that ran in reasonable time.
Rope is excellent if there are lots of prefixing (apparently the word "prepending" is made up by IT folks and isn't a proper word!) and potentially better for insertions; StringBuilders use continuous memory, so only work efficiently for appending.
Therefore, StringBuilder is great for building strings by appending fragments - a very normal use-case. As developers need to do this a lot, StringBuilders are a very mainstream technology.
Ropes are great for edit buffers, e.g. the data-structure behind, say, an enterprise-strength TextArea. So (a relaxation of Ropes, e.g. a linked list of lines rather than a binary tree) is very common in the UI controls world, but that's not often exposed to the developers and users of those controls.
You need really really big amounts of data and churn to make the rope pay-off - processors are very good at stream operations, and if you have the RAM then simply realloc for prefixing does work acceptably for normal use-cases. That competition mentioned at the top was the only time I've seen it needed.

Most advanced text editors represent the text body as a "kind of rope" (though in implementation, leaves aren't usually individual characters, but text runs), mainly to improve the the frequent inserts and deletes on large texts.
Generally, StringBuilder is optimized for appending and tries to minimize the total number of reallocations without overallocating to much. The typical guarantee is (log2 N allocations, and less than 2.5x the memory). Normally the string is built once and may then be used for quite a while without being modified.
Rope is optimized for frequent inserts and removals, and tries to minimize amount of data copied (by a larger number of allocations). In a linear buffer implementation, each insert and delete becomes O(N), and you usually have to represent single character inserts.

Javascript VMs often use ropes for strings.
Maxime Chevalier-Boisvert, developer of the Higgs Javascript VM, says:
In JavaScript, you can use arrays of strings and eventually
Array.prototype.join to make string concatenation reasonably fast,
O(n), but the "natural" way JS programmers tend to build strings is to
just append using the += operator to incrementally build them. JS
strings are immutable, so if this isn't optimized internally,
incremental appending is O(n2 ). I think it's probable that ropes were
implemented in JS engines specifically because of the SunSpider
benchmarks which do string appending. JS engine implementers used
ropes to gain an edge over others by making something that was
previously slow faster. If it wasn't for those benchmarks, I think
that cries from the community about string appending performing poorly
may have been met with "use Array.prototype.join, dummy!".
Also.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.