C# -Threading help for beginner

C# -Threading help for beginner - c#

I want to run three threads (i) one should append strings into a file.(ii) The other thread should remove special characters from the written stream.(iii)The third thread should sort the words in ascending order.How could i do it in a thread safe(synchronized) manner ?
I mean
Thread 1
sample.txt
Apple
Ma#22ngo
G#ra&&pes
Thread 2
(after removing special characters sample.txt)
Apple
Mango
Grapes
Thread 3 (sample.txt)
Apple
Grapes
Mango

Why do you want to do this using several threads? Perhaps your example has been oversimplified, but if you can avoid threading then do so - and it would appear that this problem doesn't need it.
Do you have lots of files, or one really large file, or something else? Why can't you simply perform the three actions one after another?
UPDATE
I felt I should at least try and help you solve the underlying problem you're facing. I think you need to consider the task you're looking at as a pipeline. You start with strings, you remove special characters (clean up), you write them to file. When you've finally written all the strings you need to sort them.
Everything up to the sorting stage can, and should, be done by a single thread. Read string, clean it, write it to file, move to next string. The final task of sorting can't easily happen until all the strings are written cleanly to the file.
If you have many files to write and sort then each of these can be dealt with by a seperate thread, but I would avoid involving multiple threads in the processing of any one given file.

I would perform operation 1 and 2 in the same thread by removing special characters before writing to the file. Operation 3 cannot be run in parallel with others because while the file is being written you cannot read it and sort. So basically these operations are sequential and it makes no sense to put them into separate threads.

You should implement a ThredSafe Queue (or use the one that comes with Parallels extensions).
Threads have the problem of sharing information so, even if theoretically your solution would be seen as a perfect parallel scenario, the reality is you can't just let the threads access freely to the shared data, because when you do, bad things happen.
Instead you can use a synchronization mechanism, in this case a ConcurrentQueue. That way you'll have this:
ConcurrentQueue<string> queueStrings;
ConcurrentQueue<string> queueFile;
Thread1 inserts strings into the queueStrings queue.
Thread2 reads strings from the queueString, process them, and then inserts them into the queueFile queue.
Finally, Thread3 reads the processed strings from queueFile and write them into the file.

Related

Unexpected Parallel.ForEach loop behavior

Hi I am trying to mimic multi threading with Parallel.ForEach loop. Below is my function:
public void PollOnServiceStart()
{
constants = new ConstantsUtil();
constants.InitializeConfiguration();
HashSet<string> newFiles = new HashSet<string>();
//string serviceName = MetadataDbContext.GetServiceName();
var dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnStart();
var handlers = Producer.GetParserHandlers(dequeuedItems);
while (handlers.Any())
{
Parallel.ForEach(handlers,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
handler =>
{
Logger.Info($"Started processing a file remaining in Parallel ForEach");
handler.Execute();
Logger.Info($"Enqueing one file for next process");
dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnPollInterval(1);
handlers = Producer.GetParserHandlers(dequeuedItems);
});
int filesRemovedCount = Producer.RemoveTransferredFiles();
Logger.Info($"{filesRemovedCount} files removed from {Constants.OUTPUT_FOLDER}");
}
}
So to explain what's going on. The function UpdateOdfsServiceEntriesForProcessingOnStart() gets 4 file names (4 because of parallel count) and adds them to a thread safe object called ParserHandler. These objects are then put into a list var handlers.
My idea here is to loop through this handler list and call the handler.Execute().
Handler.Execute() copies files from the network location onto a local drive, parses through the file and creates multiple output files, then sends said files to a network location and updates a DB table.
What I expect in this Parallel For Each loop is that after the Handler.Execute() call, UpdateOdfsServiceEntriesForProcessingOnPollInterval(1) function will add a new file name from the db table it reads to the dequeued items container which will then be passed as one item to the recreated handler list. In this way, after one file is done executing, a new file will take its place for each parallel loop.
However what happens is that while I do get a new file added it doesn't get executed by the next available thread. Instead what happens is that the parallel for each has to finish executing the first 4 files and then it will pick up the very next file. Meaning, after the first 4 are ran in parallel, only 1 file is ran at a time thereby nullifying the whole point of the parallel looping. The initial files added before all 4 files finish the Execute() call are never executed.
IE:
(Start1, Start2, Start3, Start4) all at once. What should happen is something like (End2, Start5), and then (End3, Start6). But what is happening is (End 2, End 3, End 1, End 4), Then Start5. End5. Start6, End6.
Why is this happening?
Because we want to deploy multiple of instances of this service app in a machine, it is not beneficial to have a giant list waiting in queue. This is wasteful as the other app instances wont be able to process things.

I am writing what should be a long comment as an answer, although it's an awful answer because it doesn't answer the question.
Be aware that parallelizing filesystem operations is unlikely to make them faster, especially if the storage is a classic hard disk. The head of the disk cannot be in N places at the same moment, and if you tell it to do so will just waste most of its time traveling instead of reading or writing.
The best way to overcome the bottleneck imposed by accessing the filesystem is to make sure that there is work for the disk to do at all moments. Don't stop the disk's work to make a computation or to fetch/save data from/to the database. To make this happen you must have multiple workflows running concurrently. One workflow will do entirely I/O with the disk, another workflow will talk continuously with the database, a third workflow will utilize the CPU by doing the one calculation after the other etc. This approach is called task parallelism (doing heterogeneous work in parallel), as opposed with data parallelism (doing homogeneous work in parallel, the speciality of Parallel.ForEach). It is also called pipelining, because in order to make all workflows run concurrently you must place intermediate buffers between them, so you create a pipeline with the data flowing from buffer to buffer. Another term used for this kind of operations is producer-consumer pattern, which describes a short pipeline consisting by only two building blocks, with the first being the producer and the second the consumer.
The most powerful tool currently available¹ to create pipelines, is the TPL Dataflow library. It offers a variety of "blocks" (pipeline segments) that can be linked with each other, and can cover most scenarios. What you do is that you instantiate the blocks that will compose your pipeline, you configure them, you tell each one what work it should do, you link them together, you feed the first block with the initial raw data that should be processed, and finally await for the Completion of the last block. You can look at an example of using the TPL Dataflow library here.
¹ Available as built-in library in the .NET platform. Powerful third-party tools also exist, like the Akka.NET for example.

How to approach multiple parsing of large amount of data (800k records)?

I have a desktop application that validates certain CSV files.
I'm given this CSV file which I need to parse and validate against multiple business rules. Those business rules can apply to every record in particular, or they can check integration bonds that have the range concerning all of the records in the file. The file is almost 800k records long.
Here's how I approach the problem currently:
I upload the csv file and convert every line to a custom object (a for loop is used here) that I end up storing in a list. This point takes 3 to 6 seconds usually, so I don't consider it a problem.
I pass the list to a validator class, which thanks to StructureMap gets all the business rules as separate classes.
I iterate through the business rules. My first business rule throws an exception like this:
The CLR has been unable to transition from COM context 0xa4234fc8 to COM context 0xa42350f0 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.
I understand that this can be hidden, but I don't want to hide the error, I want to understand what can I do to make the code more efficient. I already eliminated all thrown Exceptions in the code, and it does work better.
For each record I run the following code inside a business rule:
var mandatoryFields = GetFieldsWithAttribute<MandaroryFieldAttribute>(package);
foreach (var field in mandatoryFields)
{
var fieldValue = field.GetValue(package, null).ToString();
if (!string.IsNullOrWhiteSpace(fieldValue))
continue;
var errorMessage = GetErrorMessage(package.RowNumber, field.Name,
field.GetAttributeForPackage<CsvFieldNameAttribute>().Name);
if (FailedResults.Contains(errorMessage))
continue;
FailedResults.Add(errorMessage);
}
Since there are a lot of fields - I decided to validate fields using custom attributes, to make the process more generic. System.Reflection is used inside the two extension methods: GetAttributeForPackage and GetFieldsWithAttribute.
Write a report summarizing the validation into a text file.
The problem, as I see it, is that I have to parse every single record, on for some rules all the records for one rule.
I do not have experience in parsing large amount of data. Can anyone suggest an approach on how to handle this?

There are few things that may help you:
Since you have large file I suggest you use Memory-mapped files
.This enable programmers to work with extremely large files
Since you have lot of records to validate ,you could consider using
threading or parallel programming(Tasks).This way the execution will
be faster.
I guess you are using StreamReader.ReadLine to read every
line.

c# How to restructure my program for threading

The Problem
I have a constant flow of datapackages. Every time a new package is incoming (100 ms interval) an event pops in which I send the package to a class to process the data and visualize it. Unfortunately, it could happen that the process of a package is aborted if a new package was sent before the current one has been processed.
What I have now
Is a Code that is used as dll for another program.
When I started coding the dll, I was new to c# so I didn't want to do it too complicated. Everything is working but I faced some ugly Frame skips (in my visualization part) if the cpu is very busy.
I have several classes and one of them is handling all packages. This class has round about 50 attributes 25 functions and 1000 lines of code. only 6 functions are needed for the calculations.
The rest is setting the attributes correctly (if the user changes settings).
What I need to change
So now I want to buffer all incoming data by using a List.
The list should be handled by another thread. So it is very unlikely that writing the data to the list takes longer than 100 ms ^^ (2 arrays with ca. 40 elements each that should be equal to nothing)
What I have in mind
Splitting the mentioned class into 2 separate classes.
One that handles the packages and one that handles the settings. So I would split the user and the "program" input.
Creating a thread that uses the "package handling" class to process the buffered data.
What I have no clue about
Since the settings class contains important attributes that are needed by the handling class I don't know how to do it best, because The handling class also needs to change/fill buffers in the settings class. But this one will be invoked by the main thread. Or is it better to not split the setting and handling class and leave it as it is? I am not so familiar with threading and read the first chapter of this free e-book Threading in C#

I would just add a queue and implement a thread for the processing. That will help you with the skips and requires little change. Refactoring the code by splitting settings apart seems like a lot of work with little benefit and possible new bugs.
To add the queue;
Create a ConcurrentQueue (this is threadsafe lifo, which is what you need)
var cq=new ConcurrentQueue<Packets>();
Add all your datapackets to that Queue
cq.Enqueue(newPacket);
Create another thread that loops around and processes the Queue
if (cq.TryDequeue(out newPacket))
{
// Visualize new packet
}

What type of queue to use in parallel data processing - C# - .NET 4

Scenario:
Data is received and written to database with timestamps. I need to process the raw data in the order that is received based on the time stamp and write it back to the database, different table, again maintaining the order based on the timestamp.
I came up with the following design: Created two queues, one for storing raw data from database, another for storing processed data before it's written back to DB. I have two threads, one reading to the Initial queue and another reading from Result queue. In between i spawn multiple threads to process data from Initial queue and write it to Result queue.
I have experimented with SortedList (manual locking) and BlockingCollection. I have used two approaches to process in parallel: Parallel.For(ForEach) and TaskFactory.Task.StartNew.
Each unit of data may take variable amount of time to process, based on several factors. One thread can still be processing the first data point while other threads are done with three or four datapoints each, messing up the timestamp order.
I have found out about OrderingPartitioner recently and i thought it would solve the problem, but following MSDNs example i can see, that it's not sorting the underlying collection either. May be i need to implement custom partitioner to order my collection of complex data types? or may be there's a better way of approaching the problem?
Any suggestions and/or links to articles discussing similar problem is highly appreciated.

Personally, I would at least try to start with using a BlockingCollection<T> for the input and a ConcurrentQueue<T> instance for the results.
I would use Parallel Linq to process the results. In order to preserve the order during your processing, you could use AsOrdered() on the PLINQ statement.

Have you considered PLINQ and AsOrdered()? It might be helpful for what you're trying to achieve.
http://msdn.microsoft.com/en-us/library/dd460719.aspx

Maybe you've considered these things, but...
Why not just pass the timestamp to the database and then either let the database do the ordering or fix the ordering in the database after all processing threads have returned? Do the sql statements have to be executed sequentially?
PLINQ is great but I would try to avoid thread synchronization requirements and simply pass more ordering data to the database if you can.

Concurrent file write

how to write to a text file that can be accessed by multiple sources (possibly in a concurrent way) ensuring that no write operation gets lost?
Like, if two different processes are writing in the same moment to the file, this can lead to problems. The simples solution (not very fast and not very elegant) would be locking the file while beginning the process (create a .lock file or similar) and release it (delete the lock) while the writing is done.
When beginning to write, i would check if the .lock file exists and delay the writing till the file is released.
What is the recommended pattern to follow for this kind of situation?
Thanks
EDIT
I mean processes, like different programs from different clients, different users and so on, not threads within the same program

Consider using a simple database. You will get all this built-in safety relatively easy.

The fastest way of synchronizing access between processes is to use Mutexes / Semaphores. This thread answers how to use them, to simulate read-writer lock pattern:
Is there a global named reader/writer lock?

I suggest using the ReaderWriterLock. It's designed for multiple readers but ensures only a single writer can writer data at any one time MSDN.

I would look at something like the Command Pattern for doing this. Concurrent writing would be a nightmare for data integrity.
Essentially you use this pattern to queue your write commands so that they are done in order of request.
You should also use the ReaderWriterLock to ensure that you can read from the file while writing occurs. This would be a second line of defense behind the command pattern so that only one thread could write to the file at a given time.

You can try lock too. It's easy - "lock ensures that one thread does not enter a critical section while another thread is in the critical section of code. If another thread attempts to enter a locked code, it will wait (block) until the object is released."
http://msdn.microsoft.com/en-us/library/c5kehkcz%28VS.71%29.aspx

I would also recommend you look for examples of having multiple readers and only 1 writer in a critical section heres a short paper with a good solution http://arxiv.org/PS_cache/cs/pdf/0303/0303005v1.pdf
Alternatively you could look at creating copies of the file each time it is requested and when it comes time to write any changes to the file you merge with the original file.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.