Open thread in foreach loop - c#

I am getting an XML feed and I parse it the my MQ server, then I have a service that listen to the MQ server and reading all its messages.
I have a foreach loop that opens a new thread each iteration, in order to make the parsing faster, cause there are around 500 messages in the MQ (means there are 500 XMLs)
foreach (System.Messaging.Message m in msgs)
{
byte[] bytes = new byte[m.BodyStream.Length];
m.BodyStream.Read(bytes, 0, (int)m.BodyStream.Length);
System.Text.ASCIIEncoding ascii = new System.Text.ASCIIEncoding();
ParserClass tst = new ParserClass(ascii.GetString(bytes, 0, (int)m.BodyStream.Length));
new Thread( new ThreadStart(tst.ProcessXML)).Start();
}
In the ParserClass I have this code:
private static object thLockMe = new object();
public string xmlString { get; set; }
public ParserClass(string xmlStringObj)
{
this.xmlString = xmlStringObj;
}
public void ProcessXML()
{
lock (thLockMe)
{
XDocument reader = XDocument.Parse(xmlString);
//Some more code...
}
}
The problem is, when I run this foreach loop with 1 thread only, it works perfect, but slow.
When I run it with more then 1 thread, I get an error "Object reference not set to an instance of an object".
I guess there is something wrong with my locking since I am not very experienced with threading.
I am kinda hopeless, hope you can help!
Cheers!

I note that you are running a bunch of threads with their entire code wrapped inside a lock statement. You might as well run the methods in a sequence this way, because you are not getting any parallelism.

Since you are creating a new ParserClass instance on every iteration of your loop, and also creating and starting a new thread every iteration, you do not need a lock in your ParseXML method.
Your object on which you lock is currently static, so it is not instance bound, which means, once one thread is inside your ParseXML method, no other will be able to do anything, until the first has finished.
You are not sharing any data (from the code I can see) in your Parser class amongst threads, so you don't need a lock, inside your ParseXML function.
If you are using data that is shared between threads, then you should have a lock.
If you're going to be using lots of threads, then you're better of using a ThreadPool, and taking a finite (4 perhaps) from your pool, assigning them some work, and recycling them for the next 4 tasks.
Creating threads is an expensive operation, which requires a call into the OS kernel, so you do not want to do that 500 times. This is too costly. Also, the min reserved memory for a threadstack in Windows is 1MB, so that is 500MB in stackspace alone for your threads.
An optimal number of threads should be equal to the number of cores in your machine, however since that's not real for most purposes, you can do double or triple that, but then you're better off with a threadpool, where you recycle threads, instead of creating new one's all the time.

Even though this probably won't solve your problem, instead of creating 500 simultaneous threads you should just use the ThreadPool, which manages threads in a much more efficient way:
foreach (System.Messaging.Message m in msgs)
{
byte[] bytes = new byte[m.BodyStream.Length];
m.BodyStream.Read(bytes, 0, (int)m.BodyStream.Length);
System.Text.ASCIIEncoding ascii = new System.Text.ASCIIEncoding();
ParserClass tst = new ParserClass(ascii.GetString(bytes, 0, (int)m.BodyStream.Length));
ThreadPool.QueueUserWorkItem(x => tst.ProcessXML());
}
And to make sure they run as simultaneously as possible change your code in the ParserClass like this (assuming you indeed have resources you share between threads - if you don't have any, you don't have to lock at all):
private static object thLockMe = new object();
public string XmlString { get; set; }
public ParserClass(string xmlString)
{
XmlString = xmlString;
}
public void ProcessXML()
{
XDocument reader = XDocument.Parse(xmlString);
// some more code which doesn't need to access the shared resource
lock (thLockMe)
{
// the necessary code to access the shared resource (and only that)
}
// more code
}
Regarding your actual question:
Instead of calling OddService.InsertEvent(...) multiple times with the same parameters (that method reeks of remote calls and side effects...) you should call it once, store the result in a variable and do all subsequent operations on that variable. That way you can also conveniently check if it's not that precise method which returns null sometimes (when accessed simultaneously?).
Edit:
Does it work if you put all calls to OddService.* in lock blocks?

Related

How to periodically flush c# FileStream to the disk?

Context:
I am implementing a logging mechanism for a Web API project that writes serialized objects to a file from multiple methods which in turn is read by an external process (nxLog to be more accurate). The application is hosted on IIS and uses 18 worker processes. The App pool is recycled once a day. The expected load on the services that will incorporate the logging methods is 10,000 req/s. In short this is a classic produces/consumer problem with multiple producers (the methods that produce logs) and one consumer (the external process who reads from the log files). Update: Each process uses multiple threads as well.
I used BlockingCollection to store data (and solve the race condition) and a long running task that writes the data from the collection to the disk.
To write to the disk I am using a StreamWriter and a FileStream.
Because the write frequency is almost constant ( as I said 10,000 write per second) I decided to keep the streams open for the entire lifetime of the application pool and periodically write logs to the disk. I rely on the App Pool recycle and my DI framework to dispose my logger daily. Also note that this class will be singleton, because I didn't want to have more than one thread dedicated to writing from my thread pool.
Apparently the FileStream object will not write to the disk until it is disposed. Now I don't want the FileStream to wait for an entire day until it writes to the disk. The memory it will require to hold all that serialized object will be tremendous, not to mention that any crash on the application or the server will cause data loss or corrupted file.
Now my question:
How can I have the underlying streams (FileStream and StreamWriter) write to the disk periodically without disposing them? My initial assumption was that it will write to the disk once FileSteam exceeds its buffer size which is 4K by default.
UPDATE: The inconsistencies mentioned in the answer have been fixed.
Code:
public class EventLogger: IDisposable, ILogger
{
private readonly BlockingCollection<List<string>> _queue;
private readonly Task _consumerTask;
private FileStream _fs;
private StreamWriter _sw;
public EventLogger()
{
OpenFile();
_queue = new BlockingCollection<List<string>>(50);
_consumerTask = Task.Factory.StartNew(Write, CancellationToken.None, TaskCreationOptions.LongRunning, TaskScheduler.Default);
}
private void OpenFile()
{
_fs?.Dispose();
_sw?.Dispose();
_logFilePath = $"D:\Log\log{DateTime.Now.ToString(yyyyMMdd)}{System.Diagnostic.Process.GetCurrentProcess().Id}.txt";
_fs = new FileStream(_logFilePath, FileMode.Append, FileAccess.Write, FileShare.ReadWrite);
_sw = new StreamWriter(_fs);
}
public void Dispose()
{
_queue?.CompleteAdding();
_consumerTask?.Wait();
_sw?.Dispose();
_fs?.Dispose();
_queue?.Dispose();
}
public void Log(List<string> list)
{
try
{
_queue.TryAdd(list, 100);
}
catch (Exception e)
{
LogError(LogLevel.Error, e);
}
}
private void Write()
{
foreach (List<string> items in _queue.GetConsumingEnumerable())
{
items.ForEach(item =>
{
_sw?.WriteLine(item);
});
}
}
}
There are a few "inconsistencies" with your question.
The application is hosted on IIS and uses 18 worker processes
.
_logFilePath = $"D:\Log\log{DateTime.Now.ToString(yyyyMMdd)}{System.Diagnostic.Process.GetCurrentProcess().Id}.txt";
.
writes serialized objects to a file from multiple methods
Putting all of this together, you seem to have a single threaded situation as opposed to a multi-threaded one. And since there is a separate log per process, there is no contention problem or need for synchronization. What I mean to say is, I don't see why the BlockingCollection is needed at all. It's possible that you forgot to mention that there are multiple threads within your web process. I will make that assumption here.
Another problems is that your code does not compile
class name is Logger but the EventLogger function looks like a constructor.
some more incorrect syntax with string, etc
Putting all that aside, if you really have a contention situation and want to write to the same log via multiple threads or processes, your class seems to have most of what you need. I have modified your class to do some more things. Chief to note are the below items
Fixed all the syntax errors making assumptions
Added a timer, which will call the flush periodically. This will need a lock object so as to not interrupt the write operation
Used an explicit buffer size in the StreamWriter constructor. You should heuristically determine what size works best for you. Also, you should disable AutoFlush from StreamWriter so you can have your writes hit the buffer instead of the file, providing better performance.
Below is the code with the changes
public class EventLogger : IDisposable, ILogger {
private readonly BlockingCollection<List<string>> _queue;
private readonly Task _consumerTask;
private FileStream _fs;
private StreamWriter _sw;
private System.Timers.Timer _timer;
private object streamLock = new object();
private const int MAX_BUFFER = 16 * 1024; // 16K
private const int FLUSH_INTERVAL = 10 * 1000; // 10 seconds
public EventLogger() {
OpenFile();
_queue = new BlockingCollection<List<string>>(50);
_consumerTask = Task.Factory.StartNew(Write, CancellationToken.None, TaskCreationOptions.LongRunning, TaskScheduler.Default);
}
void SetupFlushTimer() {
_timer = new System.Timers.Timer(FLUSH_INTERVAL);
_timer.AutoReset = true;
_timer.Elapsed += TimedFlush;
}
void TimedFlush(Object source, System.Timers.ElapsedEventArgs e) {
_sw?.Flush();
}
private void OpenFile() {
_fs?.Dispose();
_sw?.Dispose();
var _logFilePath = $"D:\\Log\\log{DateTime.Now.ToString("yyyyMMdd")}{System.Diagnostics.Process.GetCurrentProcess().Id}.txt";
_fs = new FileStream(_logFilePath, FileMode.Append, FileAccess.Write, FileShare.ReadWrite);
_sw = new StreamWriter(_fs, Encoding.Default, MAX_BUFFER); // TODO: use the correct encoding here
_sw.AutoFlush = false;
}
public void Dispose() {
_timer.Elapsed -= TimedFlush;
_timer.Dispose();
_queue?.CompleteAdding();
_consumerTask?.Wait();
_sw?.Dispose();
_fs?.Dispose();
_queue?.Dispose();
}
public void Log(List<string> list) {
try {
_queue.TryAdd(list, 100);
} catch (Exception e) {
LogError(LogLevel.Error, e);
}
}
private void Write() {
foreach (List<string> items in _queue.GetConsumingEnumerable()) {
lock (streamLock) {
items.ForEach(item => {
_sw?.WriteLine(item);
});
}
}
}
}
EDIT:
There are 4 factors controlling the performance of this mechanism, and it is important to understand their relationship. Below example will hopefully make it clear
Let's say
average size of List<string> is 50 Bytes
Calls/sec is 10,000
MAX_BUFFER is 1024 * 1024 Bytes (1 Meg)
You are producing 500,000 Bytes of data per second, so a 1 Meg buffer can hold only 2 seconds worth of data. i.e. Even if FLUSH_INTERVAL is set to 10 seconds the buffer will AutoFlush every 2 seconds (on an average) when it runs out of buffer space.
Also remember that increasing the MAX_BUFFER blindly will not help, since the actual flush operation will take longer due to the bigger buffer size.
The main thing to understand is that when there is a difference in incoming data rates (to your EventLog class) and outgoing data rates (to the disk), you will either need an infinite sized buffer (assuming continuously running process) or you will have to slow down your avg. incoming rate to match avg. outgoing rate
Maybe my answer won't address your concrete concern, but I believe that your scenario could be a good use case for memory-mapped files.
Persisted files are memory-mapped files that are associated with a
source file on a disk. When the last process has finished working with
the file, the data is saved to the source file on the disk. These
memory-mapped files are suitable for working with extremely large
source files.
This could be very interesting because you'll be able to do logging from different processes (i.e. IIS worker processes) without locking issues. See MemoryMappedFile.OpenExisting method.
Also, you can log to a non-persistent shared memory-mapped file and, using a task scheduler or a Windows service, you can take pending logs to their final destination using a persistable memory-mapped file.
I see a lot of potential on using this approach because of your multi/inter-process scenario.
Approach #2
If you don't want to re-invent the wheel, I would go for a reliable message queue like MSMQ (very basic, but still useful in your scenario) or RabbitMQ. Enqueue logs in persistent queues, and a background process may consume these log queues to write logs to the file system.
This way, you can create log files once, twice a day, or whenever you want, and you're not tied to the file system when logging actions within your system.
Use the FileStream.Flush() method - you might do this after each call to .Write. It will clear buffers for the stream and causes any buffered data to be written to the file.
https://msdn.microsoft.com/en-us/library/2bw4h516(v=vs.110).aspx

Prevent Reading and Writing to a File at the Same Time

I have a process that needs to read and write to a file. The application has a specific order to its reads and writes and I want to preserve this order. What I would like to do is implement something that lets the first operation start and makes the second operation wait until the first is done with a first come first served like of queue to access the file. From what I have read file locking seems like it might be what I am looking for but I have not been able to find a very good example. Can anyone provide one?
Currently I am using a TextReader/Writer with .Synchronized but this is not doing what I hoped it would.
Sorry if this is a very basic question, threading gives me a headache :S
It should be as simple as this:
public static readonly object LockObj = new object();
public void AnOperation()
{
lock (LockObj)
{
using (var fs = File.Open("yourfile.bin"))
{
// do something with file
}
}
}
public void SomeOperation()
{
lock (LockObj)
{
using (var fs = File.Open("yourfile.bin"))
{
// do something else with file
}
}
}
Basically, define a lock object, then whenever you need to do something with your file, make sure you get a lock using the C# lock keyword. On reaching the lock statement, execution will block indefinitely until a lock has been obtained.
There are other constructs you can use for locking, but I find the lock keyword to be the most straightforward.
If you're using a current version of the .Net Framework, you can benefit from Task.ContinueWith.
If your units of work are logically always, "read some, then write some", the following expresses that intent succinctly and should scale:
string path = "file.dat";
// Start a reader task
var task = Task.Factory.StartNew(() => ReadFromFile(path));
// Continue with a writer task
task.ContinueWith(tt => WriteToFile(path));
// We're guaranteed that the read will occur before the write
// and that the write will occur once the read completes.
// We also can check the antecedent task's result (tt.Result in our
// example) for any special error logic we need.

C# Downloader: should I use Threads, BackgroundWorker or ThreadPool?

I'm writing a downloader in C# and stopped at the following problem: what kind of method should I use to parallelize my downloads and update my GUI?
In my first attempt, I used 4 Threads and at the completion of each of them I started another one: main problem was that my cpu goes 100% at each new thread start.
Googling around, I found the existence of BackgroundWorker and ThreadPool: stating that I want to update my GUI with the progress of each link that I'm downloading, what is the best solution?
1) Creating 4 different BackgroundWorker, attaching to each ProgressChanged event a Delegate to a function in my GUI to update the progress?
2) Use ThreadPool and setting max and min number of threads to the same value?
If I choose #2, when there are no more threads in the queue, does it stop the 4 working threads? Does it suspend them? Since I have to download different lists of links (20 links each of them) and move from one to another when one is completed, does the ThreadPool start and stop threads between each list?
If I want to change the number of working threads on live and decide to use ThreadPool, changing from 10 threads to 6, does it throw and exception and stop 4 random threads?
This is the only part that is giving me an headache.
I thank each of you in advance for your answers.
I would suggest using WebClient.DownloadFileAsync for this. You can have multiple downloads going, each raising the DownloadProgressChanged event as it goes along, and DownloadFileCompleted when done.
You can control the concurrency by using a queue with a semaphore or, if you're using .NET 4.0, a BlockingCollection. For example:
// Information used in callbacks.
class DownloadArgs
{
public readonly string Url;
public readonly string Filename;
public readonly WebClient Client;
public DownloadArgs(string u, string f, WebClient c)
{
Url = u;
Filename = f;
Client = c;
}
}
const int MaxClients = 4;
// create a queue that allows the max items
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>(MaxClients);
// queue of urls to be downloaded (unbounded)
Queue<string> UrlQueue = new Queue<string>();
// create four WebClient instances and put them into the queue
for (int i = 0; i < MaxClients; ++i)
{
var cli = new WebClient();
cli.DownloadProgressChanged += DownloadProgressChanged;
cli.DownloadFileCompleted += DownloadFileCompleted;
ClientQueue.Add(cli);
}
// Fill the UrlQueue here
// Now go until the UrlQueue is empty
while (UrlQueue.Count > 0)
{
WebClient cli = ClientQueue.Take(); // blocks if there is no client available
string url = UrlQueue.Dequeue();
string fname = CreateOutputFilename(url); // or however you get the output file name
cli.DownloadFileAsync(new Uri(url), fname,
new DownloadArgs(url, fname, cli));
}
void DownloadProgressChanged(object sender, DownloadProgressChangedEventArgs e)
{
DownloadArgs args = (DownloadArgs)e.UserState;
// Do status updates for this download
}
void DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
DownloadArgs args = (DownloadArgs)e.UserState;
// do whatever UI updates
// now put this client back into the queue
ClientQueue.Add(args.Client);
}
There's no need for explicitly managing threads or going to the TPL.
I think you should look into using the Task Parallel Library, which is new in .NET 4 and is designed for solving these types of problems
Having 100% cpu load has nothing to do with the download (as your network is practically always the bottleneck). I would say you have to check your logic how you wait for the download to complete.
Can you post some code of the thread's code you start multiple times?
By creating 4 different backgroundworkers you will be creating seperate threads that will no longer interfere with your GUI. Backgroundworkers are simple to implement and from what I understand will do exactly what you need them to do.
Personally I would do this and simply allow the others to not start until the previous one is finished. (Or maybe just one, and allow it to execute one method at a time in the correct order.)
FYI - Backgroundworker

Passing Data Between Threads

I have the following code, in which I’m trying to process a large amount of data, and update the UI. I’ve tried the same thing using a background worker, but I get a similar issue. The problem seems to be that I’m trying to use a class that was not instantiated on the new thread (the actual error is that the current thread doesn't "own" the instance). My question is, is there a way that I can pass this instance between threads to avoid this error?
DataInterfaceClass dataInterfaceClass = new DataInterfaceClass();
private void OutputData(List<MyResult> Data)
{
progressBar1.Maximum = Data.Count;
progressBar1.Minimum = 1;
progressBar1.Value = 1;
foreach (MyResult res in Data)
{
// Add data to listview
UpdateStatus("Processing", res.Name);
foreach (KeyValuePair<int, string> dets in res.Details)
{
ThreadPool.QueueUserWorkItem((o) =>
{
// Get large amount of data from DB based on key
// – gives error because DataInterfaceClass was
// created in different thread.
MyResult tmpResult = dataInterfaceClass
.GetInfo(dets.DataKey);
if (tmpResult == null)
{
// Updates listview
UpdateStatus("Could not get details",
dets.DataKey);
}
else
{
UpdateStatus("Got Details", dets.DataKey);
}
progressBar1.Dispatcher.BeginInvoke(
(Action)(() => progressBar1.Value++));
});
}
}
}
EDIT:
DataInterfaceClass is actually definated and created outside of the function that it is used in, but it is an instance and not static.
UPDATE:
You seem to have modified the posted source code, so...
You should create an instance of the DataInterfaceClass exclusively for each background thread or task. Provide your task with enough input to create its own instance.
That being said, if you try to access data in a single database in a highly parallel way, this might result in database timeouts. Even if you can get your data access to work in a multithreaded way, I would recommend limiting the number of simultaneous background tasks to prevent this from occurring.
You could use a Semaphore (or similar) to ensure that no more than a certain amount of tasks are running at the same time.
Create a global instance for DataInterfaceClass inside the class that has OutputData method defined, that way you would be able to use it within the method.
However, you would need to be cautious in using it. If all the threads would use the same instance to read from the database, it would result in errors.
You should either create a new instance of the DataInterfaceClass in each thread, or have some lock implemented inside your GetInfo method to avoid multiple access issues.

Safe way to start and use a new Thread?

In a previous question I asked how to improve a bit of code. It was said that I should move it to a new thread. I'd never thought about it before so it seems like a great idea to me. So this morning I went ahead and reused a bit of code I already have for processing emails and updated the way I handle image uploads into my site.
So is this a good way to start a new thread and process the images? Is there even a need to lock it like I am?
private static object dummy = new object();
public static void Save(int nProjId, byte[] bData)
{
var worker = new ThreadStart(() => ProcessImage(nProjId,bData));
var thread = new Thread(worker);
thread.Start();
}
private static void ProcessImage(int nProjId, byte[] bData)
{
lock (dummy)
{
try
{
byte[] xlargeImage = Thumbs.ResizeImageFile(bData, 700);
byte[] largeImage = Thumbs.ResizeImageFile(bData, 500);
//improved based on previous question to use the already reduced image
byte[] mediumImage = Thumbs.ResizeImageFile(xlargeImage, 200);
byte[] smallImage = Thumbs.ResizeImageFile(xlargeImage, 100);
//existing code to actually save the images
MyGlobals.GetDataAccessComponent().File_Save(
ConfigurationManager.ConnectionStrings["ImgStore"],
nProjId,
xlargeImage,
largeImage,
mediumImage,
smallImage);
}
catch (Exception)
{
//ToDo: add error handleing
{ }
throw;
}
}
}
Oh and the images now upload and process nearly instantly (locally) so it's a HUGE help so far. I just want to make sure it's the best way to do it. Oh and I'm using a dual core machine running Server 2008 with 6gb or ram, so I have a little wiggle room to make it faster or use more threads.
I would suggest using a ThreadPool class, specifically because it will re-use a thread for you rather than you creating a new thread each time, which is a little bit more intensive.
Check out the QueueUserWorkItem method.
Also if you are not using a static resource to write to (I am not sure what exactly File_Save does) I dont think there is a need for your lock. However if you are using a static resource then you should lock just the code that is using it.
Is this for any production code? Or just a sample ? If it is not for production code, apart from using ThreadPool, you can use TPL from .NET4.0 . MS recommends using TPL instead of ThreadPool.

Categories