I have written an application for my company that essentially sends records from a text or CSV file in arrays of 100 records to a web service, to which it then returns the response, also in arrays of 100 records and writes them to another file. Currently, it is single threaded (processing sequentially using the Backgroundworker in Windows Pro Forms), but I am looking to multi-thread it utilizing a threadpool and concurrentqueue.
#region NonDataCollectionService
if (!userService.needsAllRecords)
{
ConcurrentQueue<Record[]> outputQueue = new ConcurrentQueue<Record[]>();
while ((!inputFile.checkForEnd()) && (!backgroundWorker1.CancellationPending))
{
//Get array of (typically) 100 records from input file
Record[] serviceInputRecord = inputFile.getRecords(userService.maxRecsPerRequest);
//Queue array to be processed by threadpool, send in output concurrentqueue to be filled by threads
ThreadPool.QueueUserWorkItem(new WaitCallback(sendToService), new object[]{serviceInputRecord, outputQueue});
}
}
#endregion
void sendToService(Object stateInfo)
{
//The following block is in progress, I basically need to create copies of my class that calls the service for each thread
IWS threadService = Activator.CreateInstance(userService.GetType()) as IWS;
threadService.errorStatus = userService.errorStatus;
threadService.inputColumns = userService.inputColumns;
threadService.outputColumns = userService.outputColumns;
threadService.serviceOptions = userService.serviceOptions;
threadService.userLicense = userService.userLicense;
object[] objectArray = stateInfo as object[];
//Send input records to service
threadService.sendToService((Record[])objectArray[0]);
//The line below returns correctly
Record[] test123 = threadService.outputRecords;
ConcurrentQueue<Record[]> threadQueue = objectArray[1] as ConcurrentQueue<Record[]>;
//threadQueue has records here
threadQueue.Enqueue(test123);
}
However, when I check "outputQueue" in the top block, it is empty, even when threadQueue has records queued. Is this because I'm passing by value instead of passing by reference? If so, how would I do that syntactically with Threadpool.QueueUserWorkItem?
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have this. It is an application for generating bank Accounts
static void Main(string[] args)
{
string path = #"G:\BankNumbers";
var bans = BankAcoutNumbers.BANS;
const int MAX_FILES = 80;
const int BANS_PER_FILE = 81818182/80;
int bansCounter = 0;
var part = new List<int>();
var maxNumberOfFiles = 10;
Stopwatch timer = new Stopwatch();
var fileCounter = 0;
if (!Directory.Exists(path))
{
DirectoryInfo di = Directory.CreateDirectory(path);
}
try
{
while (fileCounter <= maxNumberOfFiles)
{
timer.Start();
foreach (var bank in BankAcoutNumbers.BANS)
{
part.Add(bank);
if (++bansCounter >= BANS_PER_FILE)
{
string fileName = string.Format("{0}-{1}", part[0], part[part.Count - 1]);
string outputToFile = "";// Otherwise you dont see the lines in the file. Just single line!!
Console.WriteLine("NR{0}", fileName);
string subString = System.IO.Path.Combine(path, "BankNumbers");//Needed to add, because otherwise the files will not stored in the correct folder!!
fileName = subString + fileName;
foreach (var partBan in part)
{
Console.WriteLine(partBan);
outputToFile += partBan + Environment.NewLine;//Writing the lines to the file
}
System.IO.File.WriteAllText(fileName, outputToFile);//Writes to file system.
part.Clear();
bansCounter = 0;
//System.IO.File.WriteAllText(fileName, part.ToString());
if (++fileCounter >= MAX_FILES)
break;
}
}
}
timer.Stop();
Console.WriteLine(timer.Elapsed.Seconds);
}
catch (Exception)
{
throw;
}
System.Console.WriteLine("Press any key to exit.");
System.Console.ReadKey();
}
But this generates 81 million bank account records seperated over 80 files. But can I speed up the process with threading?
You're talking about speeding up a process whose bottleneck is overwhelmingly likely the file write speed. You can't really effectively parallelize writing to a single disk.
You may see slight increases in speed if you spawn a worker thread responsible for just fileIO. In other words, create a buffer, have your main thread dump contents into it while the other thread writes it to disk. It's the classic producer/consumer dynamic. I wouldn't expect serious speed gains, however.
Also keep in mind that writing to the console will slow you down, but you can keep that in the main thread and you'll probably be fine. Just make sure you put a limit on the buffer size and have the producer thread hang back when the buffer is full.
Edit: Also have a look at the link L-Three provided, using a BufferedStream would be an improvement (and probably render a consumer thread unnecessary)
Your process can be divided into two steps:
Generate an account
Save the account in the file
First step can be done in parallel as there is no dependency between accounts. That is wile creating an account number xyz you don't have to rely on data from the account xyz - 1 (as it may not yet be created).
The problematic bit is writing the data into file. You don't want several threads trying to access and write to the same file. And adding locks will likely make your code a nightmare to maintain. Other issue is that it's the writing to the file that slows the whole process down.
At the moment, in your code creating account and writing to the file happens in one process.
What you can try is to separate these processes. So First you create all the accounts and keep them in some collection. Here multi-threading can be used safely. Only when all the accounts are created you save them.
Improving the saving process will take bit more work. You will have to divide all the accounts into 8 separate collections. For each collection you create a separate file. Then you can take first collection, first file, and create a thread that will write the data to the file. The same for second collection and second file. And so on. These 8 processes can run in parallel and you do not have to worry that more than one thread will try to access same file.
Below some pseudo-code to illustrate the idea:
public void CreateAndSaveAccounts()
{
List<Account> accounts = this.CreateAccounts();
// Divide the accounts into separate batches
// Of course the process can (and shoudl) be automated.
List<List<Account>> accountsInSeparateBatches =
new List<List<Account>>
{
accounts.GetRange(0, 10000000), // Fist batch of 10 million
accounts.GetRange(10000000, 10000000), // Second batch of 10 million
accounts.GetRange(20000000, 10000000) // Third batch of 10 million
// ...
};
// Save accounts in parallel
Parallel.For(0, accountsInSeparateBatches.Count,
i =>
{
string filePath = string.Format(#"C:\file{0}", i);
this.SaveAccounts(accountsInSeparateBatches[i], filePath);
}
);
}
public List<Account> CreateAccounts()
{
// Create accounts here
// and return them as a collection.
// Use parallel processing wherever possible
}
public void SaveAccounts(List<Account> accounts, string filePath)
{
// Save accounts to file
// The method creates a thread to do the work.
}
It's code that will execute 4 threads in 15-min intervals. The last time that I ran it, the first 15-minutes were copied fast (20 files in 6 minutes), but the 2nd 15-minutes are much slower. It's something sporadic and I want to make certain that, if there's any bottleneck, it's in a bandwidth limitation with the remote server.
EDIT: I'm monitoring the last run and the 15:00 and :45 copied in under 8 minutes each. The :15 hasn't finished and neither has :30, and both began at least 10 minutes before :45.
Here's my code:
static void Main(string[] args)
{
Timer t0 = new Timer((s) =>
{
Class myClass0 = new Class();
myClass0.DownloadFilesByPeriod(taskRunDateTime, 0, cts0.Token);
Copy0Done.Set();
}, null, TimeSpan.FromMinutes(20), TimeSpan.FromMilliseconds(-1));
Timer t1 = new Timer((s) =>
{
Class myClass1 = new Class();
myClass1.DownloadFilesByPeriod(taskRunDateTime, 1, cts1.Token);
Copy1Done.Set();
}, null, TimeSpan.FromMinutes(35), TimeSpan.FromMilliseconds(-1));
Timer t2 = new Timer((s) =>
{
Class myClass2 = new Class();
myClass2.DownloadFilesByPeriod(taskRunDateTime, 2, cts2.Token);
Copy2Done.Set();
}, null, TimeSpan.FromMinutes(50), TimeSpan.FromMilliseconds(-1));
Timer t3 = new Timer((s) =>
{
Class myClass3 = new Class();
myClass3.DownloadFilesByPeriod(taskRunDateTime, 3, cts3.Token);
Copy3Done.Set();
}, null, TimeSpan.FromMinutes(65), TimeSpan.FromMilliseconds(-1));
}
public struct FilesStruct
{
public string RemoteFilePath;
public string LocalFilePath;
}
Private void DownloadFilesByPeriod(DateTime TaskRunDateTime, int Period, Object obj)
{
FilesStruct[] Array = GetAllFiles(TaskRunDateTime, Period);
//Array has 20 files for the specific period.
using (Session session = new Session())
{
// Connect
session.Open(sessionOptions);
TransferOperationResult transferResult;
foreach (FilesStruct u in Array)
{
if (session.FileExists(u.RemoteFilePath)) //File exists remotely
{
if (!File.Exists(u.LocalFilePath)) //File does not exist locally
{
transferResult = session.GetFiles(u.RemoteFilePath, u.LocalFilePath);
transferResult.Check();
foreach (TransferEventArgs transfer in transferResult.Transfers)
{
//Log that File has been transferred
}
}
else
{
using (StreamWriter w = File.AppendText(Logger._LogName))
{
//Log that File exists locally
}
}
}
else
{
using (StreamWriter w = File.AppendText(Logger._LogName))
{
//Log that File exists remotely
}
}
if (token.IsCancellationRequested)
{
break;
}
}
}
}
Something is not quite right here. First thing is, you're setting 4 timers to run parallel. If you think about it, there is no need. You don't need 4 threads running parallel all the time. You just need to initiate tasks at specific intervals. So how many timers do you need? ONE.
The second problem is why TimeSpan.FromMilliseconds(-1)? What is the purpose of that? I can't figure out why you put that in there, but I wouldn't.
The third problem, not related to multi-programming, but I should point out anyway, is that you create a new instance of Class each time, which is unnecessary. It would be necessary if, in your class, you need to set constructors and your logic access different methods or fields of the class in some order. In your case, all you want to do is to call the method. So you don't need a new instance of the class every time. You just need to make the method you're calling static.
Here is what I would do:
Store the files you need to download in an array / List<>. Can't you spot out that you're doing the same thing every time? Why write 4 different versions of code for that? This is unnecessary. Store items in an array, then just change the index in the call!
Setup the timer at perhaps 5 seconds interval. When it reaches the 20 min/ 35 min/ etc. mark, spawn a new thread to do the task. That way a new task can start even if the previous one is not finished.
Wait for all threads to complete (terminate). When they do, check if they throw exceptions, and handle them / log them if necessary.
After everything is done, terminate the program.
For step 2, you have the option to use the new async keyword if you're using .NET 4.5. But it won't make a noticeable difference if you use threads manually.
And why is it so slow...why don't you check your system status using task manager? Is the CPU high and running or is the network throughput occupied by something else or what? You can easily tell the answer yourself from there.
The problem was the sftp client.
The purpose of the console application was to loop through a list<> and download the files. I tried with winscp and, even though, it did the job, it was very slow. I also tested sharpSSH and it was even slower than winscp.
I finally ended up using ssh.net which, at least in my particular case, was much faster than both winscp and sharpssh. I think the problem with winscp is that there was no evident way of disconnecting after I was done. With ssh.net I could connect/disconnect after every file download was made, something I couldn't do with winscp.
I had some trouble trying to place piece of code into another thread to increase performance.
I have following code below (with thread additions with comments), where I parse large XML file (final goal 100,000 rows) and then write it to a SQL Server CE 3.5 database file (.sdf) using record and insert (SqlCeResultSet/SqlCeUpdatableRecord).
Two lines of code in if statement inside the while loop,
xElem = (XElement)XNode.ReadFrom(xmlTextReader);
and
rs.Insert(record);
take about the same amount of time to execute. I was thinking to run rs.Insert(record); while I am parsing the next line of xml file. However, I still was unable to do it using either Thread or ThreadPool.
I have to make sure that the record that I pass to thread is not changed until I finish executing rs.Insert(record); in existing thread. Thus, I tried to place thread.Join() before writing new record (record.SetValue(i, values[i]);), but I still get conflict when I try to run the program - program crashes with bunch of errors due to trying to write identical row several times (especially for index).
Can anyone help me with some advise? How can I move rs.Insert(record); into another thread to increase performance?
XmlTextReader xmlTextReader = new XmlTextReader(modFunctions.InFName);
XElement xElem = new XElement("item");
using (SqlCeConnection cn = new SqlCeConnection(connectionString))
{
if (cn.State == ConnectionState.Closed)
cn.Open();
using (SqlCeCommand cmd = new SqlCeCommand())
{
cmd.Connection = cn;
cmd.CommandText = "item";
cmd.CommandType = CommandType.TableDirect;
using (SqlCeResultSet rs = cmd.ExecuteResultSet(ResultSetOptions.Updatable))
{
SqlCeUpdatableRecord record = rs.CreateRecord();
// Thread code addition
Thread t = new Thread(new ThreadStart(() => rs.Insert(record));
while (xmlTextReader.Read())
{
if (xmlTextReader.NodeType == XmlNodeType.Element &&
xmlTextReader.LocalName == "item" &&
xmlTextReader.IsStartElement() == true)
{
xElem = (XElement)XNode.ReadFrom(xmlTextReader);
values[0] = (string)xElem.Element("Index"); // 0
values[1] = (string)xElem.Element("Name"); // 1
~~~
values[13] = (string)xElem.Element("Notes"); // 13
// Thread code addition -- Wait until previous thread finishes
if (ThreadStartedS == 1)
{
t.Join()
}
// SetValues to record
for (int i = 0; i < values.Length; i++)
{
record.SetValue(i, values[i]); // 0 to 13
}
// Thread code addition -- Start thread to execute rs.Insert(record)
ThreadStartedS = 1;
t.Start();
// Original code without threads
// Insert Record
//rs.Insert(record);
}
}
}
}
}
If all of your processing is going to be done on the device (reading from the XML file on the device then parsing the data on the device), then you will see no performance increase from threading your work.
These Windows Mobile devices only have a single processor, so for them to multithread means one process works for a while, then another process works for a while. You will never have simultaneous processes running at the same time.
On the other hand, if the data from your XML file were located on a remote server, you could call the data in chunks. As a chunk arrives, you could process that data in another thread while waiting on the next chunk of data to arrive in the main thread.
If all of this work is being done on one device, you will not have good luck with multithreading.
You can still display a progress bar (from 0 to NumberOfRecords) with a cancel button so the person waiting for the data collection to complete does not go insane with anticipation.
As part of an effort to automate starting/stopping some of our NServiceBus services, I'd like to know when a service has finished processing all the messages in it's input queue.
The problem is that, while the NServiceBus service is running, my C# code is reporting one less message than is actually there. So it thinks that the queue is empty when there is still one message left. If the service is stopped, it reports the "correct" number of messages. This is confusing because, when I inspect the queues myself using the Private Queues view in the Computer Management application, it displays the "correct" number.
I'm using a variant of the following C# code to find the message count:
var queue = new MessageQueue(path);
return queue.GetAllMessages().Length;
I know this will perform horribly when there are many messages. The queues I'm inspecting should only ever have a handful of messages at a time.
I have looked at
other
related
questions,
but haven't found the help I need.
Any insight or suggestions would be appreciated!
Update: I should have mentioned that this service is behind a Distributor, which is shut down before trying to shut down this service. So I have confidence that new messages will not be added to the service's input queue.
The thing is that it's not actually "one less message", but rather dependent on the number of messages currently being processed by the endpoint which, in a multi-threaded process, can be as high as the number of threads.
There's also the issue of client processes that continue to send messages to that same queue.
Probably the only "sure" way of handling this is by counting the messages multiple times with a delay in between and if the number stay zero over a certain number of attempts that you can assume the queue is empty.
WMI was the answer! Here's a first pass at the code. It could doubtless be improved.
public int GetMessageCount(string queuePath)
{
const string query = "select * from Win32_PerfRawData_MSMQ_MSMQQueue";
var query = new WqlObjectQuery(query);
var searcher = new ManagementObjectSearcher(query);
var queues = searcher.Get();
foreach (ManagementObject queue in queues)
{
var name = queue["Name"].ToString();
if (AreTheSameQueue(queuePath, name))
{
// Depending on the machine (32/64-bit), this value is a different type.
// Casting directly to UInt64 or UInt32 only works on the relative CPU architecture.
// To work around this run-time unknown, convert to string and then parse to int.
var countAsString = queue["MessagesInQueue"].ToString();
var messageCount = int.Parse(countAsString);
return messageCount;
}
}
return 0;
}
private static bool AreTheSameQueue(string path1, string path2)
{
// Tests whether two queue paths are equivalent, accounting for differences
// in case and length (if one path was truncated, for example by WMI).
string sanitizedPath1 = Sanitize(path1);
string sanitizedPath2 = Sanitize(path2);
if (sanitizedPath1.Length > sanitizedPath2.Length)
{
return sanitizedPath1.StartsWith(sanitizedPath2);
}
if (sanitizedPath1.Length < sanitizedPath2.Length)
{
return sanitizedPath2.StartsWith(sanitizedPath1);
}
return sanitizedPath1 == sanitizedPath2;
}
private static string Sanitize(string queueName)
{
var machineName = Environment.MachineName.ToLowerInvariant();
return queueName.ToLowerInvariant().Replace(machineName, ".");
}
I have an application that crawls a site and writes the content as lucene index files into the physical directory.
When I use threads for this purpose, I am getting write errors or errors due to the locks.
I want to use multiple threads and write into the index files without missing the task of any of the threads.
public class WriteDocument
{
private static Analyzer _analyzer;
private static IndexWriter indexWriter;
private static string Host;
public WriteDocument(string _Host)
{
Host = _Host;
Lucene.Net.Store.Directory _directory = FSDirectory.GetDirectory(Host, false);
_analyzer = new StandardAnalyzer();
bool indexExists = IndexReader.IndexExists(_directory);
bool createIndex = !indexExists;
indexWriter = new IndexWriter(_directory, _analyzer, true);
}
public void AddDocument(object obj)
{
DocumentSettings doc = (DocumentSettings)obj;
Field urlField = new Field("Url", doc.downloadedDocument.Uri.ToString(), Field.Store.YES, Field.Index.TOKENIZED);
document.Add(urlField);
indexWriter.AddDocument(document);
document = null;
doc.downloadedDocument = null;
indexWriter.Optimize();
indexWriter.Close();
}
}
To the above class, I am passing the values like this:
DocumentSettings writedoc = new DocumentSettings()
{
Host = Host,
downloadedDocument = downloadDocument
};
Thread t = new Thread(() =>
{
doc.AddDocument(writedoc);
});
t.Start();
If I add t.Join(); after t.Start(); the code works for me without any errors. But this slows down my process and virtually, this is equal to the output I get without using threads.
I am getting error like:
Cannot rename /indexes/Segments.new to /indexes/Segments
the file is used by some other process.
Can anyone help me on this code?
An IndexWriter is not thread safe, so this is not possible.
If you want to use multiple threads for download, you will need to build some sort of "message pump" that is singlethreaded, to which you can feed the Documents you are downloading and creating, and puts them in a queue.
Example, in your AddDocument method, instead of utilizing the index directly, just send them of to a service that will index it eventually.
That service should try to index everything in queue all the time, and if it doesn't have a queue for the time being, sleep for a while.
An approach you can take would be to create a separate index for each of the threads and merge them all back at the end. e.g. index1, index2...indexn (corresponding to threads 1..n) and merge them.