How to process IEnumerable in batches?

How to process IEnumerable in batches? - c#

A lot of time my lists are too big tor process and i have to break them up into batches to process.
Is there a way to encapsulate this in a method or an extension. I seem to have to write this batching logic everywhere.
const int TAKE = 100;
int skip = 0;
while (skip < contacts.Count)
{
var batch = contacts.Skip(skip).Take(TAKE).ToList();
DoSomething(batch);
skip += TAKE;
}
I would like to do something like -
Batch(contacts, DoSomething);
Or anything similar so i do not have to write this batching logic again and again.

Using the batch solution from this thread it seems trivial:
const int batchSize = 100;
foreach (var batch in contacts.Batch(batchSize))
{
DoSomething(batch);
}
If you want to also wrap it up:
public static void ProcessInBatches<TSource>(
this IEnumerable<TSource> source,
int batchSize,
Action<IEnumerable<TSource>> action)
{
foreach (var batch in source.Batch(batchSize))
{
action(batch);
}
}
So, your code can be transformed into:
const int batchSize = 100;
contacts.ProcessInBatches(batchSize, DoSomething);

Related

C# thread safe class and list

The current need is to pull many records from an SQL database and then submit those records in much smaller blocks to an API call. The volume of data pull from SQL is inconsistent based on how much data is loaded by another process and will sometimes be small enough to handle with only one worker. When the data pull is large (max 5k rows) it will require more threads to help query the API to obtain speed. The process works great but when I run one at a time it is slow at large volumes. However, I'm finding the list I pass to the class is changing as multiple threads are launched. How can I achieve thread safety?
I've read about this and tried at length - locks and ConcurrentBag for example but I'm not quite sure where to apply or how to use these to achieve what I am looking for.
Here is what I have:
class Start
{
public void Execute()
{
SQLSupport sqlSupport = new SQLSupport();
List<SQLDataList> sqlDataList = new List<SQLDataList>();
List<SQLDataList> sqlAPIList = new List<SQLDataList>();
sqlDataList = sqlSupport.sqlQueryReturnList<sqlDataList>("SELECT * FROM TABLE");
if (sqlDataList.Count > 200)
{
int iRow = 0;
int iRowSQLCount = 0;
foreach (var item in sqlDataList)
{
if (iRowSQLCount == 100)
{
APIProcess apiProcess= new APIProcess (sqlAPIList);
Thread thr = new Thread(new ThreadStart(apiProcess.Execute));
thr.Start();
sqlAPIList.Clear();
iRowSQLCount = 0;
}
sqlAPIList.Add(sqlDataList[iRow]);
iRowSQLCount++;
iRow++;
}
}
}
}
class APIProcess
{
List<SQLDataList> sqlAPIList= new List<SQLDataList>();
public APIProcess(List<SQLDataList> sqlList)
{
sqlAPIList = sqlList;
}
public void Execute()
{
foreach (var item in sqlAPIList)
{
//loop through the list, interact with the API, update the list and ultimately update SQL with the API data.
}
}

Write records to a file when 500 Records reached or 2 mins of Time Passed from the previous file

I have a scenario, where I have a worker method (DoWork()) which is getting called by the web service continuously.
This worker method writes the data to a file using the (writeToFile()) and I need to write this to a file under 2 conditions
a) when the number of records reached in 500 OR
b) 2 minutes has been passed from the previous file been written
my sample code is as follows:
public override void DoWork()
{
//-- This worker method is called continuously by the webservice
List<string> list1 = new List<string>();
List<string> list2 = new List<string>();
int count =0;
//--- Some code that writes the RawData
list1.Add(rawData);
If(list1.Count<=500)
{
list2=list1;
count = WriteToFile(list2);
}
}
public static int WriteToFile(List<string> list)
{
//-- The writing of File Should happen when 500 records reached OR 2 mins passed.
// ---- Logic for writing the list file using the Streamwriter is working fine
}
I need a logic check if
500 records reached in the List OR
2 mins passed from the previous
file generated,
only then the file writing should happen.
Thanks
:)

To make it a little more readable, I'd use a little helper, here...
Time check:
class TimeCheck{
private TimeSpan timeoutDuration;
private DateTime nextTimeout;
// Default = 2 minutes.
public TimeCheck(): this(TimeSpan.FromMinutes(2))
{}
public TimeCheck( TimeSpan timeout ){
this.timeoutDuration= timeout;
this.nextTimeout = DateTime.Now + timeoutDuration;
}
public bool IsTimeoutReached => DateTime.Now >= nextTimeout;
public void Reset(){
nextTimeout = DateTime.Now + timeoutDuration;
}
}
Usage:
// on class level
const int MAXITEMS = 500;
private Lazy<TimeCheck> timecheck = new Lazy<TimeCheck>( () => return new TimeCheck() );
private List<string> list1 = new List<string>();
private readonly object Locker = new object();
public override void DoWork()
{
lock(Locker){
// ... add items to list1
// Write if a) list1 has 500+ items or b) 2 minutes since last write have passed
if( list1.Count >= MAXITEMS || timecheck.Value.IsTimeoutReached )
{
WriteToFile(list1);
list1.Clear(); // assuming _all_ Items are written.
timecheck.Value.Reset();
}
}
}
Attention:
If code is called by multiple threads, you need to make sure it's thread safe. I used lock which will create a bottleneck. You may want to figure out a more sophisticated manner, but this answer concentrates on the condition requirements.
Above snippet assumes, list1 is not accessed elsewhere.

Writing a unit test for concurrent C# code?

I've been trying to solve this issue for quite some time now. I've written some example code showcasing the usage of lock in C#. Running my code manually I can see that it works the way it should, but of course I would like to write a unit test that confirms my code.
I have the following ObjectStack.cs class:
enum ExitCode
{
Success = 0,
Error = 1
}
public class ObjectStack
{
private readonly Stack<Object> _objects = new Stack<object>();
private readonly Object _lockObject = new Object();
private const int NumOfPopIterations = 1000;
public ObjectStack(IEnumerable<object> objects)
{
foreach (var anObject in objects) {
Push(anObject);
}
}
public void Push(object anObject)
{
_objects.Push(anObject);
}
public void Pop()
{
_objects.Pop();
}
public void ThreadSafeMultiPop()
{
for (var i = 0; i < NumOfPopIterations; i++) {
lock (_lockObject) {
try {
Pop();
}
//Because of lock, the stack will be emptied safely and no exception is ever caught
catch (InvalidOperationException) {
Environment.Exit((int)ExitCode.Error);
}
if (_objects.Count == 0) {
Environment.Exit((int)ExitCode.Success);
}
}
}
}
public void ThreadUnsafeMultiPop()
{
for (var i = 0; i < NumOfPopIterations; i++) {
try {
Pop();
}
//Because there is no lock, an exception is caught when popping an already empty stack
catch (InvalidOperationException) {
Environment.Exit((int)ExitCode.Error);
}
if (_objects.Count == 0) {
Environment.Exit((int)ExitCode.Success);
}
}
}
}
And Program.cs:
public class Program
{
private const int NumOfObjects = 100;
private const int NumOfThreads = 10000;
public static void Main(string[] args)
{
var objects = new List<Object>();
for (var i = 0; i < NumOfObjects; i++) {
objects.Add(new object());
}
var objectStack = new ObjectStack(objects);
Parallel.For(0, NumOfThreads, x => objectStack.ThreadUnsafeMultiPop());
}
}
I'm trying to write a unit that tests the thread unsafe method, by checking the exit code value (0 = success, 1 = error) of the executable.
I tried to start and run the application executable as a process in my test, a couple of 100 times, and checked the exit code value each time in the test. Unfortunately, it was 0 every single time.
Any ideas are greatly appreciated!

Logically, there is one, very small, piece of code where this problem can happen. Once one of the threads enters the block of code that pops a single element, then either the pop will work in which case the next line of code in that thread will Exit with success OR the pop will fail in which case the next line of code will catch the exception and Exit.
This means that no matter how much parallelization you put into the program, there is still only one single point in the whole program execution stack where the issue can occur and that is directly before the program exits.
The code is genuinely unsafe, but the probability of an issue happening in any single execution of the code is extremely low as it requires the scheduler to decide not to execute the line of code that will exit the environment cleanly and instead let one of the other Threads raise an exception and exit with an error.
It is extremely difficult to "prove" that a concurrency bug exists, except for really obvious ones, because you are completely dependent on what the scheduler decides to do.
Looking up some other posts I see this post which is written related to Java but references C#: How should I unit test threaded code?
It includes a link to this which might be useful to you: http://research.microsoft.com/en-us/projects/chess/
Hope this is useful and apologies if it is not. Testing concurrency is inherently unpredictable as is writing example code to cause it.

Thanks for all the input! Although I do agree that this is a concurrency issue quite hard to detect due to the scheduler execution among other things, I seem to have found an acceptable solution to my problem.
I wrote the following unit test:
[TestMethod]
public void Executable_Process_Is_Thread_Safe()
{
const string executablePath = "Thread.Locking.exe";
for (var i = 0; i < 1000; i++) {
var process = new Process() {StartInfo = {FileName = executablePath}};
process.Start();
process.WaitForExit();
if (process.ExitCode == 1) {
Assert.Fail();
}
}
}
When I ran the unit test, it seemed that the Parallel.For execution in Program.cs threw strange exceptions at times, so I had to change that to traditional for-loops:
public class Program
{
private const int NumOfObjects = 100;
private const int NumOfThreads = 10000;
public static void Main(string[] args)
{
var objects = new List<Object>();
for (var i = 0; i < NumOfObjects; i++) {
objects.Add(new object());
}
var tasks = new Task[NumOfThreads];
var objectStack = new ObjectStack(objects);
for (var i = 0; i < NumOfThreads; i++)
{
var task = new Task(objectStack.ThreadUnsafeMultiPop);
tasks[i] = task;
}
for (var i = 0; i < NumOfThreads; i++)
{
tasks[i].Start();
}
//Using this seems to throw exceptions from unit test context
//Parallel.For(0, NumOfThreads, x => objectStack.ThreadUnsafeMultiPop());
}
Of course, the unit test is quite dependent on the machine you're running it on (a fast processor may be able to empty the stack and exit safely before reaching the critical section in all cases).

1.) You could inject IL Inject Context switches on a post build of your code in the form of Thread.Sleep(0) using ILGenerator which would most likely help these issues to arise.
2.) I would recommend you take a look at the CHESS project by Microsoft research team.

Huge performance hit for inlining sync method into async method

I have a simple performance test, that indirectly calls WriteAsync many times. It performs reasonably as long as WriteAsync is implemented as shown below. However, when I inline WriteByte into WriteAsync, performance degrades by about factor 7.
(To be clear: The only change that I make is replacing the statement containing the WriteByte call with the body of WriteByte.)
Can anybody explain why this happens? I've had a look at the differences in the generated code with Reflector, but nothing struck me as so totally different as that it would explain the huge perf hit.
public sealed override async Task WriteAsync(
byte[] buffer, int offset, int count, CancellationToken cancellationToken)
{
var writeBuffer = this.WriteBuffer;
var pastEnd = offset + count;
while ((offset < pastEnd) && ((writeBuffer.Count < writeBuffer.Capacity) ||
await writeBuffer.FlushAsync(cancellationToken)))
{
offset = WriteByte(buffer, offset, writeBuffer);
}
this.TotalCount += count;
}
private int WriteByte(byte[] buffer, int offset, WriteBuffer writeBuffer)
{
var currentByte = buffer[offset];
if (this.previousWasEscapeByte)
{
this.previousWasEscapeByte = false;
this.crc = Crc.AddCrcCcitt(this.crc, currentByte);
currentByte = (byte)(currentByte ^ Frame.EscapeXor);
++offset;
}
else
{
if (currentByte < Frame.InvalidStart)
{
this.crc = Crc.AddCrcCcitt(this.crc, currentByte);
++offset;
}
else
{
currentByte = Frame.EscapeByte;
this.previousWasEscapeByte = true;
}
}
writeBuffer[writeBuffer.Count++] = currentByte;
return offset;
}

async methods are rewritten by the compiler into a giant state machine, very similar to methods using yield return. All of your locals become fields in the state machine's class. The compiler currently doesn't try to optimize this at all, so any optimization is up to the coder.
Every local which would have sat happily in a register is now being read from and written to memory. Refactoring synchronous code out of async methods and into a sync method is a very valid performance optimization -- you're just doing the reverse!

How performant is StackFrame?

I am considering using something like StackFrame stackFrame = new StackFrame(1) to log the executing method, but I don't know about its performance implications. Is the stack trace something that is build anyway with each method call so performance should not be a concern or is it something that is only build when asked for it? Do you recommend against it in an application where performance is very important? If so, does that mean I should disable it for the release?

edit: Some background
We have a similar feature which is disabled 99% of the time; we were using an approach like:
public void DoSomething()
{
TraceCall(MethodBase.GetCurrentMethod().Name);
// Do Something
}
public void TraceCall(string methodName)
{
if (!loggingEnabled) { return; }
// Log...
}
TraceCall(MethodBase.GetCurrentMethod().Name)
It was simple, but regardless of whether or not tracing was enabled we were incurring the performance hit of using Reflection to lookup the method name.
Our options were to either require more code in every method (and risk simple mistakes or refusal) or to switch to using StackFrame to determine the calling method only when logging was enabled.
Option A:
public void DoSomething()
{
if (loggingEnabled)
{
TraceCall(MethodBase.GetCurrentMethod().Name);
}
// Do Something
}
public void TraceCall(string methodName)
{
if (!loggingEnabled) { return; }
// Log...
}
Option B:
public void DoSomething()
{
TraceCall();
// Do Something
}
public void TraceCall()
{
if (!loggingEnabled) { return; }
StackFrame stackFrame = new StackFrame(1);
// Log...
}
We opted for Option B. It offers significant performance improvements over Option A when logging is disabled, 99% of the time and is very simple to implement.
Here's an alteration of Michael's code, to display the cost / benefit of this approach
using System;
using System.Diagnostics;
using System.Reflection;
namespace ConsoleApplication
{
class Program
{
static bool traceCalls;
static void Main(string[] args)
{
Stopwatch sw;
// warm up
for (int i = 0; i < 100000; i++)
{
TraceCall();
}
// call 100K times, tracing *disabled*, passing method name
sw = Stopwatch.StartNew();
traceCalls = false;
for (int i = 0; i < 100000; i++)
{
TraceCall(MethodBase.GetCurrentMethod());
}
sw.Stop();
Console.WriteLine("Tracing Disabled, passing Method Name: {0}ms"
, sw.ElapsedMilliseconds);
// call 100K times, tracing *enabled*, passing method name
sw = Stopwatch.StartNew();
traceCalls = true;
for (int i = 0; i < 100000; i++)
{
TraceCall(MethodBase.GetCurrentMethod());
}
sw.Stop();
Console.WriteLine("Tracing Enabled, passing Method Name: {0}ms"
, sw.ElapsedMilliseconds);
// call 100K times, tracing *disabled*, determining method name
sw = Stopwatch.StartNew();
traceCalls = false;
for (int i = 0; i < 100000; i++)
{
TraceCall();
}
Console.WriteLine("Tracing Disabled, looking up Method Name: {0}ms"
, sw.ElapsedMilliseconds);
// call 100K times, tracing *enabled*, determining method name
sw = Stopwatch.StartNew();
traceCalls = true;
for (int i = 0; i < 100000; i++)
{
TraceCall();
}
Console.WriteLine("Tracing Enabled, looking up Method Name: {0}ms"
, sw.ElapsedMilliseconds);
Console.ReadKey();
}
private static void TraceCall()
{
if (traceCalls)
{
StackFrame stackFrame = new StackFrame(1);
TraceCall(stackFrame.GetMethod().Name);
}
}
private static void TraceCall(MethodBase method)
{
if (traceCalls)
{
TraceCall(method.Name);
}
}
private static void TraceCall(string methodName)
{
// Write to log
}
}
}
The Results:
Tracing Disabled, passing Method Name: 294ms
Tracing Enabled, passing Method Name: 298ms
Tracing Disabled, looking up Method Name: 0ms
Tracing Enabled, looking up Method Name: 1230ms

I know this is an old post, but just in case anyone comes across it there is another alternative if you are targeting .Net 4.5
You can use the CallerMemberName attribute to identify the calling method name. It is far faster than reflection or StackFrame. Here are the results from a quick test iterating a million times. StackFrame is extremely slow compared to reflection, and the new attribute makes both look like they are standing still. This was ran in the IDE.
Reflection Result:
00:00:01.4098808
StackFrame Result
00:00:06.2002501
CallerMemberName attribute Result:
00:00:00.0042708
Done
The following is from the compiled exe:
Reflection Result:
00:00:01.2136738
StackFrame Result
00:00:03.6343924
CallerMemberName attribute Result:
00:00:00.0000947
Done
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Stop();
Console.WriteLine("Reflection Result:");
sw.Start();
for (int i = 0; i < 1000000; i++)
{
//Using reflection to get the current method name.
PassedName(MethodBase.GetCurrentMethod().Name);
}
Console.WriteLine(sw.Elapsed);
Console.WriteLine("StackFrame Result");
sw.Restart();
for (int i = 0; i < 1000000; i++)
{
UsingStackFrame();
}
Console.WriteLine(sw.Elapsed);
Console.WriteLine("CallerMemberName attribute Result:");
sw.Restart();
for (int i = 0; i < 1000000; i++)
{
UsingCallerAttribute();
}
Console.WriteLine(sw.Elapsed);
sw.Stop();
Console.WriteLine("Done");
Console.Read();
}
static void PassedName(string name)
{
}
static void UsingStackFrame()
{
string name = new StackFrame(1).GetMethod().Name;
}
static void UsingCallerAttribute([CallerMemberName] string memberName = "")
{
}

A quick and naive test indicates that for performance-sensitive code, yes, you want to pay attention to this:
Don't generate 100K frames: 3ms
Generate 100K frames: 1805ms
About 20 microseconds per generated frame, on my machine. Not a lot, but a measurable difference over a large number of iterations.
Speaking to your later questions ("Should I disable StackFrame generation in my application?"), I'd suggest you analyze your application, do performance tests like the one I've done here, and see if the performance difference amounts to anything with your workload.
using System;
using System.Diagnostics;
namespace ConsoleApplication
{
class Program
{
static bool generateFrame;
static void Main(string[] args)
{
Stopwatch sw;
// warm up
for (int i = 0; i < 100000; i++)
{
CallA();
}
// call 100K times; no stackframes
sw = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
CallA();
}
sw.Stop();
Console.WriteLine("Don't generate 100K frames: {0}ms"
, sw.ElapsedMilliseconds);
// call 100K times; generate stackframes
generateFrame = true;
sw = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
CallA();
}
Console.WriteLine("Generate 100K frames: {0}ms"
, sw.ElapsedMilliseconds);
Console.ReadKey();
}
private static void CallA()
{
CallB();
}
private static void CallB()
{
CallC();
}
private static void CallC()
{
if (generateFrame)
{
StackFrame stackFrame = new StackFrame(1);
}
}
}
}

From the MSDN documentation, it seems like StackFrames are being created all the time:
A StackFrame is created and pushed on
the call stack for every function call
made during the execution of a thread.
The stack frame always includes
MethodBase information, and optionally
includes file name, line number, and
column number information.
The constructor new StackFrame(1) you'd call would do this:
private void BuildStackFrame(int skipFrames, bool fNeedFileInfo)
{
StackFrameHelper sfh = new StackFrameHelper(fNeedFileInfo, null);
StackTrace.GetStackFramesInternal(sfh, 0, null);
int numberOfFrames = sfh.GetNumberOfFrames();
skipFrames += StackTrace.CalculateFramesToSkip(sfh, numberOfFrames);
if ((numberOfFrames - skipFrames) > 0)
{
this.method = sfh.GetMethodBase(skipFrames);
this.offset = sfh.GetOffset(skipFrames);
this.ILOffset = sfh.GetILOffset(skipFrames);
if (fNeedFileInfo)
{
this.strFileName = sfh.GetFilename(skipFrames);
this.iLineNumber = sfh.GetLineNumber(skipFrames);
this.iColumnNumber = sfh.GetColumnNumber(skipFrames);
}
}
}
GetStackFramesInternal is an external method. CalculateFramesToSkip has a loop that operates exactly once, since you specified only 1 frame. Everything else looks pretty quick.
Have you tried measuring how long it would take to create, say, 1 million of them?

I am considering using something like StackFrame stackFrame = new StackFrame(1) to log the executing method
Out of interest: Why? If you only want the current method, then
string methodName = System.Reflection.MethodBase.GetCurrentMethod().Name;
seems better. Maybe not more performant (I didn't compare, but Reflection shows that GetCurrentMethod() does not simply create a StackFrame but does some "magic"), but clearer in it's intent.

I know what you mean, but this example result is overshoot. Executing the GetCurrentMethod even when logging is turned off is a waste.
It should be something like:
if (loggingEnabled) TraceCall(MethodBase.GetCurrentMethod());
Or if you want the TraceCall always executed:
TraceCall(loggingEnabled ? MethodBase.GetCurrentMethod() : null);

I think Paul Williams hit the nail on the head with the work being done. If you dig deeper into StackFrameHelper you'll find that fNeedFileInfo is actually the performance killer - especially in debug mode. Try setting it false if performance is important. You won't get much useful information in release mode anyway.
If you pass false in here you'll still get method names when doing a ToString() and without outputting any information, just moving stack pointers around, it's very fast.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to process IEnumerable in batches? - c#

Related

C# thread safe class and list

Write records to a file when 500 Records reached or 2 mins of Time Passed from the previous file

Writing a unit test for concurrent C# code?

Huge performance hit for inlining sync method into async method

How performant is StackFrame?

Categories

Resources