I need to process the single file in parallel by sending skip-take count like 1-1000, 1001-2000,2001-3000 etc
Code for parallel process
var line = File.ReadAllLines("D:\\OUTPUT.CSV").Length;
Parallel.For(1, line, new ParallelOptions { MaxDegreeOfParallelism = 10 }, x
=> {
DoSomething(skip,take);
});
Function
public static void DoSomething(int skip, int take)
{
//code here
}
How can send the skip and take count in parallel process as per my requirement ?
You can do these rather easily with PLINQ. If you want batches of 1000, you can do:
const int BatchSize = 1000;
var pageAmount = (int) Math.Ceiling(((float)lines / BatchSize));
var results = Enumerable.Range(0, pageAmount)
.AsParallel()
.Select(page => DoSomething(page));
public void DoSomething(int page)
{
var currentLines = source.Skip(page * BatchSize).Take(BatchSize);
// do something with the selected lines
}
Related
I want to divide the Db Query Result into as many tasks as I want. How can I do? For example, I want to give every 300 rows to the same process at the same time, but every 300 rows must be different 300 rows.
I don't know what do you mean
I want to give every 300 rows to the same process at the same time
However, one possible solution for dividing the query result into a list of tasks could be this:
Count total records:
var count = await context.Entities.CountAsync();
Calculate the total database call you need:
const int take = 300;
var dbCallsCount = Math.Ceiling((double)count / take);
Create a method to fetch data (note that you cannot run parallel queries through the same DbContext object):
public async Task<List<Entity>> FetchDataAsync(int page, int take)
{
using(var context = new DbContext("ConnectionString"))
{
var result = await context.Entities
.AsNoTracking()
.Skip((page - 1) * take)
.Take(take)
.ToListAsync();
return result;
}
}
Create a List of tasks to fetch data:
var taskList = new List<Task<List<Entity>>>();
for(var i = 0; i < dbCallsCount; i++)
taskList.Add(FetchDataAsync(i, take));
var result = await Task.WhenAll(taskList);
It can be a generic method to get a list of tasks for fetching data:
public async Task<List<Task<List<TEntity>>>> DivideDbQueryIntoTasks<TEntity>(int take) where TEntity : class
{
int count;
using(var context = new DbContext("ConnectionString"))
{
count = await context.DbSet<TEntity>.CountAsync();
}
var dbCallsCount = Math.Ceiling((double)count / take);
// Local function
async Task<List<TEntity>> FetchDataAsync<TEntity>(int page, int take)
{
using(var context = new DbContext("ConnectionString"))
{
var result = await context.DbSet<TEntity>
.AsNoTracking()
.Skip((page - 1) * take)
.Take(take)
.ToListAsync();
return result;
}
}
var taskList = new List<Task<List<TEntity>>>();
for(var i = 0; i < dbCallsCount; i++)
taskList.Add(FetchDataAsync<TEntity>(i, take));
return taskList;
}
And call it in this way:
var tasks = await DivideDbQueryIntoTasks<MyEntity>(300);
foreach (Task<List<IdentityUser>> task in tasks)
{
...
}
Below is my current code which gets 500 documents(JSON format) from the documentDB per call. I can only do 500 per search and adding it to a concurrent bag(in parallel). The data fetched is based on the id number I provide where to the API and picks it from that range. E.g. id = 500 [gets documents from 501 - 1000]. The below code fills concurrent bag with 25k documents as expected.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
}
I wanted to run this whole while loop in threading so that I can do a multiple call to API and fetch 500 documents parallely. I tried to modify the code as below but always I see only 500 documents still in the concurrent bag 'concurrentBag' after the whole run and the skip id stays at 500 and doesnt increment.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
Task[] tasks = new Task[threadNumber];
for (int j = 0; j < threadNumber; j++)
{
tasks[j] = Task.Run(async() =>
{
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
});
}
}
Can you please help what am I doing wrong here?
For loading document from external resources use asynchronous approach without extra threads.
Note, that when you download external resources in parallel, extra threads doing no work, but just waiting for the response, so threads are just being wasted ;)
Asynchronous approach provide possibility to launch multiple requests almost simultaneously, without waiting for every task to complete, but wait only when all tasks are ready.
var maxDocuments = 25000;
var step = 500;
var documentTasks = Enumerable.Range(1, int.Max)
.Select(offset => step * offset)
.TakeWhile(id => id <= maxDocuments)
.Select(id => client.Search<docClass>(GetFollowUpRequest(id), requestOptions))
.ToArray();
await Task.WhenAll(documentTasks);
var allDocuments = documentTasks
.Select(task = task.Result)
.SelectMany(documents => documents)
.ToArray();
I want to limit the total number of queries that I submit to my database server across all Dataflow blocks to 30. In the following scenario, the throttling of 30 concurrent tasks is per block so it always hits 60 concurrent tasks during execution. Obviously I could limit my parallelism to 15 per block to achieve a system wide total of 30 but this wouldn't be optimal.
How do I make this work? Do I limit (and block) my awaits using SemaphoreSlim, etc, or is there an intrinsic Dataflow approach that works better?
public class TPLTest
{
private long AsyncCount = 0;
private long MaxAsyncCount = 0;
private long TaskId = 0;
private object MetricsLock = new object();
public async Task Start()
{
ExecutionDataflowBlockOptions execOption
= new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 30 };
DataflowLinkOptions linkOption = new DataflowLinkOptions()
{ PropagateCompletion = true };
var doFirstIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doCPUWork = new TransformBlock<Data, Data>(
data => DoCPUBoundWork(data));
var doSecondIOWorkAsync = new TransformBlock<Data, Data>(
async data => await DoIOBoundWorkAsync(data), execOption);
var doProcess = new TransformBlock<Data, string>(
i => $"Task finished, ID = : {i.TaskId}");
var doPrint = new ActionBlock<string>(
s => Debug.WriteLine(s));
doFirstIOWorkAsync.LinkTo(doCPUWork, linkOption);
doCPUWork.LinkTo(doSecondIOWorkAsync, linkOption);
doSecondIOWorkAsync.LinkTo(doProcess, linkOption);
doProcess.LinkTo(doPrint, linkOption);
int taskCount = 150;
for (int i = 0; i < taskCount; i++)
{
await doFirstIOWorkAsync.SendAsync(new Data() { Delay = 2500 });
}
doFirstIOWorkAsync.Complete();
await doPrint.Completion;
Debug.WriteLine("Max concurrent tasks: " + MaxAsyncCount.ToString());
}
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
lock(MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
await Task.Delay(data.Delay);
lock (MetricsLock)
AsyncCount--;
return data;
}
private Data DoCPUBoundWork(Data data)
{
data.Step = 1;
return data;
}
}
Data Class:
public class Data
{
public int Delay { get; set; }
public long TaskId { get; set; }
public int Step { get; set; }
}
Starting point:
TPLTest tpl = new TPLTest();
await tpl.Start();
Why don't you marshal everything to an action block that has the actual limitation?
var count = 0;
var ab1 = new TransformBlock<int, string>(l => $"1:{l}");
var ab2 = new TransformBlock<int, string>(l => $"2:{l}");
var doPrint = new ActionBlock<string>(
async s =>
{
var c = Interlocked.Increment(ref count);
Console.WriteLine($"{c}:{s}");
await Task.Delay(5);
Interlocked.Decrement(ref count);
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 15 });
ab1.LinkTo(doPrint);
ab2.LinkTo(doPrint);
for (var i = 100; i > 0; i--)
{
if (i % 3 == 0) await ab1.SendAsync(i);
if (i % 5 == 0) await ab2.SendAsync(i);
}
ab1.Complete();
ab2.Complete();
await ab1.Completion;
await ab2.Completion;
This is the solution I ended up going with (unless I can figure out how to use a single generic DataFlow block for marshalling every type of database access):
I defined a SemaphoreSlim at the class level:
private SemaphoreSlim ThrottleDatabaseQuerySemaphore = new SemaphoreSlim(30, 30);
I modified the I/O class to call a throttling class:
private async Task<Data> DoIOBoundWorkAsync(Data data)
{
if (data.TaskId <= 0)
data.TaskId = Interlocked.Increment(ref TaskId);
Task t = Task.Delay(data.Delay); ;
await ThrottleDatabaseQueryAsync(t);
return data;
}
The throttling class: (I also have a generic version of the throttling routine because I couldn't figure out how to write one routine to handle both Task and Task<TResult>)
private async Task ThrottleDatabaseQueryAsync(Task task)
{
await ThrottleDatabaseQuerySemaphore.WaitAsync();
try
{
lock (MetricsLock)
{
AsyncCount++;
if (AsyncCount > MaxAsyncCount)
MaxAsyncCount = AsyncCount;
}
await task;
}
finally
{
ThrottleDatabaseQuerySemaphore.Release();
lock (MetricsLock)
AsyncCount--;
}
}
}
The simplest solution to this problem is to configure all your blocks with a limited-concurrency TaskScheduler:
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxConcurrencyLevel: 30).ConcurrentScheduler;
ExecutionDataflowBlockOptions execOption = new()
{
TaskScheduler = scheduler,
MaxDegreeOfParallelism = scheduler.MaximumConcurrencyLevel,
};
TaskSchedulers can only limit the concurrency of work done on threads. They can't throttle asynchronous operations that are not running on threads. So in order to enforce the MaximumConcurrencyLevel policy, unfortunately you must pass synchronous delegates to all the Dataflow blocks. For example:
TransformBlock<Data, Data> doFirstIOWorkAsync = new(data =>
{
return DoIOBoundWorkAsync(data).GetAwaiter().GetResult();
}, execOption);
This change will increase the demand for ThreadPool threads, so you'd better increase the number of threads that the ThreadPool creates instantly on demand to a higher value than the default Environment.ProcessorCount:
ThreadPool.SetMinThreads(100, 100); // At the start of the program
I am proposing this solution not because it is optimal, but because it is easy to implement. My understanding is that wasting some RAM on ~30 threads that are going to be blocked most of the time, won't have any measurable negative effect on the type of application that you are working with.
I have a list and currently we are passing on a single item at a time to another method
foreach (var contact in tracker.Parse())
basically we are selecting contacts from Azure Blob Storage and importing them into Dynamics CRM. tracker.Parse() returns is a list of contacts.
I want to select every 1000 and then wait until they are completed in the other method before I pass in the next 1000.
Need guidance on how to do this.
Appreciate the assistance!
Group the data from the data source into the desired group size and process them.
Using Linq it can be achieved with Select and GroupBy extensions
int groupSize = 1000;
//Batch the data
var batches = tracker.Parse()
.Select((contact, index) => new { contact, index })
.GroupBy(_ => _.index / groupSize, _ => _.contact);
//Process the batches.
foreach (var batch in batches) {
//Each batch would be IEnumerable<TContact>
ProcessBatch(batch);
}
This can be converted a reusable generic method if needed
public static void ProcessInBatches<TItem>(this IEnumerble<TItem> items, int batchSize, Action<IEnumerable<TItem>> processBatch) {
//Batch the data
var batches = items
.Select((item, index) => new { item, index })
.GroupBy(_ => _.index / batchSize, _ => _.item);
//Process the batches.
foreach (var batch in batches) {
//Each batch would be IEnumerable<TItem>
processBatch(batch);
}
}
Maybe something like this...
static void Main()
{
const int batchSize = 1000;
// Populate array with 5841 items of data
var lotsOfItems = new List<int>();
for (int i = 0; i < 5841; i++)
{
lotsOfItems.Add(i);
}
// Process items in batches, waiting for each batch to complete before the next
int indexOfLastItemTaken = 0;
while (indexOfLastItemTaken < lotsOfItems.Count - 1)
{
var itemsTaken = lotsOfItems.Skip(indexOfLastItemTaken).Take(batchSize).ToList();
ProcessItems(itemsTaken);
indexOfLastItemTaken += itemsTaken.Count();
}
Console.Write("Done. Press any key to quit...");
Console.ReadKey();
}
static void ProcessItems(IEnumerable<int> input)
{
// do something with input
Console.WriteLine(new string('-', 15));
Console.WriteLine($"Processing a new batch of {input.Count()} items:");
Console.WriteLine(string.Join(",", input));
}
I have a for loop inside of which
First : I want to compute the SQL required to run
Second : Run the SQL asynchronously without waiting for them individually to finish in a loop
My code looks like:
for (
int i = 0;
i < gm.ListGroupMembershipUploadDetailsInput.GroupMembershipUploadInputList.Count;
i++)
{
// Compute
SQL.Upload.UploadDetails.insertGroupMembershipRecords(
gm.ListGroupMembershipUploadDetailsInput.GroupMembershipUploadInputList[i],max_seq_key++,max_unit_key++,
out strSPQuery,
out listParam);
//Run the out SPQuery async
Task.Run(() => rep.ExecuteStoredProcedureInputTypeAsync(strSPQuery, listParam));
}
The insertGroupMembershipRecords method in a separate DAL class looks like :
public static GroupMembershipUploadInput insertGroupMembershipRecords(GroupMembershipUploadInput gm, List<ChapterUploadFileDetailsHelper> ch, long max_seq_key, long max_unit_key, out string strSPQuery, out List<object> parameters)
{
GroupMembershipUploadInput gmHelper = new GroupMembershipUploadInput();
gmHelper = gm;
int com_unit_key = -1;
foreach(var item in com_unit_key_lst){
if (item.nk_ecode == gm.nk_ecode)
com_unit_key = item.unit_key;
}
int intNumberOfInputParameters = 42;
List<string> listOutputParameters = new List<string> { "o_outputMessage" };
strSPQuery = SPHelper.createSPQuery("dw_stuart_macs.strx_inst_cnst_grp_mbrshp", intNumberOfInputParameters, listOutputParameters);
var ParamObjects = new List<object>();
ParamObjects.Add(SPHelper.createTdParameter("i_seq_key", max_seq_key, "IN", TdType.BigInt, 10));
ParamObjects.Add(SPHelper.createTdParameter("i_chpt_cd", "S" + gm.appl_src_cd.Substring(1), "IN", TdType.VarChar, 4));
ParamObjects.Add(SPHelper.createTdParameter("i_nk_ecode", gm.nk_ecode, "IN", TdType.Char, 5));
// rest of the method
}
But in case of list Count of 2k which I tried,
It did not insert 2k records in DB but only 1.
Why this does not insert all the records the input list has ?
What am I missing ?
Task.Run in a for loop
Even though this is not the question, the title itself is what I'm going to address. For CPU bound operations you could use Parallel.For or Parallel.ForEach, but since we are IO bound (i.e.; database calls) we should rethink this approach.
The obvious answer here is to create a list of tasks that represent the asynchronous operations and then await them using the Task.WhenAll API like this:
public async Task InvokeAllTheSqlAsync()
{
var list = gm.ListGroupMembershipUploadDetailsInput.GroupMembershipUploadInputList;
var tasks = Enumerable.Range(0, list.Count).Select(i =>
{
var value = list[i];
string strSPQuery;
List<SqlParameter> listParam;
SQL.Upload.UploadDetails.insertGroupMembershipRecords(
value,
max_seq_key++,
max_unit_key++,
out strSPQuery,
out listParam
);
return rep.ExecuteStoredProcedureInputTypeAsync(strSPQuery, listParam);
});
await Task.WhenAll(tasks);
}