Looping through records and processing them in batches

Looping through records and processing them in batches - c#

I have a text file with X number of records that have 24 pipe delimited fields.
ABCDEFG|123456|BILLING|1234567|12345678|12345678|...
My concern is with the BILLING column. I need to append this word with current date and a sequential number BILLING-20131021-1 but here is the trick: The digit can or must increment only for each 10% of the record. So for example if I have 100 records, first ten of them will end with 1, next ten will end with 2, and so on. If there is an uneven number than the remainder will acquire the next sequence.
I started with two loops but that didn't produce the results. The the first loop iterates through the record count and the second iterates through the first 10% of records but then I can't figure out how to get the next batch of records.
for (uint recordCount = 0; recordCount < RecordsPerBatch; recordCount++)
{
for (uint smallCount = 0; smallCount < (RecordsPerBatch / 10)); smallCount++)
{}
}

You can simply loop through, keep a counter and only increment the "small count" when you hit a defined condition.
ie.
int smallCount = 1;
for (int recordCount = 0; recordCount < totalRecords; ++recordCount)
{
if (recordCount % (totalRecords / 10) == 0)
++smallCount;
}

Maintaining your current logic you could add another variable that keep the batch counter and simplify the condition on the inner loop calculating the batch size (10% of the total records).
Also it is necessary to check if the indexer in the inner loop doesn't exceed the total record count.
uint TotalRecordCounter = 101;
uint currentBatch = 1;
uint batchSize = TotalRecordCounter / 10;
// This will account for batch size that are not exactly divisible for 10.
// But if it is allowed to have more than 10 batches then remove it
// if((TotalRecordCounter % 10) != 0)
// batchSize++;
for (uint recordCount = 0; recordCount < TotalRecordCounter; recordCount+=batchSize)
{
for (uint smallCount = 0;
smallCount < batchSize && (recordCount+smallCount) < TotalRecordCounter;
smallCount++)
{
string billing = string.Format("BILLING-{0:yyyymmdd}-{1}", DateTime.Today, currentBatch);
}
currentBatch++;
}

If I'm understanding correctly, you're saying the issue is figuring out when you hit that 10% (and 20% and 30%, etc) of the file threshold. Giving a good answer depends a lot on what your system is capable of, but there are many ways to do this with a single loop, and in the worst case, you can do it in a single, non-nested loop.
Can you find the exact number of lines in the file?
If your file isn't gigantic, and your file is line-delimited (one record per line), this is the easiest solution, just read in the file as a string array. Then you just need to go through each line, and generate the last number each time, using current_record / exact_count.
Can you calculate the exact number of lines in the file?
If your records are fixed-length, you can take the file size, divide by the record size, and therefore calculate the exact number of records, and then generate the last number as above.
Is it sufficient to estimate the number of lines in the file?
Same idea as the previous suggestion, only use an estimate of your average record size.
What type of stream are you using to read the file?
If it's one where you can find the total length of the stream and your current position, you can calculate your percentage using that, instead of using row indices.
Final fallback suggestion
If none of the other suggestions work, you can always do a simple two pass solution. In the first one, simply read through and count the number of records. If you can fit it into memory, store each record and then parse it in-memory. If you can't, just read the file to count the number of entries, then read it again.

Related

Why is Queue consuming so much memory?

Basically I was doing a code kata on codewars site to kinda of 'warm up' before starting to code, and noticed a problem that I don't know if its because of my code, or just regular thing.
public static string WhoIsNext(string[] names, long n)
{
Queue<string> fifo = new Queue<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = fifo.Dequeue();
fifo.Enqueue(name);
fifo.Enqueue(name);
}
return fifo.Peek();
}
And Is called like this:
// Test 1
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 1;
var nth = CodeKata.WhoIsNext(names, n); // n = 1 Should return sheldon.
// test 2
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 52;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Penny.
// test 3
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 7230702951;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Leonard.
In this code When I put the long n with the value 7230702951 (a really high number...), it throws an out of memory exception. Is the number that high, or is the queue just not optimized for such numbers.
I say this because I tried using a List and the list memory usage stayed under 500 MB (the plateu was around 327MB btw), and this running for about 2/3min, whereas the queue throwed the exception in a matter of seconds, and went over 2GB in just that time alone.
Can someone explain to me the why of this happening, I just curious?
edit 1
I forgot to add the List code:
public static string WhoIsNext(string[] names, long n)
{
List<string> test = new List<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = test[0];
test.RemoveAt(0);
test.Add(name);
test.Add(name);
}
return test[0];
}
edit 2
For those saying that the code doubles the names and is inneficient, I already know that, the code isn't made to be useful, is just a kata. (I updated the link now!)
My question is as to why is Queue so much more inneficient thatn List with high count numbers.

Part of the reason is that the queue code is way faster than the List code, because queues are optimised for deletes due to the fact that they are a circular buffer. Lists aren't - the list copies the array contents every time you remove that first element.
Change the input value to 72307000 for example. On my machine, the queue finishes that in less than a second. The list is still chugging away minutes (and at this rate, hours) later. In 4 minutes i is now at 752408 - it has done almost 1% of the work).
Thus, I am not sure the queue is less memory efficient. It is just so fast that you run into the memory issue sooner. The list almost certainly has the same issue (the way that List and Queue do array size doubling is very similar) - it will just likely take days to run into it.
To a certain extent, you could predict this even without running your code. A queue with 7230702951 entries in it (running 64-bit) will take a minimum of 8 bytes per entry. So 57845623608 bytes. Which is larger than 50GB. Clearly your machine is going to struggle to fit that in RAM (plus .NET won't let you have an array that large)...
Additionally, your code has a subtle bug. The loop can't ever end (if n is greater than int.MaxValue). Your loop variable is an int but the parameter is a long. Your int will overflow (from int.MaxValue to int.MinValue with i++). So the loop will never exit, for large values of n (meaning the queue will grow forever). You likely should change the type of i to long.

C# Using ArrayList.IndexOf with multiple identical items

I've got a situation where I'm using an ArrayList to store a list of file names (full file path). When I add multiple items of the same file to the array, then use ArrayList.IndexOf to find the index (I'm reporting progress to a BackgroundWorker), it always returns the index of the first item since it's searching by the file name. This causes a problem with a progress bar, i.e. I'm processing 3 files, and when complete, the bar is only 1/3 full.
Here's some sample code (I'm simply adding items here, but in the real code they are added from a ListBox):
ArrayList list = new ArrayList();
list.Add("C:\Test\TestFile1.txt");
list.Add("C:\Test\TestFile1.txt");
list.Add("C:\Test\TestFile1.txt");
var files = list;
foreach (string file in files)
backgroundWorker1.ReportProgress(files.IndexOf(file) + 1);
When this runs, only 1 "percent" of progress is reported, since the IndexOf is finding the same file every time. Is there any way around this? Or does anyone have a suggestion on another way to get the index (or any unique identifier for each item)?

The simplest approach is just to use the index to iterate:
for (int i = 0; i < list.Count; i++)
{
backgroundWorks.ReportProgress(i + 1);
// Do stuff with list[i]
}

To achieve what you want, I would recomment using a for list. You won't have to search for any indexes and can report the progress easily to the backgroundWorker1:
for (int counter = 0; counter < files.Count; counter++)
{
backgroundWorker1.ReportProgress(counter + 1);
}
By doing this, you don't get problems with same filenames.
This would be the equivalent with foreach:
int counter = 1;
foreach (string file in files)
{
backgroundWorker1.ReportProgress(counter);
counter++;
}
But I think it's better to use for in this case.

You could make a copy of your list, and then remove each item from it as it is processed. Then you could return the percentage complete as a function of the count of the temp list divided by the count of the original list.
You could also just add to an int every time your background workers process a file and use that int to track the progress (percent complete should be the int divided by the number of files).
Of course with either approach you need to make sure you are modifying the variables in question in a thread-safe manner.

Try this and it works fine for me in case of reading files having so many lines of data.
int percentage = (int)((count / (double)lineCount) * 100.0);
backgroundWorker5.ReportProgress(percentage);
Here count is total count and linecount is present running count. By using thedse two,calculate the percentage. Keep this code in your for loop or while loop.

Assuming that progressbar max value is 100, and with one file processed out of three it should show 33(%) progress, you could count progress for each file processed and recalculate overall progress :
// list contains x file paths
var files = list;
double progressStep = 1 / files.Count;
for(int i = 0; i < files.Count; i++)
{
backgroundWorker1.ReportProgress((i + 1) * progressStep);
...
}
All that remained is to assign progress value in background worker to progress bar.

Video rate image construction from binary data performance

First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}

batch commit rows to database in c# using a counter

I am developing a file parser in C# such that I have a file with many many rows. I want to write that to database.
I have created a loop in which I read the file and issue an insert.
However, what I would like to do is read say 100 rows then commit all 100, then read the next 100 and commit them and so on an so forth.
My problem (a bit conceptual) is how to find out the last set of rows as that could be less than 100.
For e.g. if my file has 550 rows, I can read 5 chunks of 100 and commit them using modulus operator on the line counter but what happens to the last batch of 50?
Any ideas in this area will be highly appreciated.
Thanks.

Top off my mind here:
First option, do a quick count of '\n', thus getting the number of lines. Then, read chunks in multiple of 2 such that the result is very near to the number of lines, for example if 501 lines exist, read chunks of 250 and 250. Then only 1 line is left. When you first run the code, check for odd or even. If odd, remember to commit the last line, else if the number of lines are even, there is no need to compensate for the last line. Hope this helps.
P.S: After dividing the number into multiple of 2, if it is very large, you can further divide into 2 (Divide and conquer!)

If your read function returns false when there isn't anything else to read.
then you can try this.
int i = 1;
while (read())
{
if (i % 100 == 0)
{
for int (j = 0; j < 100; j++)
{
write();
}
}
i++;
}
for (int j = 0; j < i % 100; j++)
{
write();
}

Something like this, perhaps? The following is pseudocode:
int rowCount = 0;
while more records to read
{
// Get record
// Parse record
// Do something with results
if(++rowCount % 100 == 0)
{
// Commit
rowCount = 0;
}
}
// If you have 0 records, you just did a commit (unless you had zero records to read,
// but I assume you can code your way around that!) So you only need a final commit
// if you had more than 0 records still uncommitted.
if(rowCount > 0)
{
// Commit remaining records
}

Series calculation

I have some random integers like
99 20 30 1 100 400 5 10
I have to find a sum from any combination of these integers that is closest(equal or more but not less) to a given number like
183
what is the fastest and accurate way of doing this?

If your numbers are small, you can use a simple Dynamic Programming(DP) technique. Don't let this name scare you. The technique is fairly understandable. Basically you break the larger problem into subproblems.
Here we define the problem to be can[number]. If the number can be constructed from the integers in your file, then can[number] is true, otherwise it is false. It is obvious that 0 is constructable by not using any numbers at all, so can[0] is true. Now you try to use every number from the input file. We try to see if the sum j is achievable. If an already achieved sum + current number we try == j, then j is clearly achievable. If you want to keep track of what numbers made a particular sum, use an additional prev array, which stores the last used number to make the sum. See the code below for an implementation of this idea:
int UPPER_BOUND = number1 + number2 + ... + numbern //The largest number you can construct
bool can[UPPER_BOUND + 1]; //can[number] is true if number can be constructed
can[0] = true; //0 is achievable always by not using any number
int prev[UPPER_BOUND + 1]; //prev[number] is the last number used to achieve sum "number"
for (int i = 0; i < N; i++) //Try to use every number(numbers[i]) from the input file
{
for (int j = UPPER_BOUND; j >= 1; j--) //Try to see if j is an achievable sum
{
if (can[j]) continue; //It is already an achieved sum, so go to the next j
if (j - numbers[i] >= 0 && can[j - numbers[i]]) //If an (already achievable sum) + (numbers[i]) == j, then j is obviously achievable
{
can[j] = true;
prev[j] = numbers[i]; //To achieve j we used numbers[i]
}
}
}
int CLOSEST_SUM = -1;
for (int i = SUM; i <= UPPER_BOUND; i++)
if (can[i])
{
//the closest number to SUM(larger than SUM) is i
CLOSEST_SUM = i;
break;
}
int currentSum = CLOSEST_SUM;
do
{
int usedNumber = prev[currentSum];
Console.WriteLine(usedNumber);
currentSum -= usedNumber;
} while (currentSum > 0);

This seems to be a Knapsack-like problem, where the value of your integers would be the "weight" of each item, the "profit" of each item is 1, and you are looking for the least number of items to exactly sum to the maximum allowable weight of the knapsack.

This is a variant of the SUBSET-SUM problem, and is also NP-Hard like SUBSET-SUM.
But if the numbers involved are small, pseudo-polynomial time algorithms exist. Check out:
http://en.wikipedia.org/wiki/Subset_sum_problem
Ok More details.
The following problem:
Given an array of integers, and integers a,b, is there
some subset whose sum lies in the
interval [a,b] is NP-Hard.
This is so because we can solve subset-sum by choosing a=b=0.
Now this problem easily reduces to your problem and so your problem is NP-Hard too.
Now you can use the polynomial time approximation algorithm mentioned in the wiki link above.
Given an array of N integers, a target S and an approximation threshold c,
there is a polynomial time approximation algorithm (involving 1/c) which tells if there is a subset sum in the interval [(1-c)S, S].
You can use this repeatedly (by some form of binary search) to find the best approximation to S you need. Note you can also use this on intervals of the from [S, (1+c)S], while the knapsack will only give you a solution <= S.
Of course there might be better algorithms, in fact I can bet on it. There should be plenty of literature on the web. Some search terms you can use: approximation algorithms for subset-sum, pseudo-polynomial time algorithms, dynamic programming algorithm etc.

A simple-brute-force-method would be to read the text in, parse it into numbers, and then go through all combinations until you find the required sum.
A quicker solution would be to sort the numbers, then...
Add the largest number to your sum, Is it too big? if so, take it off and try the next smallest.
if the sum is too small, add the next largest number and repeat.
Continue adding numbers not letting the sum exceed the target. Finish when you hit the target.
Note that when you backtrack, you may need to back track more than one level. Sounds like a good case for recursion...

If the numbers are large you can turn this into an Integer Programme. Using Mathematicas solver, it might look something like this
nums = {99, 20, 30 , 1, 100, 400, 5, 10};
vars = a /# Range#Length#nums;
Minimize[(vars.nums - 183)^2, vars, Integers]

You can sort the list of values, find the first value that's greater than the target, and start concentrating on the values that are less than the target. Find the sum that's closest to the target without going over, then compare that to the first value greater than the target. If the difference between the closest sum and the target is less than the difference between the first value greater than the target and the target, then you have the sum that's closest.
Kinda hokey, but I think the logic hangs together.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.