I am developing a file parser in C# such that I have a file with many many rows. I want to write that to database.
I have created a loop in which I read the file and issue an insert.
However, what I would like to do is read say 100 rows then commit all 100, then read the next 100 and commit them and so on an so forth.
My problem (a bit conceptual) is how to find out the last set of rows as that could be less than 100.
For e.g. if my file has 550 rows, I can read 5 chunks of 100 and commit them using modulus operator on the line counter but what happens to the last batch of 50?
Any ideas in this area will be highly appreciated.
Thanks.
Top off my mind here:
First option, do a quick count of '\n', thus getting the number of lines. Then, read chunks in multiple of 2 such that the result is very near to the number of lines, for example if 501 lines exist, read chunks of 250 and 250. Then only 1 line is left. When you first run the code, check for odd or even. If odd, remember to commit the last line, else if the number of lines are even, there is no need to compensate for the last line. Hope this helps.
P.S: After dividing the number into multiple of 2, if it is very large, you can further divide into 2 (Divide and conquer!)
If your read function returns false when there isn't anything else to read.
then you can try this.
int i = 1;
while (read())
{
if (i % 100 == 0)
{
for int (j = 0; j < 100; j++)
{
write();
}
}
i++;
}
for (int j = 0; j < i % 100; j++)
{
write();
}
Something like this, perhaps? The following is pseudocode:
int rowCount = 0;
while more records to read
{
// Get record
// Parse record
// Do something with results
if(++rowCount % 100 == 0)
{
// Commit
rowCount = 0;
}
}
// If you have 0 records, you just did a commit (unless you had zero records to read,
// but I assume you can code your way around that!) So you only need a final commit
// if you had more than 0 records still uncommitted.
if(rowCount > 0)
{
// Commit remaining records
}
Related
This question already has answers here:
Convert a list into a comma-separated string
(15 answers)
Closed 4 years ago.
I am doing some schema javascipt/C# coding and I have a question. I am doing breadcrumb schema for a little more reference. My goal is to find somehow to increment counter to be even to k in this example, so I can stop a comma from showing on the last iteration on the foreach loop. Right now, obviously they both increment at the same rate. I am having the biggest brain fart on where to place (counter++) to get it to increment and then end up even after the for each is completed. Both starting integer values should be as is. Just changing the counter to 1 is not what I am looking for :) Also the counter has to be in the ForEach loop.
Pseudo code below:
k = 1;
counter = 0;
foreach(string str in array1)
{
Schema Code
Schema Code
for(i = 0; i > 0; i++)
{
k++
counter++ (not right location, but reference)
}
if(k < counter)
{
print a coma
}
else if(k >= counter)
{
print space
}
}
Updated: My code would be where position is. I dont have access to my code at this moment. But the for each runs through the number on positions there are on the page. Then at the last position it will not write the comma.
<script type="application/ld+json">
{
"context": "http://schema.org",
"type": "BreadcrumbList",
"itemListElement": [{
"type": "ListItem",
"position": 5,
}
}
</script>
Rather than having two separate counters, I recommend starting by taking the length of the array and using a counter to print a comma while the counter is less than the number of total items in the array. That way, when you finish you'll have one less comma than array item, and you won't have a comma at the end of what you've printed.
This is kind of involved but, since you really don't want to have the comma at the end of what you've printed, it may best fit your needs.
Here's some pseudocode based on what you wrote above:
// the length of the array needs to be a constant because depending on
// what your code does it may change the array length as your loop runs
const arrayLength = the length of your array when you start
int commaCount = 0
foreach(string str in array1)
{
Schema Code
Schema Code
if (commaCount < (arrayLength -1))
// (< arrayLength -1) because then your last comma will reach
// (arrayLength -1) and then not increase any more the last time around
{
print a comma
}
}
Let me know if that helps.
(Please disregard my syntax. This is pseudocode, so definitely don't copy and paste what I just wrote.)
I debugged my code and everything works perfectly. But my code never writes to console for some reason.
Here is my code:
long largest = 0;
for (long i = 1; i < 600851475144; i++)
{
long check = 0;
for (long j = 1; j < i + 1; j++)
{
if ((i%j) == 0)
{
check++;
}
}
if (check == 2)
{
largest = i;
}
}
Console.WriteLine(largest);
Console.ReadKey();
Question: How do I write to console?
Your algorithm is too slow to complete in a reasonable time, so you need to come up with an alternative approach.
First, the algorithm has to stop checking the naive definition (two divisors). If you check all divisors up to square root of the number, and did not find any, the number is prime. Second, if you are looking for the largest prime in a range, start at the top of the range, go down, and stop as soon as you find the first prime. Third, there is no point to try even numbers.
Implementing these three changes will get your algorithm running in time.
What?
It will finish, but it will last forever because of all the iterations it has to do.
Prime finding calculation is a very intensive calculation, specially in the way you have done it.
Summarizing, it does not return because you have to wait minutes/hours/days/years? To compute that.
Your algorithm is poor. It has to make a lot of iterations. As others have already mentioned, there is no sense to divide by even numbers thus incremen by two, start with 3, you can reduce the iteration count to the square root of given number. mine is not perfect as well but it finishes in a blink. The idea is to reduce count of iterations by dividing the given number by all found divisors.
Try at your own risk!
long FindLargestPrimeDivisor(long number)
{
long largestPrimeDivisor = 1;
while (true)
{
if (number % largestPrimeDivisor == 0)
{
number /= largestPrimeDivisor;
}
if (number < largestPrimeDivisor)
{
break;
}
largestPrimeDivisor++;
}
return largestPrimeDivisor;
}
First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}
I can think of some very convoluted methods with loops and nested loops to solve this problem but I'm trying to be more professional than that.
My scenario is that I need to enter a section of code every ten percent but it isn't quite working as expected. It is entering the code about every percent which is due to my code but I lack the knowledge to know how to change it.
int currentPercent = Math.Truncate((current * 100M) / total);
//avoid divide by zero error
if (currentPercent > 0)
{
if (IsDivisible(100, currentPercent))
{
....my code that works fine other than coming in too many times
}
}
Helper referenced above where the trouble is:
private bool IsDivisible(int x, int y)
{
return (x % y) == 0;
}
So obviously it works as it should. Mod eliminates currentPercent of 3 but 1 & 2 pass when really I don't want a true value until currentPercent = 10 and then not again till 20...etc.
Thank you and my apologies for the elementary question
Mod will only catch exact occurrences of your interval. Try keeping track of your next milestone, you'll be less likely to miss them.
const int cycles = 100;
const int interval = 10;
int nextPercent = interval;
for (int index = 0; index <= cycles; index++)
{
int currentPercent = (index * 100) / cycles;
if (currentPercent >= nextPercent)
{
nextPercent = currentPercent - (currentPercent % interval) + interval;
}
}
I might misunderstand you, but it seems like you're trying to do something extremely simple more complex than it needs to be. What about this?
for (int i = 1; i <= 100; i++)
{
if (i % 10 == 0)
{
// Here, you can do what you want - this will happen
// every ten iterations ("percent")
}
}
Or, if your entire code enters from somewhere else (so no loop in this scope), the important part is the i % 10 == 0.
if (IsDivisible(100, currentPercent))
{
....my code that works fine other than coming in too many times
}
try changing that 100 to a 10. And I think your x and y are also backwards.
You can try a few sample operations using google calculator.
(20 mod 10) = 0
Not sure if I fully understand, but I think this is what you want? You also reversed the order of modulo in your code (100 mod percent, rather than the other way around):
int currentPercent = current * 100 / total;
if (currentPercent % 10 == 0)
{
// your code here, every 10%, starting at 0%
}
Note that code this way only works properly if you are guaranteed to hit every percentage-mark. If you could, say, skip from 19% to 21% then you'll need to keep track of which percentage the previous time was to see if you went over a 10% mark.
try this:
for (int percent = 1; percent <= 100; percent++)
{
if (percent % 10 == 0)
{
//code goes here
}
}
Depending on how you increment your % value, this may or may not work % 10 == 0. For example jumping from 89 to 91 % would effectively skip the code execution. You should store last executed value, 80 in this case. Then check if interval is >= 10, so 90 would work, as well as 91.
I have a text file with X number of records that have 24 pipe delimited fields.
ABCDEFG|123456|BILLING|1234567|12345678|12345678|...
My concern is with the BILLING column. I need to append this word with current date and a sequential number BILLING-20131021-1 but here is the trick: The digit can or must increment only for each 10% of the record. So for example if I have 100 records, first ten of them will end with 1, next ten will end with 2, and so on. If there is an uneven number than the remainder will acquire the next sequence.
I started with two loops but that didn't produce the results. The the first loop iterates through the record count and the second iterates through the first 10% of records but then I can't figure out how to get the next batch of records.
for (uint recordCount = 0; recordCount < RecordsPerBatch; recordCount++)
{
for (uint smallCount = 0; smallCount < (RecordsPerBatch / 10)); smallCount++)
{}
}
You can simply loop through, keep a counter and only increment the "small count" when you hit a defined condition.
ie.
int smallCount = 1;
for (int recordCount = 0; recordCount < totalRecords; ++recordCount)
{
if (recordCount % (totalRecords / 10) == 0)
++smallCount;
}
Maintaining your current logic you could add another variable that keep the batch counter and simplify the condition on the inner loop calculating the batch size (10% of the total records).
Also it is necessary to check if the indexer in the inner loop doesn't exceed the total record count.
uint TotalRecordCounter = 101;
uint currentBatch = 1;
uint batchSize = TotalRecordCounter / 10;
// This will account for batch size that are not exactly divisible for 10.
// But if it is allowed to have more than 10 batches then remove it
// if((TotalRecordCounter % 10) != 0)
// batchSize++;
for (uint recordCount = 0; recordCount < TotalRecordCounter; recordCount+=batchSize)
{
for (uint smallCount = 0;
smallCount < batchSize && (recordCount+smallCount) < TotalRecordCounter;
smallCount++)
{
string billing = string.Format("BILLING-{0:yyyymmdd}-{1}", DateTime.Today, currentBatch);
}
currentBatch++;
}
If I'm understanding correctly, you're saying the issue is figuring out when you hit that 10% (and 20% and 30%, etc) of the file threshold. Giving a good answer depends a lot on what your system is capable of, but there are many ways to do this with a single loop, and in the worst case, you can do it in a single, non-nested loop.
Can you find the exact number of lines in the file?
If your file isn't gigantic, and your file is line-delimited (one record per line), this is the easiest solution, just read in the file as a string array. Then you just need to go through each line, and generate the last number each time, using current_record / exact_count.
Can you calculate the exact number of lines in the file?
If your records are fixed-length, you can take the file size, divide by the record size, and therefore calculate the exact number of records, and then generate the last number as above.
Is it sufficient to estimate the number of lines in the file?
Same idea as the previous suggestion, only use an estimate of your average record size.
What type of stream are you using to read the file?
If it's one where you can find the total length of the stream and your current position, you can calculate your percentage using that, instead of using row indices.
Final fallback suggestion
If none of the other suggestions work, you can always do a simple two pass solution. In the first one, simply read through and count the number of records. If you can fit it into memory, store each record and then parse it in-memory. If you can't, just read the file to count the number of entries, then read it again.