I am using C# client "StackExchange.Redis" for benchmarking Redis.
The dataset is a text file of close to 16 million records. Each record has six entries, three of which are double and the other three are integers.
When I use LPush (LPushRight in api), it takes close to 4 minutes for all the data be added to Redis.
Afterwards, when I retrieve the data using (LRange in api), it takes almost 1.5 minutes to retrieve all the list.
I am using following code:
Connection:
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect("localhost");
IDatabase db = redis.GetDatabase();
Insertion:
IEnumerable<string> lines =
File.ReadLines(#"C:\Hep.xyz");
List<string> linesList = lines.ToList();
int count = lines.Count();
string[] toks;
RedisValue[] redisToks = { "", "", "", "", "", "" };
for (int i = 0; i < count; i++)
{
toks = linesList[i].Split(' ', '\t');
for (int j = 0; j < 6; j++)
{
redisToks[j] = toks[j];
}
db.ListRightPushAsync("PS:DATA:", redisToks);
if (i % 1000000 == 0)
{
Console.WriteLine("Lines Read: {0}", i);
}
}
Console.WriteLine("Press any key to continue ...");
Console.ReadLine();
Retrieval:
long len = db.ListLength("PS:DATA:");
long start = 0;
long end = 99999;
while (end < len)
{
RedisValue[] val = db.ListRange("PS:DATA:", start, end);
int length = val.Length;
start += 100000;
end += 100000;
}
Console.WriteLine("Press any key to continue ...");
Console.ReadLine();
For COnfiguration:
I have set maxmemory to 4GB and maxmemory-policy to volatile-lru
I am running all that locally on my system. My system specs are
8 GB RAM
Inter Core i7 - 5500U CPU # 2.4GHz (4 CPUs), ~2.4 GHz
Could you please help me identify the factors I need to look into, to improve performance. Also, is redis suitable for this kind of dataset?
This is not caused by that redis is slow. Becuase when you save data into redis, the time cost also includes the file io (disk io) and network io, especially the time you read lines from disk files, which takes a big time cost. So you see when you retrieve data from redis, the time cost is just 1.5 minutes and when you do insertion it took about 4 minutes.
In conclusion, redis works well in your case. one more thing to speed up the insertion is that you can use redis pipeline to decrease the network transport time.
The async write is not like the pipeline. But you should notice that async write just does not block the client. however the pipeline is batch sending command and batch read reply. So it is different.
And see https://redis.io/topics/pipelining It's not just a matter of RTT section. pipelining save redis server read() and write() time also.
Yes, what you faced is the issue mentioned in Redis's official website's documentation:
Redis lists are implemented via Linked Lists. This means that even if you have millions of elements inside a list, the operation of adding a new element in the head or in the tail of the list is performed in constant time. The speed of adding a new element with the LPUSH command to the head of a list with ten elements is the same as adding an element to the head of list with 10 million elements.
This is the reason why your insertion operation is very fast. The document further continues:
What's the downside? Accessing an element by index is very fast in
lists implemented with an Array (constant time indexed access) and not
so fast in lists implemented by linked lists (where the operation
requires an amount of work proportional to the index of the accessed
element).
The document further suggest to use Sorted Sets if you want fast access:
When fast access to the middle of a large collection of elements is important, there is a different data structure that can be used, called sorted sets. Sorted sets will be covered later in this tutorial.
Related
I am trying to optimize a data collection process in C#. I would like to understand why a certain method of parallelism I am trying is not working as expected (more details below; see "Question" section at the very bottom)
BACKGROUND
I have an external .NET Framework DLL, which is an API for an external data source; I can only consume the API, I do not have access to what goes on behind the scenes.
The API provides a function like: GetInfo(string partID, string fieldValue). Using this function, I can get information about a specific part, filtered for a single field/criteria. One single call (for just one part ID and one field value) takes around 20 milliseconds in an optimal case.
A part can have many values for the field criteria. So in order to get all the info for a part, I have to enumerate through all possible field values (13 in this case). And to get all the info for many parts (~260), I have to enumerate through all the part IDs and all the field values.
I already have all the part IDs and possible field values. The problem is performance. Using a serial approach (2 nested for-loops) is too slow (takes ~70 seconds). I would like to get the time down to ~5 seconds.
WHAT I HAVE TRIED
For different mediums of parallelizing work, I have tried:
calling API in parallel via Tasks within a single main application.
calling API in parallel via Parallel.ForEach within a single main application.
Wrapping the API call with a WCF service, and having multiple WCF service instances (just like having multiple Tasks, but this is multiple processes instead); the single main client application will call the API in parallel through these services.
For different logic of parallelizing work, I have tried:
experiment 0 has 2 nested for-loops; this is the base case without any parallel calls (so that's ~260 part IDs * 13 field values = ~3400 API calls in series).
experiment 1 has 13 parallel branches, and each branch deals with smaller 2 nested for-loops; essentially, it is dividing experiment 0 into 13 parallel branches (so rather than iterating over ~260 part IDs * 13 field values, each branch will iterate over ~20 part IDs * all 13 field values = ~260 API calls in series per branch).
experiment 2 has 13 parallel branches, and each branch deals with ALL part IDs but only 1 specific field value for each branch (so each branch will iterate over ~260 part IDs * 1 field value = ~260 API calls in series per branch).
experiment 3 has 1 for-loop which iterates over the part IDs in series, but inside the loop makes 13 parallel calls (for 13 field values); only when all 13 info is retrieved for one part ID will the loop move on to the next part ID.
I have tried experiments 1, 2, and 3 combined with the different mediums (Tasks, Parallel.ForEach, separate processes via WCF Services); so there is a total of 9 combinations. Plus the base case experiment 0 which is just 2 nested for-loops (no parallelizing there).
I also ran each combination 4 times (each time with a different set of ~260 part IDs), to test for repeatability.
In every experiment/medium combination, I am timing only the direct API call using Stopwatch; so the time is not affected by any other parts of the code (like Task creation, etc.).
Here is how I am wrapping the API call in WCF service (also shows how I am timing the API call):
public async Task<Info[]> GetInfosAsync(string[] partIDs, string[] fieldValues)
{
Info[] infos = new Info[partIDs.Length * fieldValues.Length];
await Task.Run(() =>
{
for (int i = 0; i < partIDs.Length; i++)
{
for (int j = 0; j < fieldValues.Length; j++)
{
Stopwatch timer = new Stopwatch();
timer.Restart();
infos[i * fieldValues.Length + j] = api.GetInfo(partIDs[i], fieldValues[j]);
timer.Stop();
// log timer.ElapsedMilliseconds to file (each parallel branch writes to its own file)
}
}
});
return infos;
}
And to better illustrate the 3 different experiments, here is how they are structured. These are run from the main application. I am only including how the experiments were done using the inter-process communication (GetInfosAsync defined above), as that gave me the most significant results (as explained under "Results" further below).
// experiment 1
Task<Info[]>[] tasks = new Task<Info[]>[numBranches]; // numBranches = 13
for (int k = 0; k < numBranches; k++)
{
tasks[k] = services[k].GetInfosAsync(partIDsForBranch[k], fieldValues); // each service/branch gets partIDsForBranch[k] (a subset of ~20 partIDs only used for branch k) and all 13 fieldValues
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 2
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(partIDs, new string[] { fieldValues[j] }); // each service/branch gets all ~260 partIDs and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 3
for (int i = 0; i < partIDs.Length; i++)
{
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(new string[] { partIDs[i] }, new string[] { fieldValues[j] }); // each branch/service gets the currently iterated partID and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
}
RESULTS
For experiments 1 and 2...
Task (within same application) and Parallel.ForEach (within same application) perform almost just like the base case experiment (approximately 70 to 80 seconds).
inter-process communication (i.e. making parallel calls to multiple WCF services separate from the main application) performs significantly better than Task/Parallel.ForEach. This made sense to me (I've read about how multi-process could potentially be faster than multi-thread). Experiment 2 performs better than experiment 1, with the best experiment 2 run being around 8 seconds.
For experiment 3...
Task and Parallel.ForEach (within same application) perform close to their experiment 1 and 2 counterparts (but around 10 to 20 seconds more).
inter-process communication was significantly worse compared to all other experiments, taking around 200 to 300 seconds in total. This is the result I don't understand (see "What I Expected" section further below).
The graphs below give a visual representation of these results. Except for the bar chart summary, I only included the charts for inter-process communication results since that gave significantly good/bad results.
Figure 1 (above). Elapsed times of each individual API call for experiments 1, 2, and 3 for a particular run, for inter-process communication; experiment 0 is also included (top-left).
Figure 2 (above). Summary for each method/experiment for all 4 runs (top-left). And aggregate versions for the experiment graphs above (for experiments 1 and 2, this is the sum of each branch, and the total time would be the max of these sums; for experiment 3, this is the max of each loop, and the total time would be the sum of all these maxes). So in experiment 3, almost every iteration of the outer loop is taking around 1 second, meaning there is one parallel API call in every iteration that is taking 1 second...
WHAT I EXPECTED
The best performance I got was experiment 2 with inter-process communication (best run was around 8 seconds in total). Since experiment 2 runs were better than experiment 1 runs, perhaps there is some optimization behind-the-scenes on the field value i.e. experiment 1 could potentially have different branches clash by calling the same field value at any point in time, whereas each branch in experiment 2 calls their own unique field value at any point in time).
I understand that the backend will be restricted by a certain number of calls per time period, so the spikes I see in experiments 1 and 2 make sense (and why there is almost no spikes in experiment 0).
That's why I thought, for experiment 3 using inter-process communication, I am only making 13 API calls in parallel at any single point in time (for a single part ID, each branch having its own field value), and not proceeding to the next part ID until all 13 are done. This seemed like less API calls per time period than experiment 2, which continuously makes calls on each branch. So for experiment 3, I expected little spikes and all 13 to complete in the time it took for a single API call (~20ms).
But what actually happened was that experiment 3 took the most time, and majority of API call times are spiking significantly (i.e. each iteration having a call taking around 1 second).
I also understand that experiments 1 and 2 only have 13 long-lasting parallel branches that last throughout the lifetime of a single run, whereas experiment 3 creates 13 new short-lived parallel branches ~260 times (so there around be ~3400 short-lived parallel branches created throughout the lifetime of the run). If I was timing the task creation, I would understand the increased time due to overhead, but if I am timing the API call directly, how does this impact the API call itself?
QUESTION
Is there a possible explanation to why experiment 3 behaved this way? Or is there a way to profile/investigate this? I understand that this is not much to go off of without knowing what happens behind-the-scenes of the API... But what I am asking is how to go about investigating this, if possible.
This may be because your experiment 3 used two loops, and the object created in the loop will cause each loop to be created, increasing the workload.
I use a SQL Server CE database for some simulation results. When I test the reading speed the benchmark-times differ greatly.
dbs are split to 4 .SDF files (4 quartals)
302*525000 entries overall
four SqlCeConnections
they are opened before reading and stay open
all four databases are located on the same disk (SSD)
I use SqlCeDataReader (IMHO as low-level and fast as you can get)
Reading process is parallel
Simplified code
for (int run = 0; run < 4; run++)
{
InitializeConnections();
for (int reading = 0; reading < 6; reading++)
{
ResetMemoryObjects();
Parallel.For(0, quartals.Count, (i) =>
{
values[i] = ReadFromSqlCeDb(i);
}
}
}
connections are initialized each run 1 time
all readings take place in a simple for loop and are exactly the same
before each reading all the objects are reinitialized
These are the benchmark results I get:
At this point I'm honest - I have no idea why SQL Server CE behaves that way. Maybe someone can give me a hint?
Edit 1:
I made a more in depth analysis of each step during the parallel reading. The following chart shows the steps, the "actual reading" is the part inside the while(readerdt.Read()) section.
Edit 2:
After ErikEJ' suggestion I added a TableDirect Approach and made 150 runs, 75 for SELECT and 75 for TableDirect. I summed up the Pre- & Postprocessing of the reading process, because this remains stable and nearly the same for all runs. What differs vastly is the actual reading process.
Every second run was done via TableDirect, so they both start to get drastic better results at around run 65 simultaneously. The range goes from 5.7 second up to 37.4 seconds.
This is the "acutal reading" code. (there are four different databases/sdf files with four different SqlCe connections. Tested on Ryzen7 8-Core CPU)
private static List<List<double>> ReadDataFromDbTableDirect((double von, double bis) timepoints, List<(string compName, string resName)> components, int dbIdx)
{
var values = new List<List<double>>();
for (int j = 0; j < components.Count; j++)
{
values.Add(new List<double>());
}
using (var command = sqlCeCon[dbIdx].CreateCommand())
{
command.CommandType = CommandType.TableDirect;
command.CommandText = table.TableName;
command.IndexName = "PKTIME";
command.SetRange(DbRangeOptions.InclusiveStart | DbRangeOptions.InclusiveEnd, new object[]{timepoints.von},new object[]{timepoints.bis});
using (var reader = (SqlCeDataReader)command.ExecuteReader(CommandBehavior.Default))
{
while (reader.Read())
{
for (int j = 0; j < components.Count; j++)
{
if (!reader.IsDBNull(j))
{
if (j == 0)
{
values[j].Add(reader.GetInt32(j));
}
else
{
values[j].Add(reader.GetDouble(j));
}
}
}
}
}
}
return values;
}
Still no idea why it has such a great delay.
SDF looks like this
Edit 3:
Today I made the same approach with a single database instead of four databases (to exclude problems with Parallel/Tasks). While here TableDirect has a little advantage, the main problem of differing reading speed persists (the sdf data is the same so it is comparable).
Edit 4:
These are the results on another machine. Still large outbursts, but a bit more stable. Overall still same issue.
Edit 5:
These are the results of a 4x smaller db - 500 runs. TableDirect & Select are (as in previous benchmarks) run alternately, but to better see the results in the graph they are shown in sequence. Notice that the overall time here is not 4-times smaller as you'd expect, but ~8-times smaller. Same problem with the high reading time in the beginning. Next I'll optimize ArrayPooling and stuff...
Edit 6:
To further investigate I tried this on 2 different machines. Pc#2 has Win11Home, Ryzen 5, SSD (quite new) and no Antivir, Pc#3 has Win10Pro - pristine installation, everything deactivated (WinDefender), SSD, Ryzen 7. On Pc#3 there is just one peak (first run), on Pc#2 there are several besides the initial requests.
Edit 7 :
After ErikEJ's suggestion it might be due to an Index Rebuild, I tried several things.
A test the reading times after a fresh simulation (db is freshly built in the simulation)
B test the reading times after copying and loading a db from another folder and apply a Verify(SqlCeEngine) on the db (Verify)
C test the reading times after copying and loading a db from another folder (no special db treatment)
D test the reading times after copying and loading a db from another folder, then make a quick first call with one row of all cols (preparation call)
I also tested SqlCeEngine Repair & Compact. They had nearly the same results as B
It seems like a verification solves the problem with the initial reading speed. Unfortunately the verification itself takes quite long (>10s on big dbs). Is there a quicker solution for this?
Result D is a complete surprise to me (mind the different scale). I don't understand what is happening here.. Any guesses?
Result C shows the long reading times on initial readings, but not always, which is irritating. Maybe it is not the index rebuild which is causing this?
I still have very huge variations in the reading speed on bigger databases.
I'm currently working on an improved reading process with pointers/memory to reduce GC pressure.
I will make the same tests today on another machine.
If anyone has an idea how to improve/stabilize reading speeds please let me know! Thanks in advance!
Given the task to improve the performance of a piece of code, I have came across the following phenomenon. I have a large collection of reference types in a generic Queue and I'm removing and processing the element one by one, then add them to another generic collection.
It seems the larger the elements are the more time it takes to add the element to the collection.
Trying to narrow down the problem to the relevant part of the code, I've written a test (omitting the processing of elements, just doing the insert):
class Small
{
public Small()
{
this.s001 = "001";
this.s002 = "002";
}
string s001;
string s002;
}
class Large
{
public Large()
{
this.s001 = "001";
this.s002 = "002";
...
this.s050 = "050";
}
string s001;
string s002;
...
string s050;
}
static void Main(string[] args)
{
const int N = 1000000;
var storage = new List<object>(N);
for (int i = 0; i < N; ++i)
{
//storage.Add(new Small());
storage.Add(new Large());
}
List<object> outCollection = new List<object>();
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = N-1; i > 0; --i)
{
outCollection.Add(storage[i];);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
On the test machine, using the Small class, it takes about 25-30 ms to run, while it takes 40-45 ms with Large.
I know that the outCollection has to grow from time to time to be able to store all the items, so there is some dynamic memory allocation. But given an initial collection size even makes the difference more obvious: 11-12 ms with Small and 35-38 ms with Large objects.
I am somewhat surprised, as these are reference types, so I was expecting the collections to work only with references to the Small/Large instances. I have read Eric Lippert's relevant article that and know that references should not be treated as pointers. At the same time, AFAIK currently they are implemented as pointers and their size and the collection's performance should be independent of element size.
I've decided to put up a question here hoping that someone could explain or help me to understand what's happening here. Aside the performance improvement, I'm really curious what is happening behind the scenes.
Update:
Profiling data using the diagnostic tools didn't help me much, although I have to admit I'm not an expert using the profiler. I'll collect more data later today to find where the bottleneck is.
The pressure on the GC is quite high of course, especially with the Large instances. But once the instances are created and stored in the storage collection, and the program enters the loop, there was no collection triggered any more, and memory usage hasn't increased significantly (outCollction already pre-allocated).
Most of the CPU time is of course spent with memory allocation (JIT_New), around 62% and the only other significant entry is Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name
System.Collections.Generic.List`1[System.__Canon].Add with about 7%.
With 1 million items the preallocated outCollection size is 8 million bytes (the same as the size of storage); one can suspect 64 bit addresses being stored in the collections.
Probably I'm not using the tools properly or don't have the experience to interpret the results correctly, but the profiler didn't help me to get closer to the cause.
If the loop is not triggering collections and it only copies pointers between 2 pre-allocated collections, how could the item size cause any difference? Cache hit/miss ratio is supposed to be the more or less the same in both cases, as the loop is iteration over a list of "addresses" in both cases.
Thanks for all the help so far, I will collect more data, and put an update here if anything found.
I suspect that at least one action in the above (maybe some type checks) will require a de-reference. Then the fact that many Smalls are probably sat close together on the heap and thus sharing cache lines could account for some amount of difference (certainly many more of them could share a single cache line than Larges).
Added to which you are also accessing them in the reverse order in which they were allocated which maximises such a benefit.
I'm an extremely amateur C# developer who's trying to make this console program on macOS using Visual Studio. I do it in school, but I'm self-taught and have been working on this for less than two weeks, so it's entirely possible that I'm missing some simple solution.
I've made a program that reads off a text file filled with prime numbers and converts it into a list, then begins to generate prime numbers while adding them to the list and file, and simultaneously reporting out information every time it finds a new one.
Here's the code I have:
String fileLocation = "Prime Number List.txt"; //sets the file location to the root of where the program is stored
if (!File.Exists(fileLocation)) //tests if the file has already been created
{
using (FileStream fs = File.Create(fileLocation))
{
Byte[] info = new UTF8Encoding(true).GetBytes("2"); //if not, it creates the file and creates the initial prime number of 2
fs.Write(info, 0, info.Length);
}
}
List<string> fileContents = File.ReadAllLines(fileLocation).ToList(); //imports the list of prime numbers from the file
List<int> listOfPrimeNumbers = fileContents.ConvertAll(s => Int32.Parse(s)); //converts the list into the integer variable type
int currentNumber = listOfPrimeNumbers[listOfPrimeNumbers.Count() - 1]; //sets the current number to the most recent prime number
bool isPrime; //initializing the primality test variable
int numbersGeneratedThisSession = 0; //initializing the variable for the amount of primes found in this session
var loopStart = DateTime.Now; //initializes the program start time, ignoring the time taken to load the file list
while (true)
{
isPrime = true; //defaults the number to prime
currentNumber++; //repeats the cycle for the next number
double currentNumberRoot = Math.Sqrt(System.Convert.ToDouble(currentNumber));
for (int i = 0; i < listOfPrimeNumbers.Count; i++) //cyles through all of the primes in the list. no reason to divide by composites, as any number divisible by a
//composite would be divisible by the prime factors of that composite anyway, thus if we were to divide by
//every number it would slow down the program
{
if (listOfPrimeNumbers[i] < Math.Sqrt(System.Convert.ToDouble(currentNumber))) //filters out any prime numbers greater than the square root of the current number, as any potential
//factor pair would have one of the values less than or equal to the square root
{
if (currentNumber % listOfPrimeNumbers[i] == 0) //checks for the even division of the current number by the current prime
{
isPrime = false; //if an even division is found, it reports that the number isn't false and breaks the loop
break;
}
}
else
break; //if no even divisons are found, then it reaches this point with the primality test variable still true, and breaks the loop
}
if (isPrime) //this section of the code activates when the primality test variable is true
{
listOfPrimeNumbers.Add(currentNumber); //adds the new prime to the list
File.AppendAllText(fileLocation, Environment.NewLine + currentNumber); //adds the new prime to the file on a new line
numbersGeneratedThisSession++; //raises the counter for the prime numbers generated in this session
var runtime = DateTime.Now - loopStart; //calculates the runtime of the program, excluding the time taken to load the file into the list
int runtimeInSecs = (runtime.Milliseconds / 1000) + runtime.Seconds + (runtime.Minutes * 60) + (runtime.Hours * 360) + (runtime.Days * 86400); //converts the datetime var into an int of seconds
int generationSpeed = runtimeInSecs == 0 ? 0 : numbersGeneratedThisSession / runtimeInSecs;
Console.WriteLine("\nI've generated {0} prime numbers, {1} of those being in the current session." +
"\nI've been running for {2}, which means I've been generating numbers at a speed of {3} primes per second. " +
"\nThe largest prime I've generated so far is {4}, which is {5} digits long.",
listOfPrimeNumbers.Count(), numbersGeneratedThisSession, runtime, generationSpeed, currentNumber, currentNumber.ToString().Length);
}
}
I keep getting the exception on the "listOfPrimeNumbers.Add(currentNumber);" part. I've read up on similar questions, and the most common solution to other people's problems was to set gcAllowVeryLargeObjects to true, to break the 2GB limit. That would be a temporary fix for me, however as the list will continually get larger over time, there will be a point when it hits the limits of my computer's capabilities rather than the limit of visual studio's cap.
I'm wondering if there's some sort of technique that more experienced developers use to circumvent this issue, like splitting the data into multiple lists, doing something different than I did to streamline the code, etc. I know that due to the nature of my program it's unavoidable that eventually the data will grow too large, but I'm trying to postpone that for as long as possible as the file right now is less than half a gig, which is an unreasonably small amount of memory to be crashing the program.
I'd also like to note that I ran this program for around an hour a day while I was working on the statistic feedback (meaning that the file reading, writing, and generation code itself were largely untouched during this time) for the past week. I had no problems booting it up any of those times, and the final time it ran went smoothly (didn't crash due to the out of memory exception). I only encountered this problem today when I tried starting it up again.
Individual arrays or lists in .NET are bound by all of:
the 2GiB object limit (unless gcAllowVeryLargeObjects is enabled)
the available process memory (especially relevant for 32-bit processes)
2,146,435,071 items per dimension (2,147,483,591 for single-byte values)
If you are getting anywhere near these problems, then yes: you need another approach. Moving to multiple individual lists that you treat as a composite block should serve as a stop-gap, but... I don't think this is ultimately a very scalable approach for computing prime numbers.
Since you're searching for primes with his program, you're going to run out of memory if you just attempt to store this in said memory.
Splitting your lists will help a little, as stated, but in the end you'll run into the same problem; 5 groups of 3 items is 15 items, grouped apart or not. You're going to fill up your memory quickly.
I think your problem may be here:
List<string> fileContents = File.ReadAllLines(fileLocation).ToList(); //imports the list of prime numbers from the file
List<int> listOfPrimeNumbers = fileContents.ConvertAll(s => Int32.Parse(s)); //converts the list into the integer variable type
Both of these Lists<T> are unnecessary. Your file has carriage returns in it (you're inserting Environment.NewLineon your entries), so presuming you want to just continue where you left off, you need exactly one value from that file:
//note that I used ReadLines, not ReadAllLines
int lastNumber;
if(!int.TryParse(File.ReadLines(fileLocation).ToList().Last(), out lastNumber))
{
//last value wasn't a valid integer. Start over.
lastNumber = 1;
}
Then, execute all of your logic using lastNumber, write to the file when it's prime, and don't store collections in memory at all. This will make your new limiting factor the storage space on the destination computer. If you run out of memory loading the file and getting its last string, you'll need to put together a bit of code that involves reading the file backward, but since this is a more academic project, I doubt you need to take it that far.
I'm trying to read all of the feature data from particular shapefile. In this case, I'm using DotSpatial to open the file, and I'm iterating through the features. This particular shapefile is only 9mb in size, and the dbf file is 14mb. There is roughly 75k features to loop through.
Note, this is all programmatically through a console app, so there is no rendering or anything involved.
When loading the shape file, I reproject, then I'm iterating. The loading an reprojecting is super quick. However, as soon as the code reaches my foreach block, it takes nearly 2 full minutes to load the data, and uses roughly 2GB of memory when debugging in VisualStudio. This seems very, very excessive for what's a reasonably small data file.
I've ran the same code outside of Visual Studio, from the command line, however the time is still roughly 2 full minutes, and about 1.3GB of memory for the process.
Is there anyway to speed this up at all?
Below is my code:
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get's slow here and takes forever to get to the first item
foreach(IFeature feature in indexMapFile.Features)
{
// Once inside the loop, it's blazingly quick.
}
Interestingly, when I use the VS immediate window, it's super super fast, no delay at all...
I've managed to figure this out...
For some reason, calling foreach on the features is painfully slow.
However, as these files have a 1-1 mapping with features - data rows (each feature has a relevant data row), I've modified it slightly to the following. It's now very quick.. less than a second to start the iterations.
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get the map index from the Feature data
for(int i = 0; i < indexMapFile.DataTable.Rows.Count; i++)
{
// Get the feature
IFeature feature = indexMapFile.Features.ElementAt(i);
// Now it's very quick to iterate through and work with the feature.
}
I wonder why this would be. I think I need to look at the iterator on the IFeatureList implementation.
Cheers,
Justin
This has the same problem for very large files (1.2 millions of features), populating .Features collections never ends.
But if you ask for the feature you do not have memory or delay overheads.
int lRows = fs.NumRows();
for (int i = 0; i < lRows; i++)
{
// Get the feature
IFeature pFeat = fs.GetFeature(i);
StringBuilder sb = new StringBuilder();
{
sb.Append(Guid.NewGuid().ToString());
sb.Append("|");
sb.Append(pFeat.DataRow["MAPA"]);
sb.Append("|");
sb.Append(pFeat.BasicGeometry.ToString());
}
pLinesList.Add(sb.ToString());
lCnt++;
if (lCnt % 10 == 0)
{
pOld = Console.ForegroundColor;
Console.ForegroundColor = ConsoleColor.DarkGreen;
Console.Write("\r{0} de {1} ({2}%)", lCnt.ToString(), lRows.ToString(), (100.0 * ((float)lCnt / (float)lRows)).ToString());
Console.ForegroundColor = pOld;
}
}
Look for the GetFeature method.