I'm trying to read all of the feature data from particular shapefile. In this case, I'm using DotSpatial to open the file, and I'm iterating through the features. This particular shapefile is only 9mb in size, and the dbf file is 14mb. There is roughly 75k features to loop through.
Note, this is all programmatically through a console app, so there is no rendering or anything involved.
When loading the shape file, I reproject, then I'm iterating. The loading an reprojecting is super quick. However, as soon as the code reaches my foreach block, it takes nearly 2 full minutes to load the data, and uses roughly 2GB of memory when debugging in VisualStudio. This seems very, very excessive for what's a reasonably small data file.
I've ran the same code outside of Visual Studio, from the command line, however the time is still roughly 2 full minutes, and about 1.3GB of memory for the process.
Is there anyway to speed this up at all?
Below is my code:
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get's slow here and takes forever to get to the first item
foreach(IFeature feature in indexMapFile.Features)
{
// Once inside the loop, it's blazingly quick.
}
Interestingly, when I use the VS immediate window, it's super super fast, no delay at all...
I've managed to figure this out...
For some reason, calling foreach on the features is painfully slow.
However, as these files have a 1-1 mapping with features - data rows (each feature has a relevant data row), I've modified it slightly to the following. It's now very quick.. less than a second to start the iterations.
// Load the shape file and project to GDA94
Shapefile indexMapFile = Shapefile.OpenFile(shapeFilePath);
indexMapFile.Reproject(KnownCoordinateSystems.Geographic.Australia.GeocentricDatumofAustralia1994);
// Get the map index from the Feature data
for(int i = 0; i < indexMapFile.DataTable.Rows.Count; i++)
{
// Get the feature
IFeature feature = indexMapFile.Features.ElementAt(i);
// Now it's very quick to iterate through and work with the feature.
}
I wonder why this would be. I think I need to look at the iterator on the IFeatureList implementation.
Cheers,
Justin
This has the same problem for very large files (1.2 millions of features), populating .Features collections never ends.
But if you ask for the feature you do not have memory or delay overheads.
int lRows = fs.NumRows();
for (int i = 0; i < lRows; i++)
{
// Get the feature
IFeature pFeat = fs.GetFeature(i);
StringBuilder sb = new StringBuilder();
{
sb.Append(Guid.NewGuid().ToString());
sb.Append("|");
sb.Append(pFeat.DataRow["MAPA"]);
sb.Append("|");
sb.Append(pFeat.BasicGeometry.ToString());
}
pLinesList.Add(sb.ToString());
lCnt++;
if (lCnt % 10 == 0)
{
pOld = Console.ForegroundColor;
Console.ForegroundColor = ConsoleColor.DarkGreen;
Console.Write("\r{0} de {1} ({2}%)", lCnt.ToString(), lRows.ToString(), (100.0 * ((float)lCnt / (float)lRows)).ToString());
Console.ForegroundColor = pOld;
}
}
Look for the GetFeature method.
Related
I use a SQL Server CE database for some simulation results. When I test the reading speed the benchmark-times differ greatly.
dbs are split to 4 .SDF files (4 quartals)
302*525000 entries overall
four SqlCeConnections
they are opened before reading and stay open
all four databases are located on the same disk (SSD)
I use SqlCeDataReader (IMHO as low-level and fast as you can get)
Reading process is parallel
Simplified code
for (int run = 0; run < 4; run++)
{
InitializeConnections();
for (int reading = 0; reading < 6; reading++)
{
ResetMemoryObjects();
Parallel.For(0, quartals.Count, (i) =>
{
values[i] = ReadFromSqlCeDb(i);
}
}
}
connections are initialized each run 1 time
all readings take place in a simple for loop and are exactly the same
before each reading all the objects are reinitialized
These are the benchmark results I get:
At this point I'm honest - I have no idea why SQL Server CE behaves that way. Maybe someone can give me a hint?
Edit 1:
I made a more in depth analysis of each step during the parallel reading. The following chart shows the steps, the "actual reading" is the part inside the while(readerdt.Read()) section.
Edit 2:
After ErikEJ' suggestion I added a TableDirect Approach and made 150 runs, 75 for SELECT and 75 for TableDirect. I summed up the Pre- & Postprocessing of the reading process, because this remains stable and nearly the same for all runs. What differs vastly is the actual reading process.
Every second run was done via TableDirect, so they both start to get drastic better results at around run 65 simultaneously. The range goes from 5.7 second up to 37.4 seconds.
This is the "acutal reading" code. (there are four different databases/sdf files with four different SqlCe connections. Tested on Ryzen7 8-Core CPU)
private static List<List<double>> ReadDataFromDbTableDirect((double von, double bis) timepoints, List<(string compName, string resName)> components, int dbIdx)
{
var values = new List<List<double>>();
for (int j = 0; j < components.Count; j++)
{
values.Add(new List<double>());
}
using (var command = sqlCeCon[dbIdx].CreateCommand())
{
command.CommandType = CommandType.TableDirect;
command.CommandText = table.TableName;
command.IndexName = "PKTIME";
command.SetRange(DbRangeOptions.InclusiveStart | DbRangeOptions.InclusiveEnd, new object[]{timepoints.von},new object[]{timepoints.bis});
using (var reader = (SqlCeDataReader)command.ExecuteReader(CommandBehavior.Default))
{
while (reader.Read())
{
for (int j = 0; j < components.Count; j++)
{
if (!reader.IsDBNull(j))
{
if (j == 0)
{
values[j].Add(reader.GetInt32(j));
}
else
{
values[j].Add(reader.GetDouble(j));
}
}
}
}
}
}
return values;
}
Still no idea why it has such a great delay.
SDF looks like this
Edit 3:
Today I made the same approach with a single database instead of four databases (to exclude problems with Parallel/Tasks). While here TableDirect has a little advantage, the main problem of differing reading speed persists (the sdf data is the same so it is comparable).
Edit 4:
These are the results on another machine. Still large outbursts, but a bit more stable. Overall still same issue.
Edit 5:
These are the results of a 4x smaller db - 500 runs. TableDirect & Select are (as in previous benchmarks) run alternately, but to better see the results in the graph they are shown in sequence. Notice that the overall time here is not 4-times smaller as you'd expect, but ~8-times smaller. Same problem with the high reading time in the beginning. Next I'll optimize ArrayPooling and stuff...
Edit 6:
To further investigate I tried this on 2 different machines. Pc#2 has Win11Home, Ryzen 5, SSD (quite new) and no Antivir, Pc#3 has Win10Pro - pristine installation, everything deactivated (WinDefender), SSD, Ryzen 7. On Pc#3 there is just one peak (first run), on Pc#2 there are several besides the initial requests.
Edit 7 :
After ErikEJ's suggestion it might be due to an Index Rebuild, I tried several things.
A test the reading times after a fresh simulation (db is freshly built in the simulation)
B test the reading times after copying and loading a db from another folder and apply a Verify(SqlCeEngine) on the db (Verify)
C test the reading times after copying and loading a db from another folder (no special db treatment)
D test the reading times after copying and loading a db from another folder, then make a quick first call with one row of all cols (preparation call)
I also tested SqlCeEngine Repair & Compact. They had nearly the same results as B
It seems like a verification solves the problem with the initial reading speed. Unfortunately the verification itself takes quite long (>10s on big dbs). Is there a quicker solution for this?
Result D is a complete surprise to me (mind the different scale). I don't understand what is happening here.. Any guesses?
Result C shows the long reading times on initial readings, but not always, which is irritating. Maybe it is not the index rebuild which is causing this?
I still have very huge variations in the reading speed on bigger databases.
I'm currently working on an improved reading process with pointers/memory to reduce GC pressure.
I will make the same tests today on another machine.
If anyone has an idea how to improve/stabilize reading speeds please let me know! Thanks in advance!
I am trying to write a program to perform external merge sort on a massive dataset. As a first step, I need to split the dataset into chunks that would fit into RAM. I have the following questions:
Suppose my machine has x amount of RAM installed, is there a
theoretical maximum limit on how much of it could be made available
to my process?
When I run the below program, I get a non-zero value as available memory when it fails. Why does the memory allocation fail when there is still unused RAM left? there is still 2.8GB free RAM when the memory allocation fails. What explains the observed behavior?
List<string> list = new List<string>();
try
{
while (true)
{
list.Add("random string");
}
}
catch(Exception e)
{
Microsoft.VisualBasic.Devices.ComputerInfo CI = new ComputerInfo();
Console.WriteLine(CI.AvailablePhysicalMemory);
}
If there are other programs running concurrently, how do I
determine, how much RAM is available for use by the current process?
Here is what you're looking for: ComputerInfo.AvailablePhysicalMemory
Gets the total amount of free physical memory for the computer.
private ulong GetMaxAvailableRAM()
{
Microsoft.VisualBasic.Devices.ComputerInfo CI = new ComputerInfo();
return CI.AvailablePhysicalMemory;
}
NOTE: You will need a to add a reference to Microsoft.VisualBasic
UPDATE:
Your sample to fill the RAM will run into a few other limits first.
It will first hit OutOfMemory if your not building in 64-bit. You should change your solution to build for x64 = 64-bit within the Solution configuration:
Secondly your List has a maximum supported array dimension.
By adding many small objects you will hit that limit first.
Here is a quick and dirty example making a List of Lists of strings.
(This could have smaller code using Images etc... But I was trying to stay similar to your example.)
When this is run it will consume all of your RAM and eventually start paging to disk. Remember Windows has Virtual RAM which will eventually get used up, but it's much slower than regular RAM. Also, if it uses all that up, then it might not even be able to allocate the space to instantiate the ComputerInfo Class.
NOTE: Be careful, this code will consume all RAM and potentially make your system unstable.
List<List<string>> list = new List<List<string>>();
try
{
for (UInt32 I = 0; I < 134217727; I++)
{
List<string> SubList = new List<string>();
list.Add(SubList);
for (UInt32 x = 0; x < 134217727; x++)
{
SubList.Add("random string");
}
}
}
catch (Exception Ex)
{
Console.WriteLine(Ex.Message);
Microsoft.VisualBasic.Devices.ComputerInfo CI = new ComputerInfo();
Console.WriteLine(CI.AvailablePhysicalMemory);
}
NOTE: To prevent using the Disk you could try to use something like System.Security.SecureString which prevents itself from being written to disk, but it would be very very slow to accumulate enough to fill your RAM.
Here is a test run showing the Physical Memory usage. I started running at (1)
I suggest for your final implementation that you use the ComputerInfo.AvailablePhysicalMemory value to determine how much of your data you can load before loading it (leaving some for the OS). And also look to lock objects in memory (usually used for Marshaling, etc..) to prevent accidental use of Virtual Memory.
I want to know something, I have this loop that runs for all days (7 times) and then another loop inside it, that runs for all the records in the file. (about 100000), so all in all its about a 700000 times, now I want to log each processing in each loop, and log that to a file, say we are inside 1st loop for first time, and 2nd loop for first time, we log each time, what is done in a file. But the problem is that if I were to log each time, it would terribly hurt the performance, because of so many IO operations, what I was thinking is that is there any way, I could log each and every step to memory (memory stream or anything) and then at the end of outer loop, log all that memory stream data to file?
say if I have
for (int i=0; i<7; i++)
{
for (int j=0; j<RecordsCount; j++)
{
SomeOperation();
// I am logging to a file here right now
}
}
Its hurting the performance terribly, if I remove the logging file, I could finish it a lot earlier. So I was thinking it would be better to log to memory first and write all that log from memory to file. So, is there any way to do this? If there are many, what's the best?
ps: here is a logger I found http://www.codeproject.com/Articles/23424/TracerX-Logger-and-Viewer-for-NET, but I can use any other custom logger or anything else if required.
EDIT : If I use memory stream, and then write it all to file, would that give me better performance then using File.AppendAllLines as suggested in answer by Yorye and zmbq, also if would that give me any performance gain over what Jeremy commented?
This is just an example, but you can get the idea.
You really had the solution, you just needed to do it...
for (int i=0; i<7; i++)
{
var entries = new List<string>();
for (int j=0; j<RecordsCount; j++)
{
SomeOperation();
// Log into list
entries.Add("Operation #" + j + " results: " + bla bla bla);
}
// Log all entries to file at once.
File.AppendAllLines("logFile.txt", entries);
}
Why not use a proper logging framework instead of writing your own?
NLog for instance has buffering built in and it is easy to configure: https://github.com/nlog/NLog/wiki/BufferingWrapper-target
I suggest you focus on writing code that gives value to your project while reusing existing solutions for all the other stuff. That will probably make you more efficient and give better results :-)
Logging 700,000 lines to a file shouldn't take all time long as long as you adequate buffers. In fact, it shouldn't take longer if you do it inside the loop, compared to doing it at once outside the loop.
Don't use File.AppendAllLines or something similar, instead open a stream to the file, make sure you have a buffer in place, and write through it.
My C# 3.0 application should traverse through folders and do some stuff within. To show a meaningful progress, I need to know total folder count.
If I use Directory.GetDirectories with AllDirectories option, this takes a very long time on my 2Tb hard drive with around 100K folders, and I should present a progress even for that operation! The only meaningful thing I can do is to use recursive Directory.GetDirectories and present a user with a number of already found directories. However, this takes even longer, than the first approach.
I believe, both approaches are too slow. Is there any way to get this number quicker? E.g. take from some file tables using PInvoke? Any other ideas?
My suggestion would be to simply show the user an infinitely scrolling progress bar while you are getting all of the directories and only when show the user the actual progress while your application does the work.
This way the user will know the application is working in the background while everything happens.
This sort of thing is hard to do. If you're just trying to make a rough estimate for a progress bar, you don't need much granularity, right? I would suggest manually traversing the directory tree only one or two levels deep to figure out how many 1st- and 2nd-level subdirectories there are. Then you can update your progress bar whenever you hit one of those subdirs. That ought to give you a meaningful progress bar without taking too much time to compute.
If you implement this you'll find that your first pre-scan was the slowest but it will speed up the next (full) scan because the folder-structure is getting cached.
It may be an option to only count the folders in the first N (2..4) levels. That could still be slow but it will allow for a estimated progress. Just assume all lower levels contain equal numbers of files.
Part 2, concerning the P/Invoke question
Your main cost is here is true lowlevel I/O, the overhead of the (any) API is negligible.
You probably will benefit from replacing GetFiles() with EnumerateFiles() (Fx4). More so for your main loop than for the pre-scan.
Explore FindFirstFile and FindNextFile APIs. I think they will work faster in your case
I wrote a pretty simple enumeration of files. The progress is mathematically continuous, i.e. it will not turn to a lower value later on no matter what. The estimation is based on the idea that all folders hold the same number of files and subfolders, which is obviously almost never the case, but it suffices to get a reasonable idea.
There is almost no caching, especially not of deep structures, so this should work almost as quickly as enumerating directly.
public static IEnumerable<Tuple<string, float>> EnumerateFiles (string root)
{
var files = Directory.GetFiles (root);
var dirs = Directory.GetDirectories (root);
var fact = 1f / (float) (dirs.Length + 1); // this makes for a rough estimate
for (int i = 0; i < files.Length; i++) {
var file = files[i];
var f = (float) i / (float) files.Length;
f *= fact;
yield return new Tuple<string, float> (file, f);
}
for (int i = 0; i < dirs.Length; i++) {
var dir = dirs[i];
foreach (var tuple in EnumerateFiles (dir)) {
var f = tuple.Item2;
f *= fact;
f += (i + 1) * fact;
yield return new Tuple<string, float> (tuple.Item1, f);
}
}
}
i am creating project in c#.net. my execution process is very slow. i also found the reason for that.in one method i copied the values from one list to another.that list consists more 3000values for every row . how can i speed up this process.any body help me
for (int i = 0; i < rectTristrip.NofStrips; i++)
{
VertexList verList = new VertexList();
verList = rectTristrip.Strip[i];
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
for (int j = 0; j < rectPath4.PointCount; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
This is the code slow up my procees.Rect tristirp consists lot of vertices each vertices has more 3000 values..
A profiler will tell you exactly how much time is spent on which lines and which are most important to optimize. Red-gate makes a very good one.
http://www.red-gate.com/products/ants_performance_profiler/index.htm
Like musicfreak already mentioned you should profile your code to get reliable result on what's going on. But some processes are just taking some time.
In some way you can't get rid of them, they must be done. The question is just: When they are neccessary? So maybe you can put them into some initialization phase or into another thread which will compute the results for you, while your GUI is accessible to your users.
In one of my applications i make a big query against a SQL Server. This task takes a while (built up connection, send query, wait for result, putting result into a data table, making some calculations on my own, presenting the results to the user). All of these steps are necessary and can't be make any faster. But they will be done in another thread while the user sees in the result window a 'Please wait' with a progress bar. In the meantime the user can already make some other settings in the UI (if he likes). So the UI is responsive and the user has no big problem to wait a few seconds.
So this is not a real answer, but maybe it gives you some ideas on how to solve your problem.
You can split the load into a couple of worker threads, say 3 threads each dealing with 1000 elements.
You can synchronize it with AutoResetEvent
Some suggestions, even though I think the bulk of the work is in TristripToGraphicsPath():
// Use rectTristrip.Strip.Length instead of NoOfStrips
// to let the JIT eliminate bounds checking
// .Count if it is a list instead of array
for (int i = 0; i < rectTristrip.Strip.Length; i++)
{
VertexList verList = rectTristrip.Strip[i]; // Removed 'new'
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
// Assuming pointList is infact a list, do this:
pointList.AddRange(rectPath4.PathPoints);
// Else do this:
// Use PathPoints.Length instead of PointCount
// to let the JIT eliminate bounds checking
for (int j = 0; j < rectPath4.PathPoints.Length; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
And maybe verList = rectTristrip.Strip[i]; // Removed 'VertexList' to save some memory
Define variable VertexList verList above loop.