I am trying to export a stringdictionary to a text file, it has over one million of records, it takes over 3 minutes to export into a textfile if I use a loop.
Is there a way to do that faster?
Regards
Well, it depends on what format you're using for the export, but in general, the biggest overhead for exporting large amounts of data is going to be I/O. You can reduce this by using a more compact data format, and by doing less manipulation of the data in memory (to avoid memory copies) if possible.
The first thing to check is to look at your disk I/O speed and do some profiling of the code that does the writing.
If you're maxing out your disk I/O (e.g., writing at a good percentage of disk speed, which would be many tens of megabytes per second on a modern system), you could consider compressing the data before you write it. This uses more CPU, but you write less to the disk when you do this. This will also likely increase the speed of reading the file, if you have the same bottleneck on the reading side.
If you're maxing out your CPU, you need to do less processing work on the data before writing it. If you're using a serialization library, for example, avoiding that and switching to a simpler, more specialized data format might help. Consider the simplest format you need: probably just a word for the length of the string, followed by the string data itself, repeated for every key and value.
Note that most dictionary constructs don't preserve the insert order - this often makes them poor choices if you want repeatable file contents, but (depending on the size) we may be able to improve on the time.... this (below) takes about 3.5s (for the export) to write just under 30MB:
StringDictionary data = new StringDictionary();
Random rand = new Random(123456);
for (int i = 0; i < 1000000; i++)
{
data.Add("Key " + i, "Value = " + rand.Next());
}
Stopwatch watch = Stopwatch.StartNew();
using (TextWriter output = File.CreateText("foo.txt"))
{
foreach (DictionaryEntry pair in data)
{
output.Write((string)pair.Key);
output.Write('\t');
output.WriteLine((string)pair.Value);
}
output.Close();
}
watch.Stop();
Obviously the performance will depend on the size of the actual data getting written.
Related
I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.
The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(#filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}
Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.
In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.
Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.
Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.
Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.
A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.
I have to write thousands of dynamically generated lines to a text file.
I have two choices, Which consumes less resources and is faster than the other?
A. Using StringBuilder and File.WriteAllText
StringBuilder sb = new StringBuilder();
foreach(Data dataItem in Datas)
{
sb.AppendLine(
String.Format(
"{0}, {1}-{2}",
dataItem.Property1,
dataItem.Property2,
dataItem.Property3));
}
File.WriteAllText("C:\\example.txt", sb.ToString(), new UTF8Encoding(false));
B. Using File.AppendText
using(StreamWriter sw = File.AppendText("C:\\example.txt"))
{
foreach (Data dataItem in Datas)
{
sw.WriteLine(
String.Format(
"{0}, {1}-{2}",
dataItem.Property1,
dataItem.Property2,
dataItem.Property3));
}
}
Your first version, which puts everything into a StringBuilder and then writes it, will consume the most memory. If the text is very large, you have the potential of running out of memory. It has the potential to be faster, but it could also be slower.
The second option will use much less memory (basically, just the StreamWriter buffer), and will perform very well. I would recommend this option. It performs well--possibly better than the first method--and doesn't have the same potential for running out of memory.
You can speed it quite a lot by increasing the size of the output buffer. Rather than
File.AppendText("filename")
Create the stream with:
const int BufferSize = 65536; // 64 Kilobytes
StreamWriter sw = new StreamWriter("filename", true, Encoding.UTF8, BufferSize);
A buffer size of 64K gives much better performance than the default 4K buffer size. You can go larger, but I've found that larger than 64K gives minimal performance gains, and on some systems can actually decrease performance.
You do have at least one other choice, using File.AppendAllLines()
var data = from item in Datas
select string.Format("{0}, {1}-{2}", item.Property1, item.Property2, item.Property3);
File.AppendAllLines("Filename", data, new UTF8Encoding(false));
This will theoretically use less memory than your first approach since only one line at a time will be buffered in memory.
It will probably be almost exactly the same as your second approach though. I'm just showing you a third alternative. The only advantage of this one is that you can feed it a Linq sequence, which can be useful sometimes.
The I/O speed will dwarf any other considerations, so you should concentrate on minimising memory usage as juharr noted above (and also considering the dangers of premature optimisation, of course!)
That means using your second approach, or the one I put here.
This for small payloads.
I am looking to achieve 1,000,000,000 per 100ms.
The standard BinaryFormatter is very slow. The DataContractSerializer is slow than BinaryFormatter.
Protocol buffers (http://code.google.com/p/protobuf-net/) seems slower than the BinaryFormatter for small objects!
Are there any more Serialization mechanisms a should be looking at either hardcore coding or open source projects?
EDIT:
I am serializing in-memory then transmitting the payload over tcp on a async socket. The payloads generated in memory and are small double arrays (10 to 500 points) with a ulong identifier.
Your performance requirement restricts the available serializers to 0. A custom BinaryWriter and BinaryReader would be the fastest you could get.
I'd have expected Protobuf-net to be faster even for small objects... but you may want to try my Protocol Buffer port as well. I haven't used Marc's port for a while - mine was faster when I last benchmarked, but I'm aware that he's gone through a complete rewrite since then :)
I doubt that you'll achieve serializing a billion items in 100ms whatever you do though... I think that's simply an unreasonable expectation, especially if this is writing to disk. (Obviously if you're simply overwriting the same bit of memory repeatedly you'll get a lot better performance than serializing to disk, but I doubt that's really what you're trying to do.)
If you can give us more context, we may be able to help more. Are you able to spread the load out over multiple machines, for example? (Multiple cores serializing to the same IO device is unlikely to help, as I wouldn't expect this to be a CPU-bound operation if it's writing to a disk or the network.)
EDIT: Suppose each object is 10 doubles (8 bytes each) with a ulong identifier (4 bytes). That's 84 bytes per object at minimum. So you're trying to serialize 8.4GB in 100ms. I really don't think that's achievable, whatever you use.
I'm running my Protocol Buffers benchmarks now (they give bytes serialized per second) but I highly doubt they'll give you what you want.
You claim small items are slower than BinaryFormatter, but every time I'e measured it I've found the exact opposite, for example:
Performance Tests of Serializations used by WCF Bindings
I conclude, especially with the v2 code, that this may well be your fastest option. If you can post your specific benchmark scenario I'll happily help see what is "up"... If you can't post it here, if you want to email it to me directly (see profile) that would be OK too. I don't know if your stated timings are possible under any scheme, but I'm very sure I can get you a lot faster than whatever you are seeing.
With the v2 code, the CompileInPlace gives the fastest result - it allows some IL tricks that it can't use if compiling to a physical dll.
The only reason to serialize objects is to make them compatible with a generic transport medium. Network, disk, etc. The perf of the serializer never matters because the transport medium is always so much slower than the raw perf of a CPU core. Easily by two orders of magnitude or more.
Which is also the reason that attributes are an acceptable trade-off. They are also I/O bound, their initialization data has to be read from the assembly metadata. Which requires a disk read for the first time.
So, if you are setting perf requirements, you need to focus 99% on the capability of the transport medium. A billion 'payloads' in 100 milliseconds requires very beefy hardware. Assume a payload is 16 bytes, you'll need to move 160 gigabytes in a second. This is quite beyond even the memory bus bandwidth inside the machine. DDR RAM moves at about 5 gigabytes per second. A one gigabit Ethernet NIC moves at 125 megabytes per second, burst. A commodity hard drive moves at 65 megabytes per second, assuming no seeking.
Your goal is not realistic with current hardware capabilities.
You could write a custom serialization by implement ISerailizable on your data structures. Anyway you will probably face some "impedence" from the hardware itself to serialize with these requirements.
Proto-Buff is really quick but has got limitatins. => http://code.google.com/p/protobuf-net/wiki/Performance
In my experience, Marc's Protocol Buffers implementation is very good. I haven't used Jon's. However, you should be trying to use techniques to minimise the data and not serialise the whole lot.
I would have a look at the following.
If the messages are small you should look at what entropy you have. You may have fields that can be partially or completely be de-duplicated. If the communication is between two parties only you may get benefits from building a dictionary both ends.
You are using TCP which has an overhead enough without a payload on top. You should minimise this by batching your messages in to larger bundles and/or look at UDP instead. Batching itself when combined with #1 may get you closer to your requirement when you average your total communication out.
Is the full data width of double required or is it for convenience? If the extra bits are not used this will be a chance for optimisation when converting to a binary stream.
Generally generic serialisation is great when you have multiple messages you have to handle over a single interface or you don't know the full implementation details. In this case it would probably be better to build your own serialisation methods to convert a single message structure directly to byte arrays. Since you know the full implementation both sides direct conversion won't be a problem. It would also ensure that you can inline the code and prevent box/unboxing as much as possible.
This is the FASTEST approach i'm aware of. It does have its drawbacks. Like a rocket, you wouldn't want it on your car, but it has its place. Like you need to setup your structs and have that same struct on both ends of your pipe. The struct needs to be a fixed size, or it gets more complicated then this example.
Here is the perf I get on my machine (i7 920, 12gb ram) Release mode, without debugger attached. It uses 100% cpu during the test, so this test is CPU bound.
Finished in 3421ms, Processed 52.15 GB
For data write rate of 15.25 GB/s
Round trip passed
.. and the code...
class Program
{
unsafe
static void Main(string[] args)
{
int arraySize = 100;
int iterations = 10000000;
ms[] msa = new ms[arraySize];
for (int i = 0; i < arraySize; i++)
{
msa[i].d1 = i + .1d;
msa[i].d2 = i + .2d;
msa[i].d3 = i + .3d;
msa[i].d4 = i + .4d;
msa[i].d5 = i + .5d;
msa[i].d6 = i + .6d;
msa[i].d7 = i + .7d;
}
int sizeOfms = Marshal.SizeOf(typeof(ms));
byte[] bytes = new byte[arraySize * sizeOfms];
TestPerf(arraySize, iterations, msa, sizeOfms, bytes);
// lets round trip it.
var msa2 = new ms[arraySize]; // Array of structs we want to push the bytes into
var handle2 = GCHandle.Alloc(msa2, GCHandleType.Pinned);// get handle to that array
Marshal.Copy(bytes, 0, handle2.AddrOfPinnedObject(), bytes.Length);// do the copy
handle2.Free();// cleanup the handle
// assert that we didnt lose any data.
var passed = true;
for (int i = 0; i < arraySize; i++)
{
if(msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1
||msa[i].d1 != msa2[i].d1)
{passed = false;
break;
}
}
Console.WriteLine("Round trip {0}",passed?"passed":"failed");
}
unsafe private static void TestPerf(int arraySize, int iterations, ms[] msa, int sizeOfms, byte[] bytes)
{
// start benchmark.
var sw = Stopwatch.StartNew();
// this cheats a little bit and reuses the same buffer
// for each thread, which would not work IRL
var plr = Parallel.For(0, iterations/1000, i => // Just to be nice to the task pool, chunk tasks into 1000s
{
for (int j = 0; j < 1000; j++)
{
// get a handle to the struc[] we want to copy from
var handle = GCHandle.Alloc(msa, GCHandleType.Pinned);
Marshal.Copy(handle.AddrOfPinnedObject(), bytes, 0, bytes.Length);// Copy from it
handle.Free();// clean up the handle
// Here you would want to write to some buffer or something :)
}
});
// Stop benchmark
sw.Stop();
var size = arraySize * sizeOfms * (double)iterations / 1024 / 1024 / 1024d; // convert to GB from Bytes
Console.WriteLine("Finished in {0}ms, Processed {1:N} GB", sw.ElapsedMilliseconds, size);
Console.WriteLine("For data write rate of {0:N} GB/s", size / (sw.ElapsedMilliseconds / 1000d));
}
}
[StructLayout(LayoutKind.Explicit, Size= 56, Pack=1)]
struct ms
{
[FieldOffset(0)]
public double d1;
[FieldOffset(8)]
public double d2;
[FieldOffset(16)]
public double d3;
[FieldOffset(24)]
public double d4;
[FieldOffset(32)]
public double d5;
[FieldOffset(40)]
public double d6;
[FieldOffset(48)]
public double d7;
}
If you don't want to take the time to implement a comprehensive explicit serialization/de-serialization mechanism, try this: http://james.newtonking.com/json/help/html/JsonNetVsDotNetSerializers.htm ...
In my usage with large objects (1GB+ when serialized to disk) I find that the file generated by the NewtonSoft library is 4.5 times smaller and takes 6 times fewer seconds to process than when using the BinaryFormatter.
I am creating a method in C# which generates a text file for a Google Product Feed. The feed will contain upwards of 30,000 records and the text file currently weighs in at ~7Mb.
Here's the code I am currently using (some lines removed for brevity's sake).
public static void GenerateTextFile(string filePath) {
var sb = new StringBuilder(1000);
sb.Append("availability").Append("\t");
sb.Append("condition").Append("\t");
sb.Append("description").Append("\t");
// repetitive code hidden for brevity ...
sb.Append(Environment.NewLine);
var items = inventoryRepo.GetItemsForSale();
foreach (var p in items) {
sb.Append("in stock").Append("\t");
sb.Append("used").Append("\t");
sb.Append(p.Description).Append("\t");
// repetitive code hidden for brevity ...
sb.AppendLine();
}
using (StreamWriter outfile = new StreamWriter(filePath)) {
result.Append("Writing text file to disk.").AppendLine();
outfile.Write(sb.ToString());
}
}
I am wondering if StringBuilder is the right tool for the job. Would there be performance gains if I used a TextWriter instead?
I don't know a ton about IO performance so any help or general improvements would be appreciated. Thanks.
File I/O operations are generally well optimized in modern operating systems. You shouldn't try to assemble the entire string for the file in memory ... just write it out piece by piece. The FileStream will take care of buffering and other performance considerations.
You can make this change easily by moving:
using (StreamWriter outfile = new StreamWriter(filePath)) {
to the top of the function, and getting rid of the StringBuilder writing directly to the file instead.
There are several reasons why you should avoid building up large strings in memory:
It can actually perform worse, because the StringBuilder has to increase its capacity as you write to it, resulting in reallocation and copying of memory.
It may require more memory than you can physically allocate - which may result in the use of virtual memory (the swap file) which is much slower than RAM.
For truly large files (> 2Gb) you will run out of address space (on 32-bit platforms) and will fail to ever complete.
To write the StringBuilder contents to a file you have to use ToString() which effectively doubles the memory consumption of the process since both copies must be in memory for a period of time. This operation may also fail if your address space is sufficiently fragmented, such that a single contiguous block of memory cannot be allocated.
Just move the using statement so it encompasses the whole of your code, and write directly to the file. I see no point in keeping it all in memory first.
Write one string at a time using StreamWriter.Write rather than caching everything in a StringBuilder.
This might be old but I had a file to write with about 17 million lines
so I ended up batching the writes every 10k lines similar to these lines
for (i6 = 1; i6 <= ball; i6++)
{ //this is middle of 6 deep nest ..
counter++;
// modus to get a value at every so often 10k lines
divtrue = counter % 10000; // remainder operator % for 10k
// build the string of fields with \n at the end
lineout = lineout + whatever
// the magic 10k block here
if (divtrue.Equals(0))
{
using (StreamWriter outFile = new StreamWriter(#filepath, true))
{
// write the 10k lines with .write NOT writeline..
outFile.Write(lineout);
}
// reset the string so we dont do silly like memory overflow
lineout = "";
}
}
In my case it was MUCH faster then one line at a time.
I am developing an app that utilizes very large lookup tables to speed up mathematical computations. The largest of these tables is an int[] that has ~10 million entries. Not all of the lookup tables are int[]. For example, one is a Dictionary with ~200,000 entries. Currently, I generate each lookup table once (which takes several minutes) and serialize it to disk (with compression) using the following snippet:
int[] lut = GenerateLUT();
lut.Serialize("lut");
where Serialize is defined as follows:
public static void Serialize(this object obj, string file)
{
using (FileStream stream = File.Open(file, FileMode.Create))
{
using (var gz = new GZipStream(stream, CompressionMode.Compress))
{
var formatter = new BinaryFormatter();
formatter.Serialize(gz, obj);
}
}
}
The annoyance I am having is when launching the application, is that the Deserialization of these lookup tables is taking very long (upwards of 15 seconds). This type of delay will annoy users as the app will be unusable until all the lookup tables are loaded. Currently the Deserialization is as follows:
int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
int[] lut2 = (int[]) Deserialize("lut2");
...
where Deserialize is defined as:
public static object Deserialize(string file)
{
using (FileStream stream = File.Open(file, FileMode.Open))
{
using (var gz = new GZipStream(stream, CompressionMode.Decompress))
{
var formatter = new BinaryFormatter();
return formatter.Deserialize(gz);
}
}
}
At first, I thought it might have been the gzip compression that was causing the slowdown, but removing it only skimmed a few hundred milliseconds from the Serialization/Deserialization routines.
Can anyone suggest a way of speeding up the load times of these lookup tables upon the app's initial startup?
First, deserializing in a background thread will prevent the app from "hanging" while this happens. That alone may be enough to take care of your problem.
However, Serialization and deserialization (especially of large dictionaries) tends to be very slow, in general. Depending on the data structure, writing your own serialization code can dramatically speed this up, particularly if there are no shared references in the data structures.
That being said, depending on the usage pattern of this, a database might be a better approach. You could always make something that was more database oriented, and build the lookup table in a lazy fashion from the DB (ie: a lookup is lookup in the LUT, but if the lookup doesn't exist, load it from the DB and save it in the table). This would make startup instantaneous (at least in terms of the LUT), and probably still keep lookups fairly snappy.
I guess the obvious suggestion is to load them in the background. Once the app has started, the user has opened their project, and selected whatever operation they want, there won't be much of that 15 seconds left to wait.
Just how much data are we talking about here? In my experience, it takes about 20 seconds to read a gigabyte from disk into memory. So if you're reading upwards of half a gigabyte, you're almost certainly running into hardware limitations.
If data transfer rate isn't the problem, then the actual deserialization is taking time. If you have enough memory, you can load all of the tables into memory buffers (using File.ReadAllBytes()) and then deserialize from a memory stream. That will allow you to determine how much time reading is taking, and how much time deserialization is taking.
If deserialization is taking a lot of time, you could, if you have multiple processors, spawn multiple threds to do the serialization in parallel. With such a system, you could potentially be deserializing one or more tables while loading the data for another. That pipelined approach could make your entire load/deserialization time be almost as fast as load only.
Another option is to put your tables into, well, tables: real database tables. Even an engine like Access should yield pretty good performance, because you have an obvious index for every query. Now the app only has to read in data when it's actually about to use it, and even then it's going to know exactly where to look inside the file.
This might make the app's actual performance a bit lower, because you have to do a disk read for every calculation. But it would make the app's perceived performance much better, because there's never a long wait. And, like it or not, the perception is probably more important than the reality.
Why zip them?
Disk is bigger than RAM.
A straight binary read should be pretty quick.