There is a lot of information available about the .NET LOH and it has been explained in various articles. However, it seems that some articles lack a bit of precision.
Outdated information
In Brian Rasmussen's answer (2009), program manager at Microsoft, he says the limit is 85000 bytes. He also let's us know that there is an even more curious case of double[] with a size of 1000 elements. The same 85000 limit is stated by Maoni Stephens (MSDN, 2008), member of the CLR team.
In the comments, Brian Rasmussen becomes even more exact and let's us know that it can be reproduced with a byte[] of 85000 bytes - 12 bytes.
2013 update
Mario Hewardt (author of 'Advanced Windows Debugging') told us in 2013 that .NET 4.5.1 can now compact the LOH as well, if we tell it to do so. Since it is turned off by default, the problem remains unless you're aware of it already.
2015 update
I can't reproduce the byte[] example any more. With a short brute-force algorithm, I found out that I have to subtract 24 instead (byte[84999-24] in SOH, byte[85000-24] in LOH):
static void Main(string[] args)
{
int diff = 0;
int generation = 3;
while (generation > 0)
{
diff++;
byte[] large = new byte[85000-diff];
generation = GC.GetGeneration(large);
}
Console.WriteLine(diff);
}
I also couldn't reproduce the double[] statement. Brute-forcing gives me 10622 elements as the border (double[10621] in SOH, double[10622] in LOH):
static void Main(string[] args)
{
int size = 85000;
int step = 85000/2;
while (step>0)
{
double[] d = new double[size];
int generation = GC.GetGeneration(d);
size += (generation>0)?-step:step;
step /= 2;
}
Console.WriteLine(size);
}
This happens even if I compile the application for older .NET frameworks. It also does not depend on Release or Debug build.
How can the changes be explained?
The change from 12 to 24 in the byte[] example can be explained by the change in CPU architecture from 32 to 64 bit. In programs compiled for x64 or AnyCPU, the .NET overhead increases from 2*4 bytes (4 bytes Object Header + 4 bytes Method Table) to 2*8 bytes (8 bytes Object Header + 8 bytes Method Table). In addition, the array has a length property of 4 bytes (32 bit) versus 8 bytes (64 bits).
For the double[] example, just use a calculator: 85000 bytes / 64 bit for the double type = 10625 items, which is already close. Considering the .NET overhead, the result is (85000 bytes - 24 bytes) / 8 bytes per double = 10622 doubles. So there is no special handling of double[] any more.
BTW, I have never found any working demonstration for LOH fragmentation before, so I wrote one myself. Just compile the following code for x86 and run it. It even includes some debugging hints.
It won't work as well when compiled as x64 since Windows might increase the size of the pagefile, so the subsequent allocation of 20 MB memory could be successful again.
class Program
{
static IList<byte[]> small = new List<byte[]>();
static IList<byte[]> big = new List<byte[]>();
static void Main()
{
int totalMB = 0;
try
{
Console.WriteLine("Allocating memory...");
while (true)
{
big.Add(new byte[10*1024*1024]);
small.Add(new byte[85000-3*IntPtr.Size]);
totalMB += 10;
Console.WriteLine("{0} MB allocated", totalMB);
}
}
catch (OutOfMemoryException)
{
Console.WriteLine("Memory is full now. Attach and debug if you like. Press Enter when done.");
Console.WriteLine("For WinDbg, try `!address -summary` and `!dumpheap -stat`.");
Console.ReadLine();
big.Clear();
GC.Collect();
Console.WriteLine("Lots of memory has been freed. Check again with the same commands.");
Console.ReadLine();
try
{
big.Add(new byte[20*1024*1024]);
}
catch(OutOfMemoryException)
{
Console.WriteLine("It was not possible to allocate 20 MB although {0} MB are free.", totalMB);
Console.ReadLine();
}
}
}
}
Related
Kindly bear with me for this confusing question. I'm finding it as hard to describe as it is involving and tiresome. Read it and you'll know why.
I've been hounding this issue for over a month now without much progress. I'm using an STM32 (STM32F103C8 mounted on a BluePill board) to communicate with a C# app through an FT232r Serial-USB converter. The complete communication protocol is a bit complex. I'm writing here a simplistic version of the code that explains my problem quite accurately.
STM32 does the following.
In the initial setup,
Serial.begin at 2000000 (Yes it's very high but I've analyzed it using an oscilloscope and the signal is very healthy; impedance matching and clock jitter is very accurate).
Waits for a command from the C# end to enter the loop
In the loop, it does the following.
TX a byte buffer of length N on the serial port. Packet structure is 0xAA, N bytes, 1 byte checksum.
repeat the loop
And on the C# side (Pseudo code),
new Thread(()=>{while(true) IOTick(); Thread.Sleep(30); }).Start();
IOTick() is defined as:
{
while(SerialPortObject.BytesToRead > 1)
{
header = read();
if (header != 0xAA) continue;
byte [] buffer = new byte[N + 1];
receivedBytes = readBytes(buffer, N + 1, Timeout = 500ms); // receivedBytes is never less than N + 1 for timeout greater than 120)
use the N=16 bytes. Check Nth byte to compare checksum. Doen't take too much CPU time.
Send a packet received software event.
}
}
readBytes is defined as
int readBytes(byte[] buffer, int count, int timeout)
{
var st = DateTime.Now;
for (int i = 0; i < count; i++)
{
var b_ = read(timeout);
if (b_ == -1)
return i;
buffer[i] = (byte)b_;
timeout -= (int)(DateTime.Now - st).TotalMilliseconds;
}
return count;
}
int buffer2ReadIndex = 0;
byte[] buffer2= new byte[0];
int read(int timeout)
{
DateTime start = DateTime.Now;
if (buffer2.Length == 0)
{
while (SerialPortObject.BytesToRead <= 0)
{
if ((DateTime.Now - start).TotalMilliseconds > timeout)
return -1;
System.Threading.Thread.Sleep(30);
}
buffer2 = new byte[SerialPortObject.BytesToRead];
sp.Read(buffer2, 0, buffer2.Length);
}
if (buffer2.Length > 0)
{
var b = buffer2[buffer2ReadIndex];
buffer2ReadIndex++;
if (buffer2ReadIndex >= buffer2.Length)
{
buffer2ReadIndex = 0;
buffer2 = new byte[0];
}
return b;
}
return -1;
}
Now, everything is working as expected. The packet received software event is triggered not later than every ~30ms (the windows tick time). The problem starts if I have to wait between each packet TX at the STM side. First, I suspected that the I2C I was using for some tasks between each packet TX was causing some HW or software conflict with serial data which gets corrupted. But then I noticed that only if I introduce a delay of 1 millisecond using Arduino delay() between each packet TX, the same thing happens. Almost, 1K packets should be received every second now. Almost 1 out of 10 packets after a successful header exception get either not delivered completely or delivered with corrupted checksum, causing the C# app to lose the packet Header. The new header trace obviously requires flushing some bytes, losing some packets in the communication. Even this doesn't sound too bad for an app that can afford 5% data packet loss, strangely though, when this anomaly occurs, the packet received software interrupt waits for more than 1 second after every couple hundred of consecutive events.
I'm completely blind here. Even tried it with 115200 baud rate, does the same loss with a slightly lesser loss ratio. It should be noted that at 9600 baud, the issue doesn't happen. This is the only hint I've got right now.
It looks like I've found an answer.
After digging deep into SerialPort and SerialPort.base stream class and after doing some document reading and benchmarking, here is what I've observed:
SerialPort.BytesToRead updates are not uniform. DataReceived event seems to be following it. When bytes are coming at ~200kHz, (baud = 2Mbps), It is updated almost instantaneously (or within 30ms, worst case). When they are coming at ~20kHz or slower (evenly spaced on time using a micrcontroller), the SerialPort.BytesToRead can take up to 400ms to update. This happens only after a dozen 30ms updates.
So, observing this, I can say that SerialPort.BytesToRead is updated on two conditions. Some amount of time has passed since the data arrived (and this time is not constrained to 30ms) or the data is coming too fast.
This is a strange behavior. No data is lost when this anomaly is occurring. Not to surprise, 0.06% of bytes are lost when working at full bandwidth (200KBps at baud of 2Mbps).
Have encounterred a situation where a simple .net fibonniacci code is slower on a particular set of servers and the only thing that is obviously different is the CPU.
AMD Opteron Processor 6276 - 11 secs
Intel Xeon XPU E7 - 4850 - 7 secs
Code is complied for x86 and using .NET framework 4.0.
-Clock speeds between both is similar and in fact PassMark benchmarks gives highesr scores for AMD.
-Have tried this on other AMD servers in the farm and the times are slower.
-Even my local I7 machines runs the code faster.
Fibonnacci code:
class Program
{
static void Main(string[] args)
{
const int ITERATIONS = 10000;
const int FIBONACCI = 100000;
var watch = new Stopwatch();
watch.Start();
DoFibonnacci(ITERATIONS, FIBONACCI);
watch.Stop();
Console.WriteLine("Total fibonacci time: {0}ms", watch.ElapsedMilliseconds);
Console.ReadLine();
}
private static void DoFibonnacci(int ITERATIONS, int FIBONACCI)
{
for (int i = 0; i < ITERATIONS; i++)
{
Fibonacci(FIBONACCI);
}
}
private static int Fibonacci(int x)
{
var previousValue = -1;
var currentResult = 1;
for (var i = 0; i <= x; ++i)
{
var sum = currentResult + previousValue;
previousValue = currentResult;
currentResult = sum;
}
return currentResult;
}
}
Any ideas on what maybe going on?
As we've established in the comments, you can workaround this performance bash by pinning the process to a specific processor on the AMD Opteron machines.
Kindled by this not-really-on-topic question I decided to have a look at possible scenarios where single core pinning would make such a difference (from 11 to 7 seconds seems a bit extreme).
The most plausible answer is not that revolutionary:
The AMD Opteron series employ HyperTransport in a so-called NUMA architecture, instead of a traditional FSB as you would find on Intel's SMP CPU's (Xeon 4850 included)
My guess is that this symptom stems from the fact that individual nodes in a NUMA architecture has individual cache, as opposed to the Intel CPU, in which the processor cache is shared.
In other words, when consecutive computations shift between nodes on the Opteron, the cache is flushed, whereas balancing between processors in an SMP architecture like the Xeon 4850 has no such impact since the cache is shared.
Setting affinity in .NET is pretty easy, just pick a processor (let's just take the first one for simplicity):
static void Main(string[] args)
{
Console.WriteLine(Environment.ProcessorCount);
Console.Read();
//An AffinityMask of 0x0001 will make sure the process is always pinned to processer 0
Process thisProcess = Process.GetCurrentProcess();
thisProcess.ProcessorAffinity = (IntPtr)0x0001;
const int ITERATIONS = 10000;
const int FIBONACCI = 100000;
var watch = new Stopwatch();
watch.Start();
DoFibonnacci(ITERATIONS, FIBONACCI);
watch.Stop();
Console.WriteLine("Total fibonacci time: {0}ms", watch.ElapsedMilliseconds);
Console.ReadLine();
}
Although I'm pretty sure this is not very smart in a NUMA environment.
Windows 2008 R2 has some cool native NUMA functionality, and I found a promissing codeplex project with a .NET wrapper for this as well: http://multiproc.codeplex.com/
I'm in no way near qualified to teach you how to utilize this technology, but this should point you in the right direction.
I'm currently writing a software in Visual Studio 2012 for communication with RFID-cards.
I got a DLL written in Delphi to handle the communication with the card reader.
The problem is: My software is running fine on machines, that have VS2012 installed. On other systems it freezes itself or the whole system.
I tried it on Win XP / 7 / 8 with x32 and x64 configuration.
I'm using .NET 4.0.
After connecting to the reader, the software starts a backgroundWorker, which polls (at 200ms rate) the reader with a command to inventory cards in the readers RF-field. The crash usally happens ca. 10 to 20 seconds after the reader connect. Here is the code:
[DllImport("tempConnect.dll", CallingConvention = CallingConvention.StdCall)]
private static extern int inventory(int maxlen, [In] ref int count,
IntPtr UIDs, UInt32 HFOffTime);
public String getCardID()
{
if (isConnectet())
{
IntPtr UIDs = IntPtr.Zero;
int len = 2 * 8;
Byte[] zero = new Byte[len];
UIDs = Marshal.AllocHGlobal(len);
Thread.Sleep(50);
Marshal.Copy(zero, 0, UIDs, len);
int count = 0;
int erg;
String ret;
try
{
erg = inventory(len, ref count, UIDs, 50);
}
catch (ExternalException) // this doesn't catch anything (iI have set <legacyCorruptedStateExceptionsPolicy enabled="true"/>)
{
return "\0";
}
finally
{
ret = Marshal.PtrToStringAnsi(UIDs, len);
IntPtr rslt = LocalFree(UIDs);
GC.Collect();
}
if (erg == 0)
return ret;
else
return zero.ToString();
}
else
return "\0";
}
The DLL is written in Delphi, the code DLL command is:
function inventory (maxlen: Integer; var count: Integer;
UIDs: PByteArray; HFOffTime: Cardinal = 50): Integer; STDCALL;
I think there may be a memory leak somewhere, but I have no idea how to find it...
EDIT:
I added some ideas (explicit GC.Collect(), try-catch-finally) to my code above, but it still doesnt work.
Here is the code, that calls getCardID():
The action, that runs every 200ms:
if (!bgw_inventory.IsBusy)
bgw_inventory.RunWorkerAsync();
Async backgroundWorker does:
private void bgw_inventory_DoWork(object sender, DoWorkEventArgs e)
{
if (bgw_inventory.CancellationPending)
{
e.Cancel = true;
return;
}
else
{
String UID = reader.getCardID();
if (bgw_inventory.CancellationPending)
{
e.Cancel = true;
return;
}
if (UID.Length == 16 && UID.IndexOf("\0") == -1)
{
setCardId(UID);
if (!allCards.ContainsKey(UID))
{
allCards.Add(UID, new Card(UID));
}
if (readCardActive || deActivateCardActive || activateCardActive)
{
if (lastActionCard != UID)
actionCard = UID;
else
setWorkingStatus("OK", Color.FromArgb(203, 218, 138));
}
}
else
{
setCardId("none");
if (readCardActive || deActivateCardActive || activateCardActive)
setWorkingStatus("waiting for next card", Color.Yellow);
}
}
}
EDIT
Till now I have made some little reworks (updates above) at the code. Now only the App. crashes with 0xC00000FD (Stack overflow) at "tempConnect.dll". This does not happen on Systems with VS2012 installed or if I use the DLL with native Delphi!
Do anyone have any other ideas ?
EDIT
Now I made the DLL logging it's stacksize and found something weird:
If it's called and polled from my C# Programm, the stacksize is changing continuously up and down.
If i do the same from a natural Deplhi Program the stacksize is constant!
So I'll do further investigations, but I have no really idea, what I have to search for...
I'm a little concerned about how're using that Marshal object. As you fear with the memory leak, it seems to be allocating memory quite often but I don't see it ever explicitly releasing it. The garbage collector should (operative word) be taking care of that, but you say yourself you have some unmanaged code in the mix. It is difficult with the posted information to tell where the unmanaged code begins.
Check out this question for some good techniques to finding memory leaks in .NET itself - this will give you a ton of information on how memory is being used in the managed end of your code (that is, the part you can directly control). Use the Windows Performance Monitor with breakpoints to keep an eye on the overall health of the system. If .NET appears to be behaving, but WPM is showing some sharp spikes, it's probably in the unmanaged code. You can't really control anything but your usage there, so it would probably be time to go back to the documentation at that point.
I try to stream sound samples from my microphone to my speakers by using DirectSound and C#. It should be similar to 'listening to microphone', but later I want to use this for something else. By testing my approach I've noticed silent tickeling, cracking noises in the background. I would guess this has something to do with the delay between writing and playing the buffer, which must be greater than the latency to write the chunks.
If I set the delay between recording and playout to less than 50ms. Mostly it works but sometimes I get really loud cracking noises. So I've decided to a delay about at least 50ms. This works okay for me, but the delay of the systems "listen to device" seems to be much shorter. I would guess it is about 15-30ms, and nearly not noticeable. For 50ms I get at least a little reverb effect.
In the following I'll show you my microphone code (partially):
The initialisation is done like this:
capture = new Capture(device);
// Creating the buffer
// Determining the buffer size
bufferSize = format.AverageBytesPerSecond * bufferLength / 1000;
while (bufferSize % format.BlockAlign != 0) bufferSize += 1;
chunkSize = Math.Max(bufferSize, 256);
bufferSize = chunkSize * BUFFER_CHUNKS;
this.bufferLength = chunkSize * 1000 / format.AverageBytesPerSecond; // Redetermining the buffer Length that will be used.
captureBufferDescription = new CaptureBufferDescription();
captureBufferDescription.BufferBytes = bufferSize;
captureBufferDescription.Format = format;
captureBuffer = new CaptureBuffer(captureBufferDescription, capture);
// Creating Buffer control
bufferARE = new AutoResetEvent(false);
// Adding notifier to buffer.
bufferNotify = new Notify(captureBuffer);
BufferPositionNotify[] bpns = new BufferPositionNotify[BUFFER_CHUNKS];
for(int i = 0 ; i < BUFFER_CHUNKS ; i ++) bpns[i] = new BufferPositionNotify() { Offset = chunkSize * (i+1) - 1, EventNotifyHandle = bufferARE.SafeWaitHandle.DangerousGetHandle() };
bufferNotify.SetNotificationPositions(bpns);
The capturing will run like this in an extra thread:
// Initializing
MemoryStream tempBuffer = new MemoryStream();
// Capturing
while (isCapturing && captureBuffer.Capturing)
{
bufferARE.WaitOne();
if (isCapturing && captureBuffer.Capturing)
{
captureBuffer.Read(currentBufferPart * chunkSize, tempBuffer, chunkSize, LockFlag.None);
ReportChunk(applyVolume(tempBuffer.GetBuffer()));
currentBufferPart = (currentBufferPart + 1) % BUFFER_CHUNKS;
tempBuffer.Dispose();
tempBuffer = new MemoryStream(); // Reset Buffer;
}
}
// Finalizing
isCapturing = false;
tempBuffer.Dispose();
captureBuffer.Stop();
if (bufferARE.WaitOne(bufferLength + 1)) currentBufferPart = (currentBufferPart + 1) % BUFFER_CHUNKS; // That on next start the correct bufferpart will be read.
stateControlARE.Set();
While capturing ReportChunk takes the data to the speaker as an event that could be subscribed. The speaker part is initialized like this:
// Creating the dxdevice.
dxdevice = new Device(device);
dxdevice.SetCooperativeLevel(hWnd, CooperativeLevel.Normal);
// Creating the buffer
bufferDescription = new BufferDescription();
bufferDescription.BufferBytes = bufferSize;
bufferDescription.Format = input.Format;
bufferDescription.ControlVolume = true;
bufferDescription.GlobalFocus = true; // That sound doesn't stop if the hWnd looses focus.
bufferDescription.StickyFocus = true; // - " -
buffer = new SecondaryBuffer(bufferDescription, dxdevice);
chunkQueue = new Queue<byte[]>();
// Creating buffer control
bufferARE = new AutoResetEvent(false);
// Register at input device
input.ChunkCaptured += new AInput.ReportBuffer(input_ChunkCaptured);
The data is put by the event method into the queue, simply by:
chunkQueue.Enqueue(buffer);
bufferARE.Set();
Filling the playbackbuffer and starting/stopping the playback buffer is done by another thread:
// Initializing
int wp = 0;
bufferARE.WaitOne(); // wait for first chunk
// Playing / writing data to play buffer.
while (isPlaying)
{
Thread.Sleep(1);
bufferARE.WaitOne(BufferLength * 3); // If a chunk is played and there is no new chunk we try to continue and may stop playing, else may the buffer runs out.
// Note that this may fails if the sender was interrupted within one chunk
if (isPlaying)
{
if (chunkQueue.Count > 0)
{
while (chunkQueue.Count > 0) wp = writeToBuffer(chunkQueue.Dequeue(), wp);
if (buffer.PlayPosition > wp - chunkSize * 3 / 2) buffer.SetCurrentPosition(((wp - chunkSize * 2 + bufferSize) % bufferSize));
if (!buffer.Status.Playing)
{
buffer.SetCurrentPosition(((wp - chunkSize * 2 + bufferSize) % bufferSize)); // We have 2 chunks buffered so we step back 2 chunks and play them while getting new chunks.
buffer.Play(0, BufferPlayFlags.Looping);
}
}
else
{
buffer.Stop();
bufferARE.WaitOne(); // wait for a filling chunk
}
}
}
// Finalizing
isPlaying = false;
buffer.Stop();
stateControlARE.Set();
writeToBuffer simply writes the enqueued chunk to the buffer by this.buffer.Write(wp, data, LockFlag.None); and caring about bufferSize and chunkSize and wp, which represents the last writing position. I think this is everything that is important about my code. Maybe the definitions are missing and at least there is another method that starts/stops=controls the threads.
I've posted this code in case I've made a mistake in filling the buffer or my initialisation is wrong. But I would guess that this problem occurs because the execution of C# bytecode is too slow or something like that. But in the end my question is still open: My question is how to reduce the latency and how to avoid noises that shouldn't be there?
I know the reason of your problem and the way that you can solve it, but I can't implement it in C# and .Net, so I will explain it in hope that you can find your way.
Audio will be recorded by your mic. with an specified frequency( for example 44100 ) and then played on the sound card at same sample rate( again 44100 ), the problem is the crystal that count the time in input device( mic. for example ) is not same as the crystal that play sound in sound card.
also the difference is so small they are not the same( there is no 2 exact same crystal in entire world ) so after a while there will be a gap in your playback routines.
Now the solution is re-sample data to match the sample rate of the output but I don't know how to do that in C# and .Net
A long time ago I figured out, that this problem was caused by the Thread.Sleep(1); in combination with high CPU usage. Due the windows timerresolution is 15,6ms by default, this sleep doesn't mean sleep for 1ms, but sleep until the next clock interrupt is reached. (For more read this paper) Combined with high CPU usage it may stacks up to the length of a chunk or even more.
For example: If my chunksize is 40ms, this could be about 46,8ms (3 * 15,6ms) and this causes the tickeling. One solution for that is setting the resolution down to 1ms. That can be done in this way:
[DllImport("winmm.dll", EntryPoint="timeBeginPeriod", SetLastError=true)]
private static extern uint timeBeginPeriod(uint uiPeriod);
[DllImport("winmm.dll", EntryPoint="timeEndPeriod", SetLastError=true)]
private static extern uint timeEndPeriod(uint uiPeriod);
void routine()
{
Thead.Sleep(1); // May takes about 15,6ms or even longer.
timeBeginPeriod(1); // Should be set at the startup of the application.
Thead.Sleep(1); // May takes about 1, 2 or 3 ms depending on the CPU usage.
// ... time depending routines goes here ...
timeEndPeriod(1); // Should end at application shutdown.
}
As far as I know this should be already done by directx. But due this setting is a global setting, other parts of the application or other applications maybe change it. This shouldn't happen if an application sets and revokes the setting once. But somehow it seems to happen caused by any dirty programmed part or other running application.
One more thing that needs to be watched, is whether you're still using the correct position of the directx buffer, if you're skipping one chunk for any reason. In this case a resynchronization is required.
I run through millions of records and sometimes I have to debug using Console.WriteLine to see what is going on.
However, Console.WriteLine is very slow, considerably slower than writing to a file.
BUT it is very convenient - does anyone know of a way to speed it up?
If it is just for debugging purposes you should use Debug.WriteLine instead. This will most likely be a bit faster than using Console.WriteLine.
Example
Debug.WriteLine("There was an error processing the data.");
You can use the OutputDebugString API function to send a string to the debugger. It doesn't wait for anything to redraw and this is probably the fastest thing you can get without digging into the low-level stuff too much.
The text you give to this function will go into Visual Studio Output window.
[DllImport("kernel32.dll")]
static extern void OutputDebugString(string lpOutputString);
Then you just call OutputDebugString("Hello world!");
Do something like this:
public static class QueuedConsole
{
private static StringBuilder _sb = new StringBuilder();
private static int _lineCount;
public void WriteLine(string message)
{
_sb.AppendLine(message);
++_lineCount;
if (_lineCount >= 10)
WriteAll();
}
public void WriteAll()
{
Console.WriteLine(_sb.ToString());
_lineCount = 0;
_sb.Clear();
}
}
QueuedConsole.WriteLine("This message will not be written directly, but with nine other entries to increase performance.");
//after your operations, end with write all to get the last lines.
QueuedConsole.WriteAll();
Here is another example: Does Console.WriteLine block?
I recently did a benchmark battery for this on .NET 4.8. The tests included many of the proposals mentioned on this page, including Async and blocking variants of both BCL and custom code, and then most of those both with and without dedicated threading, and finally scaled across power-of-2 buffer sizes.
The fastest method, now used in my own projects, buffers 64K of wide (Unicode) characters at a time from .NET directly to the Win32 function WriteConsoleW without copying or even hard-pinning. Remainders larger than 64K, after filling and flushing one buffer, are also sent directly, and in-situ as well. The approach deliberately bypasses the Stream/TextWriter paradigm so it can (obviously enough) provide .NET text that is already Unicode to a (native) Unicode API without all the superfluous memory copying/shuffling and byte[] array allocations required for first "decoding" to a byte stream.
If there is interest (perhaps because the buffering logic is slightly intricate), I can provide the source for the above; it's only about 80 lines. However, my tests determined that there's a simpler way to get nearly the same performance, and since it doesn't require any Win32 calls, I'll show this latter technique instead.
The following is way faster than Console.Write:
public static class FastConsole
{
static readonly BufferedStream str;
static FastConsole()
{
Console.OutputEncoding = Encoding.Unicode; // crucial
// avoid special "ShadowBuffer" for hard-coded size 0x14000 in 'BufferedStream'
str = new BufferedStream(Console.OpenStandardOutput(), 0x15000);
}
public static void WriteLine(String s) => Write(s + "\r\n");
public static void Write(String s)
{
// avoid endless 'GetByteCount' dithering in 'Encoding.Unicode.GetBytes(s)'
var rgb = new byte[s.Length << 1];
Encoding.Unicode.GetBytes(s, 0, s.Length, rgb, 0);
lock (str) // (optional, can omit if appropriate)
str.Write(rgb, 0, rgb.Length);
}
public static void Flush() { lock (str) str.Flush(); }
};
Note that this is a buffered writer, so you must call Flush() when you have no more text to write.
I should also mention that, as shown, technically this code assumes 16-bit Unicode (UCS-2, as opposed to UTF-16) and thus won't properly handle 4-byte escape surrogates for characters beyond the Basic Multilingual Plane. The point hardly seems important given the more extreme limitations on console text display in general, but could perhaps still matter for piping/redirection.
Usage:
FastConsole.WriteLine("hello world.");
// etc...
FastConsole.Flush();
On my machine, this gets about 77,000 lines/second (mixed-length) versus only 5,200 lines/sec under identical conditions for normal Console.WriteLine. That's a factor of almost 15x speedup.
These are controlled comparison results only; note that absolute measurements of console output performance are highly variable, depending on the console window settings and runtime conditions, including size, layout, fonts, DWM clipping, etc.
Why Console is slow:
Console output is actually an IO stream that's managed by your operating system. Most IO classes (like FileStream) have async methods but the Console class was never updated so it always blocks the thread when writing.
Console.WriteLine is backed by SyncTextWriter which uses a global lock to prevent multiple threads from writing partial lines. This is a major bottleneck that forces all threads to wait for each other to finish the write.
If the console window is visible on screen then there can be significant slowdown because the window needs to be redrawn before the console output is considered flushed.
Solutions:
Wrap the Console stream with a StreamWriter and then use async methods:
var sw = new StreamWriter(Console.OpenStandardOutput());
await sw.WriteLineAsync("...");
You can also set a larger buffer if you need to use sync methods. The call will occasionally block when the buffer gets full and is flushed to the stream.
// set a buffer size
var sw = new StreamWriter(Console.OpenStandardOutput(), Encoding.UTF8, 8192);
// this write call will block when buffer is full
sw.Write("...")
If you want the fastest writes though, you'll need to make your own buffer class that writes to memory and flushes to the console asynchronously in the background using a single thread without locking. The new Channel<T> class in .NET Core 2.1 makes this simple and fast. Plenty of other questions showing that code but comment if you need tips.
A little old thread and maybe not exactly what the OP is looking for, but I ran into the same question recently, when processing audio data in real time.
I compared Console.WriteLine to Debug.WriteLine with this code and used DebugView as a dos box alternative. It's only an executable (nothing to install) and can be customized in very neat ways (filters & colors!). It has no problems with tens of thousands of lines and manages the memory quite well (I could not find any kind of leak, even after days of logging).
After doing some testing in different environments (e.g.: virtual machine, IDE, background processes running, etc) I made the following observations:
Debug is almost always faster
For small bursts of lines (<1000), it's about 10 times faster
For larger chunks it seems to converge to about 3x
If the Debug output goes to the IDE, Console is faster :-)
If DebugView is not running, Debug gets even faster
For really large amounts of consecutive outputs (>10000), Debug gets slower and Console stays constant. I presume this is due to the memory, Debug has to allocate and Console does not.
Obviously, it makes a difference if DebugView is actually "in-view" or not, as the many gui updates have a significant impact on the overall performance of the system, while Console simply hangs, if visible or not. But it's hard to put numbers on that one...
I did not try multiple threads writing to the Console, as I think this should generally avoided. I never had (performance) problems when writing to Debug from multiple threads.
If you compile with Release settings, usually all Debug statements are omitted and Trace should produce the same behaviour as Debug.
I used VS2017 & .Net 4.6.1
Sorry for so much code, but I had to tweak it quite a lot to actually measure what I wanted to. If you can spot any problems with the code (biases, etc.), please comment. I would love to get more precise data for real life systems.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;
namespace Console_vs_Debug {
class Program {
class Trial {
public string name;
public Action console;
public Action debug;
public List < float > consoleMeasuredTimes = new List < float > ();
public List < float > debugMeasuredTimes = new List < float > ();
}
static Stopwatch sw = new Stopwatch();
private static int repeatLoop = 1000;
private static int iterations = 2;
private static int dummy = 0;
static void Main(string[] args) {
if (args.Length == 2) {
repeatLoop = int.Parse(args[0]);
iterations = int.Parse(args[1]);
}
// do some dummy work
for (int i = 0; i < 100; i++) {
Console.WriteLine("-");
Debug.WriteLine("-");
}
for (int i = 0; i < iterations; i++) {
foreach(Trial trial in trials) {
Thread.Sleep(50);
sw.Restart();
for (int r = 0; r < repeatLoop; r++)
trial.console();
sw.Stop();
trial.consoleMeasuredTimes.Add(sw.ElapsedMilliseconds);
Thread.Sleep(1);
sw.Restart();
for (int r = 0; r < repeatLoop; r++)
trial.debug();
sw.Stop();
trial.debugMeasuredTimes.Add(sw.ElapsedMilliseconds);
}
}
Console.WriteLine("---\r\n");
foreach(Trial trial in trials) {
var consoleAverage = trial.consoleMeasuredTimes.Average();
var debugAverage = trial.debugMeasuredTimes.Average();
Console.WriteLine(trial.name);
Console.WriteLine($ " console: {consoleAverage,11:F4}");
Console.WriteLine($ " debug: {debugAverage,11:F4}");
Console.WriteLine($ "{consoleAverage / debugAverage,32:F2} (console/debug)");
Console.WriteLine();
}
Console.WriteLine("all measurements are in milliseconds");
Console.WriteLine("anykey");
Console.ReadKey();
}
private static List < Trial > trials = new List < Trial > {
new Trial {
name = "constant",
console = delegate {
Console.WriteLine("A static and constant string");
},
debug = delegate {
Debug.WriteLine("A static and constant string");
}
},
new Trial {
name = "dynamic",
console = delegate {
Console.WriteLine("A dynamically built string (number " + dummy++ + ")");
},
debug = delegate {
Debug.WriteLine("A dynamically built string (number " + dummy++ + ")");
}
},
new Trial {
name = "interpolated",
console = delegate {
Console.WriteLine($ "An interpolated string (number {dummy++,6})");
},
debug = delegate {
Debug.WriteLine($ "An interpolated string (number {dummy++,6})");
}
}
};
}
}
Just a little trick I use sometimes: If you remove focus from the Console window by opening another window over it, and leave it until it completes, it won't redraw the window until you refocus, speeding it up significantly. Just make sure you have the buffer set up high enough that you can scroll back through all of the output.
Try using the System.Diagnostics Debug class? You can accomplish the same things as using Console.WriteLine.
You can view the available class methods here.