Merged with How to free up memory after base64 convert.
Thanks for your great suggestions to an OOM (out of memory) problem I'm seeing in code intended to stream files for web services. [I hope it is OK to start another thread which provides a bit more detail.] From the suggestions, I shrunk the buffer size used to read from the file, and it looks like memory consumption is better, but I'm still seeing an OOM problem, and I'm seeing this problem with files sizes as small as 5MB. I potentially want to deal with files ten times larger.
My problem seems now to be with the use of TextWriter.
I create a request as follows [with a few edits to shrink the code]:
HttpWebRequest oRequest = (HttpWebRequest)WebRequest.Create(new Uri(strURL));
oRequest.Method = httpMethod;
oRequest.ContentType = "application/atom+xml";
oRequest.Headers["Authorization"] = getAuthHeader();
oRequest.ContentLength = strHead.Length + strTail.Length + longContentSize;
oRequest.SendChunked = true;
using (TextWriter tw = new StreamWriter(oRequest.GetRequestStream()))
{
tw.Write(strHead);
using (FileStream fileStream = new FileStream(strPath, FileMode.Open,
FileAccess.Read, System.IO.FileShare.ReadWrite))
{
StreamEncode(fileStream, tw);
}
tw.Write(strTail);
}
.....
Which calls into the routine:
public void StreamEncode(FileStream inputStream, TextWriter tw)
{
// For Base64 there are 4 bytes output for every 3 bytes of input
byte[] base64Block = new byte[9000];
int bytesRead = 0;
string base64String = null;
do
{
// read one block from the input stream
bytesRead = inputStream.Read(base64Block, 0, base64Block.Length);
// encode the base64 string
base64String = Convert.ToBase64String(base64Block, 0, bytesRead);
// write the string
tw.Write(base64String);
} while (bytesRead !=0 );
}
Should I use something other than TextWriter because of the potential large content? It seems very convenient for being able to create the whole payload of the request.
Is this totally the wrong approach? I want to be able to support very large files.
Related
I'm working on download and then MD5 check to ensure the download is successful. I have the following code which should work, but isn't the most efficient - especially for large files.
using (var client = new System.Net.WebClient())
{
client.DownloadFile(url, destinationFile);
}
var fileHash = GetMD5HashAsStringFromFile(destinationFile);
var successful = expectedHash.Equals(fileHash, StringComparison.OrdinalIgnoreCase);
My concern is that the bytes are all streamed through to disk, and then the MD5 ComputeHash() has to open the file and read all the bytes again. Is there a good, clean way of computing the MD5 as part of the download stream? Ideally, the MD5 should just fall out of the DownloadFile() function as a side effect of sorts. A function with a signature like this:
string DownloadFileAndComputeHash(string url, string filename, HashTypeEnum hashType);
Edit: Adds code for GetMD5HashAsStringFromFile()
public string GetMD5HashAsStringFromFile(string filename)
{
using (FileStream file = File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var md5er = System.Security.Cryptography.MD5.Create();
var md5HashBytes = md5er.ComputeHash(file);
return BitConverter
.ToString(md5HashBytes)
.Replace("-", string.Empty)
.ToLower();
}
}
Is there a good, clean way of computing the MD5 as part of the download stream? Ideally, the MD5 should just fall out of the DownloadFile() function as a side effect of sorts.
You could follow this strategy, to do "chunked" calculation and minimize memory pressure (and duplication):
Open the response stream on the web client.
Open the destination file stream.
Repeat while there is data available:
Read chunk from response stream into byte buffer
Write it to the destination file stream.
Use the TransformBlock method to add the bytes to the hash calculation
Use TransformFinalBlock to get the calculated hash code.
The sample code below shows how this could be achieved.
public static byte[] DownloadAndGetHash(Uri file, string destFilePath, int bufferSize)
{
using (var md5 = MD5.Create())
using (var client = new System.Net.WebClient())
{
using (var src = client.OpenRead(file))
using (var dest = File.Create(destFilePath, bufferSize))
{
md5.Initialize();
var buffer = new byte[bufferSize];
while (true)
{
var read = src.Read(buffer, 0, buffer.Length);
if (read > 0)
{
dest.Write(buffer, 0, read);
md5.TransformBlock(buffer, 0, read, null, 0);
}
else // reached the end.
{
md5.TransformFinalBlock(buffer, 0, 0);
return md5.Hash;
}
}
}
}
}
If you're talking about large files (I'm assuming over 1GB), you'll want to read the data in chunks, then process each chunk through the MD5 algorithm, and then store it to the disk. It's doable, but I don't know how much of the default .NET classes will help you with that.
One approach might be with a custom stream wrapper. First you get a Stream from WebClient (via GetWebResponse() and then GetResponseStream()), then you wrap it, and then pass it to ComputeHash(stream). When MD5 calls Read() on your wrapper, the wrapper would call Read on the network stream, write the data out when it's received, and then pass it back to MD5.
I don't know what problems would await you if you try and do this.
Something like this.
byte[] result;
using (var webClient = new System.Net.WebClient())
{
result = webClient.DownloadData("http://some.url");
}
byte[] hash = ((HashAlgorithm)CryptoConfig.CreateFromName("MD5")).ComputeHash(result);
I am currently testing several decompression libraries for a project I'm involved with to decompress http file streams on the fly. I have tried two very promising libraries and found an issue that seems to appear in both of them.
This is what I am doing:
video.avi compressed to video.zip on HTTP server test.com/video.zip (~20MB)
HttpWebRequest to read stream from the server
Write HttpWebRequest ResponseStream data into MemoryStream
Let decompression library read from MemoryStream
Read decompressed file stream while it's being downloaded by HttpWebRequest
The whole idea works fine, I'm able to uncompress and stream the compressed video directly into VLC stdin and it's rendered just fine. However I have to use a read buffer of one byte on the decompression library. Any buffer larger than one byte will cause the uncompressed data stream to be cut off. For a test I've written the decompressed stream into a file and compared it with the original video.avi and some data is just skipped by the decompression. When streaming this broken data into VLC it causes a lot of video artifacts and the playback speed is also greatly reduced.
If I knew the size of what is available to read I could trim my buffer accordingly but no library would make this information public so all I can do is read the data with a one byte buffer. Maybe my approach is wrong? Or maybe I'm overlooking something?
Here's an example code (requires VLC):
ICSharpCode.SharpZLib (http://icsharpcode.github.io/SharpZipLib/)
static void Main(string[] args)
{
// Initialise VLC
Process vlc = new Process()
{
StartInfo =
{
FileName = #"C:\Program Files\VideoLAN\vlc.exe", // Adjust as required to test the code
RedirectStandardInput = true,
UseShellExecute = false,
Arguments = "-"
}
};
vlc.Start();
Stream outStream = vlc.StandardInput.BaseStream;
// Get source stream
HttpWebRequest stream = (HttpWebRequest)WebRequest.Create("http://codefreak.net/~daniel/apps/stream60s-large.zip");
Stream compressedVideoStream = stream.GetResponse().GetResponseStream();
// Create local decompression loop
MemoryStream compressedLoopback = new MemoryStream();
ZipInputStream zipStream = new ZipInputStream(compressedLoopback);
ZipEntry currentEntry = null;
byte[] videoStreamBuffer = new byte[8129]; // 8kb read buffer
int read = 0;
long totalRead = 0;
while ((read = compressedVideoStream.Read(videoStreamBuffer, 0, videoStreamBuffer.Length)) > 0)
{
// Write compressed video stream into compressed loopback without affecting current read position
long previousPosition = compressedLoopback.Position; // Store current read position
compressedLoopback.Position = totalRead; // Jump to last write position
totalRead += read; // Increase last write position by current read size
compressedLoopback.Write(videoStreamBuffer, 0, read); // Write data into loopback
compressedLoopback.Position = previousPosition; // Restore reading position
// If not already, move to first entry
if (currentEntry == null)
currentEntry = zipStream.GetNextEntry();
byte[] outputBuffer = new byte[1]; // Decompression read buffer, this is the bad one!
int zipRead = 0;
while ((zipRead = zipStream.Read(outputBuffer, 0, outputBuffer.Length)) > 0)
outStream.Write(outputBuffer, 0, outputBuffer.Length); // Write directly to VLC stdin
}
}
SharpCompress (https://github.com/adamhathcock/sharpcompress)
static void Main(string[] args)
{
// Initialise VLC
Process vlc = new Process()
{
StartInfo =
{
FileName = #"C:\Program Files\VideoLAN\vlc.exe", // Adjust as required to test the code
RedirectStandardInput = true,
UseShellExecute = false,
Arguments = "-"
}
};
vlc.Start();
Stream outStream = vlc.StandardInput.BaseStream;
// Get source stream
HttpWebRequest stream = (HttpWebRequest)WebRequest.Create("http://codefreak.net/~daniel/apps/stream60s-large.zip");
Stream compressedVideoStream = stream.GetResponse().GetResponseStream();
// Create local decompression loop
MemoryStream compressedLoopback = new MemoryStream();
ZipReader zipStream = null;
EntryStream currentEntry = null;
byte[] videoStreamBuffer = new byte[8129]; // 8kb read buffer
int read = 0;
long totalRead = 0;
while ((read = compressedVideoStream.Read(videoStreamBuffer, 0, videoStreamBuffer.Length)) > 0)
{
// Write compressed video stream into compressed loopback without affecting current read position
long previousPosition = compressedLoopback.Position; // Store current read position
compressedLoopback.Position = totalRead; // Jump to last write position
totalRead += read; // Increase last write position by current read size
compressedLoopback.Write(videoStreamBuffer, 0, read); // Write data into loopback
compressedLoopback.Position = previousPosition; // Restore reading position
// Open stream after writing to it because otherwise it will not be able to identify the compression type
if (zipStream == null)
zipStream = (ZipReader)ReaderFactory.Open(compressedLoopback); // Cast to ZipReader, as we know the type
// If not already, move to first entry
if (currentEntry == null)
{
zipStream.MoveToNextEntry();
currentEntry = zipStream.OpenEntryStream();
}
byte[] outputBuffer = new byte[1]; // Decompression read buffer, this is the bad one!
int zipRead = 0;
while ((zipRead = currentEntry.Read(outputBuffer, 0, outputBuffer.Length)) > 0)
outStream.Write(outputBuffer, 0, outputBuffer.Length); // Write directly to VLC stdin
}
}
To test this code I recommend setting the output buffer for SharpZipLib to 2 bytes and for SharpCompress to 8 bytes. You will see the artifacts and also that the play speed of the video is wrong, the seek time should always be aligned with the number that is counting in the video.
I haven't really found any good explanation of why a larger outputBuffer that is reading from the decompression lib is causing these problems or a way to solve this other than having the tiniest possible buffer.
So my question is what I am doing wrong or if this is a general issue when reading compressed files from streams? How could I increase the outputBuffer while reading the correct data?
Any help is greatly appreciated!
Regards,
Gachl
You need to write only how many bytes you read. Writing the entire buffer size will add additional bytes (whatever happened to be in the buffer before). zipStream.Read is not required to read as many bytes as you request.
while ((zipRead = zipStream.Read(outputBuffer, 0, outputBuffer.Length)) > 0)
outStream.Write(outputBuffer, 0, zipRead); // Write directly to VLC stdin
I need to read the first line from a stream to determine file's encoding, and then recreate the stream with that Encoding
The following code does not work correctly:
var r = response.GetResponseStream();
var sr = new StreamReader(r);
string firstLine = sr.ReadLine();
string encoding = GetEncodingFromFirstLine(firstLine);
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();
The text variable doesn't contain the whole text. For some reason the first line and several lines after it are skipped.
I tried everything: closing the StreamReader, resetting it, calling a separate GetResponseStream... but nothing worked.
I can't get the response stream again as I'm getting this file from the internet, and redownloading it again would be bad performance wise.
Update
Here's what GetEncodingFromFirstLine() looks like:
public static string GetEncodingFromFirstLine(string line)
{
int encodingIndex = line.IndexOf("encoding=");
if (encodingIndex == -1)
{
return "utf-8";
}
return line.Substring(encodingIndex + "encoding=".Length).Replace("\"", "").Replace("'", "").Replace("?", "").Replace(">", "");
}
...
// true
Assert.AreEqual("windows-1251", GetEncodingFromFirstLine(#"<?xml version=""1.0"" encoding=""windows-1251""?>"));
** Update 2 **
I'm working with XML files, and the text variable is parsed as XML:
var feedItems = XElement.Parse(text);
Well you're asking it to detect the encoding... and that requires it to read data. That's reading it from the underlying stream, and you're then creating another StreamReader around the same stream.
I suggest you:
Get the response stream
Retrieve all the data into a byte array (or MemoryStream)
Detect the encoding (which should be performed on bytes, not text - currently you're already assuming UTF-8 by creating a StreamReader)
Create a MemoryStream around the byte array, and a StreamReader around that
It's not clear what your GetEncodingFromFirstLine method does... or what this file really is. More information may make it easier to help you.
EDIT: If this is to load some XML, don't reinvent the wheel. Just give the stream to one of the existing XML-parsing classes, which will perform the appropriate detection for you.
You need to change the current position in the stream to the beginning.
r.Position = 0;
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();
I found the answer to my question here:
How can I read an Http response stream twice in C#?
Stream responseStream = CopyAndClose(resp.GetResponseStream());
// Do something with the stream
responseStream.Position = 0;
// Do something with the stream again
private static Stream CopyAndClose(Stream inputStream)
{
const int readSize = 256;
byte[] buffer = new byte[readSize];
MemoryStream ms = new MemoryStream();
int count = inputStream.Read(buffer, 0, readSize);
while (count > 0)
{
ms.Write(buffer, 0, count);
count = inputStream.Read(buffer, 0, readSize);
}
ms.Position = 0;
inputStream.Close();
return ms;
}
I have a monitoring system and I want to save a snapshot from a camera when alarm trigger.
I have tried many methods to do that…and it’s all working fine , stream snapshot from the camera then save it as a jpg in the pc…. picture (jpg format,1280*1024,140KB)..That’s fine
But my problem is in the application performance...
The app need about 20 ~30 seconds to read the steam, that’s not acceptable coz that method will be called every 2 second .I need to know what wrong with that code and how I can get it much faster than that. ?
Many thanks in advance
Code:
string sourceURL = "http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT";
byte[] buffer = new byte[200000];
int read, total = 0;
WebRequest req = (WebRequest)WebRequest.Create(sourceURL);
req.Credentials = new NetworkCredential("admin", "123456");
WebResponse resp = req.GetResponse();
Stream stream = resp.GetResponseStream();
while ((read = stream.Read(buffer, total, 1000)) != 0)
{
total += read;
}
Bitmap bmp = (Bitmap)Bitmap.FromStream(new MemoryStream(buffer, 0,total));
string path = JPGName.Text+".jpg";
bmp.Save(path);
I very much doubt that this code is the cause of the problem, at least for the first method call (but read further below).
Technically, you could produce the Bitmap without saving to a memory buffer first, or if you don't need to display the image as well, you can save the raw data without ever constructing a Bitmap, but that's not going to help in terms of multiple seconds improved performance. Have you checked how long it takes to download the image from that URL using a browser, wget, curl or whatever tool, because I suspect something is going on with the encoding source.
Something you should do is clean up your resources; close the stream properly. This can potentially cause the problem if you call this method regularly, because .NET will only open a few connections to the same host at any one point.
// Make sure the stream gets closed once we're done with it
using (Stream stream = resp.GetResponseStream())
{
// A larger buffer size would be benefitial, but it's not going
// to make a significant difference.
while ((read = stream.Read(buffer, total, 1000)) != 0)
{
total += read;
}
}
I cannot try the network behavior of the WebResponse stream, but you handle the stream twice (once in your loop and once with your memory stream).
I don't thing that's the whole problem but I'd give it a try:
string sourceURL = "http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT";
WebRequest req = (WebRequest)WebRequest.Create(sourceURL);
req.Credentials = new NetworkCredential("admin", "123456");
WebResponse resp = req.GetResponse();
Stream stream = resp.GetResponseStream();
Bitmap bmp = (Bitmap)Bitmap.FromStream(stream);
string path = JPGName.Text + ".jpg";
bmp.Save(path);
Try to read bigger pieces of data, than 1000 bytes per time. I can see no problem with, for example,
read = stream.Read(buffer, 0, buffer.Length);
Try this to download the file.
using(WebClient webClient = new WebClient())
{
webClient.DownloadFile("http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT", "c:\\Temp\myPic.jpg");
}
You can use a DateTime to put a unique stamp on the shot.
I have a web server which will read large binary files (several megabytes) into byte arrays. The server could be reading several files at the same time (different page requests), so I am looking for the most optimized way for doing this without taxing the CPU too much. Is the code below good enough?
public byte[] FileToByteArray(string fileName)
{
byte[] buff = null;
FileStream fs = new FileStream(fileName,
FileMode.Open,
FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int) numBytes);
return buff;
}
Simply replace the whole thing with:
return File.ReadAllBytes(fileName);
However, if you are concerned about the memory consumption, you should not read the whole file into memory all at once at all. You should do that in chunks.
I might argue that the answer here generally is "don't". Unless you absolutely need all the data at once, consider using a Stream-based API (or some variant of reader / iterator). That is especially important when you have multiple parallel operations (as suggested by the question) to minimise system load and maximise throughput.
For example, if you are streaming data to a caller:
Stream dest = ...
using(Stream source = File.OpenRead(path)) {
byte[] buffer = new byte[2048];
int bytesRead;
while((bytesRead = source.Read(buffer, 0, buffer.Length)) > 0) {
dest.Write(buffer, 0, bytesRead);
}
}
I would think this:
byte[] file = System.IO.File.ReadAllBytes(fileName);
Your code can be factored to this (in lieu of File.ReadAllBytes):
public byte[] ReadAllBytes(string fileName)
{
byte[] buffer = null;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[fs.Length];
fs.Read(buffer, 0, (int)fs.Length);
}
return buffer;
}
Note the Integer.MaxValue - file size limitation placed by the Read method. In other words you can only read a 2GB chunk at once.
Also note that the last argument to the FileStream is a buffer size.
I would also suggest reading about FileStream and BufferedStream.
As always a simple sample program to profile which is fastest will be most beneficial.
Also your underlying hardware will have a large effect on performance. Are you using server based hard disk drives with large caches and a RAID card with onboard memory cache? Or are you using a standard drive connected to the IDE port?
Depending on the frequency of operations, the size of the files, and the number of files you're looking at, there are other performance issues to take into consideration. One thing to remember, is that each of your byte arrays will be released at the mercy of the garbage collector. If you're not caching any of that data, you could end up creating a lot of garbage and be losing most of your performance to % Time in GC. If the chunks are larger than 85K, you'll be allocating to the Large Object Heap(LOH) which will require a collection of all generations to free up (this is very expensive, and on a server will stop all execution while it's going on). Additionally, if you have a ton of objects on the LOH, you can end up with LOH fragmentation (the LOH is never compacted) which leads to poor performance and out of memory exceptions. You can recycle the process once you hit a certain point, but I don't know if that's a best practice.
The point is, you should consider the full life cycle of your app before necessarily just reading all the bytes into memory the fastest way possible or you might be trading short term performance for overall performance.
I'd say BinaryReader is fine, but can be refactored to this, instead of all those lines of code for getting the length of the buffer:
public byte[] FileToByteArray(string fileName)
{
byte[] fileData = null;
using (FileStream fs = File.OpenRead(fileName))
{
using (BinaryReader binaryReader = new BinaryReader(fs))
{
fileData = binaryReader.ReadBytes((int)fs.Length);
}
}
return fileData;
}
Should be better than using .ReadAllBytes(), since I saw in the comments on the top response that includes .ReadAllBytes() that one of the commenters had problems with files > 600 MB, since a BinaryReader is meant for this sort of thing. Also, putting it in a using statement ensures the FileStream and BinaryReader are closed and disposed.
In case with 'a large file' is meant beyond the 4GB limit, then my following written code logic is appropriate. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries.
In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size.
(using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)
private uint Crc32CAlgorithmBigCrc(string fileName)
{
uint hash = 0;
byte[] buffer = null;
FileInfo fileInfo = new FileInfo(fileName);
long fileLength = fileInfo.Length;
int blockSize = 1024000000;
decimal div = fileLength / blockSize;
int blocks = (int)Math.Floor(div);
int restBytes = (int)(fileLength - (blocks * blockSize));
long offsetFile = 0;
uint interHash = 0;
Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
bool firstBlock = true;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[blockSize];
using (BinaryReader br = new BinaryReader(fs))
{
while (blocks > 0)
{
blocks -= 1;
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(blockSize);
if (firstBlock)
{
firstBlock = false;
interHash = Crc32CAlgorithm.Compute(buffer);
hash = interHash;
}
else
{
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
offsetFile += blockSize;
}
if (restBytes > 0)
{
Array.Resize(ref buffer, restBytes);
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(restBytes);
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
buffer = null;
}
}
//MessageBox.Show(hash.ToString());
//MessageBox.Show(hash.ToString("X"));
return hash;
}
Overview: if your image is added as a action= embedded resource then use the GetExecutingAssembly to retrieve the jpg resource into a stream then read the binary data in the stream into an byte array
public byte[] GetAImage()
{
byte[] bytes=null;
var assembly = Assembly.GetExecutingAssembly();
var resourceName = "MYWebApi.Images.X_my_image.jpg";
using (Stream stream = assembly.GetManifestResourceStream(resourceName))
{
bytes = new byte[stream.Length];
stream.Read(bytes, 0, (int)stream.Length);
}
return bytes;
}
Use the BufferedStream class in C# to improve performance. A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance.
See the following for a code example and additional explanation:
http://msdn.microsoft.com/en-us/library/system.io.bufferedstream.aspx
use this:
bytesRead = responseStream.ReadAsync(buffer, 0, Length).Result;
I would recommend trying the Response.TransferFile() method then a Response.Flush() and Response.End() for serving your large files.
If you're dealing with files above 2 GB, you'll find that the above methods fail.
It's much easier just to hand the stream off to MD5 and allow that to chunk your file for you:
private byte[] computeFileHash(string filename)
{
MD5 md5 = MD5.Create();
using (FileStream fs = new FileStream(filename, FileMode.Open))
{
byte[] hash = md5.ComputeHash(fs);
return hash;
}
}