I'm trying to store large objects as gzipped JSON text to an Azure blob.
I don't want to hold the serialized data in memory, and I don't want to spool to disk if I can avoid it, but I don't see how to just let it serialize and compress on the fly.
I'm using JSON.NET from Newtonsoft (pretty much the de facto standard JSON serializer for .NET), but the signatures of the methods don't really seem to support on-the-fly streaming.
Microsoft.WindowsAzure.Storage.Blob.CloudBlockBlob has an UploadFromStream(Stream source, AccessCondition accessCondition = null, BlobRequestOptions options = null, OperationContext operationContext = null) method, but in order for that to work properly, I need to have the position of the stream be 0, and the JsonSerializer.SerializeObject doesn't do that. It just acts on a stream, and when it's done the stream position is at EOF.
What I'd like to do is something like this:
public void SaveObject(object obj, string path, JsonSerializerSettings settings = null)
{
using (var jsonStream = new JsonStream(object, settings ?? _defaultSerializerSettings))
using (var gzipStream = new GZipStream(jsonStream))
{
var blob = GetCloudBlockBlob(path);
blob.UploadFromStream(gzipStream);
}
}
...the idea being, serialization does not start until something is pulling data (in this case, the GZipStream, which does not compress data until pulled by the blob.UploadFromStream() method) thus it maintains a low overhead. It does not need to be a seekable stream, it just needs to be read on demand.
I trust everyone can see how this would work if you were doing a stream from System.IO.File.OpenRead() instead of new JsonStream(object obj). While it gets a bit more complicated because Json.NET needs to "look ahead" and potentially fill a buffer, they got it working with the CryptoStream and GZipStream and that works real slick.
Is there a way to do this that does not load the entire JSON representation of the object into memory, or spool it to disk first just to regurgitate? If CryptoStreams can do it, we should be able to do it with Json.NET without a large amount of effort. I would think.
Related
Consider putting data onto a windows clipboard DataPackage using SetData and later retrieving it using GetDataAsync, like this:
IEnumerable<T> objects = ...;
var randomAccessStream = new InMemoryRandomAccessStream();
using (XmlDictionaryWriter xmlWriter = XmlDictionaryWriter.CreateTextWriter(randomAccessStream.AsStreamForWrite(), Encoding.Unicode)) {
var serializer = new DataContractSerializer(typeof(T), knownTypes);
foreach (T obj in objects) {
serializer.WriteObject(xmlWriter, obj);
}
}
dataPackage.SetData(formatId, randomAccessStream);
Then later on (e.g. in Clipboard.ContentsChanged),
randomAccessStream = await dataPackageView.GetDataAsync(formatId) as IRandomAccessStream;
xmlReader = XmlDictionaryReader.CreateTextReader(randomAccessStream.AsStreamForRead(), Encoding.Unicode, XmlDictionaryReaderQuotas.Max, (OnXmlDictionaryReaderClose?)null);
var serializer = new DataContractSerializer(typeof(T), knownTypes);
while (serializer.IsStartObject(xmlReader)) {
object? obj = serializer.ReadObject(xmlReader);
...
}
xmlReader.Dispose(); // in the real code, this is in a finally clause
The question I have is, when do I dispose the randomAccessStream? I've done some searching and all the examples I've seen using SetData and GetDataAsync do absolutely nothing about disposing the object that is put into or obtain from the data package.
Should I dispose it after the SetData, after the GetDataAsync, in DataPackage.OperationCompleted, in some combination of these, or none of them?
sjb
P.S. If I can squeeze in a second question here ... when I put a reference into a DataPackage using for example dataPackage.Properties.Add( "IEnumerable<T>", entities), does it create a security risk -- can other apps access the reference
and use it?
tldr
The Clipboard is designed to pass content between applications and can only pass string content or a references to files, all other content must be either serialized to string, or saved to a file, or must behave like a file, to be access across application domains via the clipboard.
There is support and guidance for passing custom data and formats via the clipboard, ultimately this involves discrete management around what is "how to prepare the content on the provider side" and "how to interpret the content on the consumer side". If you can use simple serialization for this, then KISS.
IEnumerable<Test> objectsIn = new Test[] { new Test { Name = "One" }, new Test { Name = "two" } };
var dataPackage = new DataPackage();
dataPackage.SetData("MyCustomFormat", Newtonsoft.Json.JsonConvert.SerializeObject(objectsIn));
Clipboard.SetContent(dataPackage);
...
var dataPackageView = Clipboard.GetContent();
string contentJson = (await dataPackageView.GetDataAsync("MyCustomFormat")) as string;
IEnumerable<Test> objectsOut = Newtonsoft.Json.JsonConvert.DeserializeObject<IEnumerable<Test>>(contentJson);
In WinRT the DataPackageView class implementation does support passing streams however the normal rules apply for the stream in terms of lifecycle and if the stream is disposed or not. This is useful for transferring large content or when the consumer might request the content in different formats.
If you do not have an advanced need for it, or you are not transmitting file or image based resources, then you do not need to use a stream to transfer your data.
DataPackageView - Remarks
During a share operation, the source app puts the data being shared in a DataPackage object and sends that object to the target app for processing. The DataPackage class includes a number of methods to support the following default formats: text, Rtf, Html, Bitmap, and StorageItems. It also has methods to support custom data formats. To use these formats, both the source app and target app must already be aware that the custom format exists.
OPs attempt to save a stream to the Clipboard is in this case an example of saving an arbitrary or custom object to the clipboard, it is neither a string or a pointer to a file, so the OS level does not have a native way to handle this information.
Historically, putting string data, or a file reference onto the clipboard is effectively broadcasting this information to ALL applications on the same running OS, however Windows 10 extends this by making your clipboard content able to be synchronised across devices as well. The DataTransfer namespace implementation allows you to affect the scope of this availability, but ultimately this feature is designed to allow you to push data outside of your current application sandboxed domain.
So whether you choose serialize the content yourself, or you want the DataTransfer implementation to try and do it for you, the content will be serialized if it is not already a string or file reference format, and that serialized content, if it succeeds, is what will be made available to consumers.
In this way there is no memory leak or security issue where you might inadvertently provide external processes access to your current process memory or execution context, but data security is still a concern, so don't use the clipboard to pass sensitive content.
A simpler example for Arbitrary or Custom data
OPs example is to put an IEnumerable<T> collection of objects onto the clipboard, and to retrieve them later. OP is choosing to use XML serialization via the DataContractSerializer however a reference to the stream used by the serializer was saved to the clipboard, and not the actual content.
There is a lot of plumbing and first principals logic going on that for little benefit, streams are useful if you are going to stream the content, so if you are going to allow the consumer to control the stream but if you were going to write to the stream in a single synchronous process, then it is better to close off the stream altogether and pass around the buffer that you filled via your stream, we don't even try to re-use the same stream at a later point in time.
The following solution works for Clipboard access in WinRT to pre-serialize a collection of objects and pass them to a consumer:
IEnumerable<Test> objectsIn = new Test[] { new Test { Name = "One" }, new Test { Name = "two" } };
var dataPackage = new DataPackage();
string formatId = "MyCustomFormat";
var serial = Newtonsoft.Json.JsonConvert.SerializeObject(objectsIn);
dataPackage.SetData(formatId, serial);
Clipboard.SetContent(dataPackage);
Then in perhaps an entirely different application:
string formatId = "MyCustomFormat";
var dataPackageView = Clipboard.GetContent();
object content = await dataPackageView.GetDataAsync(formatId);
string contentString = content as string;
var objectsOut = Newtonsoft.Json.JsonConvert.DeserializeObject<IEnumerable<Test>>(contentString);
foreach (var o in objectsOut)
{
Console.WriteLine(o);
}
The definition of Test, in both the provider and the consumer application contexts:
public class Test
{
public string Name { get; set; }
}
when do I dispose the randomAccessStream?
Only Dispose the stream when you have finished using it, when you have Diposed the stream it will be no longer usable in any other contexts, even if you have stored or passed multiple references to it in other object instances.
If you are talking about the original stream referenced in the SetData() logic then look at this from the other angle, If you dispose too early, the consuming code will no longer have access to the stream and will fail.
As a general rule we should try to design the logic such that at any given point in time there is a clear and single Owner for any given stream, in that way it should be clear who has responsibility for Disposing the stream. This response to a slightly different scenario explains it well, https://stackoverflow.com/a/8791525/1690217 however as a general pattern only the scope that created the stream should be responsible for Disposing it.
The one exception to that is that if you need to access the stream outside of the creating method, then the parent class should hold a reference to it, in that scenario you should make the parent class implement IDisposable and make sure it cleans up any resources that might be hanging around.
The reason that you don't see this in documentation is often that the nuances around the timing for calling Dispose() are out of scope or will get lost in examples that are contrived for other purposes.
Specifically for examples where streams are passed via any mechanism and later used, as with DataPackage, it is too hard to show all of the orchestration code to cover the time in between storing the stream with DataPackage.SetData(...) and later accessing the stream via DataPackage.GetDataAsync(...)
Also consider the most common scenario for DataPackage where the consumer is not only in a different logical scope, but most likely in an entirely different application domain, to include all the code to cover when or if to call dispose should encompass the entire code base for 2 different applications.
I have an object that has to be converted to Json format and uploaded via Stream object. This is the AWS S3 upload code:
AWSS3Client.PutObjectAsync(new PutObjectRequest()
{
InputStream = stream,
BucketName = name,
Key = keyName
}).Wait();
Here stream is Stream type which is read by AWSS3Client.
The data that I am uploading is a complex object that has to be in Json format.
I can convert object to string using JsonConvert.SerializeObject or serialize to file using JsonSerializer but since amount of data is quite significant I would prefer to avoid temporary string or file and convert object to readable Stream right away. My ideal code would look something like this:
AWSS3Client.PutObjectAsync(new PutObjectRequest()
{
InputStream = MagicJsonConverter.ToStream(myDataObject),
BucketName = name,
Key = keyName
}).Wait();
Is there a way to achieve this using Newtonsoft.Json ?
You need two things here: one is producer/consumer stream, e.g. BlockingStream from this StackOverflow question, and second, Json.Net serializer writing to this stream like in this another SO question.
Another practical option is to wrap the memory stream with gzip stream (2 lines of code).
Usually, JSON files will have great compression (1GB file can be compressed to 50MB).
Then when serving the stream to S3, wrap it with gzip stream which decompresses it.
I guess the trade-off comparing to temp file is CPU vs IO (both will probably work well). If you can save it compressed on S3 it will save you space and increase networking efficiency too.
Example code:
var compressed = new MemoryStream();
using (var zip = new GZipStream(compressed, CompressionLevel.Fastest, true))
{
-> Write to zip stream...
}
compressed.Seek(0, SeekOrigin.Begin);
-> Use stream to upload to S3
In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}
WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."
I'm wondering is there a way to send some kind of generics for example List <float> floatValues = new List<float>() need to be sent to udp client. I don't know how to do that, any help will be appreciated!
You can serialize floatValues using some serialization facility (like XmlSerializer, BinaryFormatter or DataContractSerializer) and than deserialize it back.
Or you can create your own "application level protocol" and put to the stream type name and serializer type and use this information during deserialization process.
What you want to do is known as serialization/deserialization
In computer science, in the context of data storage and transmission, serialization, is the process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and "resurrected" later in the same or another computer environment
Instead of building your own serializer, I would recommend to use one of the existing libraries like
XmlSerializer,
SoapFormatter,
BinaryFormatter,
DataContractSerializer ,
DataContractJsonSerializer,
JavaScriptSerializer,
Json.Net,
ServiceStack,
Protobuf.Net ........
Here is an example using Json serialization
//Sender
string jsonString = new JavaScriptSerializer().Serialize(floatValues);
byte[] bytesToSend = Encoding.UTF8.GetBytes(jsonString);
//Receiver
string receivedJson = Encoding.UTF8.GetString(bytesToSend);
List<float> floatValues2 = new JavaScriptSerializer()
.Deserialize<List<float>>(receivedJson);
I am getting an intermittent "out of memory" exception at this statement:
return ms.ToArray();
In this method:
public static byte[] Serialize(Object inst)
{
Type t = inst.GetType();
DataContractSerializer dcs = new DataContractSerializer(t);
MemoryStream ms = new MemoryStream();
dcs.WriteObject(ms, inst);
return ms.ToArray();
}
How can I prevent it? Is there a better way to do this?
The length of ms is 182,870,206 bytes (174.4 MB)
I am putting this into a byte array so that I can then run it through compression and store it to disk. The data is (obviously) a large list of a custom class that I am downloading from a WCF server when my silverlight application starts. I am serializing it and compressing it so it uses only about 6MB in isolated storage. The next time the user visits and runs the silverlight application from the web, I check the timestamp, and if good I just open the file from isolated, decompress it, deserialize it, and load my structure. I am keeping the entire structure in memory because the application is mostly geared around manipulating the contents of this structure.
#configurator is correct. The size of the array was too big. I rolled by own serializer, by declaring a byte array of [list record count * byte count per record], then stuffed it directly myself using statements like this to stuff it:
Buffer.BlockCopy(
BitConverter.GetBytes(node.myInt),0,destinationArray,offset,sizeof(int));
offset += sizeof(int);
and this to get it back:
newNode.myInt= BitConverter.ToInt32(sourceByteArray,offset);
offset += sizeof(int);
Then I compressed it and stored it to isolated storage.
My size went from 174MB with the DataContractSerializer to 14MB with mine.
After compression it went from a 6MB to a 1MB file in isolated storage.
Thanks to Configurator and Filip for their help.
The problem seems to be that you're expecting to return a 180MB byte array. That means the framework would need to find and allocate a consecutive 180MB of free memory to copy the stream data into, which is usually quite hard - hence the OutOfMemoryException. If you need to continue handling this amount of memory, use the memory stream itself (reading and writing to it as you need) to hold the buffer; otherwise, save it to a file (or to whatever other place you need it, e.g. serving it over a network) directly instead of using the memory stream.
I should mention that the memory stream has a 180MB array of its own in there as well, so is also in a bit of trouble and could cause OutOfMemory during serialization - it would likely be better (as in, more robust) if you could serialize it to a temporary file. You might also want to consider a more compact - but possibly less readable - serialization format, like json, binary serialization, or protocol buffers.
In response to the comment: to serialize directly to disk, use a FileStream instead of a MemoryStream:
public static void Serialize(Object inst, string filename)
{
Type t = inst.GetType();
DataContractSerializer dcs = new DataContractSerializer(t);
using (FileStream stream = File.OpenWrite(filename)) {
dcs.WriteObject(ms, inst);
}
}
I don't know how you use that code, but one thing that strikes me is that you don't release your resources. For instance, if you call Serialize(obj) a lot of times with a lot of large objects, you will end up having a lot of memory being used that is not released directly, however the GC should handle that properly, but you should always release your resources.
I've tried this piece of code:
public static byte[] Serialize(object obj)
{
Type type = obj.GetType();
DataContractSerializer dcs = new DataContractSerializer(type);
using (var stream = new MemoryStream())
{
dcs.WriteObject(stream, obj);
return stream.ToArray();
}
}
With the following Main-method in a Console Application
static void Main(string[] args)
{
var filipEkberg = new Person {Age = 24, Name = #"Filip Ekberg"};
var obj = Serialize(filipEkberg);
}
However, my byte-array is not nearly as big as yours. Having a look at this similar issue, you might want to consider checking out protobuf-net.
It might also be interesting to know what you are intending to do with the serialized data, do you need it as a byte-array or could it just as well be XML written to a text-file?
Try to serialize to a stream (i.e. FileStream) instead of byte array. This way you can serialize gigabytes of data without OutOfMemory exception.
public static void Serialize<T>(T obj, string path)
{
DataContractSerializer serializer = new DataContractSerializer(typeof(T));
Stream stream = File.OpenWrite(path);
serializer.WriteObject(stream, obj);
}
public static T Deserialize<T>(string path)
{
DataContractSerializer serializer = new DataContractSerializer(typeof(T));
Stream stream = File.OpenRead(path);
return (T)serializer.ReadObject(stream);
}
Try to set memory stream position to 0 and after only call ToArray().
Regards.