How to overcome OutOfMemoryException pulling large xml documents from an API?

How to overcome OutOfMemoryException pulling large xml documents from an API? - c#

I am pulling 1M+ records from an API. The pull works ok, but I'm getting an out of memory exception when attempting to ReadToEnd into a string variable.
Here's the code:
XDocument xmlDoc = new XDocument();
HttpWebRequest client = (HttpWebRequest)WebRequest.Create(uri);
client.Timeout = 2100000;//35 minutes
WebResponse apiResponse = client.GetResponse();
Stream receivedStream = apiResponse.GetResponseStream();
StreamReader reader = new StreamReader(receivedStream);
string s = reader.ReadToEnd();
Stack trace:
at System.Text.StringBuilder.ToString()
at System.IO.StreamReader.ReadToEnd()
at MyApplication.DataBuilder.getDataFromAPICall(String uri) in
c:\Users\RDESLONDE\Documents\Projects\MyApplication\MyApplication\DataBuilder.cs:line 578
at MyApplication.DataBuilder.GetDataFromAPIAsXDoc(String uri) in
c:\Users\RDESLONDE\Documents\Projects\MyApplication\MyApplication\DataBuilder.cs:line 543
What can I do to work around this?

It sounds like your file is too big for your environment. Loading the DOM for a large file can be problematic, especially when using the win32 platform (you haven't indicated whether this is the case).
You can combine the speed and memory efficiency of XmlReader with the convenience of XElement/Xnode, etc and use an XStreamingElement to save the transformed content after processing. This is much more memory-efficient for large files
Here's an example in pseudo-code:
// use a XStreamingElement for writing
var st = new XStreamingElement("root");
using(var xr = new XmlTextReader(stream))
{
while (xr.Read())
{
// whatever you're interested in
if (xr.NodeType == XmlNodeType.Element)
{
var node = XNode.ReadFrom(xr) as XElement;
if (node != null)
{
ProcessNode(node);
st.Add(node);
}
}
}
}
st.Save(outstream); // or st.WriteTo(xmlwriter);

XMLReader is the way to go when memory is an issue. It is also fastest.

Unfortunately, you didn't show your code but it sounds like the entire file is being loaded into memory. That's what you need to avoid.
Best if you can use a stream to process the file without loading the entire thing in memory.

class MyXmlDocument : IDisposable
{
private bool _disposed = false;
private XmlDocument _xmldoc;
public XmlDocument xmldoc
{
get { return _xmldoc; }
}
public MyXmlDocument()
{
_xmldoc = new XmlDocument();
}
~MyXmlDocument()
{
this.Dispose();
}
// Public implementation of Dispose pattern callable by consumers.
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
// Protected implementation of Dispose pattern.
protected virtual void Dispose(bool disposing)
{
if (_disposed)
{
return;
}
if (disposing)
{
// TODO: dispose managed state (managed objects).
this._xmldoc = null;
GC.Collect();
GC.WaitForPendingFinalizers();
}
// TODO: free unmanaged resources (unmanaged objects) and override a finalizer below.
// TODO: set large fields to null.
_disposed = true;
}
}
You can use this and then you can write the code like
Using(MyXmlDocument doc = new MyXmlDocument())
{
doc.xmldoc = xmldoc.Load(new StreamReader(file));
}

Related

Abandoned memory in posting image data to server

Showing high consumption of memory while posting image data to server and it is not releasing. reportModel in following source code has base64 string of image data. Here is a snapshot of source code,
public async Task<FaultReportResponseModel> ReportFault(ReportFaultRequestModel reportModel)
{
try
{
App.IsConnectedToInternet(true);
reportModel.Token = App.WebOpsToken;
//var httpContent = CreateHttpContent(reportModel);
var jsonBody = JsonConvert.SerializeObject(reportModel);
_log.Trace("ReportFault api jsonBody length: {0}", jsonBody.Length);
var content = new StringContent(jsonBody, Encoding.UTF8, "application/json");
AddAuthorizationHeader();
string serviceURL;
if (reportModel.IssueType == IssueTypes.CantFind)
{
serviceURL = Constants.CantFindSvcURL;
}
else
{
serviceURL = Constants.ReportFaultSvcURL;
}
//var url = string.Format("{0}{1}", Constants.DataSVCBaseURL, serviceURL);
var url = GetURLStringForService(serviceURL, ServiceType.WebOpsData);
var response = await _restClient.PostAsync(url, content);
var responseStr = await response.Content.ReadAsStringAsync();
var parsedResponse = JsonConvert.DeserializeObject<FaultReportResponseModel>(responseStr);
_log.Trace("Uploaded fault text: {0}", parsedResponse.OK);
content.Dispose();
return parsedResponse;
}
catch (Exception ex)
{
_log.Trace("Exception: {0}", ex.Message);
}
return null;
}
Snapshot of the memory footprint,
It is showing that Json serialization is taking memory and that never got released. Because of this abandoned memory, after few cycles of image upload app crashes.
What I tried,
Used Stream content to Post to server. In this case, it is showing memory problem in Stream. Problem pointer changed but the problem is same.
On Internet I found that it is because of Large Object Heap so, I tried to invoke GC manually but no change in memory footprint.
Any help or pointer to get out of this problem would be helpful.

You are creating large blocks of memory on the LOH. This is likely not a memory heap, though it definitely isn't optimal in high throughput applications
Assuming you want to actually use Json.Net on serialisation you can achieve this with JsonTextWriter and serialize directly to a stream (ideally the HttpClient NetworkStream). Note that Test.Json also has a very efficient methods for serializing to stream as well.
To get access to the underlying NetworkStream in HttpClient, you could create a derived HttpContent class
Example
public class SerializedStreamedContent<T> :HttpContent
{
private readonly T _value;
public SerializedStreamedContent(T value) => _value = value;
protected override Task SerializeToStreamAsync(Stream stream, TransportContext? context)
{
try
{
using var writer = new StreamWriter(stream, leaveOpen:true);
using var jsonWriter = new JsonTextWriter(writer);
var ser = new JsonSerializer();
ser.Serialize(jsonWriter, _value);
jsonWriter.Flush();
return Task.CompletedTask;
}
catch (Exception e)
{
return Task.FromException(e);
}
}
protected override bool TryComputeLength(out long length)
{
length = -1;
return false;
}
}
Note 1 : This is not intended to be a complete solution, just an example. There are many considerations that you will need to weigh up using this approach
Note 2 : In .Net 5 there is a JsonContent Class, that does all this for you with Text.Json implementation (and more)

Finding a memory leak

I have an issue with the following code. I create a memory stream in the GetDB function and the return value is used in a using block. For some unknown reason if I dump my objects I see that the MemoryStream is still around at the end of the Main method. This cause me a massive leak. Any idea how I can clean this buffer ?
I have actually checked that the Dispose method has been called on the MemoryStream but the object seems to stay around, I have used the diagnostic tools of Visual Studio 2017 for this task.
class Program
{
static void Main(string[] args)
{
List<CsvProduct> products;
using (var s = GetDb())
{
products = Utf8Json.JsonSerializer.Deserialize<List<CsvProduct>>(s).ToList();
}
}
public static Stream GetDb()
{
var filepath = Path.Combine("c:/users/tom/Downloads", "productdb.zip");
using (var archive = ZipFile.OpenRead(filepath))
{
var data = archive.Entries.Single(e => e.FullName == "productdb.json");
using (var s = data.Open())
{
var ms = new MemoryStream();
s.CopyTo(ms);
ms.Seek(0, SeekOrigin.Begin);
return (Stream)ms;
}
}
}
}

For some unknown reason if I dump my objects I see that the MemoryStream is still around at the end of the Main method.
That isn't particuarly abnormal; GC happens separately.
This cause me a massive leak.
That isn't a leak, it is just memory usage.
Any idea how I can clean this buffer ?
I would probably just not use a MemoryStream, instead returning something that wraps the live uncompressing stream (from s = data.Open()). The problem here, though, is that you can't just return s - as archive would still be disposed upon leaving the method. So if I needed to solve this, I would create a custom Stream that wraps an inner stream and which disposes a second object when disposed, i.e.
class MyStream : Stream {
private readonly Stream _source;
private readonly IDisposable _parent;
public MyStream(Stream, IDisposable) {...assign...}
// not shown: Implement all Stream methods via `_source` proxy
public override void Dispose()
{
_source.Dispose();
_parent.Dispose();
}
}
then have:
public static Stream GetDb()
{
var filepath = Path.Combine("c:/users/tom/Downloads", "productdb.zip");
var archive = ZipFile.OpenRead(filepath);
var data = archive.Entries.Single(e => e.FullName == "productdb.json");
var s = data.Open();
return new MyStream(s, archive);
}
(could be improved slightly to make sure that archive is disposed if an exception happens before we return with success)

Disposing MemoryStreams and GZipStreams

I want to compress a ProtoBuffer object on serialisation and decompress on deserialisation. Unfortunatly, C# stdlib offers only compression routines that work on streams rather than on byte[], that makes it a bit unesseray more verbose than a function call. My Code so far:
class MyObject{
public string P1 {get; set;}
public string P2 {get; set;}
// ...
public byte[] Serialize(){
var builder = new BinaryFormat.MyObject.Builder();
builder.SetP1(P1);
builder.SetP2(P2);
// ...
// object is now build, let's compress it.
var ms = new MemoryStream();
// Without this using, the serialisatoin/deserialisation Tests fail
using (var gz = new GZipStream(ms, CompressionMode.Compress))
{
builder.Build().WriteTo(gz);
}
return ms.ToArray();
}
public void Deserialize(byte[] data)
{
var ms = new MemoryStream();
// Here, Tests work, even when the "using" is left out, like this:
(new GZipStream(new MemoryStream(data), CompressionMode.Decompress)).CopyTo(ms);
var msg = BinaryFormat.MachineInfo.ParseFrom(ms.ToArray());
P1 = msg.P1;
P2 = msg.P2;
// ...
}
}
When dealing with streams, it seems one has to manually take care of the disposal of the objects. I wonder why that is, I'd expect GZipStream to be fully managed Code. And I wonder If Deserialize works only by accident and if I should dispose the MemoryStreams aswell.
I know I could probably solve this problem by simply using a thrid party compression library, but that's somewhat besides the point of this question.

GZipStream needs to be disposed so it flushes it's final blocks of compression out of its buffer to its underlying stream, it also calls dispose on the stream you passed in unless you use the overload that takes in a bool and you pass in false.
If you where using the overload that did not dispose of the MemoryStream it is not as critical to have the MemoryStream be disposed because it is not writing its internall buffer anywhere. The only thing it does is set some flags and set a Task object null so it can be GCed sooner if the stream lifetime is longer than the dispose point.
protected override void Dispose(bool disposing)
{
try {
if (disposing) {
_isOpen = false;
_writable = false;
_expandable = false;
// Don't set buffer to null - allow TryGetBuffer, GetBuffer & ToArray to work.
#if FEATURE_ASYNC_IO
_lastReadTask = null;
#endif
}
}
finally {
// Call base.Close() to cleanup async IO resources
base.Dispose(disposing);
}
}
Also, although the comment says "Call base.Close() to cleanup async IO resources" the base dispose function from the Stream class does nothing at all.
protected virtual void Dispose(bool disposing)
{
// Note: Never change this to call other virtual methods on Stream
// like Write, since the state on subclasses has already been
// torn down. This is the last code to run on cleanup for a stream.
}
All that being said, when decompressing a GZipStream you can likely get away with not disposing it for the same reason as not disposing the MemoryStream, when decompressing it does not buffer bytes anywhere so there is no need to flush any buffers.

Locking with asynchronous httpwebrequest

I have an object that downloads a file from a server, saves it into Isolated Storage asynchronously and provides a GetData method to retrieve the data. Would I use a
IsolatedStorageFile storageObj; //initialized in the constructor
lock(storageObj)
{
//save code
}
In the response and
lock(storageObj)
{
//load code
}
In the GetData method?
Edit: I'll give some context here.
The app (for Windows Phone) needs to download and cache multiple files from a server, so I've created a type that takes 2 strings (a uri and a filename), sends out for data from the given uri, and saves it. The same object also has the get data method. Here's the code (simplified a bit)
public class ServerData: INotifyPropertyChanged
{
public readonly string ServerUri;
public readonly string Filename;
IsolatedStorageFile appStorage;
DownloadState _downloadStatus = DownloadState.NotStarted;
public DownloadState DownloadStatus
{
protected set
{
if (_downloadStatus == value) return;
_downloadStatus = value;
OnPropertyChanged(new PropertyChangedEventArgs("DownloadStatus"));
}
get { return _downloadStatus; }
}
public ServerData(string serverUri, string filename)
{
ServerUri = serverUri;
Filename = filename;
appStorage = IsolatedStorageFile.GetUserStoreForApplication();
}
protected virtual void OnPropertyChanged(PropertyChangedEventArgs args)
{
if (PropertyChanged != null)
PropertyChanged(this, args);
}
public void RequestDataFromServer()
{
DownloadStatus = DownloadState.Downloading;
//this first bit adds a random unused query to the Uri,
//so Silverlight won't cache the request
Random rand = new Random();
StringBuilder uriText = new StringBuilder(ServerUri);
uriText.AppendFormat("?YouHaveGotToBeKiddingMeHack={0}",
rand.Next().ToString());
Uri uri = new Uri(uriText.ToString(), UriKind.Absolute);
HttpWebRequest serverRequest = (HttpWebRequest)WebRequest.Create(uri);
ServerRequestUpdateState serverState = new ServerRequestUpdateState();
serverState.AsyncRequest = serverRequest;
serverRequest.BeginGetResponse(new AsyncCallback(RequestResponse),
serverState);
}
void RequestResponse(IAsyncResult asyncResult)
{
var serverState = (ServerRequestUpdateState)asyncResult.AsyncState;
var serverRequest = (HttpWebRequest)serverState.AsyncRequest;
Stream serverStream;
try
{
// end the async request
serverState.AsyncResponse =
(HttpWebResponse)serverRequest.EndGetResponse(asyncResult);
serverStream = serverState.AsyncResponse.GetResponseStream();
Save(serverStream);
serverStream.Dispose();
}
catch (WebException)
{
DownloadStatus = DownloadState.Error;
}
Deployment.Current.Dispatcher.BeginInvoke(() =>
{
DownloadStatus = DownloadState.FileReady;
});
}
void Save(Stream streamToSave)
{
StreamReader reader = null;
IsolatedStorageFileStream file;
StreamWriter writer = null;
reader = new StreamReader(streamToSave);
lock (appStorage)
{
file = appStorage.OpenFile(Filename, FileMode.Create);
writer = new StreamWriter(file);
writer.Write(reader.ReadToEnd());
reader.Dispose();
writer.Dispose();
}
}
public XDocument GetData()
{
XDocument xml = null;
lock(appStorage)
{
if (appStorage.FileExists(Filename))
{
var file = appStorage.OpenFile(Filename, FileMode.Open);
xml = XDocument.Load(file);
file.Dispose();
}
}
if (xml != null)
return xml;
else return new XDocument();
}
}

Your question doesn't provide an awful lot of context, and with the amount of information given people could be inclined to simply tell you yes, maybe with small, but pertinent additions.
Practice generally sees locking occur on an instance of a dedicated object, being sure to stay away from locking on this since you lock the whole instance of the current object down, which is scarcely, if ever the intent - but, in your case, we don't rightly know to the fullest extent, however, I hardly think locking your storage instance is the way to go.
Also, since you mention client and server interaction, it isn't as straight forward.
Depending on the load and many other factors, you might want to provide many reads of the file from the server yet only a single write at any one time on the client that is downloading; for this purpose I would recommend using the ReaderWriterLockSlim class, which exposes TryEnterReadLock, TryEnterWriteLock and corresponding release methods.
For more detailed information on this class see this MSDN link.
Also, remember to use try, catch and finally when coding within the scope of a lock, always releasing the lock in the finally block.

What class contains this code? That matters as it's important if it's being created more than once. If it's created once in the process' lifetime, you can do this, if not you should lock a static object instance.
I believe though that it's good practice to create a separate object that's used only for the purpose of locking, I've forgotten why. E.g.:
IsolatedStorageFile storageObj; //initialized in the constructor
(static) storageObjLock = new object();
...
// in some method
lock(storageObjLock)
{
//save code
}

Closing a file without using using

I have a class which reads data from one file stream and writes to another. I am concerned about closing the streams after the processing has finished in closeFiles().
How would you handle the possibility that the dispose of one stream may throw an exception stopping the dispose of the other stream from being called.?
Should I be calling close and dispose on the streams or just one?
What happens if I catch any errors from the stream disposes and then continue with moving and deleting of the files as shown in lastOperation()?
In a perfect world I'd like to use a using statement in a c++ style initialisation list but I'm pretty sure that's not possible in c#.
EDIT : thanks for the quick responses guys. So what I should be doing is deriving from IDisposable and then change the constructor and add the two disposing methods like this?:
~FileProcessor()
{
Dispose(true);
}
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
private void Dispose(bool disposing)
{
if (!this.disposed)
{
if (disposing)
{
sw.Flush();
}
closeFiles();
disposed = true;
}
}
This is basically what I'm doing:
class FileProcessor
{
private string in_filename;
private string out_filename;
private StreamReader sr;
private StreamWriter sw;
bool filesOpen = false;
public FileProcessor(string filename)
{
in_filename = filename;
out_filename = filename + ".out";
openFiles();
}
~FileProcessor()
{
closeFiles();
}
private void openFiles()
{
sr = new StreamReader(in_filename);
sw = new StreamWriter(out_filename);
filesOpen = true;
}
private void closeFiles()
{
if (filesOpen)
{
sr.Close();
sw.Close();
sr.Dispose();
sw.Dispose();
filesOpen = false;
}
}
/* various functions to read, process and write to the files */
public void lastOperation()
{
closeFiles();
File.Delete( in_filename );
Directory.Move(out_filename, outdir + out_filename);
}
}

Your FileProcessor class should not have a destructor. It is of no use but it is expensive.
It should have a Dispose() (and implement the IDisposable interface) to call closeFiles().
And like #marcelo answered, Stream.Dispose() should not throw. You can rely on this for BCL classes.
But you should check each Reader/Writer for null, in case the first one opened but the second one failed:
if (sr != null) sr.Dispose();
if (sw != null) sw.Dispose();
Your filesOpen can't cover both.

I think it is a good practise to have your class implements IDisposable interface if you are using IDisposable object inside it.
Then, you should make sure that, in your Dispose() implementation, don't throw exceptions. If every object you dispose makes this guarantee, your client will be safe.

Dispose methods should never throw exceptions. There's even a code analysis tool warning for this.

In C#, using does exist. The object provided to the using statement must implement the IDisposable interface. This interface provides the Dispose method, which should release the object's resources.
If your StreamReader and StreamWriter implement IDisposable, you can put them in a using block, and they will be disposed of cleanly when you have finished with them.
using(var sr = new StreamReader(in_filename)) {
// Perform reader actions
}
// Reader will now be disposed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to overcome OutOfMemoryException pulling large xml documents from an API? - c#

XMLReader is the way to go when memory is an issue. It is also fastest.

Unfortunately, you didn't show your code but it sounds like the entire file is being loaded into memory. That's what you need to avoid. Best if you can use a stream to process the file without loading the entire thing in memory.

Related

Abandoned memory in posting image data to server

Finding a memory leak

Disposing MemoryStreams and GZipStreams

Locking with asynchronous httpwebrequest

Closing a file without using using

Categories

Resources