C# Large JSON to string causes out of memory exception - c#

I'm trying to download very large JSON file. However, I keep getting an error message:
"An unhandled exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll"
{The function evaluation was disabled because of an out of memory exception.}
Any tips how I can download this large JSON filet? I have tried to use string and StringBuilder but no luck.
Here is my code:
public static string DownloadJSON(string url)
{
try
{
String json = new WebClient().DownloadString(url); // This part fails!
return json;
}
catch (Exception)
{
throw;
}
}
I have created console application. I have tried this code with smaller JSON file and it worked.
My idea is later to split this larger JSON file and put it to database. However I need to encode it before I can put it to database. I have not write yet database part or anything else, because downloading this big JSON causes problems. I don't need it as a stream, but that was my example way how I made encoding. I need to encode it because data have special characters like å.
I tried also this but same problem:
var http = (HttpWebRequest)WebRequest.Create(url);
var response = http.GetResponse();
var stream = response.GetResponseStream();
var sr = new StreamReader(stream);
var content = sr.ReadToEnd();

I assume that you have very very large response. It is better to process stream. Now comes to point that cause outofmemoryexcetion.
In .net max size of any object 2GB. This is even for 64 bit machine. If your machine is 32 bit then this limit is very low.
In your case above rules get break so it will not work but if you have file size less than that then try to build your code against 64 bit and it will give your result.

Download it to a file:
using (WebClient webClient = new WebClient())
{
webClient.DownloadFile(
url,
path);
}
Of course it could be not handled by JSON libraries if it's too big to open in memory, but you can try to process it somehow if you don't have a choice and file must be so big.

Related

Dotnet Core - resourceStream is timing out when read as an embedded resource

I have an embedded resource containing a json file that has 19000 odd Australian suburbs. I have it in a related assembly in my solution build.
My issue is that when I check the variable "resourceStream" after getting the resource it has timed out..
The actual error close up:
resourceStream.ReadTimeout = 'resourceStream.ReadTimeout' threw an exception of type 'System.InvalidOperationException'
Here is the code:
using (var resourceStream = this.GetType().Assembly.GetManifestResourceStream("JobsLedger.INITIALISATION.SUBURB.Initialisations.SuburbJSON.australianSuburbs.json"))
{
var reader = new StreamReader(resourceStream, Encoding.UTF8);
var jsonString = reader.ReadToEnd();
var suburbs = JsonConvert.DeserializeObject<List<BaseSuburb>>(jsonString);
}
So it gets 4900 odd postcodes before timing out.. should get 19000.
The ones it gets are converted and work.. nice, but I actually need all of them :(
There must be a way to stop it from timing out. I noticed a post suggesting changing it to async but there are no await methods, it said, for GetManifestResourceStream.
Also this is on the resource stream.. not on the reader.. I think the reader would be fine to handle 18k records.
How do I get the "resourceStream" to run and actually get the whole file before timing out? or is this a problem with the file?

Overriding WebHostBufferPolicySelector for Non-Buffered File Upload

In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}
WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."

Get Size of Image File before downloading from web

I am downloading image files from web using the following code in my Console Application.
WebClient client = new WebClient();
client.DownloadFile(string address_of_image_file,string filename);
The code is running absolutely fine.
I want to know if there is a way i can get the size of this image file before I download it.
PS- Actually I have written code to make a crawler which moves around the site downloading image files. So I doesn't know its size beforehand. All I have is the complete path of file which has been extracted from the source of webpage.
Here is a simple example you can try
if you have files of different extensions like .GIF, .JPG, etc
you can create a variable or wrap the code within a Switch Case Statement
System.Net.WebClient client = new System.Net.WebClient();
client.OpenRead("http://someURL.com/Images/MyImage.jpg");
Int64 bytes_total= Convert.ToInt64(client.ResponseHeaders["Content-Length"])
MessageBox.Show(bytes_total.ToString() + " Bytes");
If the web-service gives you a Content-Length HTTP header then it will be the image file size. However, if the web-service wants to "stream" data to you (using Chunk encoding), then you won't know until the whole file is downloaded.
You can use this code:
using System.Net;
public long GetFileSize(string url)
{
long result = 0;
WebRequest req = WebRequest.Create(url);
req.Method = "HEAD";
using (WebResponse resp = req.GetResponse())
{
if (long.TryParse(resp.Headers.Get("Content-Length"), out long contentLength))
{
result = contentLength;
}
}
return result;
}
You can use an HttpWebRequest to query the HEAD Method of the file and check the Content-Length in the response
You should look at this answer: C# Get http:/…/File Size where your question is fully explained. It's using HEAD HTTP request to retrieve the file size, but you can also read "Content-Length" header during GET request before reading response stream.

Read only the title and/or META tag of HTML file, without loading complete HTML file

Scenario :
I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database
What I am doing is using System.Net.WebClient Class's DownloadString(url_path) to download & then Saving it to Database by LINQ To SQL
But this DownloadString function gives me complete html source, I just need only Title part & META tag part.
Any ideas, to download only that much content?
I think you can open a stream with this url and use this stream to read the first x bytes, I can't tell the exact number but i think you can set it to reasonable number to get the title and the description.
HttpWebRequest fileToDownload = (HttpWebRequest)HttpWebRequest.Create("YourURL");
using (WebResponse fileDownloadResponse = fileToDownload.GetResponse())
{
using (Stream fileStream = fileDownloadResponse.GetResponseStream())
{
using (StreamReader fileStreamReader = new StreamReader(fileStream))
{
char[] x = new char[Number];
fileStreamReader.Read(x, 0, Number);
string data = "";
foreach (char item in x)
{
data += item.ToString();
}
}
}
}
I suspect that WebClient will try to download the whole page first, in which case you'd probably want a raw client socket. Send the appropriate HTTP request (manually, since you're using raw sockets), start reading the response (which will not be immediately) and kill the connection when you've read enough. However, the rest will have probably already been sent from the server and winging its way to your PC whether you want it or not, so you might not save much - if anything - of the bandwidth.
Depending on what you want it for, many half decent websites have a custom 404 page which is a lot simpler than a known page. Whether that has the information you're after is another matter.
You can use the verb "HEAD" in a HttpWebRequest to return the the response headers (not element. To get the full element with the meta data you'll need to download the page and parse out the meta data you want.
System.Net.WebRequest.Create(uri) { Method = "HEAD" };

Why does HttpWebResponse return a null terminated string?

I recently was using HttpWebResponse to return xml data from a HttpWebRequest, and I noticed that the stream returned a null terminated string to me.
I assume its because the underlying library has to be compatible with C++, but I wasn't able to find a resource to provide further illumination.
Mostly I'm wondering if there is an easy way to disable this behavior so I don't have to sanitize strings I'm passing into my xml reader.
Edit here is a sample of the relevant code:
httpResponse.GetResponseStream().Read(serverBuffer, 0, BUFFER_SIZE);
output = processResponse(System.Text.UTF8Encoding.UTF8.GetString(serverBuffer))
where processResponse looks like:
processResponse(string xmlResponse)
{
var Parser = new XmlDocument();
xmlResponse = xmlResponse.Replace('\0',' '); //fix for httpwebrequest null terminating strings
Parser.LoadXml(xmlResponse);
This definitely isn't normal behaviour. Two options:
You made a mistake in the reading code (e.g. creating a buffer and then calling Read on a stream, expecting it to fill the buffer)
The web server actually returned a null-terminated response
You should be able to tell the difference using Wireshark if nothing else.
Could it be that you are setting a size (wrong size) to the buffer you are loading?
You can use a StreamReader to avoid the temp buffer if you don't need it.
using(var stream = new StreamReader(httpResponse.GetResponseStream()))
{
string output = stream.ReadToEnd();
//...
}
Hmm... I doubt it returns a null-terminated string since simply there is no such concept in C#. At best you could have a string with a \0u0000 character at the end, but in this case it would mean that the return from the server contains such a character and the HttpWebRequest is simply doing it's duty and returns whatever the server returned.
Update
after reading your code, the mistake is pretty obvious: you are Read()-ing from a stream into a byte[] but not tacking notice of how much you actually read:
int responseLength = httpResponse.GetResponseStream().Read(
serverBuffer, 0, BUFFER_SIZE);
output = processResponse(System.Text.UTF8Encoding.UTF8.GetString(
serverBuffer, 0, responseLength));
this would fix the immediate problem, leaving only the other bugs in your code to deal with, like the fact that you cannot handle correctly a response larger than BUFFER_SIZE... I would suggest you open a XML document reader on the returned stream instead of manipulating the stream via an (unnecessary) byte[ ] copy operation:
Parser.Load(httpResponse.GetResponseStream());

Categories