I need to search many millions of jpeg files stored in Azure Blob Storage to find ones that are corrupt. It is a specific type of corruption where all the bytes in the file are 0. I should be able to tell if the file is corrupt by inspecting the header, which is in the first several bytes of the file. I don't want to have to download the entire file since it will cost money and time to do so.
I'm using the Microsoft.Azure.Storage.Blob, v 11.1.2 NuGet package and have seen a few methods that looked promising, such as CloudBlockBlob.DownloadToByteArrayAsync and CloudBlockBlob.DownloadToStreamAsync, but it appears to download the entire file (well, DownloadToByteArrayAsync threw an Exception because I hoped I could give it a small array).
Any help is appreciated.
See DownloadRangeToStreamAsync and DownloadRangeToByteArrayAsync. "Range" is the key term here, as it refers to the HTTP Range header, which broadly captures the notion of only downloading part of a resource. See here for how the SDK works under the hood with the Blob REST API.
Related
As part of our installer build, we have to zip thousands of large data files into about ten or twenty 'packages' with a few hundred (or even thousands of) files in each which are all dependent on being kept with the other files in the package. (They are versioned together if you will.)
Then during the actual install, the user selects which packages they want included on their system. This also lets them download updates to the packages from our site as one large, versioned file rather than asking them to download thousands of individual ones which could also lead to them being out of sync with others in the same package.
Since these are data files, some of them change regularly during the design and coding stages, meaning we then have to re-compress all files in that particular zip package, even if only one file has changed. This makes the packaging step of our installer build take well over an hour each time, with most of that going to re-compressing things that we haven't touched.
We've looked into leaving the zip packages alone, then replacing specific files inside them, but inserting and removing large files from the middle of a zip doesn't give us that much of a performance boost. (A little, but not enough that its worth it.)
I'm wondering if its possible to pre-process files down into a cached raw 'compressed state' that matches how it would be written to the zip package, but only the data itself, not the zip header info, etc.
My thinking is if that is possible, during our build step, we would first look for any data file that doesn't have a compressed cache associated with it, and if not, we would compress that file and write the result to the cache.
Next we would simply append all of the caches together in a file stream, adding any appropriate zip header needed for the files.
This would mean we are still recreating the entire zip during each build, but we are only recompressing data that has changed. The rest would just be written as-is which is very fast since it is a straight write-to-disk. And if a data file changes, its cache is destroyed, so next build-pass it would be recreated.
However, I'm not sure such a thing is possible. Is it, and if so, is there any documentation to show how one would go about attempting this?
Yes, that's possible. The most straightforward approach would be to zip each file individually into its own associated zip archive with one entry. When any file is modified, you replace its associated zip file to keep all of those up to date. Then you can write a simple program to take a set of those single entry zip files and merge them into a single zip file. You will need to refer to the documentation in the PKZip appnote. Take a look at that.
Now that you've read the appnote, what you need to do is use the local header, data, and central header from each individual zip file, write the local header and data as is sequentially to the new zip file, and save the central header and the offsets of the local headers in the new file. Then at the end of the new file save the current offset, write a new central directory using the central headers you saved, updating the offsets appropriately, and ending with a new end of central directory record with the offset of the start of the central directory.
Update:
I decided this was a useful enough thing to write. You can get it here.
You could zip each file before hand, and then "zip" them together with no compression at the end to quickly aggregate them into a distributable package. It won't be as efficient as compressing all the data at once, but should be faster to make modifications.
I cannot seem to locate an actual exe that implements this type of functionality. It appears that most existing tools I've tried that have the ability to merge/update will reprocess(compress) the data stream as you have already stated you saw.
However it seems what you describe can be done if you or someone wants to write it. If you take a look at this link for the ZIP file format specification, you can get an overview of the structure you would have to parse out and process. It looks like you can pretty quickly go from file to file gathering up and discarding the files of interest, then merging in your new/updated files. You would still need to rebuild a new central directory (refer to section 4.3.6 of the above linked document) within your new destination archive.
After a little more digging, the DotNetZip Library forum has a message asking about the same type of functionality which also gives a description just like I described above. It also links to this document which seems to indicate that support for that may be added to the DotNetZip library for you to further experiment with.
In my current project i'm dealing with a huge number of files (over tens of milliard files with low volume-between 1 and 30 KB) as resources which copying them for my customer is time consuming job. i'm searching for a packaging mechanism that can help me to package each 1000 or 10000 one of them into a single file,resulting more copy speed because in that case i'm dealing with much less count of files; and also reading them from my application should not need any extraction and also no compression while i'm writing or changing them (because of the performance and nature of application which is distributed and resources are being shared between clients),I have searched and i know about following ZIP libraries:
SharpZipLib
DotNetZip
System.IO.Packaging
But seems above libraries need to be -at least- iterated through files to access a file in the zip or package without extraction. i need to access the files via their address (folder structure hierarchy) in the zip or package file! following links are similar questions which are answered via Iterating through the zip file:
how-to-read-data-from-a-zip-file-without-having-to-unzip-the-entire-file
content-inside-zip-file
Has anyone any idea or solution about this issue?
By the way,i'm coding in C# and the project is windows form-based.
I would do my own Package Format. With GZipStream or something else. For each files, you compress them with GZipStream, after you get the bytes values and you need to create a header in your Package Format which contains for each files (name, starting position and length). With this data in your header, that will probably by at the beginning of your package. You can get the information for your wanted file and after you just seek to the position of the compressed data, you get the byte array with the specified length.
But if you modify one files, you will need to recalculate all index after the modified files.
I am uploading documents on Google Docs as:
DocumentsService myService = new DocumentsService("");
myService.setUserCredentials("username#domain.com", password );
DocumentEntry newEntry = myService.UploadDocument(#"C:\Sample.txt", "Sample.txt");
But when I try to upload a file of 3 MB I get an exception:
An unhandled exception of type
'Google.GData.Client.GDataRequestException'
occurred in Google.GData.Client.dll
Additional information: Execution of
request failed:
http://docs.google.com/feeds/documents/private/full
How can I upload large files to Google Docs?
I am using Google API ver 2.
In your code you are attempting to upload a .txt file.
This will be converted to a Google Docs "Document" file.
Files that can be converted are limited to 1 Million characters of (2Mb in size).
If you change the extension type to something that is not recognised by Google Docs (for example .log), it will allow you to upload a file of up to 10Gb! Although the free account only has 1Gb of storage for files.
This will allow you to store and retrieve files via your application, but the user will not be able to directly modify them with the Google docs interface, although they can still download it.
There is a limit on size of file being uploaded:
http://docs.google.com/support/bin/answer.py?hl=en&answer=37603
Note that there is a conversion done to html and that post-conversion size id the limit.
If you could post some more specifics I could probably come up with a creative solution. What comes to mind so far are:
How about break document up into smaller documents and link then in either file name or actual link in document.
Pre process the document into streamlined text (not sure what kind of files you need to upload.
Upload as stored files and maybe have a google doc that loads the content in an iframe or something similar.
But yeah, if you give me more details, I can think it out if you like.
Try to findout terms and conditions, if they support larger files. Also there will be timeouts set in the library, see ifyou can increase your timeout values in your GData.Client
You have a file limit of 500Kb for documents to be converted.
http://docs.google.com/support/bin/answer.py?hl=en&answer=37603
I did read the following posts:
Pause/Resume Upload in C#
resume uploads using HTTP?
But didn't got a perfect solution to my problem.
In the above posts, one of the answers says "client and server needs to identify the file some how i suggest the use of a Guid so the server knows what file to append the extra data too." Request you to plz visit the first link of the aobve and find that answer. This answer is all about streaming. Can someone plz provide links using which I can build such kind of code?
In these posts one of the answer said "you can send several small file pieces and rebuild them server side"...HOW?
Can't I use something like checksum etc to detect how much part is uploaded and how much more needs to be and append it to that file? If yes, how?
Streams are a fairly fundemental concept in working with files on the .NET platform (as it is in Java, C and other languages). You should start by reading about them and how to use them. See the Stream class on MSDN.
HOW? By using streaming - you stream parts of the file, in small chunks (using an offset into the file and the size of chunk). Again, see Stream documentation.
You could, but checksums of different files may be the same - with a GUID the chance of a collision is pretty small compared to checksums.
In my application, the user selects a big file (>100 mb) on their drive. I wish for the program to then take the file that was selected and chop it up into archived parts that are 100 mb or less. How can this be done? What libraries and file format should I use? Could you give me some sample code? After the first 100mb archived part is created, I am going to upload it to a server, then I will upload the next 100mb part, and so on until the upload is finished. After that, from another computer, I will download all these archived parts, and then I wish to connect them into the original file. Is this possible with the 7zip libraries, for example? Thanks!
UPDATE: From the first answer, I think I'm going to use SevenZipSharp, and I believe I understand now how to split a file into 100mb archived parts, but I still have two questions:
Is it possible to create the first 100mb archived part and upload it before creating the next 100mb part?
How do you extract a file with SevenZipSharp from multiple splitted archives?
UPDATE #2: I was just playing around with the 7-zip GUI and creating multi-volume/split archives, and I found that selecting the first one and extracting from it will extract the whole file from all of the split archives. This leads me to believe that paths to the subsequent parts are included in the first one (or is it consecutive?). However, I'm not sure if this would work directly from the console, but I will try that now, and see if it solves question #2 from the first update.
Take a look at SevenZipSharp, you can use this to create your spit 7z files, do whatever you want to upload them, then extract them on the server side.
To split the archive look at the SevenZipCompressor.CustomParameters member, passing in "v100m". (you can find more parameters in the 7-zip.chm file from 7zip)
You can split the data into 100MB "packets" first, and then pass each packet into the compressor in turn, pretending that they are just separate files.
However, this sort of compression is usually stream-based. As long as the library you are using will do its I/O via a Stream-derived class, it would be pretty simple to implement your own Stream that "packetises" the data any way you like on the fly - as data is passed into your Write() method you write it to a file. When you exceed 100MB in that file, you simply close that file and open a new one, and continue writing.
Either of these approaches would allow you to easily upload one "packet" while continuing to compress the next.
edit
Just to be clear - Decompression is just the reverse sequence of the above, so once you've got the compression code working, decompression will be easy.