I need to generate etags for image files on the web. One of the possible solutions I thought of would be to calculate CRCs for the image files, and then use those as the etag.
This would require CRCs to be calculated every time someone requests an image on the server, so its very important that it can be done fast.
So, how fast are algorithms to generate CRCs? Or is this a stupid idea?
Use instead a more robust hashing algo such as SHA1.
Speed depends on the size of the image. Most time will be spent on loading data from the disk, rather than in CPU processing. You can cache your generated hashes.
But I also advise on creating etag based on last update date of the file which is much quicker and does not require loading the whole file.
Remember, etag must only be unique for a particular resource so if two different images have the same last update time, it is fine.
Most implementations use the last modified date or other file headers as the ETag including Microsoft's own, and I suggest you use that method.
Depends on the method used, and the length. Generally pretty fast, but why not cache them?
If there won't be changes to the files more often than the resolution of the system used to store it (that is, of file modification times for the filesystem or of SQLServer datetime if stored in a database), then why not just use the date of modification to the relevant resolution?
I know RFC 2616 advises against the use of timestamps, but this is only because HTTP timestamps are 1sec resolution and there can be changes more frequent than that. However:
That's still fine if you don't change images more than once a second.
It's also fine to base your e-tag on the time as long as the precision is great enough that it won't end up with the same for two versions of the same resource.
With this approach you are guaranteed a unique e-tag (collisions are unlikely with a large CRC but certainly possible), which is what you want.
Of course, if you don't ever change the image at a given URI, it's even easier as you can just use a fixed string (I prefer string "immutable").
I would suggest calculate hash when adding a image into a data base once and then just return it by SELECT along with a image itself.
If you are usign Sql Server and images not very large (max 8000 bytes) you can leverage HASHBYTES() function which able to generate SHA-1, MD5, ...
Related
I want to check if the content of a file changed.
My plan if to add a hash in the last line of the file.
Later on, I can read the file, hash it (hash everything except the last line) and compare it to the last line of the file (initial hash).
I cannot use the last modified date/time. I need to use a hash or any kind of coding stored inside the file. I use C# to code the app.
What is the most reasoneble/easiest way of doing this? I don't know which of the the following would be a good match for me: Sha1,2,3 - crc16/32/64 - md5? I do not need the method to be quick or secure.
Thank you!
It seems to me as if you're going to have a chicken or egg issue if you store the hash inside the file. You won't know the hash until you hash the file. But then when you hash the file and add that value to the end of the file, the hash will change. So clearly you need to hash the file without including the actual hash itself. You already said this, but I'm adding it again to clarify my next points.
The trick is that hash/sum algorithms give you the sum of the entire file (or byte stream, or whatever). They don't tend to give you a "running total" as it were. Which means you'll need to separate out the hash from the rest of the content before testing to see if it's changed. That is unless you write a custom hashing tool yourself.
This is of course possible using all hashing algorithms, but the fact that you are asking this question leads me to believe that you probably won't want the hassle of writing a custom (e.g.) SHA256 tool specifically designed to drop out when it reaches the stored hash.
To my eye, you have three choices:
Store the hash separately from your file - or at the minimum write a temporary file which does not contain the hash, and hash that. This would allow you to use a hashing tool already built into C# without any modification or fancy trickery. I know this does not exactly match your requirements as listed, but it's an option that you might consider.
You don't mention the size of the file, but if it is sufficiently small, you could simply slurp it up into memory minus the bytes of the hash, hash your in-memory data using a built-in tool, and then compare. This would again allow you to use built-in tools.
Use a custom hashing tool that purposely drops out when it reaches the end of the "interesting" data. If that's the case, I would unquestionably recommend a non-secure hashing method like CRC, simply because it will be so much easier to understand and modify the code yourself (it is much simpler code after all). You already mention that you don't need it to be secure, so this would meet your requirements.
If you decide to go with option #3, then I would suggest schlepping over to Rosetta Code to search for a CRC algorithm in C#. From there you can read your file, subtract out the bytes of the hash, send the remainder through your hashing algorithm. The algorithm listed there processes all bytes at once, but it would be trivial to turn the accumulator into a parameter so that you could send data in chunks. This would allow you to work on an arbitrarily large file in situ.
[EDIT] FWIW, I have already gone down a similar path. In my case I wrote a custom tool which allows us to incrementally copy extremely large files over the WAN. So big that we had problems getting the file to copy safely. Proper use of the tool is to remote the source server, pre-run a CRC32 check and save the sums at arbitrary intervals. Then one copies the CRC32 checks to the client side, and starts copying the file. Should the target get stopped in the middle, or possibly corrupted somehow, one can simply supply the name of the local partial, the remote source, the file containing CRC32 sums, and finally a target. The program will start copying from the local partial, and will only start copying from the remote when a partial CRC32 sum issue is found. Our problem was that a simple resume at the end of the bytes copy did not always work. Which was frustrating since it takes so long to copy. My team mates and I laughed several times that we might try USB drives and homing pigeons...
What are you trying to protect yourself against?
Accidental change? Then your approach sounds fine. (Make sure to add handling for when the last line with the hash was deleted by accident too.)
Malicious change? Then you'd need to hash the file content plus some private key, and use a secure hashing algorithm. MD5 is good for accidental changes because it is fast, but cryptographically it is considered broken.
I a making a state saving replay system in Unity3D and I want the replay to be written to a file. What file format is best to use when saving an replay? Xml maybe? For now I'm storing the transform data and I've implemented the option to add additional frame data.
It highly depends on what you are trying to do. There are tradeoffs for every solution in this case.
Based on the extra information you have given in the comments, the best solution I can think of in this case is marshalling your individual "recording sessions" onto a file. However, there is a little overhead to be done in order to achieve this.
Create a class called Frame, create another class called Record which has a List<Frame> frames. That way, you can place any information that you would like to be captured in each frame as attributes in the Frame class.
Since you can't marshal a generic type, you will have to marshall each frame individually. I suggest implementing a method in the Record class called MarshalRecording() that handles that for you.
However, in doing that, you will find it difficult to unmarshal your records because they may have different sizes in binary form and they wouldn't have a separator indicating where a frame ends and where the next frame begins. I suggest appending the size information at the beginning of each marshalled frame, that way you will be able to unmarshal all of the frames even if they have different sizes.
As #PiotrK pointed out in his answer, you could use protobuf. However, I don't recommend it for your specific use. Personally, I think this is overkill (Too much work for too little results, protobufs can be a PITA some times).
If you are worried about storage size, you could LZ4 the whole thing (if you are concatenating the binary information in memory), or LZ4 each frame (if you are processing each frame individually and then appending it to a file). I recommend the latter one because, depending on the amount of frames, you may run out of memory while you marshal your record.
Ps: Never use XML, it's cancerous!
The best way I know is to use some kind of format generator (like Google Protobuf). This has several advantage over using default C# serializer:
Versioning of format is as easy as possible; You can add new feature to the format and that will not break replays already existing in the field
The result is stored as either text or binary, with binary favorable and having very small footprint (for example if some of your data has default values, it won't be present in output at all - loader will handle them gracefully)
It's Google technology! :-)
One part of my application requires a bunch of images (representing scale) on screen. Because of the wide variety of possibilities, I'd rather generate the images programmatically than pre-create and store all possible images (some of which may never be used). This seems doable using the methed described in this question and answer.
However, the two pages which will use these images will have plenty of them (potentially a couple hundred on one of the pages). My question, then, is will this negatively impact the performance of the application, and if so, how drastically? The pages could potentially be reloaded several times as values change.
Would it be best to generate the images when the page is loaded? Best to precreate them and store several hundred, possibly only using a few? Or would it be best to programmatically create them the first time they are loaded, and then store them under the assumption that since they have been used once, they will likely be used again (assuming they would still be valid - it is quite possible for them to become invalid and need to be replaced)?
EDIT: Each of these images represents a number, which is an application-wide variable. It is expected that most of these numbers will be different, although there may be some few that are equal.
Why not do both, programatically generate images as needed but cache them (i.e. save them as files on the server) so they can be reused.
Further to your edit: If the images are simple image representations of numbers then just pregenerate 0 to 9 and then programatically glue them together at runtime.
I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)
Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.
Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.
How about looking the source code of Subversion ? its licensed under Apache License 2.0
Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.
The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).
Though fossil is GPL, the delta algorithm is based on rsync and described here
I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.
In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?