I know it might seem ridiculous that you would purposely want to corrupt a file, but I assure you its for a good reason.
In my app, I have a lot of xml serialization going on. This in turn also means, I have a lot of deserialization.
Today I tried some disaster scenarios. I reset the server during a serialization operation, as expected it corrupted the xml file.
The problem is, trying to "shut down" the server at exactly the right time to corrupt the file is not really optimal, firstly its luck to catch the operation during its .0001 ms write time, and secondly the server then needs to reboot.Also its just a bad idea period to be pulling the plug from the server for other reasons.
Is there an app that can effectively corrupt a file, so that this file can be used for testing in my app?
Open it up in a hex editor and have fun twiddling bits?
This is kind of the approach behind Fuzz Testing, i.e. introduce random variations and see how your application copes. You might look at some of the fuzz testing frameworks mentioned in the cited link. But in your case, it would be just as easy to use a random generator and insert bits in those positions to corrupt it. If you have a known case, then you can just use an existing corrupt file, of course.
Are you attempting to test for a partially degraded file?
If you want to test how your program reacts to bad data, why not just use any random text file as input?
There are several ways of currupting an XML file. Thinking on some: - Incomplete XML tags (truncated XML). - Unexpected content on data (Binary / more text).
For the first, I would copy a "correct/complete" XML file and would modify it by hand. For the second one I would concatenate a partial XML file with any binary file on the filesystem.
Hex editor seems a little too-much for me ;)
I would highly recommend you dont do 'random byte' corruption for testing. Not only do you not know exactly what testing state you're doing, if you do find a bug you'll be hard pressed to guarantee that the next test will verify the fix.
My recommendation is to either manually (or programatically) corrupt the file in a predictable way so that you know what you're testing and how to reproduce the test if you must. (of course, you'll probably want multiple predictable ways to ensure protection against corruption anywhere in the file)
Agree with the Hex editor option, as this will allow you to introduce non-text values into the file, such as nulls (0x00), etc.
If you're trying to simulate an interrupted write, you might want to just truncate the string representing the serialized data. This would be especially easy if you're using unit tests, but still quite feasible with Notepad.
Of course, that's just one kind of bad data, but it's worth noting that XML that's malformed in any way is essentially no longer XML, and most parsers will reject it out-of-hand at the first sign of a syntax error.
Related
is it possible to use either File.Delete or File.Encrypt to shred files? Or do both functions not overwrite the actual content on disk?
And if they do, does this also work with wear leveling of ssds and similar techniques of other storages? Or is there another function that I should use instead?
I'm trying to improve an open source project which currently stores credentials in plaintext within a file. Because of reasons they are always written to that file (I don't know why Ansible does this, but for now I don't want to touch that part of the code, there may be some valid reason, why that is that way, at least for now) and I can just delete that file afterwards. So is using File.Delete or File.Encrypt the right approach to purge that information off the disk?
Edit: If it is only possible using native API and pinvoke, I'm also fine with that. I'm not limited to only .net, but to C#.
Edit2: To provide some context: The plaintext credentials are saved by the ansible internals as they are passed as a variable for the modules that get executed on the target windows host. This file is responsible for retrieving the variables again: https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/powershell/Ansible.ModuleUtils.Legacy.psm1#L287
https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/csharp/Ansible.Basic.cs#L373
There's a possibility that File.Encrypt would do more to help shred data than File.Delete (which definitely does nothing in that regard), but it won't be a reliable approach.
There's a lot going on at both the Operating System and Hardware level that's a couple of abstraction layers separated from the .NET code. For example, your file system may randomly decide to move the location where it's storing your file physically on the disk, so overwriting the place where you currently think the file is might not actually remove traces from where the file was stored previously. Even if you succeed in overwriting the right parts of the file, there's often residual signal on the disk itself that could be picked up by someone with the right equipment. Some file systems don't truly overwrite anything: they just add information every time a change happens, so you can always find out what the disk's contents were at any given point in time.
So if you legitimately cannot prevent a file getting saved, any attempt to truly erase it is going to be imperfect. If you're willing to accept imperfection and only want to mitigate the potential for problems somewhat, you can use a strategy like the ones you've found to try to overwrite the file with garbage data several times and hope for the best.
But I wouldn't be too quick to give up on solving the problem at its source. For example, Ansible's docs mention:
A great alternative to the password lookup plugin, if you don’t need to generate random passwords on a per-host basis, would be to use Vault in playbooks. Read the documentation there and consider using it first, it will be more desirable for most applications.
I want to write a Logger/ Bug-Tracker with XML output for my current project. I'm sorry if it should be a duplicate, but the proposals were not useful and I did not found a good google solution too.
1: My first question is about exception safety.
If i use a XmlDocument the loggs are stored in memory till i call save. That means i could lose everything in case of an exception.
If I use a XmlWriter it's not stored in memory (afaik) but I have to close the writer and all elements / nodes which may be a lack in case of an exception too. Can I close and re-open the writer (with the pointer at the end of the document) ?
What's the best solution for an exception-safe XML creation ? (I only need a hint)
2: My second question is about memory usage.
Because it's a tracing tool the output can be very huge. Therefore I can't use XmlDocument. In my opinion the XmlWriter would be the best solution for this. Am i right with this ?
3: My last [minor] question is about time consumption.
Is it a good or bad idea to use a XML file for tracing? How much does it slow down my program ?
I hope you can help me.
EDIT: why do I want to use XML ?
later on, my app will run in an "unknown" environment therefore it is neccessary that I can send the log over internet to a server and I have to validate the file (XMLSchema). After this is done I want to convert it to a better readable (and nice formatted) HTML file.
As you can see, this is a much better visualization than XML (this still need some fine tuning but it's functional)
EDIT 2: current state
I have made some memory usage measurements. The Logger (currently based on XmlDocument :( ) need ~600mb for 5.000.000 entries. Not the best result but not the worst too.
Best regards Alex
For a trace file, do you need the structure of XML? What are your plans for the file once it has been produced; if it is purely for human processing then a text file would be sufficient? That way you can flush after every write.
If you want a more queryable file format, could you incorporate a light DB engine? Before changes to MDAC and 64-bit issues with MSAccess file accessing, I used to write to an mdb file. More recently I looked as SQLite and VistaDB.
Is there a specific reason behind you opting for XML?
Because this question did not contains a good answer, i have to answer it myself.
I've dicided to use a ringbuffer with 10k entries for my log (this seems to be a good number in live use) and i've tried to build a thread and exception safe logbuf (i hope it works).
best regards and thx for all answers
Alex
Why do want to roll your own logging solution?
There are very mature libraries which can do it "out of the box" afaik.
I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?
Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).
Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.
I'm currently following a tutorial series for a Tile Engine which uses XML files to store conversations between NPCs. A topic it doesn't appear to cover (I have only quickly glanced through the subsequent videos) is how to prevent the user from either altering or knowing in advance what the NPC is going to say by opening the XML file easily with a generic text editor.
The 2nd point of being able to read future conversations is not a real issue but something I wanted to think about, so if that's hard to implement I am not too fussed at this point.
How would I go about making the XML uneditable? I know vaguely about CRC32's which can check file integrity which may be useful and I also think there might be better ways to go about that (i.e. not with a CRC32).
The most extreme action I can think of would be to create my own arbitrary encoding for the conversation data, but the usefulness of XML files deters me from that slightly, and with the tutorials I'm following teaching me a lot things I don't know, I would prefer not to defer too far away from them!
Just looking for a direction really, thanks!
Xml is in its fundamentals an open format, so I mean there is not way how to make xml uneditable.
But you can have a copy of xml document (or some of fingerprint of xml) on your server (or on endpoints of NPC conversation) and then you can compare if xml document was edited or no.
If document was edited, you cas replace it with backup version or say to endpoints, that xml document was corrupted...
Historically, many games wrap multiple resources into a single binary file.
You might put it in a ZIP file (and maybe change the file extension). That would allow you to avoid having an XML file with an obvious name as a temptation for your users :).
Ultimately, you're asking something similar to the DRM question. I don't know whether your platform has an answer to that. (E.g., "using RSA encryption" is not secure as such; your program still has to decrypt the data at some point using the appropriate key, etc).
I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.
What if the XML file is, say, 8GB, and you really don't have the option?
My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.
QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?
Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....
Using XElement.Load will load the whole file into the memory. Instead, use XmlReader with the XNode.ReadFrom function, where you can selectively load notes found by XmlReader with XElement for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
If you just need to search the xml document, XmlReader alone will suffice and will not load the whole document into the memory.
Gabriel,
Dude, this isn't exactly answering your ACTUAL question (How to read big xml docs using linq) but you might want to checkout my old question What's the best way to parse big XML documents in C-Sharp. The last "answer" (timewise) was a "note to self" on what ACTUALLY WORKED. It turns out that a hybrid document-XmlReader & doclet-XmlSerializer is fast (enough) AND flexible.
BUT note that I was dealing with docs upto only 150MB. If you REALLY have to handle docs as big as 8GB? then I guess you're likely to encounter all sorts of problems; including issues with the O/S's LARGE_FILE (>2GB) handling... in which case I strongly suggest you keep things as-primitive-as-possible... and XmlReader is as primitive as possible (and THE fastest according to my testing) XML-parser available in the Microsoft namespace.
Also: I've just noticed a belated comment in my old thread suggesting that I check out VTD-XML... I had a quick look at it just now... It "looks promising", even if the author seems to have contracted a terminal case of FIGJAM. He claims it'll handle docs of upto 256GB; to which I reply "Yeah, have you TESTED it? In WHAT environment?" It sounds like it should work though... I've used this same technique to implement "hyperlinks" in a textual help-system; back before HTML.
Anyway good luck with this, and your overall project. Cheers. Keith.
I realize that this answer might be considered non-responsive and possibly annoying, but I would say that if you have an XML file which is 8GB, then at least some of what you are trying to do in XML should be done by the file system or database.
If you have huge chunks of text in that file, you could store them as individual files and store the metadata and the filenames separately. If you don't, you must have many levels of structured data, probably with a lot of repetition of the structures. If you can decide what is considered an individual 'record' which can be stored as a smaller XML file or in a column of a database, then you can structure your database based on the levels of nesting above that. XML is great for small and dirty, it's also good for quite unstructured data since it is self-structuring. But if you have 8GB of data which you are going to do something meaningful with, you must (usually) be able to count on some predictable structure somewhere in it.
Storing XML (or JSON) in a database, and querying and searching both for XML records, and within the XML is well supported nowadays both by SQL stuff and by the NoSQL paradigm.
Of course you might not have the choice of not using XML files this big, or you might have some situation where they are really the best solution. But for some people reading this it could be helpful to look at this alternative.