I am making an evolution for a log file system I have had in place for a few builds of a service I develop on. I had previously been opening the file, appending data, and prior to writing checking to see if the log file had grown over a predetermined size, if so starting a new log.
So say the log size was 100mb, at that size I delete, and start a new file, but I loose history, functional, but not the best model.
What I want to do is a FIFO model that would chop off the top and add to the end while keeping it consistently no larger than 100mb, and at least as far back as that represents.
The data is high speed in a failure prone industrial environment, so keeping it all in memory and writing the whole file at interval has proven unreliable. (SSD, fast enough to do it reasonably most of the time, spinners fail too often to tolerate)
Likewise the records are of greatly variable length (formatted as XML nodes, so parsing them back out accounts for this easily)
So the only workable model I have come up with thus far is to keep smaller slices (say 10mb) chunks, create new ones then delete the oldest 10mb slice on count >= 10.
What I would prefer to do is be able to keep the file on disk and work with the tag ends.
Open to suggestions on how this might be best achieved in a reasonable manner, or is there no reasonable manner and the layered multi log approach will be the best option?
The biggest issue with expiring old log entries in a single file is that you have to rewrite the file's content in order to expire older entries. This isn't too bad for small files (up to a few MB in size), but once you get to the point where rewriting takes a significant period of time it becomes problematic.
One of the more common ways to retire logs is to rename the existing log file and/or start a new file. Lots of programs do it that way, with either dated log file names or by using a sequential numbering system - logfile, logfile.1, logfile.2, etc. with higher-numbered files being older. You can add compression to the process to further reduce the storage requirements for expired files, etc.
Another option is to use a more database-like format, or an out-and-out database like SQLite to store your log entries. The primary downside of this of course is that your log files become more difficult to read, since they're not just in plain text form. It's simple enough to write a dump-to-text program whose output can be piped to a log parser... but even this will probably require a change in the way your consumers are interfacing with the log file.
The problem as stated is unlikely to be realistically solvable, I suspect. On the one hand you have the limitations of file manipulation, and on the other the fact that your log consumers are many and varied and therefore changes to the logging structure will be an involved process.
About all I can suggest is that you trial a log aging process similar to this:
Rename current log file
Walk renamed file and copy desired contents to new log file
Discard or archive renamed log
Beware duplication or data loss.
i dunno why u need this feature "chop off the top and add to the end while keeping it consistently no larger than 100mb".
general design approach is archiving. simply rename the oversized file to another file, or move it to somwhere else, then using back the same filename as new file.
simple as this is.
Related
So, I have been stuck on the below mentioned issue for more than a day now and am dead confused on how can I go about it.
My boss wants the logs generated by Log4Net, into the database (currently they are being generated in flat files). But hey, he does not want Log4Net to log it directly to the database (which is easy to do :)), as it would be an overhead leading to increased latency.
His requirement is, once the log files are generated, these files should be bulk inserted/copied/imported into the database.
Does anyone have any tips or suggestions I could use?
Note: The lines in the log file are not consistent, most of the times it starts with Date. But at times if there was an exception the Line starts with System.IO.Exception.
Answering this based on the information available from your question
You could just have workers/background processes running (periodically or whenever file updates depending on how much log loss you can tolerate) which would consume this file(s), parse them (you will have to do this regardless, since the files are custom to you and store them in db as you want. Since they are read only operations you will not face any concurrency issues while reading them. Although you will face this issue when writing to your db.
I have multiple Windows programs (running on Windows 2000, XP and 7), which handle text files of different formats (csv, tsv, ini and xml). It is very important not to corrupt the content of these files during file IO. Every file should be safely accessible by multiple programs concurrently, and should be resistant to system crashes. This SO answer suggests using an in-process database, so I'm considering to use the Microsoft Jet Database Engine, which is able to handle delimited text files (csv, tsv), and supports transactions. I used Jet before, but I don't know whether Jet transactions really tolerate unexpected crashes or shutdowns in the commit phase, and I don't know what to do with non-delimited text files (ini, xml). I don't think it's a good idea to try to implement fully ACIDic file IO by hand.
What is the best way to implement transactional handling of text files on Windows? I have to be able to do this in both Delphi and C#.
Thank you for your help in advance.
EDIT
Let's see an example based on #SirRufo's idea. Forget about concurrency for a second, and let's concentrate on crash tolerance.
I read the contents of a file into a data structure in order to modify some fields. When I'm in the process of writing the modified data back into the file, the system can crash.
File corruption can be avoided if I never write the data back into the original file. This can be easily achieved by creating a new file, with a timestamp in the filename every time a modification is saved. But this is not enough: the original file will stay intact, but the newly written one may be corrupt.
I can solve this by putting a "0" character after the timestamp, which would mean that the file hasn't been validated. I would end the writing process by a validation step: I would read the new file, compare its contents to the in-memory structure I'm trying to save, and if they are the same, then change the flag to "1". Each time the program has to read the file, it chooses the newest version by comparing the timestamps in the filename. Only the latest version must be kept, older versions can be deleted.
Concurrency could be handled by waiting on a named mutex before reading or writing the file. When a program gains access to the file, it must start with checking the list of filenames. If it wants to read the file, it will read the newest version. On the other hand, writing can be started only if there is no version newer than the one read last time.
This is a rough, oversimplified, and inefficient approach, but it shows what I'm thinking about. Writing files is unsafe, but maybe there are simple tricks like the one above which can help to avoid file corruption.
UPDATE
Open-source solutions, written in Java:
Atomic File Transactions: article-1, article-2, source code
Java Atomic File Transaction (JAFT): project home
XADisk: tutorial, source code
AtomicFile: description, source code
How about using NTFS file streams? Write multiple named(numbered/timestamped) streams to the same Filename. Every version could be stored in a different stream but is actually stored in the same "file" or bunch of files, preserving the data and providing a roll-back mechanism...
when you reach a point of certainty delete some of the previous streams.
Introduced in NT 4? It covers all versions. Should be crash proof you will always have the previous version/stream plus the original to recover / roll-back to.
Just a late night thought.
http://msdn.microsoft.com/en-gb/library/windows/desktop/aa364404%28v=vs.85%29.aspx
What you are asking for is transactionality, which is not possible without developing yourself the mechanism of a RDBMS database according to your requirements:
"It is very important not to corrupt the content of these files during file IO"
Pickup a DBMS.
See a related post Accessing a single file with multiple threads
However my opinion is to use a database like Raven DB for these kind of transactions, Raven DB supports concurrent access to same file as well as supporting batching on multiple operations into a single request. However everything is persisted as JSON documents, not text files. It does support .NET/C# very well, including Javascript and HTML but not Delphi.
First of all this question has nothing to do with C# or Delphi. You have to simulate your file structure as if it is a database.
Assumptions;
Moving of files is a cheap process and Op System guarantees that the files are not corrupted during move.
You have a single directory of files that need to be processed. (d:\filesDB*.*)
A Controller application is a must.
Simplified Worker Process;
-initialization
Gets a processID from the Operating system.
Creates directories in d:\filesDB
d:\filesDB\<processID>
d:\filesDB\<processID>\inBox
d:\filesDB\<processID>\outBox
-process for each file
Select file to process.
Move it to the "inBox" Directory (ensures single access to file)
Open file
Create new file in "outBox" and close it properly
Delete file in "inBox" Directory.
Move newly created file located in "OutBox" back to d:\filesDB
-finallization
remove the created directories.
Controller Application
Runs only on startup of the system, and initializes applications that will do the work.
Scan d:\filesDB directory for subdirectories,
For each subDirectory
2.1 if File exists in "inBox", move it to d:\filesDB and skip "outBox".
2.2 if File exists in "outBox", move it to d:\filesDB
2.3 delete the whole subDirectory.
Start each worker process that need to be started.
I hope that this will solve your problem.
You are creating a nightmare for yourself trying to handle these transactions and states in your own code across multiple systems. This is why Larry Ellison (Oracle CEO) is a billionaire and most of us are not. If you absolutely must use files, then setup an Oracle or other database that supports LOB and CLOB objects. I store very large SVG files in such a table for my company so that we can add and render large maps to our systems without any code changes. The files can be pulled from the table and passed to your users in a buffer then returned to the database when they are done. Setup the appropriate security and record locking and your problem is solved.
Ok, you are dead - unless you can drop XP. Simple like that.
Since POST-XP Windows supports Transactional NTFS - though it is not exposed to .NET (natively - you can still use it). This allows one to roll back or commit changes on a NTFS file system, with a DTC even in coordination with a database. Pretty nice. XP, though - no way, not there.
Start at Any real-world, enterprise-grade experience with Transactional NTFS (TxF)? as a starter. The question there lists a lot of ressources to get you started on how to do it.
Note that this DOES have a performance overhead - obviously. It is not that bad, though, unless you need a SECOND transactional resource, as there is a very thin kernel level transaction coordinator there, transactions only get promoted to full DTC when a second ressource is added.
For a direct link - http://msdn.microsoft.com/en-us/magazine/cc163388.aspx has some nice information.
I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,
The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.
What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals
I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.
You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.
If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.
I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.
I've a program (deployed a copy to each users computer) for user to store files on a centralized file server with compression (CAB file).
When adding a file, user need to extract the file onto his own disk, add the file, and compress it back onto the server. So if two users process the same compressed file at the same time, the later uploaded one will replace the one earlier and cause data loss.
My strategy to prevent this is before user extract the compressed file, the program will check if there is a specified temp file exist on the server. If not, the program will create such temp file to prevent other user's interfere, and will delete the temp file after uploading; If yes, the program will wait until the temp file is deleted.
Is there better way of doing this? And will frequently creating and deleting empty files damage the disk?
And will frequently creating and
deleting empty files damage the disk?
No. If you're using a solid-state disk, there's a theoretical limit on the number of writes that can be performed (which is an inherit limitation of FLASH). However, you're incredibly unlikely to ever reach that limit.
Is there better way of doing this
Well, I would go about this differently:
Write a Windows Service that handles all disk access, and have your client apps talk to the service. So, when a client needs to retrieve a file, it would open a socket connection to your service and request the file and either keep it in memory or save it to their local disk. Perform any modifications on the client's local copy of the file (decompress, add/remove/update files, recompress, etc), and, when the operation is complete and you're ready to save (or commit in source-control lingo) your changes, open another socket connection to your service app (running on the server), and send it the new file contents as a binary stream.
The service app would then handle loading and saving the files to disk. This gives you a lot of additional capabilities, as well - the server can keep track of past versions (perhaps even committing each version to svn or another source control system), provide metadata such as what the latest version is, etc.
Now that I'm thinking about it, you may be better off just integrating an svn interface into your app. SharpSVN is a good library for this.
Creating temporary files to flag the lock is a viable and widely used option (and no, this won't damage the disk). Another option is to open the compressed file exclusively (or let other processes only read the file but not write it) and keep the file opened while the user works with the contents of the file.
Is there better way of doing this?
Yes. From what you've written here, it sounds like you are well on your way towards re-inventing revision control.
Perhaps you could use some off-the-shelf version control system?
Or perhaps at least re-use some code from such systems?
Or perhaps you could at least learn a little about the problems those systems faced, how fixing the obvious problems led to non-obvious problems, and attempt to make a system that works at least as well?
My understanding is that version control systems went through several stages (see
"Edit Conflict Resolution" on the original wiki, the Portland Pattern Repository).
In roughly chronological order:
The master version is stored on the server. Last-to-save wins, leading to mysterious data loss with no warning.
The master version is stored on the server. When I pull a copy to my machine, the system creates a lock file on the server. When I push my changes to the server (or cancel), the system deletes that lock file. No one can change those files on the server, so we've fixed the "mysterious data loss" problem, but we have endless frustration when I need to edit some file that someone else checked out just before leaving on a long vacation.
The master version is stored on the server. First-to-save wins ("optimistic locking"). When I pull the latest version from the server, it includes some kind of version-number. When I later push my edits to the server, if the version-number I pulled doesn't match the current version on the server, someone else has cut in first and changed things ahead of me, and the system gives some sort of polite message telling me about it. Ideally I pull the latest version from the server and carefully merge it with my version, and then push the merged version to the server, and everything is wonderful. Alas, all too often, an impatient person pulls the latest version, overwrites it with "his" version, and pushes "his" version, leading to data loss.
Every version is stored on the server, in an unbroken chain. (Centralized version control like TortoiseSVN is like this).
Every version is stored in every local working directory; sometimes the chain forks into 2 chains; sometimes two chains merge back into one chain. (Distributed version control tools like TortoiseHg are like this).
So it sounds like you're doing what everyone else did when they moved from stage 1 to stage 2. I suppose you could slowly work your way through every stage.
Or maybe you could jump to stage 4 or 5 and save everyone time?
Take a look at the FileStream.Lock method. Quoting from MSDN:
Prevents other processes from reading from or writing to the FileStream.
...
Locking a range of a file stream gives the threads of the locking process exclusive access to that range of the file stream.