So, I have been stuck on the below mentioned issue for more than a day now and am dead confused on how can I go about it.
My boss wants the logs generated by Log4Net, into the database (currently they are being generated in flat files). But hey, he does not want Log4Net to log it directly to the database (which is easy to do :)), as it would be an overhead leading to increased latency.
His requirement is, once the log files are generated, these files should be bulk inserted/copied/imported into the database.
Does anyone have any tips or suggestions I could use?
Note: The lines in the log file are not consistent, most of the times it starts with Date. But at times if there was an exception the Line starts with System.IO.Exception.
Answering this based on the information available from your question
You could just have workers/background processes running (periodically or whenever file updates depending on how much log loss you can tolerate) which would consume this file(s), parse them (you will have to do this regardless, since the files are custom to you and store them in db as you want. Since they are read only operations you will not face any concurrency issues while reading them. Although you will face this issue when writing to your db.
Related
I'm currently working on a C# project of an application we'd like to develop. We're brainstorming over the question of sharing the data between users. We'd like to be able to specify a folder where all the files of the application are going to be saved and we'd like to be able to save them on a shared folder (server, different PC or Mac, Nas, etc.).
The deployment would be like so :
Installation on the first PC, we choose a network drive, share, whatever and create all the files for the application in this location.
On the second PC we install the application and we choose the same location (on the network), the application doesn't create anything, it sees that it's already existing and it uses these files as the application's data
Same thing on the other clients
The application's files are going to be documents (most likely XML formatted documents) and when opening the application we want to show all the existing documents. The thing is, we don't only want to have the list of documents and be able to edit their content, we also would like to be able to edit the document's property, so in a way we'd like a file (Sqlite, XML, whatever) representing the list of all the documents and their attributes. Same thing for a list of addresses.
I know all that looks exactly like a client / server with database solution, but this solution is out of the question. I was first looking at SQLite for my data files, but I know concurrency can be a real problem and file lock doesn't work well. The thing is, I would have the same problem with simple XML files (refreshing the content when several users are working, accessing locked files).
So I guess my final question is : Is it feasable? Is there an alternative I didn't see which would allow us to do that more easily?
EDIT :
OK I'm not responding to every post or comment, because I'm currently testing concurrency with SQLite. What I did, and please correct me if the way I test this is wrong, is launch X BackgroundWorker which are all going to insert record in a sample database (which is recreated everytime I start the application). I tried launching 100 iterations of INSERT in the database via these backgroundWorkers.
Of course concurrency is working with one application running, it's simply waiting for the last BackgroundWorker to do it's job and then writing the next record. I also tried inserting at (almost) the same time, meaning I put a loop in every BackgroundWorker waiting for a modulo 5 timestamp (every 5 seconds, every BackgroundWorker runs). Again, it's waiting for the previous insert query to end before doing the next and everything's working fine. I even tried it with 500 BackgroundWorkers and it worked fine.
I then tried launching my app several times and running them simultaneously. When doing this I did have some issue. With two instances of my app it was still working fine, but when trying this with 4-5 instances, it got really buggy and I got two types of error : 1. database is locked 2. disk I/O failure. But mostyle locked databases.
What I did was pretty intensive, in the scenario of my application, it will never ever come to 5 processes trying to simultaneously insert 500 hunded rows at the same time (maybe I'll get a concurrency of two or three connections). But what really bugged me and what makes me think my testing method is not really a good one, is that I got these errors trying to work on a database on a shared network, on a NAS AND on my own HDD. Everytime it worked for maybe 30-40 queries then throwing me "database is locked" error.
Am I testing it wrong? Maybe I shouldn't be trying so hard to make this work, but I'm still not convinced that SQLite is not a good alternative to what I'm trying to do, since the concurrency is going to be really small.
With your optimistic/pessimistic locking, you are ultimately trying to build a database. Also, you WILL have issues with consistency while trying to keep multiple files in sync with each other. Think about if you update the "metadata" file, and the write fails half-way through because of a network blip. File corruption will ensue, and you will be left trying to reconstruct things from backups.
I would suggest a couple of likely solutions:
1) Host the content yourselves, and let them be pure clients (cloud based deployments are ideal for this). Most network/firewall issues can be circumvented by using HTTP as your transport (web services).
2) Have one of the workstations be the "server", which keeps it data files on the NFS. This will give you transactional integrity, incremental backups, etc. There are lots of good embedded database managements systems to help you manage this complexity. MS SQL Server even has some great options for this.
You right, Sqlite uses file locks on database file, so storing all data files in database would bring write-starvation problem for editing your documents.
May be it's better choice to implement simple optimistic/pessimistic locking by yourself on particular-file level? For example, in case of using pessimistic lock you just don't allow anyone to edit particular file, if somebody already in process of editing it. In this case you will hold lock just on one file, but not on the entire database. If possibility of conflict(editing particular file at the same time) is pretty low, it is better to go with optimistic locking.
Simple optimistic locking implementation:
When user get file for reading - it's OK, no problem here. If user get file for editing, you could calculate hash for this file(or get timestamp of last updated time of the file), and then, when user tries to save edited file, compare current(at the moment of saving) hash/timestamp to make sure that file has not been changed by somebody else. If file has not been changed then it's ok to save it. IF file has been changed, then current user is out of luck, you need to inform him about it. This optimistic scenario is nice when possibility of this "out of luck" is pretty low. Otherwise it's better to stick with pessimistic locking, when you do not allow user even to start file editing if somebody else is doing it.
I am making an evolution for a log file system I have had in place for a few builds of a service I develop on. I had previously been opening the file, appending data, and prior to writing checking to see if the log file had grown over a predetermined size, if so starting a new log.
So say the log size was 100mb, at that size I delete, and start a new file, but I loose history, functional, but not the best model.
What I want to do is a FIFO model that would chop off the top and add to the end while keeping it consistently no larger than 100mb, and at least as far back as that represents.
The data is high speed in a failure prone industrial environment, so keeping it all in memory and writing the whole file at interval has proven unreliable. (SSD, fast enough to do it reasonably most of the time, spinners fail too often to tolerate)
Likewise the records are of greatly variable length (formatted as XML nodes, so parsing them back out accounts for this easily)
So the only workable model I have come up with thus far is to keep smaller slices (say 10mb) chunks, create new ones then delete the oldest 10mb slice on count >= 10.
What I would prefer to do is be able to keep the file on disk and work with the tag ends.
Open to suggestions on how this might be best achieved in a reasonable manner, or is there no reasonable manner and the layered multi log approach will be the best option?
The biggest issue with expiring old log entries in a single file is that you have to rewrite the file's content in order to expire older entries. This isn't too bad for small files (up to a few MB in size), but once you get to the point where rewriting takes a significant period of time it becomes problematic.
One of the more common ways to retire logs is to rename the existing log file and/or start a new file. Lots of programs do it that way, with either dated log file names or by using a sequential numbering system - logfile, logfile.1, logfile.2, etc. with higher-numbered files being older. You can add compression to the process to further reduce the storage requirements for expired files, etc.
Another option is to use a more database-like format, or an out-and-out database like SQLite to store your log entries. The primary downside of this of course is that your log files become more difficult to read, since they're not just in plain text form. It's simple enough to write a dump-to-text program whose output can be piped to a log parser... but even this will probably require a change in the way your consumers are interfacing with the log file.
The problem as stated is unlikely to be realistically solvable, I suspect. On the one hand you have the limitations of file manipulation, and on the other the fact that your log consumers are many and varied and therefore changes to the logging structure will be an involved process.
About all I can suggest is that you trial a log aging process similar to this:
Rename current log file
Walk renamed file and copy desired contents to new log file
Discard or archive renamed log
Beware duplication or data loss.
i dunno why u need this feature "chop off the top and add to the end while keeping it consistently no larger than 100mb".
general design approach is archiving. simply rename the oversized file to another file, or move it to somwhere else, then using back the same filename as new file.
simple as this is.
I have multiple Windows programs (running on Windows 2000, XP and 7), which handle text files of different formats (csv, tsv, ini and xml). It is very important not to corrupt the content of these files during file IO. Every file should be safely accessible by multiple programs concurrently, and should be resistant to system crashes. This SO answer suggests using an in-process database, so I'm considering to use the Microsoft Jet Database Engine, which is able to handle delimited text files (csv, tsv), and supports transactions. I used Jet before, but I don't know whether Jet transactions really tolerate unexpected crashes or shutdowns in the commit phase, and I don't know what to do with non-delimited text files (ini, xml). I don't think it's a good idea to try to implement fully ACIDic file IO by hand.
What is the best way to implement transactional handling of text files on Windows? I have to be able to do this in both Delphi and C#.
Thank you for your help in advance.
EDIT
Let's see an example based on #SirRufo's idea. Forget about concurrency for a second, and let's concentrate on crash tolerance.
I read the contents of a file into a data structure in order to modify some fields. When I'm in the process of writing the modified data back into the file, the system can crash.
File corruption can be avoided if I never write the data back into the original file. This can be easily achieved by creating a new file, with a timestamp in the filename every time a modification is saved. But this is not enough: the original file will stay intact, but the newly written one may be corrupt.
I can solve this by putting a "0" character after the timestamp, which would mean that the file hasn't been validated. I would end the writing process by a validation step: I would read the new file, compare its contents to the in-memory structure I'm trying to save, and if they are the same, then change the flag to "1". Each time the program has to read the file, it chooses the newest version by comparing the timestamps in the filename. Only the latest version must be kept, older versions can be deleted.
Concurrency could be handled by waiting on a named mutex before reading or writing the file. When a program gains access to the file, it must start with checking the list of filenames. If it wants to read the file, it will read the newest version. On the other hand, writing can be started only if there is no version newer than the one read last time.
This is a rough, oversimplified, and inefficient approach, but it shows what I'm thinking about. Writing files is unsafe, but maybe there are simple tricks like the one above which can help to avoid file corruption.
UPDATE
Open-source solutions, written in Java:
Atomic File Transactions: article-1, article-2, source code
Java Atomic File Transaction (JAFT): project home
XADisk: tutorial, source code
AtomicFile: description, source code
How about using NTFS file streams? Write multiple named(numbered/timestamped) streams to the same Filename. Every version could be stored in a different stream but is actually stored in the same "file" or bunch of files, preserving the data and providing a roll-back mechanism...
when you reach a point of certainty delete some of the previous streams.
Introduced in NT 4? It covers all versions. Should be crash proof you will always have the previous version/stream plus the original to recover / roll-back to.
Just a late night thought.
http://msdn.microsoft.com/en-gb/library/windows/desktop/aa364404%28v=vs.85%29.aspx
What you are asking for is transactionality, which is not possible without developing yourself the mechanism of a RDBMS database according to your requirements:
"It is very important not to corrupt the content of these files during file IO"
Pickup a DBMS.
See a related post Accessing a single file with multiple threads
However my opinion is to use a database like Raven DB for these kind of transactions, Raven DB supports concurrent access to same file as well as supporting batching on multiple operations into a single request. However everything is persisted as JSON documents, not text files. It does support .NET/C# very well, including Javascript and HTML but not Delphi.
First of all this question has nothing to do with C# or Delphi. You have to simulate your file structure as if it is a database.
Assumptions;
Moving of files is a cheap process and Op System guarantees that the files are not corrupted during move.
You have a single directory of files that need to be processed. (d:\filesDB*.*)
A Controller application is a must.
Simplified Worker Process;
-initialization
Gets a processID from the Operating system.
Creates directories in d:\filesDB
d:\filesDB\<processID>
d:\filesDB\<processID>\inBox
d:\filesDB\<processID>\outBox
-process for each file
Select file to process.
Move it to the "inBox" Directory (ensures single access to file)
Open file
Create new file in "outBox" and close it properly
Delete file in "inBox" Directory.
Move newly created file located in "OutBox" back to d:\filesDB
-finallization
remove the created directories.
Controller Application
Runs only on startup of the system, and initializes applications that will do the work.
Scan d:\filesDB directory for subdirectories,
For each subDirectory
2.1 if File exists in "inBox", move it to d:\filesDB and skip "outBox".
2.2 if File exists in "outBox", move it to d:\filesDB
2.3 delete the whole subDirectory.
Start each worker process that need to be started.
I hope that this will solve your problem.
You are creating a nightmare for yourself trying to handle these transactions and states in your own code across multiple systems. This is why Larry Ellison (Oracle CEO) is a billionaire and most of us are not. If you absolutely must use files, then setup an Oracle or other database that supports LOB and CLOB objects. I store very large SVG files in such a table for my company so that we can add and render large maps to our systems without any code changes. The files can be pulled from the table and passed to your users in a buffer then returned to the database when they are done. Setup the appropriate security and record locking and your problem is solved.
Ok, you are dead - unless you can drop XP. Simple like that.
Since POST-XP Windows supports Transactional NTFS - though it is not exposed to .NET (natively - you can still use it). This allows one to roll back or commit changes on a NTFS file system, with a DTC even in coordination with a database. Pretty nice. XP, though - no way, not there.
Start at Any real-world, enterprise-grade experience with Transactional NTFS (TxF)? as a starter. The question there lists a lot of ressources to get you started on how to do it.
Note that this DOES have a performance overhead - obviously. It is not that bad, though, unless you need a SECOND transactional resource, as there is a very thin kernel level transaction coordinator there, transactions only get promoted to full DTC when a second ressource is added.
For a direct link - http://msdn.microsoft.com/en-us/magazine/cc163388.aspx has some nice information.
I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.
You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.
If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.
I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.
I've a program (deployed a copy to each users computer) for user to store files on a centralized file server with compression (CAB file).
When adding a file, user need to extract the file onto his own disk, add the file, and compress it back onto the server. So if two users process the same compressed file at the same time, the later uploaded one will replace the one earlier and cause data loss.
My strategy to prevent this is before user extract the compressed file, the program will check if there is a specified temp file exist on the server. If not, the program will create such temp file to prevent other user's interfere, and will delete the temp file after uploading; If yes, the program will wait until the temp file is deleted.
Is there better way of doing this? And will frequently creating and deleting empty files damage the disk?
And will frequently creating and
deleting empty files damage the disk?
No. If you're using a solid-state disk, there's a theoretical limit on the number of writes that can be performed (which is an inherit limitation of FLASH). However, you're incredibly unlikely to ever reach that limit.
Is there better way of doing this
Well, I would go about this differently:
Write a Windows Service that handles all disk access, and have your client apps talk to the service. So, when a client needs to retrieve a file, it would open a socket connection to your service and request the file and either keep it in memory or save it to their local disk. Perform any modifications on the client's local copy of the file (decompress, add/remove/update files, recompress, etc), and, when the operation is complete and you're ready to save (or commit in source-control lingo) your changes, open another socket connection to your service app (running on the server), and send it the new file contents as a binary stream.
The service app would then handle loading and saving the files to disk. This gives you a lot of additional capabilities, as well - the server can keep track of past versions (perhaps even committing each version to svn or another source control system), provide metadata such as what the latest version is, etc.
Now that I'm thinking about it, you may be better off just integrating an svn interface into your app. SharpSVN is a good library for this.
Creating temporary files to flag the lock is a viable and widely used option (and no, this won't damage the disk). Another option is to open the compressed file exclusively (or let other processes only read the file but not write it) and keep the file opened while the user works with the contents of the file.
Is there better way of doing this?
Yes. From what you've written here, it sounds like you are well on your way towards re-inventing revision control.
Perhaps you could use some off-the-shelf version control system?
Or perhaps at least re-use some code from such systems?
Or perhaps you could at least learn a little about the problems those systems faced, how fixing the obvious problems led to non-obvious problems, and attempt to make a system that works at least as well?
My understanding is that version control systems went through several stages (see
"Edit Conflict Resolution" on the original wiki, the Portland Pattern Repository).
In roughly chronological order:
The master version is stored on the server. Last-to-save wins, leading to mysterious data loss with no warning.
The master version is stored on the server. When I pull a copy to my machine, the system creates a lock file on the server. When I push my changes to the server (or cancel), the system deletes that lock file. No one can change those files on the server, so we've fixed the "mysterious data loss" problem, but we have endless frustration when I need to edit some file that someone else checked out just before leaving on a long vacation.
The master version is stored on the server. First-to-save wins ("optimistic locking"). When I pull the latest version from the server, it includes some kind of version-number. When I later push my edits to the server, if the version-number I pulled doesn't match the current version on the server, someone else has cut in first and changed things ahead of me, and the system gives some sort of polite message telling me about it. Ideally I pull the latest version from the server and carefully merge it with my version, and then push the merged version to the server, and everything is wonderful. Alas, all too often, an impatient person pulls the latest version, overwrites it with "his" version, and pushes "his" version, leading to data loss.
Every version is stored on the server, in an unbroken chain. (Centralized version control like TortoiseSVN is like this).
Every version is stored in every local working directory; sometimes the chain forks into 2 chains; sometimes two chains merge back into one chain. (Distributed version control tools like TortoiseHg are like this).
So it sounds like you're doing what everyone else did when they moved from stage 1 to stage 2. I suppose you could slowly work your way through every stage.
Or maybe you could jump to stage 4 or 5 and save everyone time?
Take a look at the FileStream.Lock method. Quoting from MSDN:
Prevents other processes from reading from or writing to the FileStream.
...
Locking a range of a file stream gives the threads of the locking process exclusive access to that range of the file stream.