I am writing lines to text file. Is there a way to limit the maximum number of lines in a text file. So that I am not allowed to write after that limit.
Or
if i continue to write after the max line limit the oldest written lines are deleted to accommodate the newly added lines.
There is ... but you shouldn't be hitting it ... And if you ARE ... well, maybe a text file isn't what you're looking for.
Size wise, a file has different limitations depending on your file system ... NTFS (almost 16TB), FAT (fat 32 is almost 4GB), unix file systems will have their limitations, and so on ...
here you have answers about the size: one answer, and another
Like they suggest, your limit will be the size of the file.
As for your comment:
You can set the limit to whatever you wish.
What you do then is up to you ... if you decide to overwrite the file, it'll delete and start afresh. if you decide to append, it'll append to the end.
I would suggest create a queue of a 100 strings, and if you push new ones, drop the last one in the queue. Then you can just have that class save the log whenever, wherever and however you want.
Create your own method like this
public void writeLines(string filePath,string[] lines,int limit)
{
var buffer=Enumerable.Empty<string>();
if(File.Exists(filePath))
buffer=File.ReadAllLines(path);
File.WriteAllLines(filePath,lines);
int range=limit-lines.Length+buffer.Count;
File.AppendAllLines(filePath,buffer.Take(range));
}
The answer to the first question is very simple:
Know your storage limit.
Know your current file size.
If the new line's length plus current file size is more than the storage limit, don't append it.
Now, the second is kind of tricky. As pointed out by several participants on this thread, line-by-line threshold manipulation can be very very costly.
Let's do some napkin simulations, and assume you're inserting 1024 bytes (1KB) at each Append, and your storage limit is 1GB. Once you insert the last line (n. 1048576), you decide you need to remove the first line. There's a few ways to accomplish this, but the majority of them will involve loading the whole collection minus the initial line elsewhere (memory, disk, you name it) and Appending the new one. Not exactly the most practical approach - you'll be manipulating a stack a million times larger than the content you want to add, just for the sake of adding it.
Solution 1
Cursor buffer
On our example you have 1048576 possible entries (1KB records on 1GB file).
Start filling it up; save the current position (cursor) elsewhere.
Once you reach the limit, your cursor resets; you overwrite position 0, then 1, and so forth.
Advantages: Very low disk cost.
Disadvantages: You'll need to keep track of your current cursor somewhere.
Solution 2
Text blocks
Assume 1GB storage, and max 1MB files for this example.
Start filling up File #0.
Once it reaches 1MB, close it. Open File #1. Rinse and repeat.
Once you fill up file #1023 (thus reaching the 1GB max), delete oldest file (#0). Create file #1024. Continue your logging.
Advantages: Low manipulation cost - you only run one delete operation.
Disadvantages: You don't delete only one entry - you delete a whole block.
Related
i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}
No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.
A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.
If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.
EDIT 1:
I build a torrent application; Downloading from diffrent clients simultaneously. Each download represent a portion for my file and diffrent clients have diffrent portions.
After a download is complete, I need to know which portion I need to achieve now by Finding "empty" portions in my file.
One way to creat a file with fixed size:
File.WriteAllBytes(#"C:\upload\BigFile.rar", new byte[Big Size]);
My portion Arr that represent my file as portions:
BitArray TorrentPartsState = new BitArray(10);
For example:
File size is 100.
TorrentPartsState[0] = true; // thats mean that in my file, from position 0 until 9 I **dont** need to fill in some information.
TorrentPartsState[1] = true; // thats mean that in my file, from position 10 until 19 I **need** to fill in some information.
I seatch an effective way to save what the BitArray is containing even if the computer/application is shut down. One way I tought of, is by xml file and to update it each time a portion is complete.
I don't think its smart and effective solution. Any idea for other one?
It sounds like you know the following when you start a transfer:
The size of the final file.
The (maximum) number of streams you intend to use for the file.
Create the output file and allocate the required space.
Create a second "control" file with a related filename, e.g. add you own extension. In that file maintain an array of stream status structures corresponding to the network streams. Each status consists of the starting offset and number of bytes transferred. Periodically flush the stream buffers and then update the control file to reflect the progress made and committed.
Variations on the theme:
The control file can define segments to be transferred, e.g. 16MB chunks, and treated as a work queue by threads that look for an incomplete segment and a suitable server from which to retrieve it.
The control file could be a separate fork within the result file. (Who am I kidding?)
You could use a BitArray (in System.Collections).
Then, when you visit on offset in the file, you can set the BitArray at that offset to true.
So for your 10,000 byte file:
BitArray ba = new BitArray(10000);
// Visited offset, mark in the BitArray
ba[4] = true;
Implement a file system (like on a disk) in your file - just use something simple, should be something available in the FOS arena
I'm using C# and I write my data into csv files (for further use). However my files have grown into a large scale and i have to transpose them. what's the easiest way to do that. in any program?
Gil
In increasing order of complexity (and also increasing order of ability to handle large files):
Read the whole thing into a 2-D array (or jagged array aka array-of-arrays).
Memory required: equal to size of file
Track the file offset within each row. Start by finding each (non-quoted) newline, storing the current position into a List<Int64>. Then iterate across all rows, for each row: seek to the saved position, copy one cell to the output, save the new position. Repeat until you run out of columns (all rows reach a newline).
Memory required: eight bytes per row
Frequent file seeks scattered across a file much larger than the disk cache results in disk thrashing and miserable performance, but it won't crash.
Like above, but working on blocks of e.g. 8k rows. This will create a set of files each with 8k columns. The input block and output all fit within disk cache, so no thrashing occurs. After building the stripe files, iterate across the stripes, reading one row from each and appending to the output. Repeat for all rows. This results in sequential scan on each file, which also has very reasonable cache behavior.
Memory required: 64k for first pass, (column count/8k) file descriptors for second pass.
Good performance for tables of up to several million in each dimension. For even larger data sets, combine just a few (e.g. 1k) of the stripe files together, making a smaller set of larger stripes, repeat until you have only a single stripe with all data in one file.
Final comment: You might squeeze out more performance by using C++ (or any language with proper pointer support), memory-mapped files, and pointers instead of file offsets.
It really depends. Are you getting these out of a database? The you could use a MySql import statement. http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Or you could use could loop through the data add it to a file stream using streamwriter object.
StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}
I wrote a little proof-of-concept script here in python. I admit it's buggy and there are likely some performance improvements to be made, but it will do it. I ran it against a 40x40 file and got the desired result. I started to run it against something more like your example data set and it took too long for me to wait.
path = mkdtemp()
try :
with open('/home/user/big-csv', 'rb') as instream:
reader = csv.reader(instream)
for i, row in enumerate(reader):
for j, field in enumerate(row):
with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
contents = [ '{0},'.format(field) ]
new_row_stream.writelines(contents)
print 'read row {0:0>2}'.format(i)
with open('/home/user/transpose-csv', 'wb') as outstream:
files = glob(join(path, '*'))
files.sort()
for filename in files:
with open(filename, 'rb') as row_file:
contents = row_file.readlines()
outstream.writelines(contents + [ '\n' ])
finally:
print "done"
rmtree(path)
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx
I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.
Assumptions
The documents can be pretty large, up
to 100MB.
More often than not the document
will remain unchanged - (i.e. I don't
want to do unnecessary up front
processing).
Changes will typically be quite close
to each other in the document (i.e. as
the user types).
It should be possible to apply changes fast (without copying the whole document)
Changes will be applied in terms of
offsets and new/deleted text (not as
line/col).
To work in C#
Current considerations
Storing the data as a string. Easy to
code, fast to set, very slow to
update.
Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).
There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).
Thanks for your thoughts.
I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.
To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.
File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400
Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets
16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.
Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).
Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.
Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.
You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.
FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.
I would suggest you to take a look at Memory Mapped Files (MMF).
Some pointers:
Memory Mapped Files .NET
http://msdn.microsoft.com/en-us/library/ms810613.aspx
I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.
You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.
You can move lines within a node without much effort.
The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.
Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.
Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.
The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.
I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..
From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.
If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)
I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.