Replace bytes in a file based on positions - c#

In C# I want to replace example bytes number 200-5000 by an array/stream of 30000 bytes inside a file, the numbers are just for example, the sizes may vary, I could be replacing a section with either more or less bytes than the size of the section.
I have checked this question:
Replace sequence of bytes in binary file
It only explains how to override a fixed-sized section and will overwrite data after if the byte array length is bigger than original section and will leave old bytes in place if the section to be replaced is bigger than the new bytes.
Higher performance and not having to read the entire file into memory would be preferred, but anything goes.
I would need this to work both on windows and linux.

Related

Whats differece between SFML Image.Pixels and FIle.ReadAllBytes

When I try to understand SFML, I tried to set an icon with RenderWindowInstanse.SetIcon()
the method, that takes 3 parameters, fist two is size, 3 - byte[], then I try to use File.ReadAllBytes()
and same tools in c#, but that don't work, I search and find on-site ImageInstanse.Pixels property that returns byte[] like a parameter, that's works but I don't understand why they are returning different byte arrays
In SFML.NET, Image.Pixels returns an array of bytes that are nicely organized RGBA pixel values that represent the image in memory.
.NET's own File.ReadAllBytes() function returns the bytes that come from the file itself in the system's storage device.
Every file has a format that defines the layout and meaning of the bytes that make up that file. Image files are an extension of that concept as there any many different file formats for images. The pixel data for an image has to be encoded (and/or compressed) according to the format it is being saved as. This means that the bytes in the file no longer matches the raw RGBA pixel data as it was in the computer memory.
Files often contain lots of extra bytes for things like a file header, metadata, compression information, or possibly even an index for blocks of data that are smaller files or images within a file.
When you use File.ReadAllBytes(), you are given all of the bytes that represent this data in an array and you have to know exactly what the meaning of the byte at each index is.
SFML understands how to decode many different image formats, and will read the bytes of the file and process that into an array of pixel data. This is what the constructor for Image that takes a file is doing in the background. Once you have an SFML.Graphics.Image instance, you can use its Pixels property to access that decoded RGBA pixel data.

In SQLite what is the maximum capacity of TEXT?

According to this answer TEXT has a maximum capacity of 65535 characters (or 64Kbytes).
However I just build a test in which I stored a JSON string taken from a json file that is 305KBytes into t TEXT without problems
I am wondering if there is some property in TEXT that allows this
The limit, by default is 1 billion bytes rather than characters so the encoding has to be considered. However it can be changed.
The complete section regarding this is :-
Maximum length of a string or BLOB
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000). You can
raise or lower this value at compile-time using a command-line option
like this:
DSQLITE_MAX_LENGTH=123456789
The current implementation will only support a string or BLOB length up to 2 to the power of 31-1 or 2147483647. And some
built-in functions such as hex() might fail well before that point. In
security-sensitive applications it is best not to try to increase the
maximum string and blob length. In fact, you might do well to lower
the maximum string and blob length to something more in the range of a
few million if that is possible.
During part of SQLite's INSERT and SELECT processing, the complete
content of each row in the database is encoded as a single BLOB. So
the SQLITE_MAX_LENGTH parameter also determines the maximum number of
bytes in a row.
The maximum string or BLOB length can be lowered at run-time using the
sqlite3_limit(db,SQLITE_LIMIT_LENGTH,size) interface.
Limits In SQLite

Limiting a stream size

I'm working on C# project and I want to read a single file from multiple threads using streams in the following manner:
A file is logically divided into "chunks" of fixed size.
Each thread gets it's own stream representing a "chunk".
The problem that I want use a Stream interface and I want to limit the size of each chunk so that the corresponding stream "ends" when it reaches the chunk size.
Is there something available in standard library or my only option is to write my own implementation of Stream?
There is an overload in the Streamreader class for Streamreader.Read which allows you to limit the amount of characters read. An example can be found here: http://msdn.microsoft.com/en-us/library/9kstw824.aspx
The line you are looking for is sr.Read(c, 0, c.Length); You simply set up a char array and decide on the maximum amount of characters that are going to be read (the third argument).

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

Searching for strings in big binary files

I have a big binary file (1 MB < size < 50 MB). I need to search for a string and extract the subsequent four bytes (which is the {size,offset} of actual data in another file). What is the most efficient way to do it so that the search would be fastest?
EDIT: The strings in the index files are in sorted order.
Look up the Boyer–Moore string search algorithm.
Store the {string, size, offset} tuples in sorted order (by string) and use a binary search for the string.
You might also store, at the start of the file, offsets for each first letter of strings. For example if strings starting with 'a' began at position 120 and those starting with 'b' began at position 2000 in the file you could start the file with something like 120, 2000, ...
If the encoding is fixed (ASCII) it is relatively simple. Open a binary stream, read byte for byte and match with first char of the targetstring.
If you have strings using another (UTF-8) encoding it gets trickier.
First, use memory mapping on the file. It will be a lot more efficient than reading it into RAM because instead of two copies (one in your program and one in file cache) there is only one copy.
If each string is a fixed length then a binary search is very easy because you can treat the memory as an array of character arrays.
If each string is variable length but 0 terminated, then you can use a variant of binary search where you jump to the middle of the string list, search for the next 0, then test the next string after that. Then jump forward or back to 1/4 or 3/4 of the string list and repeat.
If each string is variable length in Pascal style, with a byte count at the beginning it is trickier. A linear search from the beginning is not too slow, for infrequent searches. If you're looking for exact string matches, do not forget that you can skip most strings just by checking that the lengths do not match.
If you have to search the list often then building an array of char pointers to the string list would again make binary search really easy. If this file is really an index file for fast searches then it probably already has this in it somewhere, unless the designer intended to build a char pointer array while loading the file.

Categories