I want to write a function in a way if i call it with an argument
of(100) and the path of particular file it will gets the first 100kb
data of the file, and when i call it for the second time with 200 it should return the next 200kb of data leaving the first 100. If there is no more left in the file it should return 0;
Thanks
most of what you want is handled by the System.IO.File and FileStream. If you want that exact function signature.
Step 1) you need to open a file with a particular path. System.IO.File has a few methods for doing this, including Open and OpenRead, as well as ReadAllXXXX, allowing you to access the flie contents in multiple ways. The one you'd probably want is OpenRead, which returns a FileStream object.
Step 2) you need to read a certain number of bytes. Once you have the FileStream from step 1, you should look at the Stream.ReadBytes method. given an array of bytes, it will read a specified number of bytes from the stream into the array
You may look at the StreamReader Class. It may get you to where you want to go, though I'm not sure of how specifically to break it down into the kb chunks you want.
Related
This is a follow up question to this question:
Difference between file path and file stream?
I didn't fully understand everything answered in the linked question.
I am using the Microsoft.SqlServer.Dac.BacPackage which contains a Load method with 2 overloads - one that receives a string path and one that receives a Stream.
This is the documentation of the Load method:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.sqlserver.dac.bacpackage.load?view=sql-dacfx-150
What exactly is the difference between the two? Am I correct in assuming that the overloading of the string path saves all the file in the memory first, while the stream isn't? Are there other differences?
No, the file will not usually be fully loaded all at once.
A string path parameter normally means it will just open the file as a FileStream and pass it to the other version of the function. There is no reason why the stream should fully load the file into memory unless requested.
A Stream parameter means you open the file and pass the resulting Stream. You could also pass any other type of Stream, such as a network stream, a zip or decryption stream, a memory-backed stream, anything really.
Short answer:
The fact that you have two methods, one that accepts a filename and one that accepts a stream is just for convenience. Internally, the one with the filename will open the file as a stream and call the other method.
Longer answer
You can consider a stream as a sequence of bytes. The reason to use a stream instead of a byte[] or List<byte>, is, that if the sequence is really, really large, and you don't need to have access to all bytes at once, it would be a waste to put all bytes in memory before processing them.
For instance, if you want to calculate the checksum for all bytes in a file: you don't need to put all data in memory before you can start calculating the sum. In fact, anything that efficiently can deliver you the bytes one by one would suffice.
That is the reason why people would want to read a file as a stream.
The reason why people want a stream as input for their data, is that they want to give the caller the opportunity to specify the source of their data: callers can provide a stream that reads from a file, but also a stream with data from the internet, or from a database, or from a textBox, the procedure does not care, as long as it can read the bytes one by one or sometimes per chunk of bytes:
using (Stream fileStream = File.Open(fileName)
{
ProcessInputData(fileStream);
}
Or:
byte[] bytesToProcess = ...
using (Stream memoryStream = new MemoryStream(bytesToProcess))
{
ProcessInputData(memoryStream);
}
Or:
string operatorInput = this.textBox1.Text;
using (Stream memoryStream = new MemoryStream(operatorInput))
{
ProcessInputData(memoryStream);
}
Conclusioin
Methods use streams in their interface to indicate that they don't need all data in memory at once. One-by-one, or per chunk is enough. The caller is free to decide where the data comes from.
FileStream.Read() returns the amount of bytes read, but... is there any situation other than having reached the end of file, that it will read less bytes than the number of bytes requested and not throw an exception?
the documentation says:
The Read method returns zero only after reaching the end of the stream. Otherwise, Read always reads at least one byte from the stream before returning. If no data is available from the stream upon a call to Read, the method will block until at least one byte of data can be returned. An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
But this doesn't quite explain in what situations data would be unavailable and cause the method to block until it can read again. I mean, shouldn't most situations where data is unavailable force an exception?
What are real situations where comparing the number of bytes read against the number of expected bytes could differ (assuming that we're already checking for end of file when we mention number of bytes expected)?
EDIT: A bit more information, reason why I'm asking this is because I've come across a bit of code where the developer pretty much did something like this:
bytesExpected = (remainingBytesInFile > 94208 ? 94208 : remainingBytesInFile
while (bytesRead < bytesExpected)
{
bytesRead += fileStream.Read(buffer, bytesRead, bytesExpected - bytesRead)
}
Now, I can't see any advantage to having this while at all, I'd expect it to throw an exception if it can't read the number of bytes expected (bearing in mind it's already taking into account that there are those many bytes left to read)
What would the reason one could possibly have for something like this? I'm sure I'm missing something
The documentation is for Stream.Read, from which FileStream is derived. Since FileStream is a stream, it should obey the stream contract. Not all streams do, but unless you have a very good reason, you should stick to that.
In a typical file stream, you'll only get a return value smaller than count when you reach the end of file (and it's a pretty simple way of checking for the end of file).
However, in a NetworkStream, for example, you keep reading in a loop until the method returns zero - signalling the end of stream. The same works for file streams - you know you're at the end of the file when Read returns zero.
Most importantly, FileStream isn't just for what you'd consider files - it's also for pseudo-files like standard input/output pipes and COM ports, for example (try opening a file stream on PRN, for example). In that case, you're not reading a file with a fixed length, and the behaviour is the same as with NetworkStream.
Finally, don't forget that FileStream isn't sealed. It's perfectly fine for you to implement a virtualized file system, for example - and it's perfectly fine if your virtualized file system doesn't support seeking, or checking the length of file.
EDIT:
To address your edit, this is exactly how you're supposed to read any stream. Nothing wrong with it. If there's nothing else to read in a stream, the Read method will simply return 0, and you know the stream is over. The only thing is, it seems that he tries to fill his buffer to full, one buffer at a time - this only makes sense if you explicitly need to partition the file by 94208 bytes, and pass that byte[] for further processing somewhere.
If that's not the case, you don't really need to fill the full buffer - you just keep reading (and probably writing on some other side) until Read returns 0. And indeed, by default, FileStream will always fill the whole buffer unless it's built around a pipe handle - but since that's a possibility, you shouldn't rely on the "real file" behaviour, so as long as you need those byte[] for something non-stream (e.g. parsing messages), this is entirely fine. If you're only using the stream as an actual stream, and you're streaming the data somewhere else, it doesn't have a point, really - you only need one while to read the file.
Your expectations would only apply to the case when the stream is reading data off of a no-latency source. Other I/O sources can be slow, which is why the Read method might will not always be able to return immediately. That doesn't mean that there is an error (so no exception), just that it has to wait for data to arrive.
Examples: network stream, file stream on slow disk, etc.
(UPDATE, HDD example) To give an example specific to files (since your case is FileStream, although Read is defined on Stream and so all implementations should fulfill the requirements): mechanical hard-drives go to "sleep" when not active (specially on battery-powered devices, read laptops). Spinning up can take a second or so. That is not an IOException, but your read would have to wait for a second before any data is read.
Simple answer is that on a FileStream it probably never happens.
However keep in mind that the Read method is inherited from Stream which serves as base for many other streams like NetworkStream and in this case you may not be able to read has many bytes as you requested simple because they havent been received from the network yet.
So like the documentation says it all depends on the implementation of the specific type of stream - FileStream, NetworkStream, etc.
EDIT 1:
I build a torrent application; Downloading from diffrent clients simultaneously. Each download represent a portion for my file and diffrent clients have diffrent portions.
After a download is complete, I need to know which portion I need to achieve now by Finding "empty" portions in my file.
One way to creat a file with fixed size:
File.WriteAllBytes(#"C:\upload\BigFile.rar", new byte[Big Size]);
My portion Arr that represent my file as portions:
BitArray TorrentPartsState = new BitArray(10);
For example:
File size is 100.
TorrentPartsState[0] = true; // thats mean that in my file, from position 0 until 9 I **dont** need to fill in some information.
TorrentPartsState[1] = true; // thats mean that in my file, from position 10 until 19 I **need** to fill in some information.
I seatch an effective way to save what the BitArray is containing even if the computer/application is shut down. One way I tought of, is by xml file and to update it each time a portion is complete.
I don't think its smart and effective solution. Any idea for other one?
It sounds like you know the following when you start a transfer:
The size of the final file.
The (maximum) number of streams you intend to use for the file.
Create the output file and allocate the required space.
Create a second "control" file with a related filename, e.g. add you own extension. In that file maintain an array of stream status structures corresponding to the network streams. Each status consists of the starting offset and number of bytes transferred. Periodically flush the stream buffers and then update the control file to reflect the progress made and committed.
Variations on the theme:
The control file can define segments to be transferred, e.g. 16MB chunks, and treated as a work queue by threads that look for an incomplete segment and a suitable server from which to retrieve it.
The control file could be a separate fork within the result file. (Who am I kidding?)
You could use a BitArray (in System.Collections).
Then, when you visit on offset in the file, you can set the BitArray at that offset to true.
So for your 10,000 byte file:
BitArray ba = new BitArray(10000);
// Visited offset, mark in the BitArray
ba[4] = true;
Implement a file system (like on a disk) in your file - just use something simple, should be something available in the FOS arena
If I have a single MemoryStream of which I know I sent multiple files (example 5 files) to this MemoryStream. Is it possible to read from this MemoryStream and be able to break apart file by file?
My gut is telling me no since when we Read, we are reading byte by byte... Any help and a possible snippet would be great. I haven't been able to find anything on google or here :(
You can't directly, not if you don't delimit the files in some way or know the exact size of each file as it was put into the buffer.
You can use a compressed file such as a zip file to transfer multiple files instead.
A stream is just a line of bytes. If you put the files next to each other in the stream, you need to know how to separate them. That means you must know the length of the files, or you should have used some separator. Some (most) file types have a kind of header, but looking for this in an entire stream may not be waterproof either, since the header of a file could just as well be data in another file.
So, if you need to write files to such a stream, it is wise to add some extra information. For instance, start with a version number, then, write the size of the first file, write the file itself and then write the size of the next file, etc....
By starting with a version number, you can make alterations to this format. In the future you may decide you need to store the file name as well. In that case, you can increase version number, make up a new format, and still be able to read streams that you created earlier.
This is of course especially useful if you store these streams too.
Since you're sending them, you'll have to send them into the stream in such a way that you'll know how to pull them out. The most common way of doing this is to use a length specification. For example, to write the files to the stream:
write an integer to the stream to indicate the number of files
Then for each file,
write an integer (or a long if the files are large) to indicate the number of bytes in the file
write the file
To read the files back,
read an integer (n) to determine the number of files in the stream
Then, iterating n times,
read an integer (or long if that's what you chose) to determine the number of bytes in the file
read the file
You could use an IEnumerable<Stream> instead.
You need to implement this yourself, what you would want to do is write in some sort of 'delimited' into the stream. As you're reading, look for that delimited, and you'll know when you have hit a new file.
Here's a quick and dirty example:
byte[] delimiter = System.Encoding.Default.GetBytes("++MyDelimited++");
ms.Write(myFirstFile);
ms.Write(delimiter);
ms.Write(mySecondFile);
....
int len;
do {
len = ms.ReadByte(buffer, lastOffest, delimiter.Length);
if(buffer == delimiter)
{
// Close and open a new file stream
}
// Write buffer to output stream
} while(len > 0);
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx