I have an extremely large 2D bytearray in memory,
byte MyBA = new byte[int.MaxValue][10];
Is there any way (probably unsafe) that I can fool C# into thinking this is one huge continuous byte array? I want to do this such that I can pass it to a MemoryStream and then a BinaryReader.
MyReader = new BinaryReader(MemoryStream(*MyBA)) //Syntax obviously made-up here
I do not believe .NET provides this, but it should be fairly easy to implement your own implementation of System.IO.Stream, that seamlessly switches backing array. Here are the (untested) basics:
public class MultiArrayMemoryStream: System.IO.Stream
{
byte[][] _arrays;
long _position;
int _arrayNumber;
int _posInArray;
public MultiArrayMemoryStream(byte[][] arrays){
_arrays = arrays;
_position = 0;
_arrayNumber = 0;
_posInArray = 0;
}
public override int Read(byte[] buffer, int offset, int count){
int read = 0;
while(read<count){
if(_arrayNumber>=_arrays.Length){
return read;
}
if(count-read <= _arrays[_arrayNumber].Length - _posInArray){
Buffer.BlockCopy(_arrays[_arrayNumber], _posInArray, buffer, offset+read, count-read);
_posInArray+=count-read;
_position+=count-read;
read=count;
}else{
Buffer.BlockCopy(_arrays[_arrayNumber], _posInArray, buffer, offset+read, _arrays[_arrayNumber].Length - _posInArray);
read+=_arrays[_arrayNumber].Length - _posInArray;
_position+=_arrays[_arrayNumber].Length - _posInArray;
_arrayNumber++;
_posInArray=0;
}
}
return count;
}
public override long Length{
get {
long res = 0;
for(int i=0;i<_arrays.Length;i++){
res+=_arrays[i].Length;
}
return res;
}
}
public override long Position{
get { return _position; }
set { throw new NotSupportedException(); }
}
public override bool CanRead{
get { return true; }
}
public override bool CanSeek{
get { return false; }
}
public override bool CanWrite{
get { return false; }
}
public override void Flush(){
}
public override void Seek(long offset, SeekOrigin origin){
throw new NotSupportedException();
}
public override void SetLength(long value){
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count){
throw new NotSupportedException();
}
}
Another way to workaround the size-limitation of 2^31 bytes is UnmanagedMemoryStream which implements System.IO.Stream on top of an unmanaged memory buffer (which might be as large as the OS supports). Something like this might work (untested):
var fileStream = new FileStream("data",
FileMode.Open,
FileAccess.Read,
FileShare.Read,
16 * 1024,
FileOptions.SequentialScan);
long length = fileStream.Length;
IntPtr buffer = Marshal.AllocHGlobal(new IntPtr(length));
var memoryStream = new UnmanagedMemoryStream((byte*) buffer.ToPointer(), length, length, FileAccess.ReadWrite);
fileStream.CopyTo(memoryStream);
memoryStream.Seek(0, SeekOrigin.Begin);
// work with the UnmanagedMemoryStream
Marshal.FreeHGlobal(buffer);
Agree. Anyway you have limit of array size itself.
If you really need to operate huge arrays in a stream, write your custom memory stream class.
I think you can use a linear structure instead of a 2D structure using the following approach.
Instead of having byte[int.MaxValue][10] you can have byte[int.MaxValue*10]. You would address the item at [4,5] as int.MaxValue*(4-1)+(5-1). (a general formula would be (i-1)*number of columns+(j-1).
Of course you could use the other convention.
If I understand your question correctly, you've got a massive file that you want to read into memory and then process. But you can't do this because the amount of data in the file exceeds that of any single-dimensional array.
You mentioned that speed is important, and that you have multiple threads running in parallel to process the data as quickly as possible. If you're going to have to partition the data for each thread anyway, why not base the number of threads on the number of byte[int.MaxValue] buffers required to cover everything?
You can create a memoryStream and then pass the array in line by line using the method Write
EDIT:
The limit of a MemoryStream is certainly the amount of memory present for your application. Maybe there is a limit beneath that but if you need more memory, then you should consider to modify your overall architecture. E.g. you could process your data in chunks, or you could do a swapping mechanism to a file.
If you are using Framework 4.0, you have the option of working with a MemoryMappedFile. Memory mapped files can be backed by a physical file, or by the Windows swap file. Memory mapped files act like an in-memory stream, transparently swapping data to/from the backing storage if and when required.
If you are not using Framework 4.0, you can still use this option, but you will need to either write your own or find an exsiting wrapper. I expect there are plenty on The Code Project.
Related
I'm extending BinaryWriter using a MemoryStream.
public class PacketWriter : BinaryWriter
{
public PacketWriter(Opcode op) : base(CreateStream(op))
{
this.Write((ushort)op);
}
private static MemoryStream CreateStream(Opcode op) {
return new MemoryStream(PacketSizes.Get(op));
}
public WriteCustomThing() {
// Validate that MemoryStream has space?
// Do all the stuff
}
}
Ideally, I want to use write using PacketWriter as long as there is space available (which is already defined in PacketSizes). If there isn't space available, I want an exception thrown. It seems like MemoryStream just dynamically allocates more space if you write over capacity, but I want a fixed capacity. Can I achieve this without needing to check the length every time? The only solution I thought of so far was to override all the Write methods of BinaryWriter and compare lengths, but this is annoying.
Just provide a buffer of the desired size to write into:
using System;
using System.IO;
class Test
{
static void Main()
{
var buffer = new byte[3];
var stream = new MemoryStream(buffer);
stream.WriteByte(1);
stream.WriteByte(2);
stream.WriteByte(3);
Console.WriteLine("Three successful writes");
stream.WriteByte(4); // This throws
Console.WriteLine("Four successful writes??");
}
}
This is documented behavior:
Initializes a new non-resizable instance of the MemoryStream class based on the specified byte array.
I'm inheriting the BinaryReader class.
I have to override some essential methods like ReadUInt16.
The internal implementation of this method is:
public virtual ushort ReadUInt16(){
FillBuffer(2);
return (ushort)(m_buffer[0] | m_buffer[1] << 8);
}
The binary files I'm reading from are organized as high byte first (big endian), and I've inherited from the BinaryReader also because I had to add some more functionality.
Anyway I want to implement the swapping in the subclass itself.
Is there another way to access m_buffer or alternative without using reflection or other consuming resources?
May be I should override the FillBuffer and back up peeked bytes? Or maybe just ignoring it? Will it have side effects? Has anyone faced this before? Can anyone explain why is FillBuffer not internal? Is it required to always fill the buffer or it can be skipped? And now that it's not internal why wasn't a protected getter to the m_buffer field implemented along?
Here's the implementation of FillBuffer.
protected virtual void FillBuffer(int numBytes) {
if (m_buffer != null && (numBytes < 0 || numBytes > m_buffer.Length)) {
throw new ArgumentOutOfRangeException("numBytes",
Environment
.GetResourceString("ArgumentOutOfRange_BinaryReaderFillBuffer"));
}
int bytesRead=0;
int n = 0;
if (m_stream==null) __Error.FileNotOpen();
// Need to find a good threshold for calling ReadByte() repeatedly
// vs. calling Read(byte[], int, int) for both buffered & unbuffered
// streams.
if (numBytes==1) {
n = m_stream.ReadByte();
if (n==-1)
__Error.EndOfFile();
m_buffer[0] = (byte)n;
return;
}
do {
n = m_stream.Read(m_buffer, bytesRead, numBytes-bytesRead);
if (n==0) {
__Error.EndOfFile();
}
bytesRead+=n;
} while (bytesRead<numBytes);
}
Not a good idea to try accessing the internal buffer. Why not do something like:
var val = base.ReadUInt16();
return (ushort)((val << 8) | ((val >> 8) & 0xFF));
Slightly slower than reading from the buffer directly, but I seriously doubt that this is going to make a material impact on the overall speed of your application.
FillBuffer is apparently an implementation detail that for some reason the Framework team decided they needed to make protected, possibly because some other Framework class takes advantage of the internal workings of BinaryReader. Since you know that all it does is fill the internal buffer, and your derived class doesn't have access to the internal buffer, so if you do decide to rewrite the reading implementation yourself, I'd suggest that you ignore that method. Calling it can't do you any good, and could do you great harm.
You might be interested in a series of articles I wrote some years ago, in which I implemented a BinaryReaderWriter class, which is essentially a BinaryReader and BinaryWriter joined together, and allows you random read/write access to the underlying stream.
I'm having an issue where I'm corrupted a PDF and not sure of a proper solution. I've seen several posts on people trying to just do a basic stream or trying to modify the file with a third party library. This is how my situation differs...
I have all the web pieces in place to get me the PDF streamed back and it works fine until I try to modify it with C#.
I've modified the PDF in a text editor manually to remove the <> entries and tested that the PDF functions properly after that.
I've then programmatically streamed the PDF in as byte[] from the database, convert it to a string, using a RegEx to find and remove the same stuff I tried removing manually.
THE PROBLEM! When I try to convert the modified PDF string contents back into a byte[] to stream back, the PDF encoding no longer seems to be correct. What is the correct encoding?
Does anyone know the best way to do something like this? I'm just trying to keep my solution as light as possible because our site is geared towards PDF document access so heavy APIs or complex are not preferable unless no other options are available. Also, because this situation is really only when our users view the file in an iframe for "preview", I can't permanently modify the PDF.
Thanks for your help in advance!
Try to use the following BinaryEncoding class as encoding. It basically casts all bytes to chars (and back), so that only ASCII data can correctly be processed as string, but the rest of the data is kept unchanged and nothing is lost as long as you don't use any UNICODE characters > 0x00FF. So for your roundtrip it should work just fine.
public class BinaryEncoding: Encoding {
private static readonly BinaryEncoding #default = new BinaryEncoding();
public static new BinaryEncoding Default {
get {
return #default;
}
}
public override int GetByteCount(char[] chars, int index, int count) {
if (chars == null) {
throw new ArgumentNullException("chars");
}
return count;
}
public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
if (chars == null) {
throw new ArgumentNullException("chars");
}
if (bytes == null) {
throw new ArgumentNullException("bytes");
}
if (charCount < 0) {
throw new ArgumentOutOfRangeException("charCount");
}
unchecked {
for (int i = 0; i < charCount; i++) {
bytes[byteIndex+i] = (byte)chars[charIndex+i];
}
}
return charCount;
}
public override int GetCharCount(byte[] bytes, int index, int count) {
if (bytes == null) {
throw new ArgumentNullException("bytes");
}
return count;
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
if (bytes == null) {
throw new ArgumentNullException("bytes");
}
if (chars == null) {
throw new ArgumentNullException("chars");
}
if (byteCount < 0) {
throw new ArgumentOutOfRangeException("byteCount");
}
unchecked {
for (int i = 0; i < byteCount; i++) {
chars[charIndex+i] = (char)bytes[byteIndex+i];
}
}
return byteCount;
}
public override int GetMaxByteCount(int charCount) {
return charCount;
}
public override int GetMaxCharCount(int byteCount) {
return byteCount;
}
}
You seem to be discovering that...
the PDF format is not trivial!
Whereby it may be OK (yet kludgey) to patch a few "text" bytes, in-situ (i.e. keeping size and structure unchanged), "messing" much more that that with the PDF files typically ends up breaking them. Regular expression for sure seem to be a blunt tool for the job.
The PDF file needs to be parsed and seen as a hierarchical collection objects (and then some..), and that's why we need the libraries which encapsulate the knowledge about the format.
If you need convincing, you may peruse the, now ISO standard, specification for the PDF Format (version 1.7) available for free on Adobe web site. BTW, these 750 pages cover the latest version, while there's much overlay, previous versions introduce yet another layer of details to contend with...
Edit:
This said, in re-reading the question, and Lucero's remark, the changes indicated do seem small/safe enough that a "snip and tuck" approach may work.
Beware that this type of approach may lead to issues, over time (when the format encountered is of a different, older or newer!, version, or when the file content, somehow causes different structures to be exposed, or...) or also with some specific uses (for example it may prevent users to use some features of the PDF documents such as forms or security). Maybe a compromise is to learn enough about the format(s) at hand and confirm that the changes are indeed casual.
Also... while the PDF format is a relatively complicated affair, the libraries that deal with it are not necessarily heavy, and they are typically easy to use.
In short, you'll need to weight the benefits and drawbacks of both approaches and pick accordingly ;-) (how was that for a "non-answer").
Look into IText. There is a reason why things like the apache commons library exist.
I have a class that inherits from MemoryStream in order to provide some buffering. The class works exactly as expected but every now and then I get an InvalidOperationException during a Read with the error message being
Collection was modified; enumeration operation may not execute.
My code is below and the only line that enumerates a collection would seem to be:
m_buffer = m_buffer.Skip(count).ToList();
However I have that and all other operations that can modify the m_buffer object within locks so I'm mystified as to how a Write operation could interfere with a Read to cause that exception?
public class MyMemoryStream : MemoryStream
{
private ManualResetEvent m_dataReady = new ManualResetEvent(false);
private List<byte> m_buffer = new List<byte>();
public override void Write(byte[] buffer, int offset, int count)
{
lock (m_buffer)
{
m_buffer.AddRange(buffer.ToList().Skip(offset).Take(count));
}
m_dataReady.Set();
}
public override int Read(byte[] buffer, int offset, int count)
{
if (m_buffer.Count == 0)
{
// Block until the stream has some more data.
m_dataReady.Reset();
m_dataReady.WaitOne();
}
lock (m_buffer)
{
if (m_buffer.Count >= count)
{
// More bytes available than were requested.
Array.Copy(m_buffer.ToArray(), 0, buffer, offset, count);
m_buffer = m_buffer.Skip(count).ToList();
return count;
}
else
{
int length = m_buffer.Count;
Array.Copy(m_buffer.ToArray(), 0, buffer, offset, length);
m_buffer.Clear();
return length;
}
}
}
}
I cannot say exactly what's going wrong from the code you posted, but a bit of an oddity is that you lock on m_buffer, but replace the buffer, so that the collection locked is not always the collection that is being read and modified.
It is good practice to use a dedicated private readonly object for the locking:
private readonly object locker = new object();
// ...
lock(locker)
{
// ...
}
You have at least one data race there: on the Read method , if you're pre-empted after the if(m_buffer.Count == 0) block and before the lock, Count can be 0 again. You should check the count inside the lock, and use Monitor.Wait, Monitor.Pulse and/or Monitor.PulseAll for the wait/signal coordination, like this:
// On Write
lock(m_buffer)
{
// ...
Monitor.PulseAll();
}
// On Read
lock(m_buffer)
{
while(m_buffer.Count == 0)
Monitor.Wait(m_buffer);
// ...
You have to protect all accesses to m_buffer, and calling m_buffer.Count is not special in that regard.
Do you modify the content of buffer in another thread somewhere, I suspect that may be the enumeration giving error rather than m_buffer.
System.IO.BinaryReader reads values in a little-endian format.
I have a C# application connecting to a proprietary networking library on the server side. The server-side sends everything down in network byte order, as one would expect, but I find that dealing with this on the client side is awkward, particularly for unsigned values.
UInt32 length = (UInt32)IPAddress.NetworkToHostOrder(reader.ReadInt32());
is the only way I've come up with to get a correct unsigned value out of the stream, but this seems both awkward and ugly, and I have yet to test if that's just going to clip off high-order values so that I have to do fun BitConverter stuff.
Is there some way I'm missing short of writing a wrapper around the whole thing to avoid these ugly conversions on every read? It seems like there should be an endian-ness option on the reader to make things like this simpler, but I haven't come across anything.
There is no built-in converter. Here's my wrapper (as you can see, I only implemented the functionality I needed but the structure is pretty easy to change to your liking):
/// <summary>
/// Utilities for reading big-endian files
/// </summary>
public class BigEndianReader
{
public BigEndianReader(BinaryReader baseReader)
{
mBaseReader = baseReader;
}
public short ReadInt16()
{
return BitConverter.ToInt16(ReadBigEndianBytes(2), 0);
}
public ushort ReadUInt16()
{
return BitConverter.ToUInt16(ReadBigEndianBytes(2), 0);
}
public uint ReadUInt32()
{
return BitConverter.ToUInt32(ReadBigEndianBytes(4), 0);
}
public byte[] ReadBigEndianBytes(int count)
{
byte[] bytes = new byte[count];
for (int i = count - 1; i >= 0; i--)
bytes[i] = mBaseReader.ReadByte();
return bytes;
}
public byte[] ReadBytes(int count)
{
return mBaseReader.ReadBytes(count);
}
public void Close()
{
mBaseReader.Close();
}
public Stream BaseStream
{
get { return mBaseReader.BaseStream; }
}
private BinaryReader mBaseReader;
}
Basically, ReadBigEndianBytes does the grunt work, and this is passed to a BitConverter. There will be a definite problem if you read a large number of bytes since this will cause a large memory allocation.
I built a custom BinaryReader to handle all of this. It's available as part of my Nextem library. It also has a very easy way of defining binary structs, which I think will help you here -- check out the Examples.
Note: It's only in SVN right now, but very stable. If you have any questions, email me at cody_dot_brocious_at_gmail_dot_com.