Storing files to byte array

Storing files to byte array - c#

I have a database object that has a column to store files as varbinary. I have tried to store single file using C# and byte arrays and it worked. How can I add multiple files to this column. Please help.

I would suppose you'd need to concat the byte arrays from each file into a giant byte array and then insert that byte array into the field, but then how would know where 1 file begins and the next ends?
You could try to put in a magic set of bytes between each file byte array, but then what happens when one of those files randomly has that magic set of bytes?
If the files are the same exact type, say images, you could look for the magic bytes certain image file types always start with to separate them out, but again, there's still the random chance you might find those magic bytes in the middle of one of the files.
There is also memory concerns both saving and retrieving if the combined files are too large.
This idea also violates database design / normalization.
I would do what Jeremy Lakeman recommends: create a child table.
IE,
Files Table Columns:
ParentID (foreign key to parent table)
FileID (Autonumber / primary key)
File (varbinary)

Related

C# Insert data in file without rewriting it [duplicate]

This question already has answers here:
Inserting bytes in the middle of binary file
(3 answers)
Closed 3 years ago.
I have a file with some data in it. Now i want to add some content but not by appending it. More like "adding this block of 4 bytes between the current 10th and 11th byte in this file". Currently I'm using FileStream to to read and write from files.
So my question: is there a way to insert this data without rewriting the entire file?
Thank you,
Nils.

Edit 2 - the rewrite
After a lot of comments, I figured out the real issue is that you have a database that mostly works like a file System. The biggest difference is propably that the clusters know the file they belong to, rather then the other way around. I am going to use Filesystem terminology for the DDL/Shema. Sorry I can not get proper SQL Syntax highlighting to work.
CREATE TABLE Files(
ID INTEGER PRIMARY KEY
/* a bunch of other columns that do not really mater for this */
);
CREATE TABLE Clusters(
ID INTEGER PRIMARY KEY,
FK_FileID INTEGER FOREIGN KEY (Files.ID), -- Special, see text
ClusterNumber INTEGER, --Special, see text
Contents --whatver type you need
);
Clusters is a odd table in many regardes:
the pirmary key is most irrelevant. Indeed you can propably remove indexing for it. The only reasons I have it are a) habit, b) becaue you might regret lacking it and c) it might be usefull for management work
ClusterNumber is the "N-th Cluster for FK_FileID"
ClusterNumber and FK_FileID should have a shared unique constraint (the combination of both must be unique) and should propably be on a index covering both. Think of them as if they were a Composite Primary key or multirow surrogate key (wich does sound like a oxymoron). You will use those way more often then the official PK.
You would get all the clusters for a File like this:
SELECT Content FROM Clusters
ORDER BY ClusterNumber
WHERE FK_FileID = /*The file whose whole data you want*/
Wich would be nice covered by that extra Index.
If you want to shove in a segment anywhere in this you would:
Move all the following segments ClusterNumber 1 up
Just add a Cluster entry with that newly freed-up ClusterNumber for this file
You can be somewhat wastefull with that last step, like only adding a 4 letter cluster.
Asuming you store this on a HDD/Rotating Disk Storage you will propably not get around defragmenting this regulary anyway, so you might as well consolidate the clusters to cut the waste then while doing so. Unless you can somehow teach the DBMS to properly read this (you need to ask DB experts for this), you want to have all clusters of a file in sequence in the DB as much as possible, so it will be together on the disk as well. Of course is the physical medium is a proper SSD, you can skip the defragmentation and only consolidate.
Advanced options include stuff like reserving expansion room for a file (that no other file will be using) ahead of time. So the clusters can be kept together (even if not in order) and the need for defragmentation is less.

MySQL Field Space Allocation

First I create a model on my application, then Entity Framework generates the SQL for creating a table.
The first generates a column with type varchar(20), the second generates longtext.
Example
[StringLength(20)]
public string Code { get; set; }
public string CodeTwo { get; set; }
Questions
There's any difference between these two declarations(space allocation)?
(Even if they store the same value like "test" which has 5 characters.)
If I know that a field has a variance of it's length between let's say 10-15 characters, is the best approach limiting to the max length or let it "unlimited"(space allocation) ?
Thanks in advance.
Sorry my poor english.

Translated answer of the user #Marconcílio Souza , on the same question asked in another language.
When the Entity Framework generates the tables in your database it check the types of each field, in the case of type STRING when you specify the size it does the same specification to the bank with its corresponding type.
In the case of its
[StringLength (20)]
public string Code {get; set; }
The corresponding MySQL is varchar (20), but when the same string type and declared without a fixed size Entity Framework allocate as much as possible for this type in the database in the case of MySQL and longtext.
The columns of type BLOB as LONGTEXT are inherently variable length and take up almost no storage when not used. The space required by them is not affected even if a NULL value in the case of a use such as 'text' test 'set' the allocation and the size of the passed string.
* Advantages / disadvantages of BLOBs vs. VARCHARs *
All comments in this paragraph referring VARCHAR type are valid for CHAR type too.
Each comment ends with BLOB + or VARCHAR + mark to indicate what type of data is better.
- You know maximum length of your data?
With VARCHARs you need to declare the maximum length of the chain.
With blobs you do not have to worry about it.
BLOB +
You need to store very long strings?
A single VARCHAR is limited to 32K bytes (i.e., about 10 thousand Unicode characters).
The maximum size is blob (according to Service Guide);
- Page size 1kb => 64 Mb
- Page Size 2kb => 512 Mb
- Page size of 4 KB => 4Gb
- Page size of 8KB => 32Gb
BLOB +
You need to store many long text columns in single table?
The total line length (uncompressed) is restricted to 64K. VARCHARs are stored online directly, so you can not store many long strings in a row.
Blobs are represented by their blob-id, and uses only 8 bytes from 64K maximum.
BLOB +
You want to minimize the call between client and server?
VARCHAR data is fetched along with other line data in a search operation and usually several rows are sent over the network at the same time.
Every single blob needs to do extra search operation open / fetch.
VARCHAR +
You want to minimize the amount of data transferred between client and server?
The advantage of blobs is that to get the line you get only blob-id, so you can decide whether or not to seek BLOB data.
In older versions of InterBase there was a problem that VARCHARs were sent over the network in declared full length. This problem has been fixed in Firebird 1.5 and InterBase 6.5.
draw (BLOB + for older versions of the server)
You want to minimize the space used?
VARCHARs are compressed RLE (indeed entire line are compressed except blobs). A maximum of 128 bytes can be compressed to 2 bytes. This means that even empty varchar (32000) will occupy 500 + 2 bytes.
Blobs are not compressed, but empty (ie null) blob will occupy only 8 bytes of blob-id (and will be later RLE compressed). non-empty blob may be stored on the same page as other data from the line (if appropriate) or in separate page. Small blob that fits the data page has overhead of 40 bytes (or a little more). Big blob has the same 40-byte overhead in the data page, plus 28 bytes overhead on each blob page (30 bytes in the first). A blob page can not contain more than one blob (ie blob pages are not shared as data pages). For example. for 4K page size, if you store 5K blob, two pages of the blob type will be allocated, which means that you lose 3K of space! In other words - the larger page size, the higher probability that small blobs will fit on data page, but also more wasted space if separate blob pages are needed for large blobs.
VARCHAR + (except VARCHARs with extremely large declared length, or tables with lots of NULL blobs)
You need table with extremely large number of rows?
Each line is identified by DB_KEY, which is a 64-bit value, 32 bits, 32 bits and which is balanced ID is used to locate the line. maximum number of theoretical way of rows in a table is 2 ^ 32 (but for various reasons the maximum true is even lower). Blob -IDS are allocated from the same address space as DB_KEYs, that means the more blobs in the table, less DB_KEYs remain to face queues. On the other hand, when the stored lines are wide (e.g. they contain long VARCHARs), then fewer lines fit the data page and many DB_KEY values remain unasigned anyway.
varchar +?
You want a good performance?
Because large blobs are stored outside the data pages, they increase "density" of lines of data pages efficiency and thus cache (reduce the number of I / O operations during the search).
BLOB +
You need to perform the search on the contents of text columns?
In VARCHAR you can use operators such as '=', '>', among them, of (), case sensitive as and departure case insensitive CONTAINING. In most cases index can be used to speed up the search.
Blobs can not be indexed, and you are restricted to TASTE, starting and containing operators. You can not directly compare blobs with operators '=', '>' etc. (Unless you use UDF), so you can not, for example, join tables in Blob fields.
VARCHAR +
You want to search content of these texts with CONTAINING?
Containig can be used to perform case-insensitive search content VARCHAR field. (No index use)
Because you can not set collation order for BLOB columns, you can not use the fully insensitive search case with national characters in BLOB columns (only the lower half of the character set is case insensitive). (Alternatively, you can use UDF).
Firebird 2 already allows you to set text wrapping (and binary) columns.
VARCHAR +
You need capital contents of the text column?
You can use the built-in UPPER () function on varchar, but not the blob. (Also CAST, MIN, MAX can not be used with blobs)
VARCHAR +
You can not sort by blob column. (E GROUP BY, DISTINCT, UNION, JOIN ON)
Unable to concatenate blob columns.
VARCHAR +
There is no built-in conversion function (CAST) for converting blob to VARCHAR or VARCHAR to blob.
(But you can write UDF for this purpose.)
Since Firebird 1.5 you can use builtin SUBSTRING function to convert blob to VARCHAR (but FROM clauses and can not exceed 32K).
to draw
You can not assign value to blob directly in SQL command,
for example. Enter values guide (MyBlob) ( 'abc'); (But you can use UDF for converting string to blob).
VARCHAR +
Firebird - 0.9.4 already has this functionality
to draw
You need a good security on these text columns?
To recover the table data, you must be granted the SELECT privilege.
To retrieve blob, you need to know only blob -id (stored in the table), but Firebird / InterBase will not check if you have any blob table rights belongs. This means that everyone who know or guess right blob -id can read the blob without any rights to the table. (You can try it with ISQL and BLOBDUMP command.)
VARCHAR +
More details
Reference 1
Reference 2
Reference 3
Reference 4

Optimizing SDF filesize

I recently started learning Linq and SQL. As a small project I'm writing a dictionary application for Windows Phone. The project is split into two Applications. One Application (that currently runs on my PC) generates a SDF file on my PC. The second App runs on my Windows Phone and searches the database. However I would like to optimize the data usage. The raw entries of the dictionary are written in a TXT file with a filesize of around 39MB. The file has the following layout
germanWord \tab englishWord \tab group
germanWord \tab englishWord \tab group
The file is parsed into a SDF database with the following tables.
Table Word with columns _version (rowversion), Id (int IDENTITY), Word (nvarchar(250)), Language (int)
This table contains every single word in the file. The language is a flag from my code that I used in case I want to add more languages later. A word-language pair is unique.
Table Group with columns _version (rowversion), GroupId (int IDENTITY), Caption (nvarchar(250))
This table contains the different groups. Every group is present one time.
Table Entry with columns _version (rowversion), EntryId (int IDENTITY), WordOneId (int), WordTwoId(int), GroupId(int)
This table links translations together. WordOneId and WordTwoId are foreign keys to a row in the Word Table, they contain the id of a row. GroupId defines the group the words belong to.
I chose this layout to reduce the data footprint. The raw textfile contains some german (or english) words multiple times. There are around 60 groups that repeat themselfes. Programatically I reduce the wordcount from around 1.800.000 to around 1.100.000. There are around 50 rows in the Group table. Despite the reduced number of words the SDF is around 80MB in filesize. That's more than twice the size of the the raw data. Another thing is that in order to speed up the searching of translation I plan to index the Word column of the Word table. By adding this index the file grows to over 130MB.
How can it be that the SDF with ~60% of the original data is twice as large?
Is there a way to optimize the filesize?

The database file must contain all of the data from your raw file, in addition to row metadata -- it also will contain the strings based on the datatypes specified -- I believe your option here is NVARCHAR which uses two bytes per letter. Combining these considerations, it would not surprise me that a database file is over twice as large as a text file of the same data using the ISO-Latin-1 character set.

C# Programming: Maintaining a List that Associates an ID with Information Quickly

In a game that I've been working on, I created a system by which the game polls a specific 'ItemDatabase' file in order to retrieve information about itself based on a given identification number. The identification number represented the point in the database at which the information regarding a specific item was stored. The representation of every item in the database was comprised of 162 bytes. The code for the system was similar to the following:
// Retrieves the information about an 'Item' object given the ID. The
// 'BinaryReader' object contains a file stream to the 'ItemDatabase' file.
public Item(ushort ID, BinaryReader itemReader)
{
// Since each 'Item' object is represented by 162 bytes of information in the
// database, skip 162 bytes per ID skipped.
itemReader.BaseStream.Seek(162 * ID, SeekOrigin.Begin);
// Retrieve the name of this 'Item' from the database.
this.itemName = itemReader.ReadChars(20).ToString();
}
Normally there wouldn't be anything particularly wrong with this system as it queries the desired data and initializes it to the correct variables. However, this process must occur during game time, and, based on research that I've done about the efficiency of the 'Seek' method, this technique won't be nearly fast enough to be incorporated into a game. So, my question is: What's a good way to maintain a list that associates an identification number with information that can be accessed quickly?

You best shot would be a database. SQLite is very portable and does not need to be installed on the system.
If you have loaded all the data into memory, you can use Dictionary<int, Item>. This makes it very easy to add and remove items to the list.
As it seems like your IDs all go from 0 and upwards, it would be really fast with just an array. Just set the index of the item to be the id.

Assuming the information in the "database" is not being changed continuously, couldn't you just read out the various items once-off during the load of the game or level? You could store the data in a variety of ways, such as a Dictionary. The .Net Dictionary is actually what is commonly referred to as a hash table, mapping keys (in this case, your ID field) to objects (which I am guessing is of type "Item"). Lookup times are extremely good (definitely in the millions per second), I doubt you'd ever have issues.
Alternatively, if your ID is a ushort, you could just store your objects in an array with all possible ushort values. An array of 65535 length is not large in today's terms. Array lookups are as fast as you can get.

You could use a Dictionary or if this is used in a multi-threaded app then ConcurrentDictionary .
Extremely fast but a bit more effort in implementing is a MemoryMappedFile .

C#: Storing Filesize in Database

I'm storing objects in a database as varbinary(MAX) and want to know their filesize. Without getting into the pro and cons of using the varbinary(MAX) datatype, what is the best way to read the file size of an object stored in the database?
Is it:
A. Better to just read the column from the DB and call the .Length property of System.Data.Linq.Binary.
OR
B. Better to determine the file size of the object before it is added to the DB and create another column called Size.
The files I'm dealing with are generally between 0 and 3 MB with a skew towards the smaller size. It doesn't necessarily make sense to hit the DB again for the file size, but it also doesn't really make sense to read through the entire item to determine its length.

Why not add a calculated column in your database that would be DATALENGTH([your_col])?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.