This question already has answers here:
Inserting bytes in the middle of binary file
(3 answers)
Closed 3 years ago.
I have a file with some data in it. Now i want to add some content but not by appending it. More like "adding this block of 4 bytes between the current 10th and 11th byte in this file". Currently I'm using FileStream to to read and write from files.
So my question: is there a way to insert this data without rewriting the entire file?
Thank you,
Nils.
Edit 2 - the rewrite
After a lot of comments, I figured out the real issue is that you have a database that mostly works like a file System. The biggest difference is propably that the clusters know the file they belong to, rather then the other way around. I am going to use Filesystem terminology for the DDL/Shema. Sorry I can not get proper SQL Syntax highlighting to work.
CREATE TABLE Files(
ID INTEGER PRIMARY KEY
/* a bunch of other columns that do not really mater for this */
);
CREATE TABLE Clusters(
ID INTEGER PRIMARY KEY,
FK_FileID INTEGER FOREIGN KEY (Files.ID), -- Special, see text
ClusterNumber INTEGER, --Special, see text
Contents --whatver type you need
);
Clusters is a odd table in many regardes:
the pirmary key is most irrelevant. Indeed you can propably remove indexing for it. The only reasons I have it are a) habit, b) becaue you might regret lacking it and c) it might be usefull for management work
ClusterNumber is the "N-th Cluster for FK_FileID"
ClusterNumber and FK_FileID should have a shared unique constraint (the combination of both must be unique) and should propably be on a index covering both. Think of them as if they were a Composite Primary key or multirow surrogate key (wich does sound like a oxymoron). You will use those way more often then the official PK.
You would get all the clusters for a File like this:
SELECT Content FROM Clusters
ORDER BY ClusterNumber
WHERE FK_FileID = /*The file whose whole data you want*/
Wich would be nice covered by that extra Index.
If you want to shove in a segment anywhere in this you would:
Move all the following segments ClusterNumber 1 up
Just add a Cluster entry with that newly freed-up ClusterNumber for this file
You can be somewhat wastefull with that last step, like only adding a 4 letter cluster.
Asuming you store this on a HDD/Rotating Disk Storage you will propably not get around defragmenting this regulary anyway, so you might as well consolidate the clusters to cut the waste then while doing so. Unless you can somehow teach the DBMS to properly read this (you need to ask DB experts for this), you want to have all clusters of a file in sequence in the DB as much as possible, so it will be together on the disk as well. Of course is the physical medium is a proper SSD, you can skip the defragmentation and only consolidate.
Advanced options include stuff like reserving expansion room for a file (that no other file will be using) ahead of time. So the clusters can be kept together (even if not in order) and the need for defragmentation is less.
Related
I am developing an application which receives packets from network and stores them into database. in one part, I save dns records to db, in this format:
IP Address(unsigned 32bit integer)
DNS record(unlimited string)
The rate of DNS records, is about 10-100 records per second. As it's realtime, I have not enough time to check for duplicates by string search in database. I was thinking of a good method to get a unique short integer (you say,64 bit) per given unique string. So my search, from string search, becomes number search and lets me check for duplicates faster. Any idea about implementations of what I told, or better approaches is appreciated. samples in C# are preferred. but any good idea is welcomed.
I would read through this, talking about hashing strings into integers, and since addresses are pretty long (letter wise), I would use some modulo function to keep it in integer limits.
The results would be checked with a hash table for duplicates.
This could be done for the first 20 letters, and then the next 20 for a nested hash table if required and so on.
Make sure you set up your table-indexes and Primary Keys in the table correctly.
Load the table contents asynchronuosly every couple of seconds and populate a generic dictionary<long,string> with it.
Perform the search on the dictionary as it is optimized for searches. If you need it even faster, use a hashtable.
Flush the newly added entries in a Transaction asynchronuosly into the DB.
P.S. Your Scenario is to vague to create a decent code example.
Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.
Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.
I'm implementing DynamoDB in our project. We have to put large data strings into database so we are splitting data into small pieces and inserting multiple rows with only one attribute value changed - part of string. One column (range key) contains a number of part. Inserting and selecting data works perfectly fine for small and large strings. The problem is deleting an item. I read that when you want to delete an item you need to specify primary key for such item (hash key or hash key and range key - depends on table). But what if I want to delete items that have particular value for one of attributes? Do I need to scan (scan, not query) entire table and for each row run delete or batch delete? Or is there some another solution without using two queries? What I'm trying to do is to avoid scanning entire table. I think we will have about 100 - 1000 milions of rows in such table, so scanning will be very slow.
Thanks for help.
There are no way to delete an arbitrary element in DynamoDB. You indeed need to know the hash_key and the range_key.
If query does not fit your needs for this (ie. you even do not know the hash_key), then you're stuck.
The best would be to re-thing your data modeling. Build a custom index or do 'lazy delete'.
To achieve 'lazy delete', use a table as a queue of element to delete. Periodically, run an EMR on it to do all the delete in the batch in a single scan operation. It's really not the best solution but the only way I can think of to avoid re-modeling.
TL;DR: There is no real way but workarounds. I highly recommend that you re-model at least part of your data.
In a game that I've been working on, I created a system by which the game polls a specific 'ItemDatabase' file in order to retrieve information about itself based on a given identification number. The identification number represented the point in the database at which the information regarding a specific item was stored. The representation of every item in the database was comprised of 162 bytes. The code for the system was similar to the following:
// Retrieves the information about an 'Item' object given the ID. The
// 'BinaryReader' object contains a file stream to the 'ItemDatabase' file.
public Item(ushort ID, BinaryReader itemReader)
{
// Since each 'Item' object is represented by 162 bytes of information in the
// database, skip 162 bytes per ID skipped.
itemReader.BaseStream.Seek(162 * ID, SeekOrigin.Begin);
// Retrieve the name of this 'Item' from the database.
this.itemName = itemReader.ReadChars(20).ToString();
}
Normally there wouldn't be anything particularly wrong with this system as it queries the desired data and initializes it to the correct variables. However, this process must occur during game time, and, based on research that I've done about the efficiency of the 'Seek' method, this technique won't be nearly fast enough to be incorporated into a game. So, my question is: What's a good way to maintain a list that associates an identification number with information that can be accessed quickly?
You best shot would be a database. SQLite is very portable and does not need to be installed on the system.
If you have loaded all the data into memory, you can use Dictionary<int, Item>. This makes it very easy to add and remove items to the list.
As it seems like your IDs all go from 0 and upwards, it would be really fast with just an array. Just set the index of the item to be the id.
Assuming the information in the "database" is not being changed continuously, couldn't you just read out the various items once-off during the load of the game or level? You could store the data in a variety of ways, such as a Dictionary. The .Net Dictionary is actually what is commonly referred to as a hash table, mapping keys (in this case, your ID field) to objects (which I am guessing is of type "Item"). Lookup times are extremely good (definitely in the millions per second), I doubt you'd ever have issues.
Alternatively, if your ID is a ushort, you could just store your objects in an array with all possible ushort values. An array of 65535 length is not large in today's terms. Array lookups are as fast as you can get.
You could use a Dictionary or if this is used in a multi-threaded app then ConcurrentDictionary .
Extremely fast but a bit more effort in implementing is a MemoryMappedFile .
Scenario
I'm parsing emails and inserting them a database using an ORM (NHibernate to be exact). While my current approach does technically work I'm not very fond of it but can't of a better solution. The email contains 50~ fields and is sent from a third party and looks like this (obviously a very short dummy sample).
Field #1: Value 1 Field #2: Value 2
Field #3: Value 3 Field #4: Value 4 Field #5: Value 5
Problem
My problem is that with parsing this many fields the database table is an absolute monster. I can't create proper models employing any kind of relationships either AFAIK because each email sent is all static data and doesn't rely on any other sources.
The only idea I have is to find commonalities between each field and split them into more manageable chunks. Say 10~ fields per entity, so 5 entities total. However, I'm not terribly in love with that idea either seeing as all I'd be doing is create one-to-one relationships.
What is a good way of managing large number of properties that are out of your control?
Any thoughts?
Create 2 tables: 1 for the main object, and the other for the fields. That way you can programatically access each field as necessary, and the object model doesn't look to nasty.
But this is just off the top of my head; you have a weird problem.
If the data is coming back in a file that you can parse easily, then you might be able to get away with creating a command line application that will produce scripts and c# that you can then execute and copy, paste into your program. I've done that when creating properties out of tables from html pages (Like this one I had to do recently)
If the 50 properties are actually unique and discrete pieces of data regarding this one entity, I don't see a problem with having those 50 properties (even though that sounds like a lot) on one object. For example, the Type class has a large number of boolean properties relating to it's data (IsPublic, etc).
Alternatives:
Well, one option that comes to mind immediately is using dynamic object and overriding TryGetMember to lookup the 'property' name as a key in a dictionary of key value pairs (where your real set up of 50 key value pairs exists). Of course, figuring out how to map that from your ORM into your entity is the other problem and you'd lose intellisense support.
However, just throwing the idea out there.
Use a dictionary instead of separate fields. In the database, you just have a table for the field name and its value (and what object it belongs to).