MongoDB key-value DB alternative with compression

MongoDB key-value DB alternative with compression - c#

I am currently using MongoDB to store lots of real-time signals of some sensors. The stored information includes a timestamp, a numeric value and a flag that indicates the quality of the signal.
Queries are very fast but the amount of disk used is exorbitant and would like to try other non-relational database most suitable to my purposes.
I've been looking at http://nosql-database.org/ but I don't know which database is the best for my needs.
Thank you very much :)

http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

MongoDB stores field names inside every document, which is great because it allows all documents to have different fields, but creates a storage overhead when fields are always the same.
To reduce the disk consumption, try shortening the field names, so instead of:
{
_id: "47cc67093475061e3d95369d",
timestamp: "2011-06-09T17:46:21",
value: 314159,
quality: 3
}
Try this:
{
_id: "47cc67093475061e3d95369d",
t: "2011-06-09T17:46:21",
v: 314159,
q: 3
}
Then you can map these field names to something more meaningful inside your application.
Also, if you're storing separate _id and timestamp fields then you might be doubling up.
The ObjectId type has a timestamp embedded in it, so depending on how you query and use your data, it might mean you can do without a separate timestamp field all together.

Disk space is cheap, don't care about this, development with new database will cost much more... If you on windows you can try RavenDB.
Also mb take a look into this answer about reducing mongo database size:
You can do this "compression" by running mongod --repair or by
connecting directly and running db.repairDatabase().

Related

Trying to get an UPSERT working on a set of data using dapper

I'm trying to get an upsert working on a collection of IDs (not the primary key - that's an identity int column) on a table using dapper. This doesn't need to be a dapper function, just including in case that helps.
I'm wondering if it's possible (either through straight SQL or using a dapper function) to run an upsert on a collection of IDs (specifically an IEnumerable of ints).
I really only need a simple example to get me started, so an example would be:
I have three objects of type Foo:
{ "ExternalID" : 1010101, "DescriptorString" : "I am a descriptive string", "OtherStuff" : "This is some other stuff" }
{ "ExternalID" : 1010122, "DescriptorString" : "I am a descriptive string123", "OtherStuff" : "This is some other stuff123" }
{ "ExternalID" : 1033333, "DescriptorString" : "I am a descriptive string555", "OtherStuff" : "This is some other stuff555" }
I have a table called Bar, with those same column names (where only 1033333 exists):
Table Foo
Column ID | ExternalID | DescriptorString | OtherStuff
Value [1]|[1033333] |["I am a descriptive string555"]|["This is some other stuff555"]

Well, since you said that this didn't need to be dapper-based ;-), I will say that the fastest and cleanest way to get this data upserted is to use Table-Valued Parameters (TVPs) which were introduced in SQL Server 2008. You need to create a User-Defined Table Type (one time) to define the structure, and then you can use it in either ad hoc queries or pass to a stored procedure. But this way you don't need to export to a file just to import, nor do you need to convert it to XML just to convert it back to a table.
Rather than copy/paste a large code block, I have noted three links below where I have posted the code to do this (all here on S.O.). The first two links are the full code (SQL and C#) to accomplish this (the 2nd link being the most analogous to what you are trying to do). Each is a slight variation on the theme (which shows the flexibility of using TVPs). The third is another variation but not the full code as it just shows the differences from one of the first two in order to fit that particular situation. But in all 3 cases, the data is streamed from the app into SQL Server. There is no creating of any additional collection or external file; you use what you currently have and only need to duplicate the values of a single row at a time to be sent over. And on the SQL Server side, it all comes through as a populated Table Variable. This is far more efficient than taking data you already have in memory, converting it to a file (takes time and disk space) or XML (takes cpu and memory) or a DataTable (for SqlBulkCopy; takes cpu and memory) or something else, only to rely on an external factor such as the filesystem (the files will need to be cleaned up, right?) or need to parse out of XML.
How can I insert 10 million records in the shortest time possible?
Pass Dictionary<string,int> to Stored Procedure T-SQL
Storing a Dictionary<int,string> or KeyValuePair in a database
Now, there are some issues with the MERGE command (see Use Caution with SQL Server's MERGE Statement) that might be a reason to avoid using it. So, I have posted the "upsert" code that I have been using for years to an answer on DBA.StackExchange:
How to avoid using Merge query when upserting multiple data using xml parameter?

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.

Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).

I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.

Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog

I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

how to check if value is present in a very big data record or a big list efficiently

hi guys i have this doubt ...
if i have a record of username and password details for logging in to a website I'll most probably get the user name and password from the form and will be using to check if the given username is present in the database by using a contains() Boolean operation and if contains then check the password is same as saved in the database..
but for websites like g-mail and Facebook there are million of records and the authentication is very quick ...
how to they do it ..what method do they follow for this
how they check if a value is present in a large record that quickly ?
does the process involve just adding more server for processing speed ?
ty for the answers ...
**
sorry i have posted this question without knowing about indexers ..
(just came to know that by creating indexes to one or multiple column
the full table scan is minimized and index path is used instead which
is less costlier and more efficient operation ..)
**

You just need one SQL query:
select 1 from user u
where u.login = :theEnteredLogin
and u.hashed_password = :theHashedEnteredPassword
(where :xxx are parameters of the query).
If you have an index on the login column or even better, on [login - hashed_password], the query should not take more than a few milliseconds to execute.

Well, they have lots of servers and high-performance databases. At a low level, the table for the hash is probably indexed by the hash for fast lookup - binary search style.

For medium to large data sets indexing, combined with proper sizing of disk, memory and cpus, is the most adopted approach.
For very large data sets, the database can be distributed and data partitioned.
For very, very large data sets, aside from the above scenarios, used technologies usually involve using map reduce model.

C# Programming: Maintaining a List that Associates an ID with Information Quickly

In a game that I've been working on, I created a system by which the game polls a specific 'ItemDatabase' file in order to retrieve information about itself based on a given identification number. The identification number represented the point in the database at which the information regarding a specific item was stored. The representation of every item in the database was comprised of 162 bytes. The code for the system was similar to the following:
// Retrieves the information about an 'Item' object given the ID. The
// 'BinaryReader' object contains a file stream to the 'ItemDatabase' file.
public Item(ushort ID, BinaryReader itemReader)
{
// Since each 'Item' object is represented by 162 bytes of information in the
// database, skip 162 bytes per ID skipped.
itemReader.BaseStream.Seek(162 * ID, SeekOrigin.Begin);
// Retrieve the name of this 'Item' from the database.
this.itemName = itemReader.ReadChars(20).ToString();
}
Normally there wouldn't be anything particularly wrong with this system as it queries the desired data and initializes it to the correct variables. However, this process must occur during game time, and, based on research that I've done about the efficiency of the 'Seek' method, this technique won't be nearly fast enough to be incorporated into a game. So, my question is: What's a good way to maintain a list that associates an identification number with information that can be accessed quickly?

You best shot would be a database. SQLite is very portable and does not need to be installed on the system.
If you have loaded all the data into memory, you can use Dictionary<int, Item>. This makes it very easy to add and remove items to the list.
As it seems like your IDs all go from 0 and upwards, it would be really fast with just an array. Just set the index of the item to be the id.

Assuming the information in the "database" is not being changed continuously, couldn't you just read out the various items once-off during the load of the game or level? You could store the data in a variety of ways, such as a Dictionary. The .Net Dictionary is actually what is commonly referred to as a hash table, mapping keys (in this case, your ID field) to objects (which I am guessing is of type "Item"). Lookup times are extremely good (definitely in the millions per second), I doubt you'd ever have issues.
Alternatively, if your ID is a ushort, you could just store your objects in an array with all possible ushort values. An array of 65535 length is not large in today's terms. Array lookups are as fast as you can get.

You could use a Dictionary or if this is used in a multi-threaded app then ConcurrentDictionary .
Extremely fast but a bit more effort in implementing is a MemoryMappedFile .

Storing Data from Forms without creating 100's of tables: ASP.NET and SQL Server

Let me first describe the situation. We host many Alumni events over the course of each year and provide online registration forms for each event. There is a large chunk of data that is common for each event:
An Event with dates, times, managers, internal billing info, etc.
A Registration record with info about the payment and total amount charged per form submission
Bio/Demographic and alumni data about the 1 or more attendees (name, address, degree, etc.)
We store all of the above data within columns in tables as you would expect.
The trouble comes with the 'extra' fields we are asked to put on the forms. Maybe it is a dinner and there is a Veggie or Carnivore option, perhaps there is lodging and there are bed or smoking options, or perhaps there is an optional transportation option. There are tons of weird little "can you add this to the form?" types of requests we receive.
Currently, we JSONify any non-standard data and store it all in one column (per attendee) called 'extras'. We can read this data out in code but it is not well suited to querying. Our internal staff would like to generate a quick report on Veggie dinners needed for instance.
Other than creating a separate table for each form that holds the specific 'extra' data items, are there any other approaches that could make my life (and reporting) easier? Anyone working in a simialr environment?

This is actually one of the toughest problem to solve efficiently. The SQL Server Customer Advisory Team has dedicated a white-paper to the topic which I highly recommend you read: Best Practices for Semantic Data Modeling for Performance and Scalability.
You basically have 3 options:
semantic database (entity-attribute-value)
XML column
sparse columns
Each solution comes with ups and downs. Out of the top of my hat I'd say XML is probably the one that gives you the best balance of power and flexibility, but the optimal solution really depends on lots of factors like data set sizes, frequency at which new attributes are created, the actual process (human operators) that create-populate-use these attributes etc, and not at least your team skill set (some might fare better with an EAV solution, some might fare better with an XML solution). If the attributes are created/managed under a central authority and adding new attributes is a reasonable rare event, then the sparse columns may be a better answer.

Well you could also have the following db structure:
Have a table to store custom attributes
AttributeID
AttributeName
Have a mapping table between events and attributes with:
AttributeID
EventID
AttributeValue
This means you will be able to store custom information per event. And you will be able to reuse your attributes. You can include some metadata as
AttributeType
AllowBlankValue
to the attribute to handle it easily afterwards

Have you considered using XML instead of JSON? Difference: XML is supported (special data type) and has query integration ;)

quick and dirty, but actually nice for querying: simply add new columns. it's not like the empty entries in the previous table should cost a lot.
more databasy solution: you'll have something like an event ID in your table. You can link this to an n:m table connecting events to additional fields. And then store the additional field data in a table with additional_field_id, record_id (from the original table) and the actual value. Probably creates ugly queries, but seems politically correct in terms of database design.
I understand "NoSQL" (not only sql ;) databases like couchdb let you store arbitrary fields per record, but since you're already with SQL Server, I guess that's not an option.

This is the solution that we first proposed in ASP.NET Forums (that later became Community Server), and that the ASP.NET team built a similar version of in the ASP.NET 2.0 Membership when they released it:
Property Bags on your domain objects
For example:
Event.Profile() or in your case, Event.Extras().
Basically, a property bag is a serialized collection of data stored in a name/value pair in a column (or columns). The ASP.NET 2.0 Membership went the route of storing names in a semi-colon delimited list, and values in the same:
Table: aspnet_Profile
Column: PropertyNames (separated by semi-colons, and has start index and end index)
Column: PropertyValues (separated by semi-colons, and only stores the string value)
The downside to that approach is it is all strings, and manually has to be parsed (even though the membership system does it for you automatically).
Recently, my current method is I've built FormCollection and NameValueCollection C# extension methods that automatically serialize the collections to an XML result. And I store that XML in the table in it's own column associated with that entity. I also have a deserializer C# extension on XElement that deserializes that data back to the collection at runtime.
This gives you the power of actually querying those properties in XML, via SQL (though, that can be slow though - always flatten out your read-only data).
The final note is runtime querying: The general rule we follow is, if you are going to query a property of an entity in normal application logic, then you move that property to an actual column on the table - and create the appropriate indexes. If that data will never be queried directly (for example, Linq-to-Sql or EF), then leave it in the XML Property Bag.
Property Bags gives you the power of extending your domain models however you like, without having to modify the db schema.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.