RavenDB Query<T> Always Faster Than Load<T>

RavenDB Query<T> Always Faster Than Load<T> - c#

On the RavenDB site it says "Use Load over Query when you know the documents Id". In my tests on a simple collection of approximately 1,500 objects Load is always slower. Why?
Load:
var doc = session.Load<Document>("Documents/123");
Query
var doc = session.Query<Document>().Where(x => x.Id == "123").SingleOrDefault();
In a test, retrieving every document, the average Query time was 66 milliseconds vs 137 for the Load. The RavenDB instance is located in another office hence the high times. Regardless should Load not always be faster?
Edit
This is statement I'm referring to http://ravendb.net/kb/31/my-10-tips-and-tricks-with-ravendb. Tip #4. Is it wrong?

From what I understand, Load will guarantee to return a result (provided that id exists in the database) whereas Query might not return a result if the indexes haven't yet been updated.
You could have a scenario whereby you insert a record, then on the next line try to retrieve that same record using Query and then not get anything back. Load would return a record in this scenario.
So I guess the performance degradation you are seeing might be related to the fact that you are querying by index when using Query, whereas Load is hitting the actual data store.

When retrieving an item by its Id, you are required to use the .Load(id) method.
Load is an ACID compliant operation. It retrieves documents directly from the document store.
Query is a BASE operation that is "eventually consistent". It goes first against an index, finds the documents in the document store, and then returns them. Querying by an Id could potentially return null if the document was just added and has not been indexed yet.
RavenDB 2.0 added a feature to prevent you from querying by Id. It will throw an exception if you try to do so. So using Load is not just a best practice, it's a requirement.

Related

Mongo Update Unintentionally Inserting Document

I have a .NET application written in C#, and use Mongo for my database backend. One of my collections, UserSearchTerms, repeatedly (and unintentionally) has duplicate documents created.
I've teased out the problem to an update function that gets called asynchronously, and can be called multiple times simultaneously. In order to avoid problems with concurrent runs, I've implemented this code using an update which I trigger on any documents that match a specific query (unique on user and program), upserting if no documents are found.
Initially, I can guarantee that no duplicates exist and so expect that only the following two cases can occur:
No matching documents exist, triggering an upsert to add a new document
One matching document exists, and so an update is triggered only on that one document
Given these two cases, I expect that there would be no way for duplicate documents to be inserted through this function - the only time a new document should be inserted is if there are none to begin with. Yet over an hour or so, I've found that even though documents for a particular user/program pair exist, new documents for them are created.
Am I implementing this update correctly to guarantee that duplicate documents will not be created? If not, what is the proper way to implement an update in order to assure this?
This is the function in question:
public int UpdateSearchTerm(UserSearchTerm item)
{
_userSearches = _uow.Db.GetCollection<UserSearchTerm>("UserSearchTerms");
var query = Query.And(Query<UserSearchTerm>.EQ(ust => ust.UserId, item.UserId), Query<UserSearchTerm>.EQ(ust => ust.ProgramId, item.ProgramId));
_userSearches.Update(query, Update<UserSearchTerm>.Replace(item), new MongoUpdateOptions { Flags = UpdateFlags.Upsert });
return (int)_userSearches.Count(query);
}
Additional Information:
I'm using mongod version 2.6.5
The mongocsharpdriver version I'm using is 1.9.2
I'm running .NET 4.5
UserSearchTerms is the collection I store these documents in.
The query is intended to match users on both userId AND programId - my definition of a 'unique' document.
I return a count after the fact for debugging purposes.

You could add a unique index on userId and programId to ensure that no duplicate will be inserted
Doc : https://docs.mongodb.org/v2.4/tutorial/create-a-unique-index/

MongoDB - Inserting the result of a query in one round-trip

Consider this hypothetical snippet:
using (mongo.RequestStart(db))
{
var collection = db.GetCollection<BsonDocument>("test");
var insertDoc = new BsonDocument { { "currentCount", collection.Count() } };
WriteConcernResult wcr = collection.Insert(insertDoc);
}
It inserts a new document with "currentCount" set to the value returned by collection.Count().
This implies two round-trips to the server. One to calculate collection.Count() and one to perform the insert. Is there a way to do this in one round-trip?
In other words, can the value assigned to "currentCount" be calculated on the server at the time of the insert?
Thanks!

There is no way to do this currently (Mongo 2.4).
The upcoming 2.6 version should have batch operations support but I don't know if it will support batching operations of different types and using the results of one operation from another operation.
What you can do, however, is execute this logic on the server by expressing it in JavaScript and using eval:
collection.Database.Eval(new BsonJavaScript(#"
var count = db.test.count();
db.test.insert({ currentCount: count });
");
But this is not recommended, because of several reasons: you lose the write concern, it is very unsafe in terms of security, it requires admin permissions, it holds a global write lock, and it won't work on sharded clusters :)
I think your best route at the moment would be to do this in two queries.
If you're looking for atomic updates or counters (which don't exactly match your example but seem somewhat related), take a look at findAndModify and the $inc operator of update.

If you've got a large collection and you're looking to save CPU, it's recommended that you create another collection called counters that only has one document per collection that you want to count and increment the document pertaining to you're collection each time you insert a document.
See the guidance here.
It appears that you can place a JavaScript function inside your query, so perhaps it can be done in one trip, but I haven't implemented this in my own app, so I can't confirm that.

How does Raven know what collection to include?

I am looking at the following sample code to include referenced documents and avoid round trip.
var order = session.Query<Order>()
.Customize(x => x.Include<Order>(o=>o.CustomerId)) // Load also the costumer
.First();
var customer = session.Load<Customer>(order.CustomerId);
My question is how does Raven know that this o=>o.CustomerId implies Customer document/collection? At no time was the entity Customer supplied in the query to get the Order entity. Yet Raven claims that the 2nd query to get Customer can be done against the cache, w/o any network trip.
If it's by naming convention, which seems like a very poor/fragile/brittle convention to adopt, what happens when I need to include more than 1 documents?
Eg. a car was purchased under 2 names, so I want to link back to 2 customers, the primary and secondary customer/driver. They're both stored in the Customer collection.
var sale = session.Query<Sale>()
.Customize(x => x.Include<Sale>(o=>o.PrimaryCustomerId).Include<Sale>(o=>o.SecondaryCustomerId)) // Load also the costumer
.First();
var primaryCustomer = session.Load<Customer>(order.PrimaryCustomerId);
var secondaryCustomer = session.Load<Customer>(order.SecondaryCustomerId);
How can I do the above in 1 network trip? How would Raven even knows that this o=>o.PrimaryCustomerId and o=>o.SecondaryCustomerId are references to the one and same table Customer since obviously the property name and collection name don't line up?

Raven doesn't have the concept of "tables". It does know about "collections", but they are just a convenience mechanism. Behind the scenes, all documents are stored in one big database. The only thing that makes a "collection" is that each document has a Raven-Entity-Name metadata value.
Both the examples you showed will result in one round trip (each). Your code looks just fine to me.
My question is how does Raven know that this o=>o.CustomerId implies Customer document/collection? At no time was the entity Customer supplied in the query to get the Order entity.
It doesn't need to be supplied in the query. As long as the data stored in the CustomerId field of the Sale document is a full document key, then that document will be returned to the client and loaded into session.
Yet Raven claims that the 2nd query to get Customer can be done against the cache, w/o any network trip.
That's correct. The session container tracks all documents returned - not just the ones from the query results. So later when you call session.Load using the same document key, it already has it in session so it doesn't need to go back to the server.
Regardless of whether you query, load, or include - the document doesn't get deserialized into a static type until you pull it out of the session. That's why you specify the Customer type in the session.Load<Customer> call.
If it's by naming convention, which seems like a very poor/fragile/brittle convention to adopt ...
Nope, it's by the value stored in the property which is a document key such as "customers/123". Every document is addressable by its document key, with or without knowing the static type of the class.
what happens when I need to include more than 1 documents?
The exact same thing. There isn't a limit on how many documents can be included or loaded into session. However, you should be sure to open the session in a using statement so it is disposed properly. The session is a "Unit of Work container".
How would Raven even knows that this o=>o.PrimaryCustomerId and o=>o.SecondaryCustomerId are references to the one and same table Customer since obviously the property name and collection name don't line up?
Again, it doesn't matter what the names of the fields are. It matters that the data in those fields contains a document id, such as "customers/123". If you aren't storing the full string identifier, then you will need to build the document key inside the lambda expression. In other words, if Sale.CustomerId contains just the number 123, then you would need to include it with .Include<Sale>(o=> "customers/" + o.CustomerId).

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.

Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).

I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.

Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog

I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

Can I use LINQ to skip a collection and just return 100 records?

I have the following that returns a collection from Azure table storage where Skip is not implemented. The number of rows returned is approximately 500.
ICollection<City> a = cityService.Get("0001I");
What I would like to do is to be able to depending on an argument have just the following ranges returned:
records 1-100 passing in 0 as an argument to a LINQ expression
records 101-200 passing in 100 as an argument to a LINQ expression
records 201-300 passing in 200 as an argument to a LINQ expression
records 301-400 passing in 300 as an argument to a LINQ expression
etc
Is there some way I can add to the above and use link to get these ranges
of records returned:

As you already stated in your question, the Skip method is not implemented in Windows Azure Table storage. This means you have 2 options left:
Option 1
Download all data from table storage (by using ToList, see abatishchev's answer) and execute the Skip and Take methods on this complete list. In your question you're talking about 500 records. If the number of records doesn't grow too much this solution should be OK for you, just make sure that all records have the same partition key.
If the data grows you can still use this approach, but I suggest you evaluate a caching solution to store all the records instead of loading them from table storage over and over again (this will improve the performance, but don't expect this to work with very large amounts of data). Caching is possible in Windows Azure using:
Windows Azure Caching (Preview)
Windows Azure Shared Caching
Option 2
The CloudTableQuery class allows you to query for data, but more important to receive a continuation token to build a paging implementation. This allows you to detect if you can query for more data, the pagination example on Scott's blogpost (see nemensv's comment) uses this.
For more information on continuation tokens I suggest you take a look at Jim's blogpost: Azure#home Part 7: Asynchronous Table Storage Pagination. By using continuation tokens you only download the data for the current page meaning it will also work correctly even if you have millions of records. But you have to know the downside of using continuation tokens:
This won't work with the Skip method out of the box, so it might not be a solution for you.
No page 'numbers', because you only know if there's more data (not how much)
No way to count all records

If paging is not supported by the underlying engine, the only way to implement it is to load all the data into memory and then perform paging:
var list = cityService.Get("0001I").ToList(); // meterialize
var result = list.Skip(x).Take(y);

Try something like this:
cityService.Get("0001I").ToList().Skip(n).Take(100);
This should return records 201-300:
cityService.Get("0001I").ToList().Skip(200).Take(100);

a.AsEnumerable().Skip(m).Take(n)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.