how to speed up query binary search in LINQ - c#

I use Linq select statement to get file as byte from the database. It take long time to get the byte array from my Files Table.
I'm using this code:
Fun<Files,bool> Filter = delegate(Files x)
{
return x.FileID == 10;
};
Files File = DAL.FilesDAL.GetFile(Filter).First();
public static List<Files> GetFile(Func<Files,bool> lambda){
return db.Where(lamnda).ToList();
}
For 1M file size it take up to 1m. That too long for my clients.
How I can improve the speed of that query?

Looks like you're missing the fact, that using Func<> makes your query be executed as LINQ to Objects. That means entire table is fetched into application memory and then filtering is performed by the application, not by the database. To make the filter execute at database it should be passed as Expression<Func<>> instead:
public static List<Files> GetFile(Expression<Func<Files,bool>> lambda){
return db.Where(lamnda).ToList();
}
I assumed, that db here is IQueryable<Files>.
To call it use:
Files File = DAL.FilesDAL.GetFile(x => x.Filter == 10).First();
To make it even more descriptive, you should probably change your method to only return one Files item. Method name is GetFile, so I would expect it to return one file, not collection of files:
public static Files GetFile(Expression<Func<Files,bool>> lambda){
return db.FirstOrDefault(lamnda);
}
usage:
var File = DAL.FilesDAL.GetFile(x => x.Filter == 10);
Or to increase semantics, you can refactor the method to be GetFileById and take Id, not an expression:
public static Files GetFileById(int Id)
{
return db.FirstOrDefault(x => x.FileId == id);
}
usage
var File = DAL.FilesDAL.GetFileById(10);

How I can improve the speed of that query?
Faster server. Faster network. Do not get the binary content you do not need.
SImple like that. NOthing you can do on LINQ level that will magically fix a bad design.
Generally:
Btw., a bad mistake - you return ToList in GetFiles. Never do that. Either let it a queryable, so the user can add conditions and order and grouping or return an enumerable (which is way less powerfull) but ToList means all gets into memory - even if the user then does not want a list.
And man, the way to hand in the lambdaa is just redundant and limiting. Bad design.

Related

How to get records in order from database materialized view using entity framework

First I create database view where I have the records ordered.
But when I try to do "Skip" and "Take" they are not ordered.
var query = dbContext.UserView.OrderBy(x => x.Id);
for (int i = 0; i < 10; i++)
{
var users = await query
.Skip(i)
.Take(1)
.ToListAsync();
await SendMessage(users);
}
I am trying to take and send records on chunks but I don't want to load them in memory.
If I don't order var query = dbContext.UserView.OrderBy(x => x.Id); here, I receive different order each time in my for loop even though I create my database view with "order by".
When I call ToListAsync(), will it order every time and become a slower query.
Is there a way to create the database view and every time when I ask for records to keep the same order?
Thank you
Database Views are NEVER ordered by definition. The CREATE or ALTER statement for the view will fail if there is an ORDER BY clause. In this case, though, you cover for it with the .OrderBy() call from the dbContext.
Additionally, it's usually best to avoid calling ToList() or ToListAysnc() before you need to, and in this case you probably don't need to.
As for the ordering... this is what async code is for: it lets you do work on one thing while still await-ing another. The result is things don't always run in the order you expected. But there are some things we can do to help.
Finally, the way the loop is structured this would have sent 10 separate queries to the database. That's bad. In fact, it's probably the main source of the strange ordering, as the other async operations should all finish nearly instantly and so avoid most of the reordering issues.
You want it to look more like this instead:
var query = dbContext.UserView.OrderBy(x => x.Id).Take(10);
foreach(var user in query)
{
await SendMessage(user);
}
Of if SendMessge() can only accept a sequence:
var query = dbContext.UserView.OrderBy(x => x.Id).Take(10);
await SendMessage(query);
And if (and only if!) SendMessage() absolutely demands a list (and given the requirement from the comments to still only have one item in memory), we can still improve performance by only running ONE query:
var query = await dbContext.UserView.OrderBy(x => x.Id).Take(10);
foreach(var user in query)
{
//guessing at the type name here)
var list = new List<User>() {user} ;
await SendMessage(list);
}
Again: the above only runs ONE query on the database, but still only has ONE item in memory at a time.
But in this last situation I'd probably first explore whether I could also change or overload SendMessage() to allow an IEnumerable<User>. That is, the method probably looks like something like this:
public static async void SendMessage(List<User> users)
{
foreach(var user in users)
{
//send the message
}
}
And if so you can change it like this:
public static async void SendMessage(IEnumerable<User> users)
{
foreach(var user in users)
{
//send the message
}
}
Note the code in method did not change... only the function signature. Best of all, you can still pass a list to the method when you call it. And now you can also pass any other enumerable, including the version from the second code sample in this answer.

Custom Method in LINQ Query

I sum myself to the hapless lot that fumbles with custom methods in LINQ to EF queries. I've skimmed the web trying to detect a pattern to what makes a custom method LINQ-friendly, and while every source says that the method must be translatable into a T-SQL query, the applications seem very diverse. So, I'll post my code here and hopefully a generous SO denizen can tell me what I'm doing wrong and why.
The Code
public IEnumerable<WordIndexModel> GetWordIndex(int transid)
{
return (from trindex in context.transIndexes
let trueWord = IsWord(trindex)
join trans in context.Transcripts on trindex.transLineUID equals trans.UID
group new { trindex, trans } by new { TrueWord = trueWord, trindex.transID } into grouped
orderby grouped.Key.word
where grouped.Key.transID == transid
select new WordIndexModel
{
Word = TrueWord,
Instances = grouped.Select(test => test.trans).Distinct()
});
}
public string IsWord(transIndex trindex)
{
Match m = Regex.Match(trindex.word, #"^[a-z]+(\w*[-]*)*",
RegexOptions.IgnoreCase);
return m.Value;
}
With the above code I access a table, transIndex that is essentially a word index of culled from various user documents. The problem is that not all entries are actually words. Nubers, and even underscore lines, such as, ___________,, are saved as well.
The Problem
I'd like to keep only the words that my custom method IsWord returns (at the present time I have not actually developed the parsing mechanism). But as the IsWord function shows it will return a string.
So, using let I introduce my custom method into the query and use it as a grouping parameter, the is selectable into my object. Upon execution I get the omninous:
LINQ to Entities does not recognize the method
'System.String IsWord(transIndex)' method, and this
method cannot be translated into a store expression."
I also need to make sure that only records that match the IsWord condition are returned.
Any ideas?
It is saying it does not understand your IsWord method in terms of how to translate it to SQL.
Frankly it does not do much anyway, why not replace it with
return (from trindex in context.transIndexes
let trueWord = trindex.word
join trans in context.Transcripts on trindex.transLineUID equals trans.UID
group new { trindex, trans } by new { TrueWord = trueWord, trindex.transID } into grouped
orderby grouped.Key.word
where grouped.Key.transID == transid
select new WordIndexModel
{
Word = TrueWord,
Instances = grouped.Select(test => test.trans).Distinct()
});
What methods can EF translate into SQL, i can't give you a list, but it can never translate a straight forward method you have written. But their are some built in ones that it understands, like MyArray.Contains(x) for example, it can turn this into something like
...
WHERE Field IN (ArrItem1,ArrItem2,ArrItem3)
If you want to write a linq compatible method then you need to create an expresion tree that EF can understand and turn into SQL.
This is where things star to bend my mind a little but this article may help http://blogs.msdn.com/b/csharpfaq/archive/2009/09/14/generating-dynamic-methods-with-expression-trees-in-visual-studio-2010.aspx.
If the percentage of bad records in return is not large, you could consider enumerate the result set first, and then apply the processing / filtering?
var query = (from trindex in context.transIndexes
...
select new WordIndexModel
{
Word,
Instances = grouped.Select(test => test.trans).Distinct()
});
var result = query.ToList().Where(word => IsTrueWord(word));
return result;
If the number of records is too high to enumerate, consider doing the check in a view or stored procedure. That will help with speed and keep the code clean.
But of course, using stored procedures has disadvatages of reusability and maintainbility (because of no refactoring tools).
Also, check out another answer which seems to be similar to this one: https://stackoverflow.com/a/10485624/3481183

How do I make a streaming LINQ expression that delivers the filtered out items as well as the filtered items?

I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.
Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.
var elements = sheet
.Rows() <-- BEGIN sheet data transform
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows() <-- END sheet data transform
.ToElements(strings) <-- BEGIN domain transform
.RemoveBadRecords(out discard)
.OrderByCompositeKey();
The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.
public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
return map.Select(pair =>
{
try
{
return ElementRow.FromCells(pair.Key, pair.Value);
}
catch (Exception)
{
return ElementRow.BadRecord(pair.Key);
}
});
}
Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
return elements.Where(el => el.FormatId != 0);
}
However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
var temp = new List<ElementRow>();
var filtered = elements.Where(el =>
{
if (el.FormatId == 0) temp.Add(el);
return el.FormatId != 0;
});
discard = temp;
return filtered;
}
And, lo! I thought I was hardcore and would have this working in one shot...
var discard = new List<ElementRow>();
var elements = data
/* snipped long LINQ statement */
.RemoveBadRecords(out discard)
/* snipped long LINQ statement */
discard.ForEach(el => failures.Add(el));
foreach(var el in elements)
{
/* do more work, maybe add more failures */
}
return new Result(elements, failures);
But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.
The temp list was created
The Where filter was assigned (but not yet run)
And the discard list was assigned
Then the streaming thing was returned
When discard was iterated, it contained no elements, because the elements weren't iterated over yet.
Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?
Some Commentary
Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.
When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!
It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.
However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)
I'd suggest materializing the query before removing the bad records - and then you can run separate queries:
var allElements = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.ToList();
var goodElements = allElements.Where(el => el.FormatId != 0)
.OrderByCompositeKey();
var badElements = allElements.Where(el => el.FormatId == 0);
By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.
EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:
var lookup = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.OrderByCompositeKey()
.ToLookup(el => el.FormatId == 0);
Then you can use:
foreach (var goodElement in lookup[false])
{
...
}
and
foreach (var badElement in lookup[true])
{
...
}
Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:
foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
...
}
I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.

Entity framework with Linq to Entities performance

If I have a static method like this
private static bool TicArticleExists(string supplierIdent)
{
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
if((from a in db.Articles where a.SupplierArticleID.Equals(supplierIdent) select a).Count() > 0)
return true;
}
return false;
}
and use this method in various places in foreach loops or just plain calling it numerous times, does it create and open new connection every time?
If so, how can I tackle this? Should I cache the results somewhere, like in this case, I would cache the entire Classifications table in Memory Cache? And then do queries vs this cached object?
Or should I make TicDatabaseEntities variable static and initialize it at class level?
Should my class be static if it contains only static methods? Because right now it is not..
Also I've noticed that if I return result.First() instead of FirstOrDefault() and the query does not find a match, it will issue an exception (with FirstOrDefault() there is no exception, it returns null).
Thank you for clarification.
new connections are non-expensive thanks to connection caching. Basically, it grabs an already open connection (I htink they are kept open for 2 minutes for reuse).
Still, caching may be better. I do really not like the "firstordefault". Thinks of whether you can acutally pull in more in ONE statement, then work from that.
For the rest, I can not say anything - too much depends on what you actually do there logically. What IS TicDatabaseEntities? CAN it be cached? How long? Same with (3) - we do not know because we do not know what else is in there.
If this is something like getting just some lookup strings for later use, I would say....
Build a key out of classI, class II, class III
load all classifications in (I assume there are only a couple of hundred)
Put them into a static / cached dictionary, assuming they normally do not change (and I htink I have that idea here - is this a financial tickstream database?)
Without business knowledge this can not be answered.
4: yes, that is as documented. First gives first or an exception, FirstOrDefault defaults to default (empty struct initialized with 0, null for classes).
Thanks Dan and TomTom, I've came up with this. Could you please comment this if you see anything out or the order?
public static IEnumerable<Article> TicArticles
{
get
{
ObjectCache cache = MemoryCache.Default;
if (cache["TicArticles"] == null)
{
CacheItemPolicy policy = new CacheItemPolicy();
using(TicDatabaseEntities db = new TicDatabaseEntities())
{
IEnumerable<Article> articles = (from a in db.Articles select a).ToList();
cache.Set("TicArticles", articles, policy);
}
}
return (IEnumerable<Article>)MemoryCache.Default["TicArticles"];
}
}
private static bool TicArticleExists(string supplierIdent)
{
if (TicArticles.Count(p => p.SupplierArticleID.Equals(supplierIdent)) > 0)
return true;
return false;
}
If this is ok, I'm going to make all my method follow this pattern.
does it create and open new connection every time?
No. Connections are cached.
Should I cache the results somewhere
No. Do not cache entire tables.
should I make TicDatabaseEntities variable static and initialize it at class level?
No. Do not retain a DataContext instance longer than a UnitOfWork.
Should my class be static if it contains only static methods?
Sure... doing so will prevent anyone from creating useless instances of the class.
Also I've noticed that if I return result.First() instead of FirstOrDefault() and the query does not find a match, it will issue an exception
That is the behavior of First. As such - I typically restrict use of First to IGroupings or to collections previously checked with .Any().
I'd rewrite your existing method as:
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
bool result = db.Articles
.Any(a => a.supplierArticleID.Equals(supplierIdent));
return result;
}
If you are calling the method in a loop, I'd rewrite to:
private static Dictionary<string, bool> TicArticleExists
(List<string> supplierIdents)
{
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
HashSet<string> queryResult = new HashSet(db.Articles
.Where(a => supplierIdents.Contains(a.supplierArticleID))
.Select(a => a.supplierArticleID));
Dictionary<string, bool> result = supplierIdents
.ToDictionary(s => s, s => queryResult.Contains(s));
return result;
}
}
I'm trying to find the article where I read this, but I think it's better to do (if you're just looking for a count):
from a in db.Articles where a.SupplierArticleID.Equals(supplierIdent) select 1
Also, use Any instead of Count > 0.
Will update when I can cite a source.

Linq-to-Sql: recursively get children

I have a Comment table which has a CommentID and a ParentCommentID. I am trying to get a list of all children of the Comment. This is what I have so far, I haven't tested it yet.
private List<int> searchedCommentIDs = new List<int>();
// searchedCommentIDs is a list of already yielded comments stored
// so that malformed data does not result in an infinite loop.
public IEnumerable<Comment> GetReplies(int commentID) {
var db = new DataClassesDataContext();
var replies = db.Comments
.Where(c => c.ParentCommentID == commentID
&& !searchedCommentIDs.Contains(commentID));
foreach (Comment reply in replies) {
searchedCommentIDs.Add(CommentID);
yield return reply;
// yield return GetReplies(reply.CommentID)); // type mis-match.
foreach (Comment replyReply in GetReplies(reply.CommentID)) {
yield return replyReply;
}
}
}
2 questions:
Is there any obvious way to improve this? (Besides maybe creating a view in sql with a CTE.)
How come I can't yield a IEnumerable <Comment> to an IEnumerable <Comment>, only Comment itself?
Is there anyway to use SelectMany in this situation?
I'd probably use either a UDF/CTE, or (for very deep structures) a stored procedure that does the same manually.
Note that if you can change the schema, you can pre-index such recursive structures into an indexed/ranged tree that lets you do a single BETWEEN query - but the maintenance of the tree is expensive (i.e. query becomes cheap, but insert/update/delete become expensive, or you need a delayed scheduled task).
Re 2 - you can only yield the type specified in the enumeration (the T in IEnumerable<T> / IEnumerator<T>).
You could yield an IEnumerable<Comment> if the method returned IEnumerable<IEnumerable<Comment>> - does that make sense?
Improvements:
perhaps a udf (to keep composability, rather than a stored procedure) that uses the CTE recursion approach
use using, since DataContext is IDisposable...
so:
using(var db = new MyDataContext() ) { /* existing code */ }
LoadWith is worth a try, but I'm not sure I'd be hopeful...
the list of searched ids is risky as a field - I guess you're OK as long as you don't call it twice... personally, I'd use an argument on a private backing method... (i.e. pass the list between recursive calls, but not on the public API)

Categories