I have a Comment table which has a CommentID and a ParentCommentID. I am trying to get a list of all children of the Comment. This is what I have so far, I haven't tested it yet.
private List<int> searchedCommentIDs = new List<int>();
// searchedCommentIDs is a list of already yielded comments stored
// so that malformed data does not result in an infinite loop.
public IEnumerable<Comment> GetReplies(int commentID) {
var db = new DataClassesDataContext();
var replies = db.Comments
.Where(c => c.ParentCommentID == commentID
&& !searchedCommentIDs.Contains(commentID));
foreach (Comment reply in replies) {
searchedCommentIDs.Add(CommentID);
yield return reply;
// yield return GetReplies(reply.CommentID)); // type mis-match.
foreach (Comment replyReply in GetReplies(reply.CommentID)) {
yield return replyReply;
}
}
}
2 questions:
Is there any obvious way to improve this? (Besides maybe creating a view in sql with a CTE.)
How come I can't yield a IEnumerable <Comment> to an IEnumerable <Comment>, only Comment itself?
Is there anyway to use SelectMany in this situation?
I'd probably use either a UDF/CTE, or (for very deep structures) a stored procedure that does the same manually.
Note that if you can change the schema, you can pre-index such recursive structures into an indexed/ranged tree that lets you do a single BETWEEN query - but the maintenance of the tree is expensive (i.e. query becomes cheap, but insert/update/delete become expensive, or you need a delayed scheduled task).
Re 2 - you can only yield the type specified in the enumeration (the T in IEnumerable<T> / IEnumerator<T>).
You could yield an IEnumerable<Comment> if the method returned IEnumerable<IEnumerable<Comment>> - does that make sense?
Improvements:
perhaps a udf (to keep composability, rather than a stored procedure) that uses the CTE recursion approach
use using, since DataContext is IDisposable...
so:
using(var db = new MyDataContext() ) { /* existing code */ }
LoadWith is worth a try, but I'm not sure I'd be hopeful...
the list of searched ids is risky as a field - I guess you're OK as long as you don't call it twice... personally, I'd use an argument on a private backing method... (i.e. pass the list between recursive calls, but not on the public API)
Related
I'm trying to optimize a routine that looks sort of like this (simplified):
public async Task<IEnumerable<Bar>> GetBars(ObjectId id){
var output = new Collection<Bar>();
var page = 1;
var hasMore = true;
while(hasMore) {
var foos = await client.GetFoos(id, page);
foreach(var foo : foos) {
if(!Proceed(foo)) {
hasMore = false;
break;
}
output.Add(new Bar().Map(foo)
}
page++;
return output;
}
The method that calls GetBars() looks something like this
public async Task<Baz> GetBaz(ObjectId id){
var bars = await qux.GetBars();
if(bars.Any() {
var bazBaseData = qux.GetBazBaseData(id);
var bazAdditionalData = qux.GetBazAdditionalData(id);
return new Baz().Map(await bazBaseData, await bazAdditionalData, bars);
}
}
GetBaz() returns between 0 and a lot of items. Since we run through a few million id's we initially added the if(bars.Any()) statement as an initial attempt of speeding up the application.
Since the GetBars() is awaited it blocks the thread until it has collected all its data (which can take some time). My idea was to use yield return and then replace the if(bars.Any()) with a check that tests if we get at least one element, so we can fire off the two other async methods in the meantime (which also takes some time to execute).
My question is then how to do this. I know System.Linq.Count()and System.Linq.Any() defeats the whole idea of yield return and if I check the first item in the enumerable it is removed from the enumerable.
Is there another/better option besides adding for instance an out parameter to GetBars()?
TL;DR: How do I check whether an enumerable from a yield return contains any objects without starting to iterate it?
For your actual question "How do I check whether an enumerable from a yield return contains any objects without starting to iterate it?" well, you don't.
It's that simple, you can't period since the only thing you can do with an IEnumerable is well, to enumerate it. Calling Any() isn't an issue however since that "does" only enumerate the first element (and not the whole list) but it's not possible to enumerate nothing as a lot of ienumerables don't exist in any form except that of a pipeline (there could be no backing collection, it's not possible to check if something that doesn't exist yet has any elements, by design this makes no sense)
Edit : also , i don't see any yield in your code, are you mixing up awaitable and yield concepts (totally unrelated) ?
I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.
Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.
var elements = sheet
.Rows() <-- BEGIN sheet data transform
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows() <-- END sheet data transform
.ToElements(strings) <-- BEGIN domain transform
.RemoveBadRecords(out discard)
.OrderByCompositeKey();
The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.
public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
return map.Select(pair =>
{
try
{
return ElementRow.FromCells(pair.Key, pair.Value);
}
catch (Exception)
{
return ElementRow.BadRecord(pair.Key);
}
});
}
Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
return elements.Where(el => el.FormatId != 0);
}
However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
var temp = new List<ElementRow>();
var filtered = elements.Where(el =>
{
if (el.FormatId == 0) temp.Add(el);
return el.FormatId != 0;
});
discard = temp;
return filtered;
}
And, lo! I thought I was hardcore and would have this working in one shot...
var discard = new List<ElementRow>();
var elements = data
/* snipped long LINQ statement */
.RemoveBadRecords(out discard)
/* snipped long LINQ statement */
discard.ForEach(el => failures.Add(el));
foreach(var el in elements)
{
/* do more work, maybe add more failures */
}
return new Result(elements, failures);
But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.
The temp list was created
The Where filter was assigned (but not yet run)
And the discard list was assigned
Then the streaming thing was returned
When discard was iterated, it contained no elements, because the elements weren't iterated over yet.
Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?
Some Commentary
Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.
When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!
It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.
However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)
I'd suggest materializing the query before removing the bad records - and then you can run separate queries:
var allElements = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.ToList();
var goodElements = allElements.Where(el => el.FormatId != 0)
.OrderByCompositeKey();
var badElements = allElements.Where(el => el.FormatId == 0);
By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.
EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:
var lookup = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.OrderByCompositeKey()
.ToLookup(el => el.FormatId == 0);
Then you can use:
foreach (var goodElement in lookup[false])
{
...
}
and
foreach (var badElement in lookup[true])
{
...
}
Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:
foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
...
}
I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.
I have the following method which get all parents for a node using LinqToSql but I don't know how much it's bad for performance.
From NodeTable:
public partial class Node
{
public List<Node> GetAllParents(IEnumerable<Node> records)
{
if (this.ParentID == 0)
{
// Reach the parent, so create the instance of the collection and brake recursive.
return new List<Node>();
}
var parent = records.First(p => p.ID == ParentID);
// create a collection from one item to concat it with the all parents.
IEnumerable<Node> lst = new Node[] { parent };
lst = lst.Concat(parent.GetAllParents(records));
return lst.ToList();
}
}
Is it good !! or any idea to improve it !!
Thanks.
As such, above code is walking parent-child hierarchy in upward (parent's) direction. So in worst case, it would result in making n queries to database for hierarchy depth of n. I would suggest you to try deferred execution by changing method slightly such as
public IEnumerable<Node> GetAllParents(IEnumerable<Node> records)
{
if (this.ParentID == 0)
{
// Reach the parent, so create the instance of the collection and brake recursive.
return new List<Node>();
}
var parent = records.Where(p => p.ID == ParentID);
var parents = parent.Concat(parent.GetAllParents(records));
return parent;
}
I am not 100% sure if it would work but idea is to exploit expression trees/deferred execution so that multiple queries are fired within single database trip.
Yet another idea would be to write a stored proc/view that will returns all parent (look at CTE in sql server for the same).
EDIT: Used Where instead of First for finding parent in above code because First will certainly be evaluated immediately - (warning: still untested code)
This will result in a query for every single parent node.
The best to approach this is too either write a stored procedure using CTE or if aforementioned not possible, do a breadth first search/query. The latter will require a query for every level, but will result in much less queries overall.
I'm not sure that there's much less you can do, this is - off the top of my head - the same but not recursive and possibly therefore a bit more efficient - but the issue is always going to be the query for the parent.
List<Node> parentList = new List<Node>();
Node current = this;
while (current.ParentID != 0)
{
// current = this.Parent;
current = records.First(r => r.ID == current.ParentID);
parentList.Add(current)
}
return parentList;
It depends on how big your hierarchy is likely to be. If you know that it is never going to need to recurse that many times is isnt a problem, however it maybe quicker to just load the entire table into memory rather than sending multiple calls to the db.
I'm having some trouble figuring out the best way to do this, and I would appreciate any help.
Basically, I'm setting up a filter that allows the user to look at a history of audit items associated with an arbitrary "filter" of usernames.
The datasource is a SQL Server data base, so I'm taking the IQueryable "source" (either a direct table reference from the db context object, or perhaps an IQueryable that's resulted from additional queries), applying the WHERE filter, and then returning the resultant IQueryable object....but I'm a little stumped as to how to perform OR using this approach.
I've considered going the route of Expressions because I know how to OR those, but I haven't been able to figure out quite how to do that with a "Contains" type evaluation, so I'm currently using a UNION, but I'm afraid this might have negative impact on performance, and I'm wondering if it may not give me exactly what I need if other filters (in addition to user name filtering shown here) are added in an arbirary order.
Here is my sample code:
public override IQueryable<X> ApplyFilter<X>(IQueryable<X> source)
{
// Take allowed values...
List<string> searchStrings = new List<string>();
// <SNIP> (This just populates my list of search strings)
IQueryable<X> oReturn = null;
// Step through each iteration, and perform a 'LIKE %value%' query
string[] searchArray = searchStrings.ToArray();
for (int i = 0; i < searchArray.Length; i++)
{
string value = searchArray[i];
if (i == 0)
// For first step, perform direct WHERE
oReturn = source.Where(x => x.Username.Contains(value));
else
// For additional steps, perform UNION on WHERE
oReturn = oReturn.Union(source.Where(x => x.Username.Contains(value)));
}
return oReturn ?? source;
}
This feels like the wrong way to do things, but it does seem to work, so my question is first, is there a better way to do this? Also, is there a way to do a 'Contains' or 'Like' with Expressions?
(Editted to correct my code: In rolling back to working state in order to post it, I apparently didn't roll back quite far enough :) )
=============================================
ETA: Per the solution given, here is my new code (in case anyone reading this is interested):
public override IQueryable<X> ApplyFilter<X>(IQueryable<X> source)
{
List<string> searchStrings = new List<string>(AllowedValues);
// <SNIP> build collection of search values
string[] searchArray = searchStrings.ToArray();
Expression<Func<X, bool>> expression = PredicateBuilder.False<X>();
for (int i = 0; i < searchArray.Length; i++)
{
string value = searchArray[i];
expression = expression.Or(x => x.Username.Contains(value));
}
return source.Where(expression);
}
(One caveat I noticed: Following the PredicateBuilder's example, an empty collection of search strings will return false (false || value1 || ... ), whereas in my original version, I was assuming an empty list should just coallesce to the unfiltered source. As I thought about it more, the new version seems to make more sense for my needs, so I adopted that)
=============================================
You can use the PredicateBuilder from the LINQkit to dynamically construct your query.
If I have a static method like this
private static bool TicArticleExists(string supplierIdent)
{
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
if((from a in db.Articles where a.SupplierArticleID.Equals(supplierIdent) select a).Count() > 0)
return true;
}
return false;
}
and use this method in various places in foreach loops or just plain calling it numerous times, does it create and open new connection every time?
If so, how can I tackle this? Should I cache the results somewhere, like in this case, I would cache the entire Classifications table in Memory Cache? And then do queries vs this cached object?
Or should I make TicDatabaseEntities variable static and initialize it at class level?
Should my class be static if it contains only static methods? Because right now it is not..
Also I've noticed that if I return result.First() instead of FirstOrDefault() and the query does not find a match, it will issue an exception (with FirstOrDefault() there is no exception, it returns null).
Thank you for clarification.
new connections are non-expensive thanks to connection caching. Basically, it grabs an already open connection (I htink they are kept open for 2 minutes for reuse).
Still, caching may be better. I do really not like the "firstordefault". Thinks of whether you can acutally pull in more in ONE statement, then work from that.
For the rest, I can not say anything - too much depends on what you actually do there logically. What IS TicDatabaseEntities? CAN it be cached? How long? Same with (3) - we do not know because we do not know what else is in there.
If this is something like getting just some lookup strings for later use, I would say....
Build a key out of classI, class II, class III
load all classifications in (I assume there are only a couple of hundred)
Put them into a static / cached dictionary, assuming they normally do not change (and I htink I have that idea here - is this a financial tickstream database?)
Without business knowledge this can not be answered.
4: yes, that is as documented. First gives first or an exception, FirstOrDefault defaults to default (empty struct initialized with 0, null for classes).
Thanks Dan and TomTom, I've came up with this. Could you please comment this if you see anything out or the order?
public static IEnumerable<Article> TicArticles
{
get
{
ObjectCache cache = MemoryCache.Default;
if (cache["TicArticles"] == null)
{
CacheItemPolicy policy = new CacheItemPolicy();
using(TicDatabaseEntities db = new TicDatabaseEntities())
{
IEnumerable<Article> articles = (from a in db.Articles select a).ToList();
cache.Set("TicArticles", articles, policy);
}
}
return (IEnumerable<Article>)MemoryCache.Default["TicArticles"];
}
}
private static bool TicArticleExists(string supplierIdent)
{
if (TicArticles.Count(p => p.SupplierArticleID.Equals(supplierIdent)) > 0)
return true;
return false;
}
If this is ok, I'm going to make all my method follow this pattern.
does it create and open new connection every time?
No. Connections are cached.
Should I cache the results somewhere
No. Do not cache entire tables.
should I make TicDatabaseEntities variable static and initialize it at class level?
No. Do not retain a DataContext instance longer than a UnitOfWork.
Should my class be static if it contains only static methods?
Sure... doing so will prevent anyone from creating useless instances of the class.
Also I've noticed that if I return result.First() instead of FirstOrDefault() and the query does not find a match, it will issue an exception
That is the behavior of First. As such - I typically restrict use of First to IGroupings or to collections previously checked with .Any().
I'd rewrite your existing method as:
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
bool result = db.Articles
.Any(a => a.supplierArticleID.Equals(supplierIdent));
return result;
}
If you are calling the method in a loop, I'd rewrite to:
private static Dictionary<string, bool> TicArticleExists
(List<string> supplierIdents)
{
using (TicDatabaseEntities db = new TicDatabaseEntities())
{
HashSet<string> queryResult = new HashSet(db.Articles
.Where(a => supplierIdents.Contains(a.supplierArticleID))
.Select(a => a.supplierArticleID));
Dictionary<string, bool> result = supplierIdents
.ToDictionary(s => s, s => queryResult.Contains(s));
return result;
}
}
I'm trying to find the article where I read this, but I think it's better to do (if you're just looking for a count):
from a in db.Articles where a.SupplierArticleID.Equals(supplierIdent) select 1
Also, use Any instead of Count > 0.
Will update when I can cite a source.