I have a table ReadNews as follows:
id: int
User_id: int
News_id : int
DateRead : Datetime
FacebookUser_id: int
this is stored in a table.
I am getting my friends list of id, via rest call to fb and storing as IEnumerable<string> in the memory.
I need to find the latest news that my friends (IEnumerable<string>) read ordered by datetime descending. I m using NHibernate.
What s the most efficient way of building this query?
Essentially I have all the read news by fb users, but i want to get only the ones read by my friends.
I was thinking of doing:
select * from readnews where FacebookUser_id in (IEnumerable<string>);
In principle, if you are all appropriately indexed up then using IN(x,y,z) is pretty performant and linq/nh should intemperate it fine for you. Try out nhibernate rino to be sure though.
http://nhprof.com/
Have you found any performance problem with it? top n + index on FacebookUser_id should do just fine.
It all depends on amount of data you have, etc. Ask yourself if you don't see any problems with it, aren't you trying premature optimization?
Related
I will start working on xamarin shortly and will be transferring a lot of code from android studio's java to c#.
In java I am using a custom classes which are given arguments conditions etc, convert them to SQL statements and then loads the results to the objects in the project's model
What I am unsure of is wether linq is a better option for filtering such data.
For example what would happen currently is somethng along these lines
List<Customer> customers = (new CustomerDAO()).get_all()
Or if I have a condition
List<Customer> customers = (new CustomerDAO()).get(new Condition(CustomerDAO.Code, equals, "code1")
Now let us assume I have transferred the classes to c# and I wish to do somethng similar to the second case.
So I will probably write something along the lines of:
var customers = from customer
in (new CustomerDAO()).get_all()
where customer.code.equals("code1")
select customer
I know that the query will only be executed when I actually try to access customers, but if I have multiple accesses to customers ( let us say that I use 4 foreach loops later on) will the get_all method be called 4 times? or are the results stored at the first execution?
Also is it more efficient (time wise because memory wise it is probably not) to just keep the get_all() method and use linq to filter the results? Or use my existing setup which in effect executes
Select * from Customers where code = 'code1'
And loads the results to an object?
Thanks in advance for any help you can provide
Edit: yes I do know there is sqlite.net which pretty much does what my daos do but probably better, and at some point I will probably convert all my objects to use it, I just need to know for the sake of knowing
if I have multiple accesses to customers ( let
us say that I use 4 foreach loops later on) will the get_all method be
called 4 times? or are the results stored at the first execution?
Each time you enumerate the enumerator (using foreach in your example), the query will re-execute, unless you store the materialized result somewhere. For example, if on the first query you'd do:
var customerSource = new CustomerDAO();
List<Customer> customerSource.Where(customer => customer.Code.Equals("code1")).ToList();
Then now you'll be working with an in-memory List<Customer> without executing the query over again.
On the contrary, if each time you'd do:
var filteredCustomers = customerSource.Where(customer => customer.Code.Equals("code1"))
foreach (var customer in filteredCustomers)
{
// Do stuff
}
Then for each enumeration you'll be exeucting the said query over again.
Also is it more efficient (time wise because memory wise it is
probably not) to just keep the get_all() method and use linq to filter
the results? Or use my existing setup which in effect executes
That really depends on your use-case. Lets imagine you were using LINQ to EF, and the customer table has a million rows, do you really want to be bringing all of them in-memory and only then filtering them out to use a subset of data? It would usually be better to full filtered query.
I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}
No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.
I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.
I have a database structure which has a set of users and their UserId
I then have a table called 'Post' which consists of a text field and a CreatedBy field.
I then have a 'Follows' table which consists of 'WhoIsFollowing' and 'WhoTheyFollow' fields.
The idea is that the 'Follows' table maps which users another user 'Follows'.
If I am using the application as a particular user and I want to get all my relevant 'Posts', these would be posts of those users I follow, or my own posts.
I have been trying to get this into one LINQ statement but have been failing to get it perfect. Ultimately I need to query the 'Posts' table for all the 'Posts' that I have posted, joined with all the posts of the people I follow in the 'Follows' table.
I have got it working with this statement
postsWeWant = (from s in db.Posts
join sa in db.Follows on s.CreatedBy equals sa.WhoTheyAreFollowing into joinTable1
from x in joinTable1.DefaultIfEmpty()
where (x.WhoIsFollowing == userId || s.CreatedBy == userId) && !s.Deleted
orderby s.DateCreated descending
select s).Take(25).ToList();
The issue is that it seems to come back with duplicates for all the posts posted by the user themselves. I have added .Distinct() to get around this, but instead of taking 25 posts each time, the duplicates are meaning it comes back with much less when there are a lot of posts by that user in the latest 25.
First off, why is the above coming back with duplicates? (It would help me understand the statement a bit more), and secondly how do I get around it?
Its difficult to say exactly without the data structure, but I would recommend investigating and perhaps expanding your join to eliminate duplicate association.
If that fails then I would use a group by clause to remove duplicates so there is no need for a distinct. The reason you are ending up with less than 25 records is probably because the elimination of duplicates is happening after taking 25. But I think I would need more of your code to tell for sure.
Good day,
I have the following tables
PARENT 1=>N CHILDREN 1=>N GRANDCHILDREN.
Both tables have over 30 columns.
I need to select over 50,000 records form PARENT, plus I will need certain fields from CHILDREN and GRANDCHILDREN. Data is needed to manipulate in memory (complex algorithms on what's been selected).
I am using Entity Framework 5.
I tried various combinations of Eager loading (Include, projection etc), but I am still not able to make it perform better then it perorms with LINQ-to-SQL in the following scenario:
"
SELECT from PROJECTS
on binding of each row:
SELECT from CHILDREN
SELECT from GRANDCHILDREN
"
it generates at least 50,001 calls to the DB, but it's still performing better then any of my EF approaches, which take over x5 longer than the current LINQ-to-SQL design.
The best solution would be to have an WHERE IN query on children, but it's not available in EF 5 in native implementation (contains doesn't cut it - too slow for badly done...).
Any ideas will be greatly appreciated.
Thanks,
I assume you are implementing paging in your grid view and are not puting thousands of rows into a grid view at once. If so, you can only select 10 or however many rows you are displaying in the grid view at a time. This will be a lot easier to work with.
I found this example on MSDN that implements paging server side to reduce the number of rows returned in a single query.
You can also consider writing or having a dba write an efficient stored procedure that you can link to your entity framework to control the SQL Code.
I had similar issue some days ago. EF is very slow. After some experiments I received more or less normal performans with direct queries:
Create ViewModel with needed fields:
public class MyViewModel
{
public string one {get; set;}
public string two {get; set;}
}
Then in controller action:
MyViewModel result = db.Database.SqlQuery<MyViewModel>
("SELECT a.one, b.two" +
" FROM Table1 a, Table2 b" +
" WHERE a.id == somthing"
).FirstOrDefault();
Paging wouldn't work for I need data to be sorted based on a calculated field. The field can be only calculated in the web-server memory for the calculation needs client info (yes, yes, there is a way of passing this info to the db server, but this wasn't an option).
Solution:
using(var onecontext = new myCTx())
{
SELECT all from PROJECTS
and implement Context.EntityName.SQLQuery() on all grand children, using the good old WHERE IN construct (I put it all into my entities' partial classes as extensions).
}
this way I get all my data in N db trips, where N is the number of generations, which is fine. The EF context then connects everything together. And then I perform all my r
EF 6 should have WHERE IN built in, so I guess this approach will become more obvious then. Mind you: using Contains() is not an option for large data for it produces multiple OR's instead of the straight IN. Yes, ADO.NET then translates OR's into IN, but before that there is some really heavy lifting being done, which is killing your app server.
I am trying to create a friendly report summing enrollment for number of students by time of day. I initially started with loops for campusname, then time, then day and hibut it was extremely inefficient and slow. I decided to take another approach and select all the data I need in one select and organize it using c#.
Raw Data View
My problem is I am not sure whether to put this into arrays, or lists, or a dictionary or datatable to sum the enrollment and organize it as seen below(mockup, not calculated). Any guidance would be appreciated.
Friendly View
Well, if you only need to show the user some data (and not edit it) you may want to create a report.
Otherwise, if you only need sums, you could get all the data in an IEnumerable and call .Sum(). And as pointed out by colinsmith, you can use Linq in parallel.
But one thing is definite though... If you have a lot of data, you don't want to do many queries. You could either use a sum query in SQL (if the data is stored in a database) or do the sum from a collection you've fetched.
You don't want to fetch the data in a loop. Processing data in memory is way faster than querying multiple times the database and then process it.
Normally I would advise you to do this in the database, i.e. a select using group by etc, I'm having a bit of trouble figuring out how your first picture relates to the second with regards to the days so I can't offer an example.
You could of course do this in C# as well using LINQ to objects but I would first try and solve it in the DB, you are better of performance and bandwidth wise that way.
I am not quite sure what you are exactly after. But from my understanding, i would suggest you to create a class to represent your enrollment
public class Enrollment
{
public string CampusName { set;get;}
public DateTime DateEnrolled { set;get;}
}
And Get all enrollment details from the database to a collection of this class
List<Enrollment> enrollments=db.GetEnrollments();
Now you can do so many operations on this Collection to get your desired data
Ex : If you want to get all Enrollment happened on Fridays
var fridaysEnrollMent = enrollments.
Where(x => x.DateEnrolled.DayOfWeek == DayOfWeek.Friday).ToList();
If you want the Count of Enrollments happened in AA campus
var fridayCount = fridaysEnrollMent .Where(d => d.CampusName == "AA").Count();
something like
select campusname, ssrmeet_begin_time, count(ssrmeet_monday), count(ssrmeet_tue_day) ..
from the_table
group by campusname, ssrmeet_begin_time
order by campusname, ssrmeet_begin_time
/
should be close to what you want. The count only counts the values, not the NULL's. It is also thousands of times faster than first fetching all data to the client. Let the database do the analysis for you, it already has all the data.
BTW: instead of those pics, it is smarter to give some ddl and insert statements with data to work on. That would invite more people to help to answer the question.