Expand a start/end interval entry into separate rows using LINQ/EF - c#

Given an object (or data row, in SQL terms) representing a range of numbers, with Start and End properties, how do I "expand" those numbers into actual individual units using LINQ in a way that is translatable to EFCore?
This is easy to do in memory with an iterator:
for (var number = range.Start; number <= range.End; number++)
yield return number;
But of course I can't do this with a queryable, since iterating on the original query already goes into the database.
How do I achieve the same behavior while doing this in the database? Say I have a IQueryable<Range> and wanted to convert it into a IQueryable<int>, where for each range, the resulting query would contain all individual entries in the range as separate rows.
I found a few examples of how to achieve this directly in SQL but I wanted to make sure it was possible to do it using LINQ operators and in a way that EF could interpret into the appropriate SQL statement.
Can something like this be coded without using an iterator?

Related

IEnumerable.Select() returning unexpected results

I am querying Azure Table Storage using C#, which returns an IEnumerable. I am using .Select() to get two different attributes from my result but the count of each attribute is not correct. For example:
IEnumerable<SomeClass> results = table.ExecuteQuery(query);
IEnumerable<DateTime> dates = results.Select(x => x.Date);
IEnumerable<double> doubles = results.Select(x => x.Doubles);
Every result has a date and a double value (I have verified this), so the count of each of them should be exactly the same as each other and result but they come back differently. I might have 300k results, and then 299,997 dates and 300,003 doubles.
When I do something like:
results.ToList();
and then use .Select() I get the correct results. I am trying to avoid converting the records to a list first because it takes way too long. I also want to avoid using a for loop or a foreach loop because they also take far too long.
My question is: Is there a way to use .Select() on an IEnumerable and get accurate results? Or is there another way to do this which would be very fast?
NOTE: I am plotting this data on an xy graph and for about 300k records it takes about 1 minute 30 seconds. About 90% of that time is due to a foreach loop I had. If I convert to a list first it takes even longer to process. Using the .Select() on an IEnumerable is very fast but I need reliable results and the amount of x values has to be the same as y values.
Azure tables / SDK uses lazy enumerables. The actual http request to table service to retrieve the entities is not done when you call executequery but when you iterate over the returned Ienumerable results. That's one possibility why you see diff results as you iterate over the enumerable by select maybe the data on the table is changing?

Textual Mining on the column Cell of Table that remove the Duplicates based on "##" notation

Let's Assume I have Table in SQL server that represents employee information for example
I want to do the Textual Mining on the Degree column that remove the Duplicates based on "##" notation.
LINQ to SQL
I am using Linq to SQL , so I am planning to get this data in C# variable context.And Perform operation on string and store again to the location!
Rules: i need to update the data or generate new table!
Is this right way of doing whether its possible ? need some suggestion on this approach or any alternative suggestions are welcome
So it looks like you need to break up the string based on the "##" delimiters, take the distinct items, and put them back in -- comma-delimited this time? The String.Split method to break up the string and then LINQ's Distinct extension method should get you just the unique ones.
Assuming you've got the text of the degree in a variable somewhere:
var uniques = degree
.Split(new String[] { "##" }, StringSplitOptions.None)
.Distinct();
String.Split usually works with a single character delimiter, but there's an overload that allows splitting on a larger string, so you'll have to use that one.
Then you can use String.Join to comma-delimit the unique items, or whatever else you need to do.
Edit: Apologies, I thought your original question was more about how to eliminate the duplicates than how to use LINQ to SQL.
Assuming you've got your DataContext and object model set up, you just need to select your object(s) out of the database using LINQ to SQL, make the changes you need to them, and then and then call SubmitChanges() on them.
For example:
var degrees = from d in context.GetTable<Employee>() select d;
foreach (var d in degrees)
{
d.Degree = String.Join(",", d.Degree
.Split(new String[] { "##" }, StringSplitOptions.None)
.Distinct());
}
context.SubmitChanges();
If you're new to LINQ to SQL, it may be worthwhile to run through a tutorial or two first. Here's part 1 of a pretty good series:
Lastly, you mentioned in your edit that you have the option of creating a new table after making your changes -- if that's the case, I'd consider storing the individual degrees in a table that links back to the employee record, rather than storing them as comma-separated values. It depends on your needs, of course, but SQL is designed to work in tables and sets, so the less string parsing/processing you can do the better.
Good luck!

Compare very large lists of database objects in c#

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.
My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?
Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;
// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();
// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}
No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.
I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008):
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx
Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

Search Substring on a Integer Value

Let's say we have a mongodb collection that has elements containing an int attribute value like: {"MyCollectionAttribute": 12345}
How can I search the string "234" inside the int using Query<T>. syntax?
For now it seems to work(as explained here) using raw query like:
var query = new QueryDocument("$where", "/234/.test(this.MyCollectionAttribute)");
myCollection.Find(query);
Is it preferable to store the values directly as strings instead of integers, since a regex match will be slow? How do you approach theese situations?
Edit
Context: a company can have some internal codes that are numbers. In sql server they can be stored as a column of int type in order to have data integrity at database level and then queried from linq to sql with something like:
.where(item => item.CompanyCode.ToString().Contains("234"))
In this way there is both data integrity at db level and type safety of the query.
I asked the question in order to see how this scenario can be implemented using mongodb.
Does not make much sense what you are asking.
Regular expressions are for search within strings and not within integers.
If you want to perform a substring search (for whatever reason) then store your numbers
as strings and not as integers - obviously.

Limit Number of Results being returned in a List from Linq

I'm using Linq/EF4.1 to pull some results from a database and would like to limit the results to the (X) most recent results. Where X is a number set by the user.
Is there a way to do this?
I'm currently passing them back as a List if this will help with limiting the result set. While I can limit this by looping until I hit X I'd just assume not pass the extra data around.
Just in case it is relevant...
C# MVC3 project running from a SQL Server database.
Use the Take function
int numberOfrecords=10; // read from user
listOfItems.OrderByDescending(x => x.CreatedDate).Take(numberOfrecords)
Assuming listOfItems is List of your entity objects and CreatedDate is a field which has the date created value (used here to do the Order by descending to get recent items).
Take() Function returns a specified number of contiguous elements from the start of a
sequence.
http://msdn.microsoft.com/en-us/library/bb503062.aspx
results = results.OrderByDescending(x=>x.Date).Take(10);
The OrderByDescending(...) will sort items by your date/time property (or w/e logic you want to use to get most recent) and Take(...) will limit to first x items (first being most recent, thanks to the ordering).
Edit: To return some rows not starting at the first row, use Skip():
results = results.OrderByDescending(x=>x.Date).Skip(50).Take(10);
Use Take(), before converting to a List. This way EF can optimize the query it creates and only return the data you need.

Categories