Optimizing query with many-to-many relationship on big data set - c#

I have a database (SQLite) constructed with similar DDL:
CREATE TABLE [Player] (
[PlayerID] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
[Name] TEXT UNIQUE NULL
);
CREATE TABLE [Position] (
[PlayerID] INTEGER NOT NULL,
[SingleHandID] INTEGER NOT NULL,
[Position] INTEGER NULL,
PRIMARY KEY ([PlayerID],[SingleHandID])
);
CREATE TABLE [SingleHand] (
[SingleHandID] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[Stake] FLOAT NULL,
[Date] DATE NULL,
DataSetID INTEGER NULL
[IsPreflopAllIn] BOOLEAN NULL
);
CREATE UNIQUE INDEX [NameIndex] ON [Player](
[Name] ASC
CREATE INDEX [DataSetIndex] ON [SingleHand](
[DataSetID] ASC
);
It is mapped to Entity Framework model. I am working on large data sets up to 10 million records each.
My problem is, that I need to find all Hands where a specific Player is sting on any given Postion (plus some other filters, like date range).
While I can scan the database very quickly, to find data from a single table, for example:
//[playerIDs and selectedPos are cashed in memory]
context.Positions.Where(p => playerIDs.Contains(p.PlayerID) && selectedPos.Contains(p.Position)).Select(p => p.SingleHandID).Take(maxHands ?? 1);
When I need to do any join between tables, it starts to run very slowly, for example:
//accesing both Position and SingleHand table
context.Positions.Where(p => playerIDs.Contains(p.PlayerID) && selectedPos.Contains(p.Position) && p.SingleHand.DataSetID == dataSetNumber).Select(p => p.SingleHandID).Take(maxHands ?? 1);
What clever trick, comining queries, and code (ex, with local caching) can I pull, to make this run most efficent? I am using System.Data.SQLite provider.
Maybe I should add redundant DataSetID to Position table, then I can do my main query only on the Position table? Later, when I will have the IDs of all the matching hands, it should be quicker to add additional conditions (like date checking)

Create a new index:
CREATE INDEX [DataSetIndex2] ON [SingleHand](
[SingleHandID] ASC,
[DataSetID] ASC
);
That should help a lot.
You can also try something like this:
context.Positions
.Where(p => playerIDs.Contains(p.PlayerID) && SelectedPos.Contains(p.Position))
.Select(p => p.SingleHandID)
.Intersect(context.SingleHand
.Where(s=>s.DataSetId==dataSetNumber)
.Select(s=>s.SingleHandID))
.Take(maxHands ?? 1);

Related

Fast Way to Replace Names with Ids in Datatable?

I have a very large CSV file I have to load on a regular basis that contains time series data. Examples of the headers are below:
| SiteName | Company | Date | ResponseTime | Clicks |
This data comes from a service external to the uploader. SiteName and Company are both string fields. In the database these are normalized. There is a Site table and a Company table:
CREATE TABLE [dbo].[Site] (
[Id] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
[Name] NVARCHAR(MAX) NOT NULL
)
CREATE TABLE [dbo].[Company] (
[Id] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
[Name] NVARCHAR(MAX) NOT NULL
)
As well as the data table.
CREATE TABLE [dbo].[SiteStatistics] (
[Id] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
[CompanyId] INT NOT NULL,
[SiteId] INT NOT NULL,
[DataTime] DATETIME NOT NULL,
CONSTRAINT [SiteStatisticsToSite_FK] FOREIGN KEY ([SiteId]) REFERENCES [Site]([Id]),
CONSTRAINT [SiteStatisticsToCompany_FK] FOREIGN KEY ([CompanyId]) REFERENCES [Company]([Id])
)
At around 2 million rows in the CSV file any sort of IO-bound iteration isn't going to work. I need this done in minutes, not days.
My initial thought is that I could pre-load Site and Company into DataTables. I already have the CSV loaded into a datatable in the format that matches the CSV columns. I need to now replace every SiteName with the Id field of Site and every Company with the Id field of Company. What is the quickest, most efficient way to handle this?
If you go with Pre-Loading the Sites and Company's you can get the distinct values using code:
DataView view = new DataView(table);
DataTable distinctCompanyValues = view.ToTable(true, "Company")
DataView view = new DataView(table);
DataTable distinctSiteValues = view.ToTable(true, "Site")
Then load those two DataTables into their SQL Tables using Sql-Bulk-Copy.
Next dump all the data in:
CREATE TABLE [dbo].[SiteStatistics] (
[Id] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
[CompanyId] INT DEFAULT 0,
[SiteId] INT DEFAULT 0,
[Company] NVARCHAR(MAX) NOT NULL,
[Site] NVARCHAR(MAX) NOT NULL,
[DataTime] DATETIME NOT NULL
)
Then do an UPDATE to set the Referential Integrity fields:
UPDATE [SiteStatistics] ss SET
[CompanyId] = (SELECT Id FROM [Company] c Where ss.[Company] = c.Name),
[SiteId] = (SELECT Id FROM [Site] s Where ss.[Site] = s.Name)
Add the Foreign Key constraints:
ALTER TABLE [SiteStatistics] ADD CONSTRAINT [SiteStatisticsToSite_FK] FOREIGN KEY ([SiteId]) REFERENCES [Site]([Id])
ALTER TABLE [SiteStatistics] ADD CONSTRAINT [SiteStatisticsToCompany_FK] FOREIGN KEY ([CompanyId]) REFERENCES [Company]([Id])
Finally delete the Site & Company name fields from SiteStatistics:
ALTER TABLE [SiteStatistics] DROP COLUMN [Company];
ALTER TABLE [SiteStatistics] DROP COLUMN [Site];

How to grab last row in database table with specific requirements?

Okay so I am accepting payments on my site (via Authorize.Net). The payment form redirects to a receipt page.
I will have a column in the database for an invoice code (column InvoiceCode), which is RRC0A in this instance. Then I will have another column for an 8 digit number (column InvoiceNumber). Then I will have InvoiceCode + InvoiceNumber = InvoiceId. For example, the InvoiceId will be RRC0A + 8 numbers. It will increment as such: 00000000, 00000001, 00000002, etc. Therefore the InvoiceId will be RRC0A00000001. I cannot simply increment the column in my database because there will be other InvoiceCodes that also start at 00000000.
I need to increment the InvoiceNumber by one when I add a new row. How can I grab the last InvoiceNumber that was entered into the database? It must be associated with the InvoiceCode RRC0A. This could occur when more than 1 person is making a payment, so I am not sure of the best way.
How can I pad the incrementing InvoiceNumber with 0's in front so that it is always 8 digits?
Using an identity and a computed column you can created you invoice numbers with the correct formatting at the time of insert.
CREATE TABLE [dbo].[Invoices](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Code] [nchar](5) NOT NULL,
[InvoiceNumber] AS ([Code]+right('00000000'+CONVERT([nvarchar](10),[ID]),(8))) PERSISTED,
[Cost] [decimal](18, 2) NOT NULL,
CONSTRAINT [PK_Invoices] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
sample bulk insert
INSERT INTO [dbo].[Invoices] ([Code], [Cost])
OUTPUT INSERTED.*
SELECT 'ABC01', 500 UNION ALL
SELECT 'ABC01', 501 UNION ALL
SELECT 'EFG23', 502 UNION ALL
SELECT 'RRAc1', 503 UNION ALL
SELECT 'ABC01', 504
output
ID Code InvoiceNumber Cost
1 ABC01 ABC0100000001 500.00
2 ABC01 ABC0100000002 501.00
3 EFG23 EFG2300000003 502.00
4 RRAc1 RRAc100000004 503.00
5 ABC01 ABC0100000005 504.00
When you insert your records you can get the ID and InvoiceNumber back at the same time.
The values are also persisted so they may be indexed as you would other columns.
SELECT InvoiceCode, MAX(InvoiceID)
FROM yourTable t
GROUP BY InvoiceCode
This should return the latest InvoiceID for each InvoiceCode, but you can add your own WHERE clause to filter it down
As for how to pad-left in sql, check out this answer.
A as in one column is just a bad design
Have composite PK
InvCode (varchar), InvInt (int)
declare #InvCode varchar(20) = 'RRC0A'
insert into invoice (InvCode, InvInt)
OUTPUT INSERTED.InvInt, INSERTED.InvCode
select #InvCode, isnull(max(InvInt),-1) + 1
from invoice
where InvCode = #InvCode;
The isnull will deal with the first one
A single statement is a transaction so I don't think two simultaneous could clobber
Even if they did the PK would be violated so the insert would fail
use a view or a computed column for the formatted invoice number
CREATE TABLE [dbo].[Invoice](
[InvCode] [varchar](10) NOT NULL,
[InvInt] [int] NOT NULL,
[Formatted] AS ([InvCode]+right('00000000'+CONVERT([nvarchar](10),[InvInt]),(8))),
CONSTRAINT [PK_Invoice] PRIMARY KEY CLUSTERED
(
[InvCode] ASC,
[InvInt] ASC
)
You can grab the last InvoiceNumber with a SELECT query.
You can pad the invoice number with the + sign to concatenate two strings, and then use RIGHT() to get the right-most 8 characters.

LINQ Expression for CROSS APPLY two levels deep

Fairly new to LINQ and am trying to figure out how to write a particular query. I have a database where each CHAIN consists of one or more ORDERS and each ORDER consists of one or more PARTIALS. The database looks like this:
CREATE TABLE Chain
(
ID int NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1,1),
Ticker nvarchar(6) NOT NULL,
Company nvarchar(128) NOT NULL
)
GO
CREATE TABLE [Order]
(
ID int NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1,1),
Chart varbinary(max) NULL,
-- Relationships
Chain int NOT NULL
)
GO
ALTER TABLE dbo.[Order] ADD CONSTRAINT FK_Order_Chain
FOREIGN KEY (Chain) REFERENCES dbo.Chain ON DELETE CASCADE
GO
CREATE TABLE Partial
(
ID int NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1,1),
Date date NOT NULL,
Quantity int NOT NULL,
Price money NOT NULL,
Commission money NOT NULL,
-- Relationships
[Order] int NOT NULL
)
GO
ALTER TABLE dbo.Partial ADD CONSTRAINT FK_Partial_Order
FOREIGN KEY ([Order]) REFERENCES dbo.[Order] ON DELETE CASCADE
I want to retrieve the chains, ordered by the earliest date among all the partials of all the orders for each particular chain. In T-SQL I would write the query as this:
SELECT p.DATE, c.*
FROM CHAIN c
CROSS APPLY
(
SELECT DATE = MIN(p.Date)
FROM PARTIAL p
JOIN [ORDER] o
ON p.[ORDER] = o.ID
WHERE o.CHAIN = c.ID
) AS p
ORDER BY p.DATE ASC
I have an Entity Framework context that contains a DbSet<Chain>, a DbSet<Order>, and a DbSet<Partial>. How do I finish this statement to get the result I want?:
IEnumerable<Chain> chains = db.Chains
.Include(c => c.Orders.Select(o => o.Partials))
.[WHAT NOW?]
Thank you!
.[WHAT NOW?]
.OrderBy(c => c.Orders.SelectMany(o => o.Partials).Min(p => p.Date))
Here c.Orders does join Chain to Order, while o.SelectMany(o => o.Partials) does join Order to Partial. Once you have access to Partial records, you can use any aggregate function, like Min(p => p.Date) in your case.

Does permutation exist

I am trying to figure out the proper query in linq to sql, but I just cant figure out how to do so. Lets say I have a table with the following (this table basically is a one to many relationship)
Id (PK) | SupervisorId | EmployeeId
1 1 5
2 1 6
3 1 7
4 2 5
5 2 6
6 3 7
7 4 7
8 4 8
I want my linq to sql query to find the supervisorId which has for employeeId 5 and 6. The query would return 2 only. I could use 2 where clause, but lets say I would want to input 3 employeeIds, my query would have to modified. If the passed permutation doesnt exist for one matched SupervisorId (ex: 5,6,8 in this case), the result would be null or empty.
The function would look like this:
int FindSuperVisorId(List<int> employeeIds);
I really dont know where to start in linq to sql for this type of scenario.
Thanks
So I'm pretty sure that this query should be converted property to LINQ to SQL, but I'm not completely sure.
So first we group by supervisor so that we have sequences of employees for that supervisor. Then we use Except with the employees you're interested in in both directions. If the the count of both of those Except calls is zero then the sets are exactly equal. There are more efficient ways of determining if two sets are equal in linq-to-objects, but I doubt they would be properly converted to SQL code.
var supervisorId = table.GroupBy(item => item.SupervisorId)
.Select(group => new
{
additionalItems = group.Select(item => item.EmployeeId).Except(employees),
missingItems = employees.Except(group.Select(item => item.EmployeeId)),
group = group
})
.Where(queries => queries.additionalItems.Count() == 0
&& queries.missingItems.Count() == 0)
.Select(queries => queries.group.Key)//gets the supervisorID
.FirstOrDefault();
I had to model your table as a many-many relationship as follows:
CREATE TABLE [dbo].[Employee](
[Name] [nvarchar](50) NOT NULL,
[Id] [int] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
CREATE TABLE [dbo].[SupervisorEmployees](
[SupervisorId] [int] NOT NULL,
[EmployeeId] [int] NOT NULL,
CONSTRAINT [PK_SupervisorEmployees] PRIMARY KEY CLUSTERED
(
[SupervisorId] ASC,
[EmployeeId] ASC
)
GO
ALTER TABLE [dbo].[SupervisorEmployees] WITH CHECK ADD CONSTRAINT [FK_SupervisorEmployees_Employee] FOREIGN KEY([SupervisorId])
REFERENCES [dbo].[Employee] ([Id])
GO
ALTER TABLE [dbo].[SupervisorEmployees] CHECK CONSTRAINT [FK_SupervisorEmployees_Employee]
GO
ALTER TABLE [dbo].[SupervisorEmployees] WITH CHECK ADD CONSTRAINT [FK_SupervisorEmployees_Employee1] FOREIGN KEY([EmployeeId])
REFERENCES [dbo].[Employee] ([Id])
GO
ALTER TABLE [dbo].[SupervisorEmployees] CHECK CONSTRAINT [FK_SupervisorEmployees_Employee1]
GO
Then using Entity Framework database first (not Linq to SQL unfortunately) the following LINQPad code works fine:
void Main()
{
FindSupervisorIds( new List<int>{5,6} ).Dump();
}
IEnumerable<int> FindSupervisorIds(List<int> employeeIds)
{
// two Excepts to do 'sequence equals'
var supervisors = Employees.Where (e =>
!e.Employees.Select (em => em.Id).Except(employeeIds).Any()
&& !employeeIds.Except(e.Employees.Select (em => em.Id)).Any()
);
return supervisors.Select (s => s.Id).Distinct();
}
int? FindSupervisorId(List<int> employeeIds)
{
var supervisors = FindSupervisorIds(employeeIds).ToList();
if(supervisors.Count == 1)
{
return supervisors.First ();
}
return null;
}
FindSupervisorIds generates a single SQL query. If you need to check there's only one matching supervisor it's probably best to call ToList() on the returned list of supervisors as in FindSupervisorId.
Trying to do the same thing with LINQ to SQL fails due to the calls to Except with the exception
'NotSupportedException: Local sequence cannot be used in LINQ to SQL implementations of query operators except the Contains operator.'
one possibility:
public int FindSuperVisorId(IEnumerable<Employee> employes)
{
var distinctSupervisors = employes.Select(e => e.SuperVisor).Distinct();
var superVisor = distinctSupervisors.Where(supervisor => employes.All(employee => employee.SuperVisor.Equals(supervisor))).FirstOrDefault();
return superVisor;
}
and in case you want all matches of same supervisors:
public IEnumerable<int> FindSuperVisorId(IEnumerable<Employee> employes)
{
var distinctSupervisors = employes.Select(e => e.SuperVisor).Distinct();
var equalSupervisors = distinctSupervisors .Where(supervisor => employes.All(employee => employee.SuperVisor.Equals(supervisor)));
return equalSupervisors;
}
or directly:
public IEnumerable<int> FindSuperVisorId(IEnumerable<Employee> employes)
{
return employes.Select(e => e.SuperVisor).Distinct()
.Where(supervisor => employes.All(employee => employee.SuperVisor.Equals(supervisor)));
}

Entity Framework, random query paging

This is what I want to achieve:
I want to query my db to return a list of entities
Randomize the list
Store the IDS of items received for future queries
Run a new query on the same table where the IDs are in the list that I have stored
Order by the list that I have stored.
I have managed to achieve step 1, 2, 3, 4 already but step 5 is difficult. Can anyone help me with a query like so:
SELECT *
FROM table_name
WHERE id IN (1,2,3,4....)
ORDER BY (1,2,3,4....)
Thanks in advance
Try
SELECT table_name.*
FROM crazy_sorted_table
LEFT JOIN
table_name ON crazy_sorted_table.ID=table_name.ID
A normal join (equi join) should do the trick , here is sample approach i tested:
/**crazyOrder filled 100 rows with random value from 1-250 in Id**/
CREATE TABLE [dbo].[crazyOrder] (
[Id] INT NOT NULL,
[Area] VARCHAR (50) NULL,
PRIMARY KEY CLUSTERED ([Id] ASC)
);
/**Normal order is filled with value from 1-100 sequentially in id**/
CREATE TABLE [dbo].[normalOrder] (
[Id] INT NOT NULL,
[Name] VARCHAR (50) NULL,
PRIMARY KEY CLUSTERED ([Id] ASC)
);
create table #tempOrder
(id int)
insert into #tempOrder
Select top 10 Id
from crazyOrder
order by NewID()
go
Select n.*
from normalOrder n
join #tempOrder t
on t.id = n.id
I was able to retrieve the rows in the same order as in the temp table (i used a data generator for the values)

Categories