Multi-column duplicates query

Multi-column duplicates query - c#

I'm using t-sql from csharp to query a database table that contains 5 columns of
integers which tracks the number of times certain actions have taken place in a
program. There are also other columns in the table.
Example:
Num1 Num2 Num3 Num4 Num5
1 15 22 23 32
15 4 21 17 19
6 5 15 18 20
I need to construct a query that returns each duplicated set of integers and a
count of all rows where all of these 5 columns, when considered as a set of values, have been duplicated.
To clarify further, I need to know how many times Num1=6, Num2=5, Num3=15, Num4=18, and Num5=20, if it does occur. I also need to know if any other sets of duplicates occur in these five columns.
I know some SQL, but this is a complex query that I need help with. I've tried
many subqueries etc, but I just can't figure out the right combination of
SELECT and ORDER BY's to make it work. The data table in question has about 7000
records in it and is expected to grow no larger than about 10k, so performance is secondary.
THANKS in advance.

This looks like a straight-forward SELECT COUNT with a GROUP BY on the five columns.
Something along the lines of:
SELECT Num1, Num2, Num3, Num4, Num5, COUNT(someColumn) GROUP BY Num1, Num2, Num3, Num4, Num5

You can do this with a join on a subquery, which takes only duplicates and count them
select a.Col1, a.Col2, a.Col3, a.Num1, a.Num2, a.Num3, a.Num4, a.Num5, t.cnt as numberOfDuplicates
from tableA a
join (select Num1, Num2, Num3, Num4, Num5, count(*) as cnt
from tableA
group by Num1, Num2, Num3, Num4, Num5
having count(*) > 1
) t
on a.Num1 = t.Num1 and a.Num2 = t.Num2 and a.Num3 = t.Num3 and a.Num4 = t.Num4 and a.Num5 = t.Num5

Related

C# calculate difference from two rows based on a sql query

I have a task to solve. I am trying to display the operation time of two machines (number1 & number 2) in a diagram. Therefore i store information in a table. The columns are id, date, number1, number2.
Lets assume i have this specific dataset:
id date number1 number2
1| 24.09.14 | 100 | 120
2| 01.10.14 | 150 | 160
For displaying the information I need to retrieve the following data.
((number1(2)- number1(1)) + number2(2) - number1(1))/2)/(number of days (date2 - date1))
This should result in the following specific numbers.
((150-100 + 160-120)/2)/7= 6,42
Or in plain words. The result should be the average daily operation time from all of my machines. Substracting saturdays and sundays from the number of dates would be nice but not necessary.
I hope that you understand my question. In essence I am facing the problem that i dont know how to work with different rows from a simple sql query.
The programming language is c# in a razor based web project.

First I doubt that you have only 2 records in database. Here some code that makes calculation for every 2 rows in DataSet.
for(int i=0; i < dst.Tables[0].Rows.Count - 1; i+=2)
{
if(dst.Tables[0].Rows.Count % 2 != 0)
Console.WriteLine("Wrong records count")
int number1Row1 =Convert.ToInt32(dst.Tables[0].Rows[i]["Number1"]);
int number1Row2 =Convert.ToInt32(dst.Tables[0].Rows[i]["Number2"]);
int number2Row1 =Convert.ToInt32(dst.Tables[0].Rows[i+1]["Number1"]);
int number2Row2 =Convert.ToInt32(dst.Tables[0].Rows[i+1]["Number2"]);
DateTime dateRow1 =Convert.ToDateTime(dst.Tables[0].Rows[i]["Date"]);
DateTime dateRow2 =Convert.ToDateTime(dst.Tables[0].Rows[i+1]["Date"]);
double calc = ((number1Row2- number1Row1 + number2Row2 - number2Row1)/2)*(dateRow1 - dateRow2).TotalDays
Console.WriteLine(calc);
}
It is wroted to be maximum clear to understand.

Your formule have probably a mistake in front of your numerical sample :
((number1(2)- number1(1)) + number2(2) - number2(1))/2)/(number of days (date2 - date1))
If the values of the id column are chronological and have no holes (1.2, 3, 4, ... OK but 1,3,4, 6 KO ...) you can try the following script :
SELECT t2.number1 , t1.number1, t2.number2 , t1.number1 , DATEDIFF(DAY, t2.date, t1.date)
, (((t2.number1 - t1.number1) + t2.number2 - t1.number2) /2 ) / DATEDIFF(DAY, t2.date, t1.date) as result
FROM #tmp t1
INNER JOIN #tmp t2 ON t1.id + 1 = t2.id
--- I create a #tmp table for test
CREATE table #tmp
(
id int,
Date DateTime,
number1 float,
number2 float
)
--- insert samples data
INSERT INTO #tmp (id, Date, number1, number2) VALUES (1, '2014-09-24T00:00:00', 100, 120), (2, '2014-10-01T00:00:00', 150, 160)
it work great on my SQL Server

Yes you can do it with sql query. Try the below query.
SELECT
N1.Date as PeriodStartDate,
N2.Date as PeriodEndDate,
CAST(CAST((((N2.number1- N1.number1) + (n2.number2 - N1.number2))/2) AS DECIMAL(18,2))/(datediff(d,n1.date,n2.date)) AS DECIMAL(18,2) ) AS AverageDailyOperation
FROM
[dbo].[NumberTable] N1
INNER JOIN
[dbo].[NumberTable] N2
ON N2.Date>N1.Date
I have assumed the table name as NumberTable, I have added PeriodStartDate and PeriodEndDate to make it meaningful. You can remove it as per your need.

How to write dynamic Linq to count matching numbers

I have two tables in database. Ticket and TicketNumbers. I would like to write Linq to count the number of tickets that have numbers matching those passed into this function. Since we don't know how many numbers must be matched, the Linq has to be dynamic ... to my understanding.
public int CountPartialMatchingTicket(IList<int> numbers)
{
// where's the code? =_=;
}
Say for example there are 3 Tickets in the database now and I want to count up all those that have the numbers 3 and 4.
(1) 1 2 3 4
(2) 1 3 4 6 7
(3) 1 2 3
In this case the function should return 2, since ticket (1) and (2) have the matching numbers.
In another case if asked to match 1, 2, 3, then again we should be returned 2.
Here's what those two tables in the database look like:
Ticket:
TicketId, Name, AddDateTime
TicketNumbers:
Id, TicketId, Number
I've never used Dynamic Linq before, so just in case this is what I have at the top of my cs file.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using LottoGen.Models;
using System.Linq.Dynamic;
First things first though, I don't even know how I should write the Linq line for a fixed amount of numbers. I suppose the SQL would be like this:
SELECT TicketId, COUNT(0) AS Expr1
FROM TicketNumber
WHERE (Number = 3) OR (Number = 4)
GROUP BY TicketId
However this isn't what I want either. The above query would get me Tickets that have either a 3 or a 4 - but I just want the tickets that have BOTH numbers. And I guess it has to be nested somehow to return a single count. If I had to use my imagination for completing the function then, it would be something like this:
public int CountPartialMatchingTicket(IList<int> numbers)
{
string query = "";
foreach(int number in numbers) {
query += "Number = " + number.ToString() + " AND ";
}
// I know.. there is a trailing AND.. lazy
int count = DbContext.TicketNumbers.Where(query).Count();
return count;
}
Oh wait a minute. There's no Dynamic Linq there... The above is looking like something I would do in PHP and that query statement obviously does not do anything useful. What am I doing? :(
At the end of the day, I want to output a little table to the webpage looking like this.
Ticket Matching Tickets
-----------------------------------
3 4 2
Trinity, help!

public int CountPartialMatchingTicket(IList<int> numbers)
{
var arr = numbers.ToArray();
int count = DbContext.Tickets
.Count(tk=>arr.All(n=> tk.TicketNumbers.Any(tn=>tn.Number== n));
return count;
}
UPDATE: "If you don't mind, how would I limit this query to just the tickets made on a particular AddDateTime (the whole day)?"
The part inside the Count() method call is the WHERE condition, so just extend that:
DateTime targetDate = ......;
DateTime tooLate = targetDate.AddDay(1);
int count = DbContext.Tickets
.Count(tk=>
targetDate < tk.AddDateTime && tk.AddDateTime < tooLate
&& arr.All(n=> tk.TicketNumbers.Any(tn=>tn.Number== n));

This is similar to James Curran's answer, but a little simpler, and it should produce a simpler WHERE IN-based query:
// count all tickets...
return DbContext.Tickets.Count(
// where any of their ticket numbers
tk => tk.TicketNumbers.Any(
// are contained in our list of numbers
tn => numbers.Contains(tn.Number)))

Method to find hits in comma-separated number string on SQL Server

I have a Windows forms (c#) application and a table in SQL Server that has two columns like this:
ticket (int) | numbers (string)
12345 | '01, 02, 04, 05, 09, 10, 23'
This table may have like 100.000 rows or more.
Where I have to do is to found the amount of hits giving an array of numbers like a lottery.
I have 12 hits, 11 hits and 9 hits for example and for each raffled number I have to perform the search of what win the 12 hits, 11 hits or 9 hits.
So, how is the best way to get this approach? I need the best performance.
For now I have this code:
string sentSQL = " SELECT ticket, numbers FROM tableA";
/* CODE TO PERFORM THE CONNECTION */
/*...*/
DbDataReader reader = connection.ExecuteReader();
int hits12, hits11, hits9 = 0;
int count;
while (reader.Read())
{
count = 0;
string numbers = reader["numbers"].ToString();
string ticketNumber = reader["ticket"].ToString();
int maxJ = balls.Count; //balls is the ArrayList with the numbers currently extracted in the raffle
for (int j = 0; j < maxJ; j++)
{
if (numbers.Contains(balls[j].ToString()))
{
count++;
}
}
switch (count)
{
case 12:
hits12++;
break;
case 11:
hits11++;
break;
case 9:
hits9++;
break;
}
}
This is working but maybe there is a better method to make it possible.
I'm using SQL Server 2012, maybe is there a function that help me?
Edit: Can i perform in the sql query a SUM of the CHARINDEX of each number to get the amount of hits inside the sql query?

You currently have a totally tacky solution.
create table ticket (
ticketId int not null -- PK
)
create table TicketNumbers *
ticketId int not null,
numberSelected int not null
)
TicketNumbers has an FK to Ticket, and a PK of TicketNumber + numberSelected.
select t.ticketId, count(*) CorrectNumbers
from ticket t
inner join TicketNumbers tn on tn.ticketId = t.TicketId
where tn.numberSelected in (9, 11, 12, 15) -- list all winning numbers
group by t.ticketId
order by count(*) desc
Cheers -

One simple way to improve this is to update your select statement to get only records with numbers greater than your first ball number and less that your last ball number + 1 ...
Example (probably not correct SQL):
SELECT ticket, numbers FROM tableA where '10' < numbers and '43' > numbers

SQL Query: For each value, determine the percentage of rows that contain the value?

Let's say I have a contact manager system. There are notes associated to each contact made by employees.
So, here's my quick example:
ContactName, NoteCount
John, 100
Rob, 10
Amy, 10
Chris, 10
How do i figure out the that 75% of contacts have 10 notes assoicated with them and that 25% of contacts have 100 notes associated with them?
Please explain what I'm trying to do in Layman's terms.

If you really want the percentage of people that have the exact number, use this:
SELECT
NoteCount,
COUNT(*) ContactsWithThisNoteCount,
COUNT(*) / (SELECT COUNT(*) FROM Contacts) PercentageContactsWithThisNoteCount
FROM
Contacts
GROUP BY
NoteCount
If you want grouings like "0-9", "10-99", and "100+" then you just need a little bit of a calculation in the group by and MIN/MAXon NoteCount.

select
((countTen/countTotal)*100) as percentTen,
((countHundred/countTotal)*100) as percentHundred
FROM (
select
cast(sum(case when noteCount <= 10 then 1 else 0 end) as float) as countTen,
cast(sum(case when noteCount <= 100 and > 10 then 1 else 0 end) as float) as countHundred,
cast(count(*) as float) as countTotal
from
contacts
) temp
Should be ok, I often use the trick sum + case when i need to do a count on a filter

SQL huge selection of IDs - How to make it faster?

I have an array with a huge amounts of IDs I would like to select out from the DB.
The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE).
(The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)
However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow.
Let me give an example:
SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)
Time to execute: ~1100 msec (This returns appx 200 rows in my example)
Versus:
SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)
Time to execute: ~80 msec (This returns appx 50000 rows in my example)
So even though I get 250 times more data back, it executes 14 times faster...
So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)
private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
public static string MassIdSelectionStringBuilder(
List<int> keys, ref int startindex, string colname)
{
const int maxlength = 63000;
if (keys.Count - startindex == 1)
{
string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
startindex++;
return idstring;
}
StringBuilder sb = new StringBuilder(maxlength + 1000);
List<int> individualkeys = new List<int>(256);
int min = keys[startindex++];
int max = min;
sb.Append("(");
const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
{
int key = keys[startindex];
if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
{
if (min == max)
individualkeys.Add(min);
else
{
if(sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
min = max = key;
}
else
{
max = key;
}
}
if (min == max)
individualkeys.Add(min);
else
{
if (sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
if (individualkeys.Count > 0)
{
if (sb.Length > 2)
sb.Append(" OR ");
string[] individualkeysstr = new string[individualkeys.Count];
for (int i = 0; i < individualkeys.Count; i++)
individualkeysstr[i] = individualkeys[i].ToString();
sb.AppendFormat("{0} IN ({1})", colname, String.Join(",",individualkeysstr));
}
sb.Append(")");
return sb.ToString();
}
It is then used like this:
List<int> keys; //Sort and make unique
...
for (int i = 0; i < keys.Count;)
{
string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);
However, my question is...
Does anyone know of a better/faster/smarter way to do this?

In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:
Mimicking a table variable parameter with an image
Yet is was a while ago. The following articles by Erland Sommarskog are up to date:
Arrays and Lists in SQL Server

If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN
if that isn't possible then try creating a TABLE variable like so
DECLARE #tTable TABLE
(
#Id int
)
store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try

You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.
let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}.
They translate into
ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7
ID BETWEEN 1 AND 2 OR ID in (8)
The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.
I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).

Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.

Another dirty idea similar to Neils,
Have a indexed view which holds the IDs alone based on your business condition
And you can join the view with your actual table and get the desired result.

The efficient way to do this is to:
Create a temporary table to hold the IDs
Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO #Temporary VALUES ... to insert it into the temporary table
INNER JOIN the temporary table or use it in an IN (SELECT ID from #Temporary) subquery
Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.
No recompilation is done at all when this is executed as long as the large string is passed as a parameter.
Note that in the loop you must tracking the prior and current comma in two separate values

Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:
Select * from
( SELECT *
FROM dbo.table
WHERE ID between <lowerbound> and <upperbound>) as range
where ID in (
1206,
1207,
1208,
1209,
1210,
1211,
1212,
1213,
1214,
1215,
1216,
1217,
1218,
1219,
1220,
1221,
1222,
1223,
1224,
1225,
1226,
1227,
1228,
<...>,
1230,
1231
)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Multi-column duplicates query - c#

This looks like a straight-forward SELECT COUNT with a GROUP BY on the five columns. Something along the lines of: SELECT Num1, Num2, Num3, Num4, Num5, COUNT(someColumn) GROUP BY Num1, Num2, Num3, Num4, Num5

Related

C# calculate difference from two rows based on a sql query

How to write dynamic Linq to count matching numbers

Method to find hits in comma-separated number string on SQL Server

SQL Query: For each value, determine the percentage of rows that contain the value?

SQL huge selection of IDs - How to make it faster?

Categories

Resources