group by first letter of the string - c#

I have a Person table, with huge number of records, and want to group by duplicate persons in it,
by one of requirements persons are duplicates, if they have the same family name, and the first letter of first names are equal, so I want to group by first name, and first letter of family name, is there a way to group in sql like this? I need this in C#, so some code processing could be done, but the number of persons is huge, so it should be fast algorithm.

If I understand you correctly, from SqlServer you can do something like
SELECT DISTINCT
Surname,
LEFT(FirstName,1) FirstNameLetter
FROM Persons
Other than that, we will need a little bit more detail. Table schema, expected result set, etc...

SELECT MEMBER.MEMBER_FIRSTNAME, COUNT(MEMBER.MEMBER_LASTNAME)
FROM dbo.MEMBER
GROUP BY MEMBER.MEMBER_FIRSTNAME, SUBSTRING(MEMBER.MEMBER_LASTNAME, 1,1)
HAVING COUNT(MEMBER.MEMBER_LASTNAME) > 1
This query will give you all (members in this case) where the first name is the same and the last name's first letter is the same for more than one member. In other words duplicates as you've defined it.

Related

Merging queries of same table N times

I have a table of word, a lookup table of where those words are found in documents, and the number of times that word appears in that document. So there might be a record that says Alpha exists 5 times in document X, while Beta exists 3 times in document X, and another for Beta existing twice in document Y.
The user can enter a number of words to search, so "quick brown fox" is three queries, but "quick brown fox jumped" is four queries. While I can get a scored result set of each word in turn, what I actually want is to add the number of occurrences together for each word, such that the top result is the highest occurrence count for all words.
A document might have hundreds of "quick" and "brown" occurrences but no "fox" occurrences. The results should still be included as it could score higher than a document with only one each of "quick", "brown", and "fox".
The problem I can't work out is how to amalgamate the 1 to N queries with the occurences summed. I think I need to use GROUP BY and SUM() but not certain. Linq preferred but SQL would be ok. MS SQL 2016.
I want to pass the results on to a page indexer so a for-each over the results wouldn't work, plus we're talking 80,000 word records, 3 million document-word records, and 100,000 document records.
// TextIndexDocument:
// Id | WordId | Occurences | DocumentId | (more)
//
// TextIndexWord:
// Id | Word
foreach (string word in words)
{
string lword = word.ToLowerInvariant();
var results = from docTable in db.TextIndexDocuments
join wordTable in db.TextIndexWords on docTable.WordId equals wordTable.Id
where wordTable.Word == lword
orderby docTable.Occurences descending
select docTable;
// (incomplete)
}
More information
I understand that full text searching is recommended. The problem then is how to rank the results from a half dozen unrelated tables (searching in forum posts, articles, products...) into one unified result set - let's say record Id, record type (article/product/forum), and score. The top result might be a forum post while the next best hits are a couple of articles, then a product, then another forum post and so on. The TextIndexDocument table already has this information across all the relevant tables.
Let's assume that you can create a navigation property TextIndexDocuments in Document:
public virtual ICollection<TextIndexDocuments> TextIndexDocuments{ get; set; }
and a navigation property in TextIndexDocument:
public virtual TextIndexWord TextIndexWord { get; set; }
(highly recommended)
Then you can use the properties to get the desired results:
var results =
(
from doc in db.Documents
select new
{
doc,
TotalOccurrences =
doc.TextIndexDocuments
.Where(tid => lwords.Contains(tid.TextIndexWord.Word))
.Sum(doc => doc.Occurrences)
}
).OrderByDescending(x => x.TotalOccurrences)
As far as I know this can not, or at least easily, be accomplished in LINQ, especially in any kind of performant way.
What you really should consider, assuming your DBA will allow it, is Full-Text indexing of your documents stored in SQL Server. From my understanding the RANK operator is exactly what you are looking for which has been highly optimized for Full-Text.
In response to your comment: (sorry for not noticing that)
You'll need to either do a series of subqueries or Common-Table-Expressions. CTE's are a bit hard to get used to writing at first but once you get used to them they are far more elegant than the corresponding query written with sub queries. Either way the query execution plan will be exactly the same, so there is no performance gain from going the CTE route.
You want to add up occurences for the words per document. So group by document ID, use SUM and order by total descending:
select documentid, sum(occurences)
from doctable
where wordid in (select id from wordtable where word in 'quick', 'brown', 'fox')
group by documentid
order by sum(occurences) desc;

Full match against a List of strings per id

I need a Linq query that will return null if not all the rows have matching string from within a List<string> for each hardware_id column.
I have the following table:
id (int) - Primary Key
name (string)
user_id (int)
hardware_id (int)
I have a List<string> that contain phrases. I want the query to return the hardare_id number if all the phrases in the List have matching strings in the name row. If there one of the phrases doesn't have a name match, to return null and if all the phrases exist per each hardware_id for all the phrases, the query should return the list of hardware_id's that each one of those hardware_id's, have full match with all the phrases within the List.
Or in other words, return a list of hardware_id's that each id, has its all name 's matching the ones in the List<string>.
I thought about iterating each Id in a different query but it's not an effective way to do it. Maybe you know a good query to tackle this.
I'm using Entity Framework 6 / C# / MySQL
Note: the query is done only per user id. So I filter the table first by the User Id and then need to find the matching hardare_id's that satisfy the condition.
Group on hardware_id and then look for all phrases existence in the List
table.GroupBy(x=>x.hardware_id)
.Where(x=> x.All(s=> phrases.Contains(s.name))
.Select(x=>x.Key);

LINQ - how to remove duplicate rows in table

After certain proccess, I wan to remove duplicates from the table and commit the changes, so only single values remain.
I have three criteria for removal:
Name
date
status (is always 1)
So if there are records with same Name, and same date and same status... remove one. Does not matter which one.
I have:
dbContext.tbl_mytable
Since you are talking about deleting records, you need to test this first.
So if there are records with same Name, and same date and same status... remove one. Does not matter which one.
I'm assuming you want to remove all but one, ie, if you have three records with the same details, you remove two and leave one.
If so, you should be able to identify the duplicates by grouping by { Name, date, status} and then selecting all except the first record in each group.
ie something like
var duplicates = (from r in dbContext.tbl_mytable
group r by new { r.Name, r.date, r.status} into results
select results.Skip(1)
).SelectMany(a=>a);
dbContext.tbl_mytable.DeleteAllOnSubmit(duplicates);
dbContext.SubmitChanges();

Remove certain parts of a statement

I've got an application which generates a number of SQL statements, whose selection fields all contain AS clauses, like so:
SELECT TOP 200 [customer_name] AS [Customer Name], [customer_age] AS [Customer Age], AVG([customer_age]) AS 'Average Customer Age' FROM [tb_customers] GROUP BY [customer_age]
My statements will always be in that format. My task is to parse them so that "TOP 200" is removed, as well as all the AS clauses except for aggregates. In other words, I would want to parse the statements and in that case it would end up like so:
SELECT [customer_name], [customer_age], AVG([customer_age]) AS 'Average Customer Age' FROM [tb_customers] GROUP BY [customer_age]
How would I go about doing this? Is it even possible, as it seems like a very complex parsing task since the amount of fields is never going to be the same. If it helps, I've got a variable which stores the amount of fields in it (not including aggregates).
You may use a regular expression, like replace all occurrences of
AS \[.*?\]
with empty text
or all occurrences of
AS \[.*?\],
with a comma ",".
The question mark "?" is important here as it turns off greedy matching.

Adding a search to a linq query

I have a basic datatable, it can be completely generic for this example, except for that it contains a Username column.
I'd like to have a simple textbox and a button that performs a similarity search on the Username field. I know I can use the .Contains() method and it will translate to LIKE in sql, but is this the correct way of doing this?
Secondly, suppose that I have another item in a many to many relationship that I also want to search, in this case Label.
Data
{
ID,
Name,
...
}
Many
{
DataID,
OtherID
}
Other
{
ID,
Label
}
I'd eventually like to find all of the Data items with a Label similar to some search clause. Do I again just use .Contains?
I'd then like to sort to get the best matches for Username and Label in the same query; how can the combined likeness of {Username and Label} be sorted?
Edit: How are a LIKE query's results sorted? It is simply based on the index, and a binary it matches vs it does not match? I guess I'm not that concerned with a similarity score per say, I was more or less just wondering about the mechanism. It seems as though its pretty easy to return LIKE queries, but I've always thought that LIKE was a poor choice because it doesn't use indexes in the db. Is this true, if so does it matter?
String similarity isn't something SQL can do well. Your best bet may be to find all the matches with the same first two (or three if necessary) characters and then, assuming this is a manageable number, compute the similarity score client-side using the Levenshtein distance or similar (see http://en.wikipedia.org/wiki/Levenshtein_distance).
Or if you are feeling brave you could try something like this! http://anastasiosyal.com/archive/2009/01/11/18.aspx

Categories