I'm trying to implement a tool that groups certain strings based on the lemmas of their words. During the initialization I make a dictionary for each possible group containing a list of words that would group into this key. This is what I have so far:
public Dictionary<string, HashSet<string>> Sets { get; set; }
private void Initialize(IStemmer stemmer)
{
// Stemming of keywords and groups
var keywordStems = new Dictionary<string, List<string>>();
var groupStems = new Dictionary<string, List<string>>();
foreach (string keyword in Keywords)
{
keywordStems.Add(keyword, CreateLemmas(keyword, stemmer));
foreach (string subset in CreateSubsets(keyword))
{
if (subset.Length > 1 && !groupStems.ContainsKey(subset))
{
groupStems.Add(subset, CreateLemmas(subset, stemmer));
}
}
}
// Initialize all viable sets
// This is the slow part
foreach (string gr in groupStems.Keys)
{
var grStems = groupStems[gr];
var grKeywords = new HashSet<string>((from kw in Keywords
where grStems.All(keywordStems[kw].Contains)
select kw));
if (grKeywords.Count >= Settings.MinCount)
{
Sets.Add(gr, grKeywords);
}
}
}
Is there any way that I can speed the bottleneck of this method up?
The answer of #mjwills is a good idea. It seems likely that this is the most expensive operation:
var grKeywords = new HashSet<string>((
from kw in Keywords
where grStems.All(keywordStems[kw].Contains)
select kw));
The suggestion is to optimize the Contains by taking advantage of the fact that the stems are a set. But if they're a set then why are we repeatedly asking for containment at all? They're a set; do set operations. The question is "what are the keywords such that every member of the grStem set is contained within the keyword's stem set". "Is every member of this set contained in that set" is the subset operation.
var grKeywords = new HashSet<string>((
from kw in Keywords
where grStems.IsSubsetOf(keywordStems[kw])
select kw));
The implementation of IsSubsetOf is optimized for common scenarios like "both operands are sets". And it takes early outs; if your group stems set is larger than the keyword stem set then you don't need to check every element; one of them is going to be missing. But your original algorithm checks every element anyways, even when you could bail early and save all that time.
And again #mjwills has a good idea which I'll suggest some possible improvements to. The idea here is to execute the query, cache the results in an array, and only later realize it as a hash set, if necessary:
foreach (var entry in groupStems)
{
var grStems = entry.Value;
var grKeywords = (WHATEVER).ToArray();
if (grKeywords.Length >= Settings.MinCount)
Sets.Add(entry.Key, new HashSet<string>(grKeywords));
}
First: I actually doubt that avoiding the unnnecessary hash set construction by replacing it with unnecessary array constructions is a win. Measure it and see.
Second: ToList can be faster than ToArray because a list can be constructed before you know the size of the query result set. ToArray basically has to do a ToList first, and then copy the results into an exactly-sized array. So if ToArray is not a win, ToList might be. Or not. Measure it.
Third: I note that the whole thing can be rewritten into a query should you prefer that style.
var q = from entry in groupStems
let grStems = entry.Value
let grKeywords = new HashSet<string>(WHATEVER)
where grKeywords.Count >= Settings.MinCount
select (entry.Key, grKeywords);
var result = q.ToDictionary( ... and so on ... )
That's probably not faster, but it might be easier to reason about.
One suggestion would be to change:
var keywordStems = new Dictionary<string, List<string>>();
to:
var keywordStems = new Dictionary<string, HashSet<string>>();
That should have an impact due to your later Contains call:
var grKeywords = new HashSet<string>((from kw in Keywords
where grStems.All(keywordStems[kw].Contains)
select kw));
because Contains is generally faster on a HashSet than a List.
Also consider changing:
foreach (string gr in groupStems.Keys)
{
var grStems = groupStems[gr];
var grKeywords = new HashSet<string>((from kw in Keywords
where grStems.All(keywordStems[kw].Contains)
select kw));
if (grKeywords.Count >= Settings.MinCount)
{
Sets.Add(gr, grKeywords);
}
}
to:
foreach (var entry in groupStems)
{
var grStems = entry.Value;
var grKeywords = (from kw in Keywords
where grStems.All(keywordStems[kw].Contains)
select kw).ToArray();
if (grKeywords.Length >= Settings.MinCount)
{
Sets.Add(entry.Key, new HashSet<string>(grKeywords));
}
}
By shifting the HashSet initialization (which is relatively expensive compared to initializing an Array) into the if statement then you may improve performance if the if is entered relatively rarely (in your comments you state it is entered roughly 25% of the time).
Related
i am differentiating 1 million list of object with another 1 million list of object.
i am using for , foreach but it takes too much of time to iterate those list.
can any one help me best way to do this
var SourceList = new List<object>(); //one million
var TargetList = new List<object>()); // one million
//getting data from database here
//SourceList with List of one million
//TargetList with List of one million
var DifferentList = new List<object>();
//ForEach
SourceList.ToList().ForEach(m =>
{
if (!TargetList.Any(s => s.Name == m.Name))
DifferentList.Add(m);
});
//for
for (int i = 0; i < SourceList .Count; i++)
{
if (!TargetList .Any(s => s == SourceList [i].Name))
DifferentList .Add(SourceList [i]);
}
I think it seems like a bad idea but IEnumerable magic will help you.
For starters, simplify your expression. It looks like this:
var result = sourceList.Where(s => targetList.Any(t => t.Equals(s)));
I recommend making a comparison in the Equals method:
public class CompareObject
{
public string prop { get; set; }
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
Next add AsParallel. This can both speed up and slow down your program. In your case, you can add ...
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
CPU 100% loaded if you try to list all at once like this:
var cnt = result.Count();
But it’s quite tolerable to work if you get the results in small portions.
result.Skip(10000).Take(10000).ToList();
Full code:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 100000).ToString();
}
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
var lr = result.Skip(10000).Take(10000).ToList();
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}
Update
I remembered what you can use Hashtable.Choos unique values from targetList and from sourceList next fill out the result whose values are not targetList.
Example:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 1000000).ToString();
}
public new int GetHashCode() {
return prop.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var sourceHashtable = new Hashtable();
var targetHashtable = new Hashtable();
foreach (var element in targetList)
{
var hash = element.GetHashCode();
if (!targetHashtable.ContainsKey(hash))
targetHashtable.Add(element.GetHashCode(), element);
}
var result = new List<CompareObject>();
foreach (var element in sourceList)
{
var hash = element.GetHashCode();
if (!sourceHashtable.ContainsKey(hash))
{
sourceHashtable.Add(hash, element);
if(!targetHashtable.ContainsKey(hash)) {
result.Add(element);
}
}
}
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}
Scanning the target list to match the name is an O(n) operation, thus your loop is O(n^2). If you build a HashSet<string> of all the distinct names in the target list, you can check whether a name exists in the set in O(1) time using the Contains method.
//getting data from database here
You are getting the data out of a system that specializes in matching and sorting and filtering data, into your RAM that by default cannot yet do that task at all. And then you try to sort, filter and match yourself.
That will fail. No matter how hard you try, it is extremely unlikely that your computer with a single programmer working at a matching algorithm will outperform your specialized piece of hardware called a database server at the one operation this software is supposed to be really good at that was programmed by teams of experts and optimized for years.
You don't go into a fancy restaurant and ask them to give you huge bags of raw ingredients so you can throw them into a big bowl unpeeled and microwave them at home. No. You order a nice dish because it will be way better than anything you could do yourself.
The simple answer is: Do not do that. Do not take the raw data and rummage around in it for hours. Leave that job to the database. It's the one thing it's supposed to be good at. Use it's power. Write a query that will give you the result, don't get the raw data and then play database yourself.
Foreach performs a null check before each iteration, so using a standard for loop will provide slightly better performance that will be hard to beat.
If it is taking too long, can you break down the collection into smaller sets and/or process them in parallel?
Also you could look a PLinq (Parallel Linq) using .AsParallel()
Other areas to improve are the actual comparison logic that you are using, also how the data is stored in memory, depending on your problem, you may not have to load the entire object into memory for every iteration.
Please provide a code example so that we can assist further, when such large amounts of data are involved performance degredation is to be expected.
Again depending on the time that we are talking about here, you could upload the data into a database and use that for the comparison rather than trying to do it natively in C#, this type of solution is better suited to data sets that are already in a database or where the data changes much less frequently than the times you need to perform the comparison.
I'm new to this stuff but what I'm trying to do is filter the log4net log I've bungled into Application_Error by some particular bits of information, such as HTTP_USER_AGENT REQUEST TYPE CONTENT_TYPE HTTP_REFERER.
The code I have so far is:
string[] vars = {"IP address", "X-Forwarded-For", "HTTP_USER_AGENT","REQUEST TYPE","CONTENT_TYPE","HTTP_REFERER"};
var param = Request.Params;
var paramEnum = param.GetEnumerator();
while (paramEnum.MoveNext())
{
foreach (var paramVar in vars.Where(paramVar => paramVar == paramEnum.Current.ToString()))
{
Log.Error("\nparam:" + paramVar);
}
}
But it looks ugly and moreover I think I'm not on the right track in terms of a succinct LINQ query, especially using both a while and foreach loop.
I'm new to LINQ as I say and resharper created the LINQ for me - if you feel I'm lacking in certain areas please let me know what I'd need to further my understanding of collections/querying.
Is this the most performant way of doing things, or am I on the wrong track altogether?
It's better to iterate through vars instead
foreach (var paramVar in vars)
{
var value = Request.Params.Get(paramVar);
if (!string.IsNullOrEmpty(value))
{
Log.Error("\nparam:" + paramVar);
}
}
or using LINQ
var query = from paramVar in vars
let value = Request.Params.Get(paramVar)
where !string.IsNullOrEmpty(value)
select paramVar;
foreach (var paramVar in query)
{
Log.Error("\nparam:" + paramVar);
}
I have two arrays that need to be mapped. In code
var result = "[placeholder2] Hello my name is [placeholder1]";
var placeholder = { "[placeholder1]", "[placeholder2]", "[placeholder3]", "[placeholder4]" };
var placeholderValue = { "placeholderValue3", "placeholderValue2", "placeholderValue3" };
Array.ForEach(placeholder , i => result = result.Replace(i, placeholderValue));
given i, placeholderValue needs to be set in an intelligent way. I can implement a switch statement. The cyclomatic complexity would be unacceptable with 30 elements or so. What is a good pattern, extension method or otherwise means to achieve my goal?
I skipped null checks for simplicity
string result = "[placeholder2] Hello my name is [placeholder1]";
var placeHolders = new Dictionary<string, string>() {
{ "placeholder1", "placeholderValue1" },
{ "placeholder2", "placeholderValue2" }
};
var newResult = Regex.Replace(result,#"\[(.+?)\]",m=>placeHolders[m.Groups[1].Value]);
The smallest code change would be to just use a for loop, rather than a ForEach or, in your case, a ForEach taking a lambda. With a for loop you'll have the index of the appropriate value in the placehoderValue array.
The next improvement would be to make a single array of an object holding both a placeholder and it's value, rather than two 'parallel' arrays that you need to keep in sync.
Even better than that, and also even simpler to implement, is to just have a Dictionary with the key being a placeholder and the value being the placeholder value. This essentially does the above suggestion for you through the use of the KeyValuePair class (so you don't need to make your own).
At that point the pseudocode becomes:
foreach(key in placeholderDictionary) replace key with placeholderDictionary[key]
I think you want to use Zip to combine the placeholders with their values.
var result = "[placeholder2] Hello my name is [placeholder1]";
var placeholder = new[] { "[placeholder1]", "[placeholder2]", "[placeholder3]", "[placeholder4]" };
var placeholderValue = new[] { "placeholderValue1", "placeholderValue2", "placeholderValue3", "placeholderValue4" };
var placeHolderPairs = placeholder.Zip(placeholderValue, Tuple.Create);
foreach (var pair in placeHolderPairs)
{
result = result.Replace(pair.Item1, pair.Item2);
}
Is it possible to convert two or more lists into one single list, in .NET using C#?
For example,
public static List<Product> GetAllProducts(int categoryId){ .... }
.
.
.
var productCollection1 = GetAllProducts(CategoryId1);
var productCollection2 = GetAllProducts(CategoryId2);
var productCollection3 = GetAllProducts(CategoryId3);
You can use the LINQ Concat and ToList methods:
var allProducts = productCollection1.Concat(productCollection2)
.Concat(productCollection3)
.ToList();
Note that there are more efficient ways to do this - the above will basically loop through all the entries, creating a dynamically sized buffer. As you can predict the size to start with, you don't need this dynamic sizing... so you could use:
var allProducts = new List<Product>(productCollection1.Count +
productCollection2.Count +
productCollection3.Count);
allProducts.AddRange(productCollection1);
allProducts.AddRange(productCollection2);
allProducts.AddRange(productCollection3);
(AddRange is special-cased for ICollection<T> for efficiency.)
I wouldn't take this approach unless you really have to though.
Assuming you want a list containing all of the products for the specified category-Ids, you can treat your query as a projection followed by a flattening operation. There's a LINQ operator that does that: SelectMany.
// implicitly List<Product>
var products = new[] { CategoryId1, CategoryId2, CategoryId3 }
.SelectMany(id => GetAllProducts(id))
.ToList();
In C# 4, you can shorten the SelectMany to: .SelectMany(GetAllProducts)
If you already have lists representing the products for each Id, then what you need is a concatenation, as others point out.
you can combine them using LINQ:
list = list1.Concat(list2).Concat(list3).ToList();
the more traditional approach of using List.AddRange() might be more efficient though.
List.AddRange will change (mutate) an existing list by adding additional elements:
list1.AddRange(list2); // list1 now also has list2's items appended to it.
Alternatively, in modern immutable style, you can project out a new list without changing the existing lists:
Concat, which presents an unordered sequence of list1's items, followed by list2's items:
var concatenated = list1.Concat(list2).ToList();
Not quite the same, Union projects a distinct sequence of items:
var distinct = list1.Union(list2).ToList();
Note that for the 'value type distinct' behaviour of Union to work on reference types, that you will need to define equality comparisons for your classes (or alternatively use the built in comparators of record types).
You could use the Concat extension method:
var result = productCollection1
.Concat(productCollection2)
.Concat(productCollection3)
.ToList();
I know this is an old question I thought I might just add my 2 cents.
If you have a List<Something>[] you can join them using Aggregate
public List<TType> Concat<TType>(params List<TType>[] lists)
{
var result = lists.Aggregate(new List<TType>(), (x, y) => x.Concat(y).ToList());
return result;
}
Hope this helps.
list4 = list1.Concat(list2).Concat(list3).ToList();
// I would make it a little bit more simple
var products = new List<List<product>> {item1, item2, item3 }.SelectMany(id => id).ToList();
This way it is a multi dimensional List and the .SelectMany() will flatten it into a IEnumerable of product then I use the .ToList() method after.
I've already commented it but I still think is a valid option, just test if in your environment is better one solution or the other. In my particular case, using source.ForEach(p => dest.Add(p)) performs better than the classic AddRange but I've not investigated why at the low level.
You can see an example code here: https://gist.github.com/mcliment/4690433
So the option would be:
var allProducts = new List<Product>(productCollection1.Count +
productCollection2.Count +
productCollection3.Count);
productCollection1.ForEach(p => allProducts.Add(p));
productCollection2.ForEach(p => allProducts.Add(p));
productCollection3.ForEach(p => allProducts.Add(p));
Test it to see if it works for you.
Disclaimer: I'm not advocating for this solution, I find Concat the most clear one. I just stated -in my discussion with Jon- that in my machine this case performs better than AddRange, but he says, with far more knowledge than I, that this does not make sense. There's the gist if you want to compare.
To merge or Combine to Lists into a One list.
There is one thing that must be true: the type of both list will be
equal.
For Example: if we have list of string so we can add add another list to the
existing list which have list of type string otherwise we can't.
Example:
class Program
{
static void Main(string[] args)
{
List<string> CustomerList_One = new List<string>
{
"James",
"Scott",
"Mark",
"John",
"Sara",
"Mary",
"William",
"Broad",
"Ben",
"Rich",
"Hack",
"Bob"
};
List<string> CustomerList_Two = new List<string>
{
"Perter",
"Parker",
"Bond",
"been",
"Bilbo",
"Cooper"
};
// Adding all contents of CustomerList_Two to CustomerList_One.
CustomerList_One.AddRange(CustomerList_Two);
// Creating another Listlist and assigning all Contents of CustomerList_One.
List<string> AllCustomers = new List<string>();
foreach (var item in CustomerList_One)
{
AllCustomers.Add(item);
}
// Removing CustomerList_One & CustomerList_Two.
CustomerList_One = null;
CustomerList_Two = null;
// CustomerList_One & CustomerList_Two -- (Garbage Collected)
GC.Collect();
Console.WriteLine("Total No. of Customers : " + AllCustomers.Count());
Console.WriteLine("-------------------------------------------------");
foreach (var customer in AllCustomers)
{
Console.WriteLine("Customer : " + customer);
}
Console.WriteLine("-------------------------------------------------");
}
}
In the special case: "All elements of List1 goes to a new List2": (e.g. a string list)
List<string> list2 = new List<string>(list1);
In this case, list2 is generated with all elements from list1.
You need to use Concat operation
When you got few list but you don't know how many exactly, use this:
listsOfProducts contains few lists filled with objects.
List<Product> productListMerged = new List<Product>();
listsOfProducts.ForEach(q => q.ForEach(e => productListMerged.Add(e)));
If you have an empty list and you want to merge it with a filled list, do not use Concat, use AddRange instead.
List<MyT> finalList = new ();
List<MyT> list = new List<MyT>() { a = 1, b = 2, c = 3 };
finalList.AddRange(list);
Given a collection of records like this:
string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN
Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both uniquely identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.
Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.
For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.
That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.
How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.
Then do:
var unique = allData.Concat(id1Data.Except(allData))
.Concat(id2Data.Except(id1Data).Except(allData));
having defined equality just on the basis of ID2.
I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).
EDIT: Potentially nicer idea: split the data up as before, then do:
var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);
I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...
(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)
This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:
private class Comp : IEqualityComparer<Item>
{
public bool Equals(Item x, Item y)
{
var equalityOfB = x.ID2 == y.ID2;
if (x.ID1 == y.ID1 && equalityOfB)
return true;
if (x.ID1 == null && equalityOfB)
{
x.ID1 = y.ID1;
return true;
}
if (y.ID1 == null && equalityOfB)
{
y.ID1 = x.ID1;
return true;
}
return false;
}
public int GetHashCode(Item obj)
{
return obj.ID2.GetHashCode();
}
}
Then you could use it on a list as such...
var l = new[] {
new Item { ID1 = "a", ID2 = "b" },
new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();
I had a similar issue a couple of months ago.
Try something like this...
public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
{
Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
List<T> finalList = new List<T>();
int i = 0;
foreach (T currValue in sections)
{
if (!uniqueStore.ContainsKey(currValue.Name))
{
uniqueStore.Add(currValue.Name, 0);
finalList.Add(sections[i]);
}
i++;
}
return finalList;
}
records.GroupBy(r => r, new RecordByIDsEqualityComparer())
.Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())
or if you want to merge the records, then Aggregate instead of OrderByDescending/First.