I compare task data from Microsoft project using a nested for loop. But since the project has many records (more than 1000), it is very slow.
How do I improve the performance?
for (int n = 1; n < thisProject.Tasks.Count; n++)
{
string abc = thisProject.Tasks[n].Name;
string def = thisProject.Tasks[n].ResourceNames;
for (int l = thisProject.Tasks.Count; l > n; l--)
{
// MessageBox.Show(thisProject.Tasks[l].Name);
if (abc == thisProject.Tasks[l].Name && def == thisProject.Tasks[l].ResourceNames)
{
thisProject.Tasks[l].Delete();
}
}
}
As you notice, I am comparing the Name and ResourceNames on the individual Task and when I find a duplicate, I call Task.Delete to get rid of the duplicate
A hash check should be lot faster in this case then nested-looping i.e. O(n) vs O(n^2)
First, provide a equality comparer of your own
class TaskComparer : IEqualityComparer<Task> {
public bool Equals(Task x, Task y) {
if (ReferenceEquals(x, y)) return true;
if (ReferenceEquals(x, null)) return false;
if (ReferenceEquals(y, null)) return false;
if (x.GetType() != y.GetType()) return false;
return string.Equals(x.Name, y.Name) && string.Equals(x.ResourceNames, y.ResourceNames);
}
public int GetHashCode(Task task) {
unchecked {
return
((task?.Name?.GetHashCode() ?? 0) * 397) ^
(task?.ResourceNames?.GetHashCode() ?? 0);
}
}
}
Don't worry too much about the GetHashCode function implementation; this is just a broiler-plate code which composes a unique hash-code from its properties
Now you have this class for comparison and hashing, you can use the below code to remove your dupes
var set = new HashSet<Task>(new TaskComparer());
for (int i = thisProject.Tasks.Count - 1; i >= 0; --i) {
if (!set.Add(thisProject.Tasks[i]))
thisProject.Tasks[i].Delete();
}
As you notice, you are simply scanning all your elements, while storing them into a HashSet. This HashSet will check, based on our equality comparer, if the provided element is a duplicate or not.
Now, since you want to delete it, the detected dupes are deleted. You can modify this code to simply extract the Unique items instead of deleting the dupes, by reversing the condition to if (set.Add(thisProject.Tasks[i])) and processing within this if
Microsoft Project has a Sort method which makes simple work of this problem. Sort the tasks by Name, Resource Names, and Unique ID and then loop through comparing adjacent tasks and delete duplicates. By using Unique ID as the third sort key you can be sure to delete the duplicate that was added later. Alternatively, you can use the task ID to remove tasks that are lower down in the schedule. Here's a VBA example of how to do this:
Sub RemoveDuplicateTasks()
Dim proj As Project
Set proj = ActiveProject
Application.Sort Key1:="Name", Ascending1:=True, Key2:="Resource Names", Ascending2:=True, Key3:="Unique ID", Ascending3:=True, Renumber:=False, Outline:=False
Application.SelectAll
Dim tsks As Tasks
Set tsks = Application.ActiveSelection.Tasks
Dim i As Integer
Do While i < tsks.Count
If tsks(i).Name = tsks(i + 1).Name And tsks(i).ResourceNames = tsks(i + 1).ResourceNames Then
tsks(i + 1).Delete
Else
i = i + 1
End If
Loop
Application.Sort Key1:="ID", Renumber:=False, Outline:=False
Application.SelectBeginning
End Sub
Note: This question relates to algorithm, not syntax; VBA is easy to translate to c#.
This should give you all the items which are duplicates, so you can delete them from your original list.
thisProject.Tasks.GroupBy(x => new { x.Name, x.ResourceNames}).Where(g => g.Count() > 1).SelectMany(g => g.Select(c => c));
Note that you probably do not want to remove all of them, only the duplicate versions, so be careful how you loop through this list.
A Linq way of getting distinct elements from your Tasks list :
public class Task
{
public string Name {get;set;}
public string ResourceName {get;set;}
}
public class Program
{
public static void Main()
{
List<Task> Tasks = new List<Task>();
Tasks.Add(new Task(){Name = "a",ResourceName = "ra"});
Tasks.Add(new Task(){Name = "b",ResourceName = "rb"});
Tasks.Add(new Task(){Name = "c",ResourceName = "rc"});
Tasks.Add(new Task(){Name = "a",ResourceName = "ra"});
Tasks.Add(new Task(){Name = "b",ResourceName = "rb"});
Tasks.Add(new Task(){Name = "c",ResourceName = "rc"});
Console.WriteLine("Initial List :");
foreach(var t in Tasks){
Console.WriteLine(t.Name);
}
// Here comes the interesting part
List<Task> Tasks2 = Tasks.GroupBy(x => new {x.Name, x.ResourceName})
.Select(g => g.First()).ToList();
Console.WriteLine("Final List :");
foreach(Task t in Tasks2){
Console.WriteLine(t.Name);
}
}
}
This selects every first elements having the same Name and ResourceName.
Run the example here.
Related
I have a simple class Item:
public class Item
{
public int Start { get; set;}
public int Stop { get; set;}
}
Given a List<Item> I want to split this into multiple sublists of contiguous elements. e.g. a method
List<Item[]> GetContiguousSequences(Item[] items)
Each element of the returned list should be an array of Item such that list[i].Stop == list[i+1].Start for each element
e.g.
{[1,10], [10,11], [11,20], [25,30], [31,40], [40,45], [45,100]}
=>
{{[1,10], [10,11], [11,20]}, {[25,30]}, {[31,40],[40,45],[45,100]}}
Here is a simple (and not guaranteed bug-free) implementation that simply walks the input data looking for discontinuities:
List<Item[]> GetContiguousSequences(Item []items)
{
var ret = new List<Item[]>();
var i1 = 0;
for(var i2=1;i2<items.Length;++i2)
{
//discontinuity
if(items[i2-1].Stop != items[i2].Start)
{
var num = i2 - i1;
ret.Add(items.Skip(i1).Take(num).ToArray());
i1 = i2;
}
}
//end of array
ret.Add(items.Skip(i1).Take(items.Length-i1).ToArray());
return ret;
}
It's not the most intuitive implementation and I wonder if there is a way to have a neater LINQ-based approach. I was looking at Take and TakeWhile thinking to find the indices where discontinuities occur but couldn't see an easy way to do this.
Is there a simple way to use IEnumerable LINQ algorithms to do this in a more descriptive (not necessarily performant) way?
I set of a simple test-case here: https://dotnetfiddle.net/wrIa2J
I'm really not sure this is much better than your original, but for the purpose of another solution the general process is
Use Select to project a list working out a grouping
Use GroupBy to group by the above
Use Select again to project the grouped items to an array of Item
Use ToList to project the result to a list
public static List<Item[]> GetContiguousSequences2(Item []items)
{
var currIdx = 1;
return items.Select( (item,index) => new {
item = item,
index = index == 0 || items[index-1].Stop == item.Start ? currIdx : ++currIdx
})
.GroupBy(x => x.index, x => x.item)
.Select(x => x.ToArray())
.ToList();
}
Live example: https://dotnetfiddle.net/mBfHru
Another way is to do an aggregation using Aggregate. This means maintaining a final Result list and a Curr list where you can aggregate your sequences, adding them to the Result list as you find discontinuities. This method looks a little closer to your original
public static List<Item[]> GetContiguousSequences3(Item []items)
{
var res = items.Aggregate(new {Result = new List<Item[]>(), Curr = new List<Item>()}, (agg, item) => {
if(!agg.Curr.Any() || agg.Curr.Last().Stop == item.Start) {
agg.Curr.Add(item);
} else {
agg.Result.Add(agg.Curr.ToArray());
agg.Curr.Clear();
agg.Curr.Add(item);
}
return agg;
});
res.Result.Add(res.Curr.ToArray()); // Remember to add the last group
return res.Result;
}
Live example: https://dotnetfiddle.net/HL0VyJ
You can implement ContiguousSplit as a corutine: let's loop over source and either add item into current range or return it and start a new one.
private static IEnumerable<Item[]> ContiguousSplit(IEnumerable<Item> source) {
List<Item> current = new List<Item>();
foreach (var item in source) {
if (current.Count > 0 && current[current.Count - 1].Stop != item.Start) {
yield return current.ToArray();
current.Clear();
}
current.Add(item);
}
if (current.Count > 0)
yield return current.ToArray();
}
then if you want materialization
List<Item[]> GetContiguousSequences(Item []items) => ContiguousSplit(items).ToList();
Your solution is okay. I don't think that LINQ adds any simplification or clarity in this situation. Here is a fast solution that I find intuitive:
static List<Item[]> GetContiguousSequences(Item[] items)
{
var result = new List<Item[]>();
int start = 0;
while (start < items.Length) {
int end = start + 1;
while (end < items.Length && items[end].Start == items[end - 1].Stop) {
end++;
}
int len = end - start;
var a = new Item[len];
Array.Copy(items, start, a, 0, len);
result.Add(a);
start = end;
}
return result;
}
I am working with two lists. The first contains a large sequence of strings. The second contains a smaller list of strings. I need to find where the second list exists in the first list.
I worked with enumeration, and due to the large size of the data, this is very slow, I was hoping for a faster way.
List<string> first = new List<string>() { "AAA","BBB","CCC","DDD","EEE","FFF" };
List<string> second = new List<string>() { "CCC","DDD","EEE" };
int x = SomeMagic(first,second);
And I would need x to = 2.
Ok, here is my variant with old-good-for-each-loop:
private int SomeMagic(IEnumerable<string> source, IEnumerable<string> target)
{
/* Some obvious checks for `source` and `target` lenght / nullity are ommited */
// searched pattern
var pattern = target.ToArray();
// candidates in form `candidate index` -> `checked length`
var candidates = new Dictionary<int, int>();
// iteration index
var index = 0;
// so, lets the magic begin
foreach (var value in source)
{
// check candidates
foreach (var candidate in candidates.Keys.ToArray()) // <- we are going to change this collection
{
var checkedLength = candidates[candidate];
if (value == pattern[checkedLength]) // <- here `checkedLength` is used in sense `nextPositionToCheck`
{
// candidate has match next value
checkedLength += 1;
// check if we are done here
if (checkedLength == pattern.Length) return candidate; // <- exit point
candidates[candidate] = checkedLength;
}
else
// candidate has failed
candidates.Remove(candidate);
}
// check for new candidate
if (value == pattern[0])
candidates.Add(index, 1);
index++;
}
// we did everything we could
return -1;
}
We use dictionary of candidates to handle situations like:
var first = new List<string> { "AAA","BBB","CCC","CCC","CCC","CCC","EEE","FFF" };
var second = new List<string> { "CCC","CCC","CCC","EEE" };
If you are willing to use MoreLinq then consider using Window:
var windows = first.Window(second.Count);
var result = windows
.Select((subset, index) => new { subset, index = (int?)index })
.Where(z => Enumerable.SequenceEqual(second, z.subset))
.Select(z => z.index)
.FirstOrDefault();
Console.WriteLine(result);
Console.ReadLine();
Window will allow you to look at 'slices' of the data in chunks (based on the length of your second list). Then SequenceEqual can be used to see if the slice is equal to second. If it is, the index can be returned. If it doesn't find a match, null will be returned.
Implemented SomeMagic method as below, this will return -1 if no match found, else it will return the index of start element in first list.
private int SomeMagic(List<string> first, List<string> second)
{
if (first.Count < second.Count)
{
return -1;
}
for (int i = 0; i <= first.Count - second.Count; i++)
{
List<string> partialFirst = first.GetRange(i, second.Count);
if (Enumerable.SequenceEqual(partialFirst, second))
return i;
}
return -1;
}
you can use intersect extension method using the namepace System.Linq
var CommonList = Listfirst.Intersect(Listsecond)
I have two large lists and I need get the diff between them.
The first list is from another system via webservice, the second list is from a database (destiny of data).
i will compare and get items from first list that not have in second list and insert in the database (second list source).
have another solution with best performance?
using List.Any(), the process take a lot of hours and not finish...
using for loop, the process take 10 hours or more.
Each list have 1.300.000 records
newItensForInsert = List1.Where(item1 => !List2.Any(item2 => item1.prop1 == item2.prop1 && item1.prop2 == item2.prop2)).ToList();
//or
for (int i = 0; i < List1.Count; i++)
{
if (!List2.Any(x => x.prop1 == List1[i].prop1 && x.prop2 == List1[i].prop2))
{
ListForInsert.Add(List1[i]);
}
}
//or
ListForInsert = List1.AsParallel().Except(List2.AsParallel(), IEqualityComparer).ToList();
You could use List.Except
List<object> webservice = new List<object>();
List<object> database = new List<object>();
IEnumerable<object> toPutIntoDatabase = webservice.Except(database);
database.AddRange(toPutIntoDatabase);
EDIT:
You can even use the new PLINQ (parallel LINQ) like this
IEnumerable<object> toPutIntoDatabase = webservice.AsParallel().Except(database.AsParallel());
EDIT:
Maybe you could use a Hashset to speed up lookups.
HashSet<object> databaseHash = new HashSet<object>(database);
foreach (var item in webservice)
{
if (databaseHash.Contains(item) == false)
{
database.Add(item);
}
{
If same data type then you can use List.Exists,
Else Better to go with inner join and emit
var newdata = from c in dblist
join p in list1 on c.Category equals p.Category into ps
from p in ps.DefaultIfEmpty()
it will select list if given data not present in dblist
HashSet<T> is optimized for executing this kind of set operations. In many cases it's worth the effort to create HashSets from Lists and do the set operation on the Hashsets. I demonstrated this with a little Linqpad program.
The program creates two lists containing 1,300,000 objects. It uses your method to get the difference (or better: attempted to used, because I ran out of patience). And it uses LINQ's Except and hashsets with ExceptWith, both with an IEqualityComparer. The program is listed below.
The result was:
Lists created: 00:00:00.9221369
Hashsets created: 00:00:00.1057532
Except: 00:00:00.2564191
ExceptWith: 00:00:00.0696830
So creating the HashSets and executing ExceptWith (together 0.18), beat Except (0.26s).
One caveat: creating HashSets may take too much memory since the large lists already take a fair amount of memory.
void Main()
{
var sw = Stopwatch.StartNew();
var amount = 1300000;
//amount = 50000;
var list1 = Enumerable.Range(0, amount).Select(i => new Demo(i)).ToList();
var list2 = Enumerable.Range(10, amount).Select(i => new Demo(i)).ToList();
sw.Stop();
sw.Elapsed.Dump("Lists created");
sw.Restart();
var hs1 = new HashSet<Demo>(list1, new DemoComparer());
var hs2 = new HashSet<Demo>(list2, new DemoComparer());
sw.Stop();
sw.Elapsed.Dump("Hashsets created");
sw.Restart();
// var list3 = list1.Where(item1 => !list2.Any(item2 => item1.ID == item2.ID)).ToList();
// sw.Stop();
// sw.Elapsed.Dump("Any");
// sw.Restart();
var list4 = list1.Except(list2, new DemoComparer()).ToList();
sw.Stop();
sw.Elapsed.Dump("Except");
sw.Restart();
hs1.ExceptWith(hs2);
sw.Stop();
sw.Elapsed.Dump("ExceptWith");
// list3.Count.Dump();
list4.Count.Dump();
hs1.Count.Dump();
}
// Define other methods and classes here
class Demo
{
public Demo(int id)
{
ID = id;
Name = id.ToString();
}
public int ID { get; set; }
public string Name { get; set; }
}
class DemoComparer : IEqualityComparer<Demo>
{
public bool Equals(Demo x, Demo y)
{
return (x == null && y == null)
|| (x != null && y != null) && x.ID.Equals(y.ID);
}
public int GetHashCode(Demo obj)
{
return obj.ID.GetHashCode();
}
}
Use List.Exists, it is better than List.Any Performance-wise
I have an array which contains fields for a data structure in the following format;
[0] = Record 1 (Name Field)
[1] = Record 1 (ID Field)
[2] = Record 1 (Other Field)
[3] = Record 2 (Name Field)
[4] = Record 2 (ID Field)
[5] = Record 2 (Other Field)
etc.
I'm processing this into a collection as follows;
for (int i = 0; i < components.Length; i = i + 3)
{
results.Add(new MyObj
{
Name = components[i],
Id = components[i + 1],
Other = components[i + 2],
});
}
This works fine, but I was wondering if there is a nice way to achieve the same output with LINQ? There's no functional requirement here, I'm just curious if it can be done or not.
I did do some experimenting with grouping by an index (after ToList()'ing the array);
var groupings = components
.GroupBy(x => components.IndexOf(x) / 3)
.Select(g => g.ToArray())
.Select(a => new
{
Name = a[0],
Id = a[1],
Other = a[2]
});
This works, but I think it's a bit overkill for what I'm trying to do. Is there a simpler way to achieve the same output as the for loop?
Looks like a perfect candidate for Josh Einstein's IEnumerable.Batch extension. It slices an enumerable into chunks of a certain size and feeds them out as an enumeration of arrays:
public static IEnumerable<T[]> Batch<T>(this IEnumerable<T> self, int batchSize)
In the case of this question, you'd do something like this:
var results =
from batch in components.Batch(3)
select new MyObj { Name = batch[0], Id = batch[1], Other = batch[2] };
Update: 2 years on and the Batch extension I linked to seems to have disappeared. Since it was considered the answer to the question, and just in case someone else finds it useful, here's my current implementation of Batch:
public static partial class EnumExts
{
/// <summary>Split sequence into blocks of specified size.</summary>
/// <typeparam name="T">Type of items in sequence</typeparam>
/// <param name="sequence"><see cref="IEnumerable{T}"/> sequence to split</param>
/// <param name="batchLength">Number of items per returned array</param>
/// <returns>Arrays of <paramref name="batchLength"/> items, with last array smaller if sequence count is not a multiple of <paramref name="batchLength"/></returns>
public static IEnumerable<T[]> Batch<T>(this IEnumerable<T> sequence, int batchLength)
{
if (sequence == null)
throw new ArgumentNullException("sequence");
if (batchLength < 2)
throw new ArgumentException("Batch length must be at least 2", "batchLength");
using (var iter = sequence.GetEnumerator())
{
var bfr = new T[batchLength];
while (true)
{
for (int i = 0; i < batchLength; i++)
{
if (!iter.MoveNext())
{
if (i == 0)
yield break;
Array.Resize(ref bfr, i);
break;
}
bfr[i] = iter.Current;
}
yield return bfr;
bfr = new T[batchLength];
}
}
}
}
This operation is deferred, single enumeration and executes in linear time. It is relatively quick compared to a few other Batch implementations I've seen, even though it is allocating a new array for each result.
Which just goes to show: you never can tell until you profile, and you should always quote the code in case it disappears.
I would say stick with your for-loop. However, this should work with Linq:
List<MyObj> results = components
.Select((c ,i) => new{ Component = c, Index = i })
.GroupBy(x => x.Index / 3)
.Select(g => new MyObj{
Name = g.First().Component,
Id = g.ElementAt(1).Component,
Other = g.Last().Component
})
.ToList();
Maybe an iterator could be appropriate.
Declare a custom iterator:
static IEnumerable<Tuple<int, int, int>> ToPartitions(int count)
{
for (var i = 0; i < count; i += 3)
yield return new Tuple<int, int, int>(i, i + 1, i + 2);
}
Prepare the following LINQ:
var results = from partition in ToPartitions(components.Length)
select new {Name = components[partition.Item1], Id = components[partition.Item2], Other = components[partition.Item3]};
This method may give you an idea on how to make the code more expressive.
public static IEnumerable<MyObj> AsComponents<T>(this IEnumerable<T> serialized)
where T:class
{
using (var it = serialized.GetEnumerator())
{
Func<T> next = () => it.MoveNext() ? it.Current : null;
var obj = new MyObj
{
Name = next(),
Id = next(),
Other = next()
};
if (obj.Name == null)
yield break;
yield return obj;
}
}
As it stands, I dislike the way I detect the end of the input, but you might have domain specific information on how to do this better.
I have following two approaches. Approach 1 uses a HashSet and List. Second approach uses Sorting of Array.
Which is better in terms of processing speed
when there are many records?
when there is small number of records?
CODE
string entryValue = "A,B, a , b, ";
if (!String.IsNullOrEmpty(entryValue.Trim()))
{
//APPROACH 1
bool isUnique = true;
//Hash set is unique set -- Case sensitivty Ignored
HashSet<string> uniqueRecipientsSet = new HashSet<string>(entryValue.Trim().Split(',').Select(t => t.Trim()),StringComparer.OrdinalIgnoreCase );
//List can hold duplicates
List<string> completeItems = new List<string>(entryValue.Trim().Split(',').Select(t => t.Trim()));
if (completeItems.Count != uniqueRecipientsSet.Count)
{
isUnique = false;
}
//APPROACH 2
bool isUniqueCheck2 = true;
string[] words = entryValue.Split(',');
Array.Sort(words);
for (int i = 1; i < words.Length; i++)
{
if (words[i].ToLower().Trim() == words[i - 1].ToLower().Trim())
{
isUniqueCheck2 = false;
break;
}
}
bool result1 = isUnique;
bool result2 = isUniqueCheck2;
}
REFERENCES:
Split comma separated string to count duplicates
MSDN Blog - Find Duplicates using LINQ
You can simplify your first approach:
List<string> completeItems = new List<string>(entryValue.Trim().Split(',').Select(t => t.Trim()));
isUnique = completeItems.Count == completeItems.Distinct().Count();
This would eliminate multiple splitting, and hide the hash set behind the call of Distinct(). Note that the if statement is also unnecessary.
You could have used StopWatch yourself. The first approach is a little bit faster:
1) 00:00:00.0460701 2) 00:00:00.0628364
Each approach 10000 repitions (just a simple way to measure the time)
The hashset approach is O(n); the sort approach is O(n log n).
However, an even quicker option would be to short-circuit the hashset approach by stopping as soon as you first see a duplicate:
HashSet<string> uniqueRecipientsSet
= new HashSet<string>(StringComparer.OrdinalIgnoreCase);
bool isUnique = true;
foreach(var item in entryValue.Split(',').Select(t => t.Trim()))
{
if (!uniqueRecipientsSet.Add(item))
{
isUnique = false;
break;
}
}
You could hide the foreach loop in LINQ:
HashSet<string> uniqueRecipientsSet
= new HashSet<string>(StringComparer.OrdinalIgnoreCase);
bool isUnique = entryValue.Split(',').Select(t => t.Trim())
.All(i => uniqueRecipientsSet.Add(i));
This is "LINQ-with-side-effects" but it does reduce the whole thing to two lines.
You could write your own AreAllDistinct extension method to avoid the side-effect-iness:
public static bool AreAllDistinct<T>(
this IEnumerable<T> source, IEqualityComparer<T> comparer)
{
HashSet<T> checker = new HashSet<T>(comparer);
foreach (var t in T)
if (!checker.Add(t))
return false;
return true;
}
bool isUnique = entryValue.Split(',').Select(t => t.Trim())
.AreAllDistinct(StringComparer.OrdinalIgnoreCase);