Comparing large list

Comparing large list - c#

I have two very Lage lists, a few hundred thousand items per list, one is complete and the other one hast missing items. I need to know which items are missing in the incomplete list. I‘ve already tried using Enumerable.Except but it takes ages until they are fully compared.

var incompleteSet = new HashSet<string>(incompleteList);
IEnumerable<string> missing = completeList.Where(str => !incompleteSet.Contains(str));
But the same mechanism is roughly used in Enumerable.Except so I don't think it will make performance better. Did you compile in release or debug config?

Based on the information you have provided, I think you should be able to get good performance benefits by transforming your string into integral type before comparison.
I have written the LINQ and non LINQ versions of the implementation. The main difference is that the .ToDictionary call will be slightly slower, due to re-allocation of bigger memory slots. In the non-LINQ version we can use a HashSet, but the version I use (4.6.1) does not allow me to construct by specifying the capacity.
// Sample String POS0001:615155172
static long GetKey(string s) => long.Parse("1" + s.Substring(3, 4) + s.Substring(8));
static IEnumerable<string> FindMissing(IEnumerable<string> masterList, ICollection<string> missingList) {
var missingSet = new Dictionary<long, bool>(missingList.Count);
foreach (string s in missingList)
missingSet.Add(GetKey(s), true);
// Compact LINQ Way, but potentially, ineffecient
//var missingSet = missingList.ToDictionary(GetKey, s => true);
return masterList.Where(s => !missingSet.ContainsKey(GetKey(s)));
}
There are, slightly more involved, single-pass ways to solve your problem, since your data is already sorted. Let me know if this works for you or not, as I don't have a test bed to test this.

Related

Better way to search a List for a string

I've implimented the following code to search a list of objects for a particular value:
List<customer> matchingContacts = cAllServer
.Where(o => o.customerNum.Contains(searchTerm) ||
o.personInv.lastname.Contains(searchTerm) ||
o.personDel.lastname.Contains(searchTerm))
.ToList();
is there a quicker or cleaner way to impliment this search?

Since you will have to iterate through all of the list items, it will have O(n) complexity. Performance also depends whether you are operating on IQueryable collection (with or without lazy loading) or it is a serialized IEnumerable collection. I'd advise to check first the properties that are most likely to have the value you are searching for, because you are using "or" operator, so that you can speed up your "Contains" operation. You iterate more quickly if you know that it is a match for a particular entity in 10ms rather than in 25ms. There is also a popular argument what is faster? Contains or IndexOf? Well IndexOf should be a little bit faster, but I doubt you'll notice it, unless you operate on lists with millions of elements. Is String.Contains() faster than String.IndexOf()?

I think that this is fine as it is, but in other hand I'd think twice about the need to convert to list, you already are receving an IEnumerable type that'll let you iterate through stuff. Unless you need to go back and forth the list and searching by index there's no need for you to convert it to List.
This is an small optimization though.

The one thing I'd suggest is to create a new "searchtext" column, prepopulated with (o.customerNum + "|" + o.personInv.lastname + "|" + o.personDel.lastname).ToUpper().
List<customer> matchingContacts = cAllServer
.Where(o => o.searchtext.Contains(searchTerm))
.ToList();
This performs one search instead of three (but on a longer string), and it you .ToUpper() searchTerm, you can perform a case-sensivitive search which migh tbe trivial faster.
On the whole, I wouldn't expect this to be significantly faster.

Efficient method for checking substrings C#

I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718
In some string[] array I have a list of web-sites
amazon.com
google.com
ieee.org
...
I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?
For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.
Can someone help me with faster approach? Thank you in advance

Good solution would leverage hashing. My approach would be following
Hash all your known hosts (the string[] collection that you mention)
Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
Sort the list (hashes.Sort())
When looking up a url:
Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
If the hash exists, then you have this host in your list
Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.
Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.

Create a Dictionary of domain to counter.
For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.
I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.

You can read this question, the answers will be help you:
High performance "contains" search in list of strings in C#

Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop
as in something like
int l = url.length;
int position = 0;
while (position < l)
{
if (url[i] == website[0])
{
//test rest of web site from position in an other loop
if (exactMatch(url,position, website))
}
}
Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.

Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:
var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
var domain = naiveDomainExtractor(line);
if(tally.ContainsKey(domain)) tally[domain]++;
}
...which takes a third of a second on my not particularly speedy machine, including generation of test data.
Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.

You'd have to test the performance but you might try converting the urls to the actual System.Uri object.
Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:
IEnumerable<Uri> inputUrls = File.ReadAllLines(#"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));

Using ToList() on Enumerable LINQ query results for large data sets - Efficiency Issue?

I've been making a lot of use of LINQ queries in the application I'm currently writing, and one of the situations that I keep running into is having to convert the LINQ query results into lists for further processing (I have my reasons for wanting lists).
I'd like to have a better understanding of what happens in this list conversion in case there are inefficiencies since I've used it repeatedly now. So, given I execute a line line like this:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
Questions:
Is there any overhead here aside from the creation of a new list and its population with references to the elements in the Enumerable returned from the query?
Would you consider this inefficient?
Is there a way to get the LINQ query to directly generate a list to avoid the need for a conversion in this circumstance?

Well, it creates a copy of the data. That could be inefficient - but it depends on what's going on. If you need a List<T> at the end, List<T> is usually going to be close to as efficient as you'll get. The one exception to that is if you're going to just do a conversion and the source is already a list - then using ConvertAll will be more efficient, as it can create the backing array of the right size to start with.
If you only need to stream the data - e.g. you're just going to do a foreach on it, and taking actions which don't affect the original data sources - then calling ToList is definitely a potential source of inefficiency. It will force the whole of list1 to be evaluated - and if that's a lazily-evaluated sequence (e.g. "the first 1,000,000 values from a random number generator") then that's not good. Note that as you're doing a join, list2 will be evaluated anyway as soon as you try to pull the first value from the sequence (whether that's in order to populate a list or not).
You might want to read my Edulinq post on ToList to see what's going on - at least in one possible implementation - in the background.

There is no any other overhed except those ones already mantioned by you.
I would say yes, but it depends on concrete application scenario. By the way, in general it's better to avoid additional calls. (I think this is obvious).
I'm afraid not. The LINQ query return a sequence of data, that could be an infinit sequence potentially. Converting to List<T> you make it finit, with also a possibility of index access, which is not possible to have in sequence or stream.
Suggession: avoid situation where you need the List<T>. If, by the way, you need it, push inside as less data as you you need in the current moment.
Hope this helps.

In addition to what has been said, if the initial two lists that you're joining were already quite large, creating a third (creating an "intersection" of the two) could cause out of memory errors. If you just iterate the result of the LINQ statement, you'll reduce the memory usage dramatically.

Most of the overhead happens before the list creation like the connection to db, getting the data to
an adapter, for the var type the .NET need to decide it's data type/structure...
The efficiency is very relative term. For a programmer who doesn't strong in SQL is efficient,
faster developing (relatively to old ADO) the overheads detailed in 1.
On the other hand LINQ can call procedures from the db itself, which already faster.
I suggest you to to the next test:
Run your program on maximal amount of data and measure the time.
Use some db procedure to export the data to file (like XML, CSV,....) and try to build your list
from that file and measure the time.
Then you can see if the difference is significant.
But the second ways is less efficient for the programmer, but can reduce the run time.

Enumerable.ToList(source) is essentially just a call to new List(source).
This constructor will test whether source is an ICollection<T>, and if it is allocate an array of the appropriate size. In other cases, i.e. most cases where the source is a LINQ query, it will allocate an array with the default initial capacity (four items) and grow it by doubling the capacity as needed. Each time the capacity doubles, a new array is allocated and the old one is copied over into the new one.
This may introduce some overhead in cases where your list wil have a lot of items (we're probably talking thousands at least). The overhead can be significant as soon as the list grows over 85 KB, as it is then allocated on the Large Object Heap, which is not compacted and may suffer from memory fragmentation. Note that I'm refering to the array in the list. If T is a reference type, that array contains only references, not the actual objects. Those objects then don't count for the 85 KB limitation.
You could remove some of this overhead if you can accurately estimate the size of your sequence (where it is better to overestimate a little bit than it is to underestimate a little bit). For example, if you are only running a .Select() operator on something that implements ICollection<T>, you know the size of the output list.
In such cases, this extension method would reduce this overhead:
public static List<T> ToList<T>(this IEnumerable<T> source, int initialCapacity)
{
// parameter validation ommited for brevity
var result = new List<T>(initialCapacity);
foreach (T item in source)
{
result.Add(item);
}
return result;
}
In some cases, the list you create is just going to replace a list that was already there, e.g. from a previous run. In those cases, you can avoid quite a few memory allocations if you reuse the old list. That would only work if you don't have concurrent access to that old list though, and I wouldn't do it if new lists will typically be significantly smaller than old lists. If that's the case, you can use this extension method:
public static void CopyToList<T>(this IEnumerable<T> source, List<T> destination)
{
// parameter validation ommited for brevity
destination.Clear();
foreach (T item in source)
{
destination.Add(item);
}
}
This being said, would I consider .ToList() being inefficient? No, if you have the memory, and you're going to use the list repeatedly, either for random indexing into it a lot, or iterating over it multiple times.
Now back to your specific example:
var matches = (from x in list1 join y in list2 on x equals y select x).ToList();
It may be more efficient to do this in some other way, for example:
var matches = list1.Intersect(list2).ToList();
which would yield the same results if list1 and list2 don't contain duplicates, and is very efficient if list2 is small.
The only way to really know though, as usual, is to measure using typical workloads.

Remove duplicates from array of struct

I can't figured out in remove duplicates entries from an Array of struct
I have this struct:
public struct stAppInfo
{
public string sTitle;
public string sRelativePath;
public string sCmdLine;
public bool bFindInstalled;
public string sFindTitle;
public string sFindVersion;
public bool bChecked;
}
I have changed the stAppInfo struct to class here thanks to Jon Skeet
The code is like this: (short version)
stAppInfo[] appInfo = new stAppInfo[listView1.Items.Count];
int i = 0;
foreach (ListViewItem item in listView1.Items)
{
appInfo[i].sTitle = item.Text;
appInfo[i].sRelativePath = item.SubItems[1].Text;
appInfo[i].sCmdLine = item.SubItems[2].Text;
appInfo[i].bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false;
appInfo[i].sFindTitle = item.SubItems[4].Text;
appInfo[i].sFindVersion = item.SubItems[5].Text;
appInfo[i].bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false;
i++;
}
I need that appInfo array be unique in sTitle and sRelativePath members the others members can be duplicates
EDIT:
Thanks to all for the answers but this application is "portable" I mean I just need the .exe file and I don't want to add another files like references *.dll so please no external references this app is intended to use in a pendrive
All data comes form a *.ini file what I do is: (pseudocode)
ReadFile()
FillDataFromFileInAppInfoArray()
DeleteDuplicates()
FillListViewControl()
When I want to save that data into a file I have these options:
Using ListView data
Using appInfo array (this is more faster¿?)
Any other¿?
EDIT2:
Big thanks to: Jon Skeet, Michael Hays thanks for your time guys!!

Firstly, please don't use mutable structs. They're a bad idea in all kinds of ways.
Secondly, please don't use public fields. Fields should be an implementation detail - use properties.
Thirdly, it's not at all clear to me that this should be a struct. It looks rather large, and not particularly "a single value".
Fourthly, please follow the .NET naming conventions so your code fits in with all the rest of the code written in .NET.
Fifthly, you can't remove items from an array, as arrays are created with a fixed size... but you can create a new array with only unique elements.
LINQ to Objects will let you do that already using GroupBy as shown by Albin, but a slightly neater (in my view) approach is to use DistinctBy from MoreLINQ:
var unique = appInfo.DistinctBy(x => new { x.sTitle, x.sRelativePath })
.ToArray();
This is generally more efficient than GroupBy, and also more elegant in my view.
Personally I generally prefer using List<T> over arrays, but the above will create an array for you.
Note that with this code there can still be two items with the same title, and there can still be two items with the same relative path - there just can't be two items with the same relative path and title. If there are duplicate items, DistinctBy will always yield the first such item from the input sequence.
EDIT: Just to satisfy Michael, you don't actually need to create an array to start with, or create an array afterwards if you don't need it:
var query = listView1.Items
.Cast<ListViewItem>()
.Select(item => new stAppInfo
{
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
bFindInstalled = item.SubItems[3].Text == "Sí",
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = item.SubItems[6].Text == "Sí"
})
.DistinctBy(x => new { x.sTitle, x.sRelativePath });
That will give you an IEnumerable<appInfo> which is lazily streamed. Note that if you iterate over it more than once, however, it will iterate over listView1.Items the same number of times, performing the same uniqueness comparisons each time.
I prefer this approach over Michael's as it makes the "distinct by" columns very clear in semantic meaning, and removes the repetition of the code used to extract those columns from a ListViewItem. Yes, it involves building more objects, but I prefer clarity over efficiency until benchmarking has proved that the more efficient code is actually required.

What you need is a Set. It ensures that the items entered into it are unique (based on some qualifier which you will set up). Here is how it is done:
First, change your struct to a class. There is really no getting around that.
Second, provide an implementation of IEqualityComparer<stAppInfo>. It may be a hassle, but it is the thing that makes your set work (which we'll see in a moment):
public class AppInfoComparer : IEqualityComparer<stAppInfo>
{
public bool Equals(stAppInfo x, stAppInfo y) {
if (ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return Equals(x.sTitle, y.sTitle) && Equals(x.sRelativePath,
y.sRelativePath);
}
// this part is a pain, but this one is already written
// specifically for your question.
public int GetHashCode(stAppInfo obj) {
unchecked {
return ((obj.sTitle != null
? obj.sTitle.GetHashCode() : 0) * 397)
^ (obj.sRelativePath != null
? obj.sRelativePath.GetHashCode() : 0);
}
}
}
Then, when it is time to make your set, do this:
var appInfoSet = new HashSet<stAppInfo>(new AppInfoComparer());
foreach (ListViewItem item in listView1.Items)
{
var newItem = new stAppInfo {
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
sCmdLine = item.SubItems[2].Text,
bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false,
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false};
appInfoSet.Add(newItem);
}
appInfoSet now contains a collection of stAppInfo objects with unique Title/Path combinations, as per your requirement. If you must have an array, do this:
stAppInfo[] appInfo = appInfoSet.ToArray();
Note: I chose this implementation because it looks like the way you are already doing things. It has an easy to read for-loop (though I do not need the counter variable). It does not involve LINQ (wich can be troublesome if you aren't familiar with it). It requires no external libraries outside of what .NET framework provides to you. And finally, it provides an array just like you've asked. As for reading the file in from an INI file, hopefully you see that the only thing that will change is your foreach loop.
Update
Hash codes can be a pain. You might have been wondering why you need to compute them at all. After all, couldn't you just compare the values of the title and relative path after each insert? Well sure, of course you could, and that's exactly how another set, called SortedSet works. SortedSet makes you implement IComparer in the same way that I implemented IEqualityComparer above.
So, in this case, AppInfoComparer would look like this:
private class AppInfoComparer : IComparer<stAppInfo>
{
// return -1 if x < y, 1 if x > y, or 0 if they are equal
public int Compare(stAppInfo x, stAppInfo y)
{
var comparison = x.sTitle.CompareTo(y.sTitle);
if (comparison != 0) return comparison;
return x.sRelativePath.CompareTo(y.sRelativePath);
}
}
And then the only other change you need to make is to use SortedSet instead of HashSet:
var appInfoSet = new SortedSet<stAppInfo>(new AppInfoComparer());
It's so much easier in fact, that you are probably wondering what gives? The reason that most people choose HashSet over SortedSet is performance. But you should balance that with how much you actually care, since you'll be maintaining that code. I personally use a tool called Resharper, which is available for Visual Studio, and it computes these hash functions for me, because I think computing them is a pain, too.
(I'll talk about the complexity of the two approaches, but if you already know it, or are not interested, feel free to skip it.)
SortedSet has a complexity of O(log n), that is to say, each time you enter a new item, will effectively go the halfway point of your set and compare. If it doesn't find your entry, it will go to the halfway point between its last guess and the group to the left or right of that guess, quickly whittling down the places for your element to hide. For a million entries, this takes about 20 attempts. Not bad at all. But, if you've chosen a good hashing function, then HashSet can do the same job, on average, in one comparison, which is O(1). And before you think 20 is not really that big a deal compared to 1 (after all computers are pretty quick), remember that you had to insert those million items, so while HashSet took about a million attempts to build that set up, SortedSet took several million attempts. But there is a price -- HashSet breaks down (very badly) if you choose a poor hashing function. If the numbers for lots of items are unique, then they will collide in the HashSet, which will then have to try again and again. If lots of items collide with the exact same number, then they will retrace each others steps, and you will be waiting a long time. The millionth entry will take a million times a million attempts -- HashSet has devolved into O(n^2). What's important with those big-O notations (which is what O(1), O(log n), and O(n^2) are, in fact) is how quickly the number in parentheses grows as you increase n. Slow growth or no growth is best. Quick growth is sometimes unavoidable. For a dozen or even a hundred items, the difference may be negligible -- but if you can get in the habit of programming efficient functions as easily as alternatives, then it's worth conditioning yourself to do so as problems are cheapest to correct closest to the point where you created that problem.

Use LINQ2Objects, group by the things that should be unique and then select the first item in each group.
var noDupes = appInfo.GroupBy(
x => new { x.sTitle, x.sRelativePath })
.Select(g => g.First()).ToArray();

!!! Array of structs (value type) + sorting or any kind of search ==> a lot of unboxing operations.
I would suggest to stick with recommendations of Jon and Henk, so make it as a class and use generic List<T>.
Use LINQ GroupBy or DistinctBy, as for me it is much simple to use built in GroupBy, but it also interesting to take a look at an other popular library, perhaps it gives you some insights.
BTW, Also take a look at the LambdaComparer it will make you life easier each time you need such kind of in place sorting/search, etc...

How to check Array of strings contains a particular string?

I'm using .NET 2.0
I have a large array of string.
I want to check whether a particular string is there in the array or not,
I'm not sure, whether following code is optimized or I need to make it more optimized.
please guide.
string []test_arr= new string[]{"key1","key2","key3"};
Boolean testCondition = (new List<string>(test_arr)).Contains("key3");
I also wants to know more about
.NET Generics
.NET Attributes
.NET Reflections
is there any good reference or book, that someone has already refer then help me out !

string []test_arr= new string[]{"key1","key2","key3"};
bool testCondition = Array.Exists
(
test_arr,
delegate(string s) { return s == "key3";}
);

If its possible you could sort your array (using static Array.Sort method) and then use Array.BinarySearch
Alternatively you need to use a more optimised data structure for your strings.

My answer very similar to Matt Howells.
But I'm suggesting to use StringComparison
Array.Exists<string>(stringsArray,
delegate(string match)
{
return match.Equals("key", StringComparison.InvariantCultureIgnoreCase)
});

In the .NET Framework version 2.0, the Array class implements the System.Collections.Generic.IList, System.Collections.Generic.ICollection, and System.Collections.Generic.IEnumerable generic interfaces.
Hence you can do the following:
string[] test_arr = new string[]{"key1","key2","key3"};
Boolean testCondition = ((IList<string>)test_arr).Contains("key3");

List is O(n), SortedList is O(log n)

abouot your larga array of string:
there is no optimized way as long as you use an array (you have to start at the first element and go through each until you find it - or go through the whole array if you dont) - this gives you a worst case time of O(n) (O notation gives the time a program needs to accomplish something).
Since you want to optimize for search I suggest you use a hashtable or tree instead (depending on how large your dataset is). This will greatly reduce the time you need to check

In your sample the biggest overhead will probably be the creation of the List, but that may be part of the demonstration.
Starting from array, the following will probably be faster:
int x = Array.IndexOf<string>(test_arr, "key3");
bool testCondition = x >= 0;
But if you have the option, it would be more efficient to use a HashSet<string> to store them in the first place. HashSet can check the existence of an element in O(1).
Regarding your other questions, they have already been asked on SO, use the search option, for instance with "C# books"

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Comparing large list - c#

I have two very Lage lists, a few hundred thousand items per list, one is complete and the other one hast missing items. I need to know which items are missing in the incomplete list. I‘ve already tried using Enumerable.Except but it takes ages until they are fully compared.

var incompleteSet = new HashSet<string>(incompleteList); IEnumerable<string> missing = completeList.Where(str => !incompleteSet.Contains(str)); But the same mechanism is roughly used in Enumerable.Except so I don't think it will make performance better. Did you compile in release or debug config?

Related

Better way to search a List for a string

Efficient method for checking substrings C#

Using ToList() on Enumerable LINQ query results for large data sets - Efficiency Issue?

Remove duplicates from array of struct

How to check Array of strings contains a particular string?

Categories

Resources