Faster way to find first occurence of String in list - c#

I have a method, that finds first occurrences in list of words.
wordSet - set of words, that i need to check
That list is representation of text, so words located in order, that text has.
so if pwWords has suck elements {This,is,good,boy,and,this,girl,is,bad}
and wordSet has {this,is} method should add true only for first two elements.
My question is: is there any faster way to do this?
Because if pwWords has like over million elements, and wordSet over 10 000 it works pretty slow.
public List<bool> getFirstOccurances(List<string> pwWords)
{
var firstOccurance = new List<bool>();
var wordSet = new List<String>(WordsWithFDictionary.Keys);
foreach (var pwWord in pwWords)
{
if (wordSet.Contains(pwWord))
{
firstOccurance.Add(true);
wordSet.Remove(pwWord);
}
else
{
firstOccurance.Add(false);
}
}
return firstOccurance;
}

Another approach is using HashSet for wordSet
public List<bool> getFirstOccurances(List<string> pwWords)
{
var wordSet = new HashSet<string>(WordsWithFDictionary.Keys);
return pwWords.Select(word => wordSet.Contains(word)).ToList();
}
HashSet.Contains algorithm is O(1), where List.Contains will loop all items until item is found.
For better performance you can create wordSet only once if this is possible.
public class FirstOccurances
{
private HashSet<string> _wordSet;
public FirstOccurances(IEnumerable<string> wordKeys)
{
_wordSet = new HashSet<string>(wordKeys);
}
public List<bool> GetFor(List<string> words)
{
return words.Select(word => _wordSet.Contains(word)).ToList();
}
}
Then use it
var occurrences = new FirstOccurances(WordsWithFDictionary.Keys);
// Now you can effectively search for occurrences multiple times
var result = occurrences.GetFor(pwWords);
var anotherResult = occurrences.GetFor(anotherPwWords);
Because item of pwWords can be checked for occurrences independently and if order of items not imported you can try to use Parallel LINQ
public List<bool> GetFor(List<string> words)
{
return words.AsParallel().Select(word => _wordSet.Contains(word)).ToList();
}

Related

Best Way to compare 1 million List of object with another 1 million List of object in c#

i am differentiating 1 million list of object with another 1 million list of object.
i am using for , foreach but it takes too much of time to iterate those list.
can any one help me best way to do this
var SourceList = new List<object>(); //one million
var TargetList = new List<object>()); // one million
//getting data from database here
//SourceList with List of one million
//TargetList with List of one million
var DifferentList = new List<object>();
//ForEach
SourceList.ToList().ForEach(m =>
{
if (!TargetList.Any(s => s.Name == m.Name))
DifferentList.Add(m);
});
//for
for (int i = 0; i < SourceList .Count; i++)
{
if (!TargetList .Any(s => s == SourceList [i].Name))
DifferentList .Add(SourceList [i]);
}
I think it seems like a bad idea but IEnumerable magic will help you.
For starters, simplify your expression. It looks like this:
var result = sourceList.Where(s => targetList.Any(t => t.Equals(s)));
I recommend making a comparison in the Equals method:
public class CompareObject
{
public string prop { get; set; }
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
Next add AsParallel. This can both speed up and slow down your program. In your case, you can add ...
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
CPU 100% loaded if you try to list all at once like this:
var cnt = result.Count();
But it’s quite tolerable to work if you get the results in small portions.
result.Skip(10000).Take(10000).ToList();
Full code:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 100000).ToString();
}
public new bool Equals(object o)
{
if (o.GetType() == typeof(CompareObject))
return this.prop == ((CompareObject)o).prop;
return this.GetHashCode() == o.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var result = sourceList.AsParallel().Where(s => !targetList.Any(t => t.Equals(s)));
var lr = result.Skip(10000).Take(10000).ToList();
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}
Update
I remembered what you can use Hashtable.Choos unique values from targetList and from sourceList next fill out the result whose values are not targetList.
Example:
static Random random = new Random();
public class CompareObject
{
public string prop { get; private set; }
public CompareObject()
{
prop = random.Next(0, 1000000).ToString();
}
public new int GetHashCode() {
return prop.GetHashCode();
}
}
void Main()
{
var sourceList = new List<CompareObject>();
var targetList = new List<CompareObject>();
for (int i = 0; i < 10000000; i++)
{
sourceList.Add(new CompareObject());
targetList.Add(new CompareObject());
}
var stopWatch = new Stopwatch();
stopWatch.Start();
var sourceHashtable = new Hashtable();
var targetHashtable = new Hashtable();
foreach (var element in targetList)
{
var hash = element.GetHashCode();
if (!targetHashtable.ContainsKey(hash))
targetHashtable.Add(element.GetHashCode(), element);
}
var result = new List<CompareObject>();
foreach (var element in sourceList)
{
var hash = element.GetHashCode();
if (!sourceHashtable.ContainsKey(hash))
{
sourceHashtable.Add(hash, element);
if(!targetHashtable.ContainsKey(hash)) {
result.Add(element);
}
}
}
stopWatch.Stop();
Console.WriteLine(stopWatch.Elapsed);
}
Scanning the target list to match the name is an O(n) operation, thus your loop is O(n^2). If you build a HashSet<string> of all the distinct names in the target list, you can check whether a name exists in the set in O(1) time using the Contains method.
//getting data from database here
You are getting the data out of a system that specializes in matching and sorting and filtering data, into your RAM that by default cannot yet do that task at all. And then you try to sort, filter and match yourself.
That will fail. No matter how hard you try, it is extremely unlikely that your computer with a single programmer working at a matching algorithm will outperform your specialized piece of hardware called a database server at the one operation this software is supposed to be really good at that was programmed by teams of experts and optimized for years.
You don't go into a fancy restaurant and ask them to give you huge bags of raw ingredients so you can throw them into a big bowl unpeeled and microwave them at home. No. You order a nice dish because it will be way better than anything you could do yourself.
The simple answer is: Do not do that. Do not take the raw data and rummage around in it for hours. Leave that job to the database. It's the one thing it's supposed to be good at. Use it's power. Write a query that will give you the result, don't get the raw data and then play database yourself.
Foreach performs a null check before each iteration, so using a standard for loop will provide slightly better performance that will be hard to beat.
If it is taking too long, can you break down the collection into smaller sets and/or process them in parallel?
Also you could look a PLinq (Parallel Linq) using .AsParallel()
Other areas to improve are the actual comparison logic that you are using, also how the data is stored in memory, depending on your problem, you may not have to load the entire object into memory for every iteration.
Please provide a code example so that we can assist further, when such large amounts of data are involved performance degredation is to be expected.
Again depending on the time that we are talking about here, you could upload the data into a database and use that for the comparison rather than trying to do it natively in C#, this type of solution is better suited to data sets that are already in a database or where the data changes much less frequently than the times you need to perform the comparison.

How do I order this list of site URLs in C#?

I have a list of site URLs,
/node1
/node1/sub-node1
/node2
/node2/sub-node1
The list is given to me in a random order, I need to order it so the the top level is first, followed by sub-levels and so on (because I cannot create /node2/sub-node1 without /node2 existing). Is there a clean way to do this?
Right now I'm just making a recursive call, saying if I can't create sub-node1 because node2 exists, create node2. I'd like to have the order of the list determine the creation and get rid of my recursive call.
My first thought was ordering by length of the string... but then I thought of a list like this, that might include something like aliases for short names:
/longsitename/
/a
/a/b/c/
/a
/a/b/
/otherlongsitename/
... and I thought a better option was to order by the number of level-separator characters first:
IEnumerable<string> SortURLs(IEnumerable<string> urls)
{
return urls.OrderBy(s => s.Count(c => c == '/')).ThenBy(s => s);
}
Then I thought about it some more and I saw this line in your question:
I cannot create /node2/sub-node1 without /node2 existing
Aha! The order of sections or within a section does not really matter, as long as children are always listed after parents. With that in mind, my original thought was okay and ordering by length of the string alone should be just fine:
IEnumerable<string> SortURLs(IEnumerable<string> urls)
{
return urls.OrderBy(s => s.Length);
}
Which lead me at last to wondering why I cared about the length at all? If I just sort the strings, regardless of length, strings with the same beginning will always sort the shorter string first. Thus, at last:
IEnumerable<string> SortURLs(IEnumerable<string> urls)
{
return urls.OrderBy(s => s);
}
I'll leave the first sample up because it may be useful if, at some point in the future, you need a more lexical or logical sort order.
Is there a clean way to do this?
Just sorting the list of URI's using a standard string sort should get you what you need. In general, "a" will order before "aa" in a string sort, so "/node1" should end up before "/node1/sub-node".
For example:
List<string> test = new List<string> { "/node1/sub-node1", "/node2/sub-node1", "/node1", "/node2" };
foreach(var uri in test.OrderBy(s => s))
Console.WriteLine(uri);
This will print:
/node1
/node1/sub-node1
/node2
/node2/sub-node1
Perhaps this works for you:
var nodes = new[] { "/node1", "/node1/sub-node1", "/node2", "/node2/sub-node1" };
var orderedNodes = nodes
.Select(n => new { Levels = Path.GetFullPath(n).Split('\\').Length, Node = n })
.OrderBy(p => p.Levels).ThenBy(p => p.Node);
Result:
foreach(var nodeInfo in orderedNodes)
{
Console.WriteLine("Path:{0} Depth:{1}", nodeInfo.Node, nodeInfo.Levels);
}
Path:/node1 Depth:2
Path:/node2 Depth:2
Path:/node1/sub-node1 Depth:3
Path:/node2/sub-node1 Depth:3
var values = new string[]{"/node1", "/node1/sub-node1" ,"/node2", "/node2/sub-node1"};
foreach(var val in values.OrderBy(e => e))
{
Console.WriteLine(val);
}
The best is to use natural sorting since your strings are mixed between strings and numbers. Because if you use other sorting methods or techniques and you have like this example:
List<string> test = new List<string> { "/node1/sub-node1" ,"/node13","/node10","/node2/sub-node1", "/node1", "/node2" };
the output will be:
/node1
/node1/sub-node1
/node10
/node13
/node2
/node2/sub-node1
which is not sorted.
You can look at this Implementation
If you mean you need all the first level nodes before all the second level nodes, sort by the number of slashes /:
string[] array = {"/node1","/node1/sub-node1", "/node2", "/node2/sub-node1"};
array = array.OrderBy(s => s.Count(c => c == '/')).ToArray();
foreach(string s in array)
System.Console.WriteLine(s);
Result:
/node1
/node2
/node1/sub-node1
/node2/sub-node1
If you just need parent nodes before child nodes, it doesn't get much simpler than
Array.Sort(array);
Result:
/node1
/node1/sub-node1
/node2
/node2/sub-node1
Recursion is actually exactly what you should use, since this is most easily represented by a tree structure.
public class PathNode {
public readonly string Name;
private readonly IDictionary<string, PathNode> _children;
public PathNode(string name) {
Name = name;
_children = new Dictionary<string, PathNode>(StringComparer.InvariantCultureIgnoreCase);
}
public PathNode AddChild(string name) {
PathNode child;
if (_children.TryGetValue(name, out child)) {
return child;
}
child = new PathNode(name);
_children.Add(name, child);
return child;
}
public void Traverse(Action<PathNode> action) {
action(this);
foreach (var pathNode in _children.OrderBy(kvp => kvp.Key)) {
pathNode.Value.Traverse(action);
}
}
}
Which you can then use like this:
var root = new PathNode(String.Empty);
var links = new[] { "/node1/sub-node1", "/node1", "/node2/sub-node-2", "/node2", "/node2/sub-node-1" };
foreach (var link in links) {
if (String.IsNullOrWhiteSpace(link)) {
continue;
}
var node = root;
var lastIndex = link.IndexOf("/", StringComparison.InvariantCultureIgnoreCase);
if (lastIndex < 0) {
node.AddChild(link);
continue;
}
while (lastIndex >= 0) {
lastIndex = link.IndexOf("/", lastIndex + 1, StringComparison.InvariantCultureIgnoreCase);
node = node.AddChild(lastIndex > 0
? link.Substring(0, lastIndex) // Still inside the link
: link // No more slashies
);
}
}
var orderedLinks = new List<string>();
root.Traverse(pn => orderedLinks.Add(pn.Name));
foreach (var orderedLink in orderedLinks.Where(l => !String.IsNullOrWhiteSpace(l))) {
Console.Out.WriteLine(orderedLink);
}
Which should print:
/node1
/node1/sub-node1
/node2
/node2/sub-node-1
/node2/sub-node-2

Using Contains() list method to evaluate list contents

I have a list that contains 3 items, two of type_1, and one of type_2. I want to return a second list that contains the type and number of that type that exists. When stepping through the breakpoints set at the foreach loop, the IF statement is never true. I assume there is something wrong with my attempt to use Contains() method.
The output should be something like:
type_1 2
type_2 1
Instead, it evaluates as:
type_1 1
type_1 1
type_2 1
Is my use of Contains() not correct?
public List<item_count> QueryGraphListingsNewAccountReport()
List<item> result = new List<items>();
var type_item1 = new item { account_type = "Type_1" };
var type_item2 = new item { account_type = "Type_1" };
var type_item3 = new item { account_type = "Type_2" };
result.Add(type_item1);
result.Add(type_item2);
result.Add(type_item3);
//Create a empty list that will hold the account_type AND a count of how many of that type exists:
List<item_count> result_count = new List<item_count>();
foreach (var item in result)
{
if (result_count.Contains(new item_count { account_type = item.account_type, count = 1 } ) == true)
{
var result_item = result_count.Find(x => x.account_type == item.account_type);
result_item.count += 1;
result_count.Add(result_item);
}
else
{
var result_item = new item_count { account_type = item.account_type, count = 1 };
result_count.Add(result_item);
}
}
return result_count;
}
public class item
{
public string account_type { get; set; }
}
public class item_count
{
public int count {get; set;}
public string account_type { get; set; }
}
I think your problem is that you don't want to use contains at all. You are creating a new object in your contains statement and, obviously, it isn't contained in your list already because you only just created it. The comparison is comparing references, not values.
Why not just use the find statement that you do in the next line instead? If it returns null, then you know there isn't an item already with that type.
So you could do something like this:
var result_item = result_count.Find(x => x.account_type == item.account_type);
if (result_item != null)
{
result_item.count++;
// note here you don't need to add it back to the list!
}
else
{
// create your new result_item here and add it to your list.
}
Note: Find is o(n), so this might not scale well if you have a really large set of types. In that case, you might be better off with Saeed's suggestion of grouping.
You can do:
myList.GroupBy(x=>x.type).Select(x=>new {x.Key, x.Count()});
If you want use for loop, it's better to use linq Count function to achieve this, If you want use Contains you should implement equal operator as the way you used.

Enumerated int List<GradeRange>

I really have no clue about enumerated list, but after some research I found that this list may help solve my problem. So I have a string in my settings called strGrades, and it is a range of strings that I manually update. The range is 0155-0160, 0271-0388, 0455-0503, 0588-687. What I basically want to do is find the values that are not in this grade list (for example 0161,0389, 0504-0587...)
So I came up with a function that will allow me to get each match in the grade range:
public static List<GradeRange> GetValidGrades()
{
MatchCollection matches= Regex.Matches(Settings.Default.productRange,
Settings.Default.srGradeRange);
List<GradeRange> ranges= new List<GradeRange();
if(matches.Count >0)
{
foreach (Match match in matches)
{
ranges.Add(new GradeRange() 23 {
Start= int.Parse(match.Groups["Start"].Value),
Stop= int.Parse(match.Groups["Stop"].Value)
});
}
}
return ranges;
}
here is the grade range class
public class GrandRange
{
public int Start{get; set;)
public int Stop {get; set; )
}
So the function above caputures my Start and End values, can anyone please help me get this into a list where I can find the values that fall outside of the range values, I just need a starting point. Thanks so much!
You could use a custom extension method that creates .Between along with a Where
var myFilteredList = list.Where(x=>!myValue.Between(x.Start, x.Stop, true));
This isnt the most performant answer, but if you need a list of all the numbers that are not between certain ranges, then you could do something like this:
var missingNumbers = new List<int>();
var minStop = list.OrderBy(x=>x.Stop).Min().Stop;
var maxStart = list.OrderBy(x=>x.Start).Max().Start;
Enumerable.Range(minStop, maxStart).ToList()
.ForEach(x=>
{
if(!x.Between(x.Start, x.Stop, true))
missingNumbers.Add(x);
}
);
Here this should get you started
var strings = "0155-0160, 0271-0388, 0455-0503, 0588-687";
var splitStrings = strings.Split(char.Parse(","));
var grads = new List<GrandRange>();
foreach (var item in splitStrings) {
var splitAgain = item.Split(char.Parse("-"));
var grand = new GrandRange
{
Start = int.Parse(splitAgain[0]),
Stop = int.Parse(splitAgain[1])
};
grads.Add(grand);
}
}

Finding differences in two lists

I am thinking about a good way to find differences in two lists
here is the problem:
Two lists have some strings where first 3 numbers/characters (*delimited) represent the unique key(followed by the text String="key1*key2*key3*text").
here is the string example:
AA1*1D*4*The quick brown fox*****CC*3456321234543~
where "*AA1*1D*4*" is a unique key
List1: "index1*index2*index3", "index2*index2*index3", "index3*index2*index3"
List2: "index2*index2*index3", "index1*index2*index3", "index3*index2*index3", "index4*index2*index3"
I need to match indexes in both lists and compare them.
If all 3 indexes from 1 list match 3 indexes from another list, I need to track both string entries in the new list
If there is a set of indexes in one list that don't appear in another, I need to track one side and keep an empty entry in another side. (#4 in the example above)
return the list
This is what I did so far, but I am kind of struggling here:
List<String> Base = baseListCopy.Except(resultListCopy, StringComparer.InvariantCultureIgnoreCase).ToList(); //Keep unique values(keep differences in lists)
List<String> Result = resultListCopy.Except(baseListCopy, StringComparer.InvariantCultureIgnoreCase).ToList(); //Keep unique values (keep differences in lists)
List<String[]> blocksComparison = new List<String[]>(); //we container for non-matching blocks; so we could output them later
//if both reports have same amount of blocks
if ((Result.Count > 0 || Base.Count > 0) && (Result.Count == Base.Count))
{
foreach (String S in Result)
{
String[] sArr = S.Split('*');
foreach (String B in Base)
{
String[] bArr = B.Split('*');
if (sArr[0].Equals(bArr[0]) && sArr[1].Equals(bArr[1]) && sArr[2].Equals(bArr[2]) && sArr[3].Equals(bArr[3]))
{
String[] NA = new String[2]; //keep results
NA[0] = B; //[0] for base
NA[1] = S; //[1] for result
blocksComparison.Add(NA);
break;
}
}
}
}
could you suggest a good algorithm for this process?
Thank you
You can use a HashSet.
Create a HashSet for List1. remember index1*index2*index3 is diffrent from index3*index2*index1.
Now iterate through second list.
Create Hashset for List1.
foreach(string in list2)
{
if(hashset contains string)
//Add it to the new list.
}
If I understand your question correctly, you'd like to be able to compare the elements by their "key" prefix, instead by the whole string content. If so, implementing a custom equality comparer will allow you to easily leverage the LINQ set algorithms.
This program...
class EqCmp : IEqualityComparer<string> {
public bool Equals(string x, string y) {
return GetKey(x).SequenceEqual(GetKey(y));
}
public int GetHashCode(string obj) {
// Using Sum could cause OverflowException.
return GetKey(obj).Aggregate(0, (sum, subkey) => sum + subkey.GetHashCode());
}
static IEnumerable<string> GetKey(string line) {
// If we just split to 3 strings, the last one could exceed the key, so we split to 4.
// This is not the most efficient way, but is simple.
return line.Split(new[] { '*' }, 4).Take(3);
}
}
class Program {
static void Main(string[] args) {
var l1 = new List<string> {
"index1*index1*index1*some text",
"index1*index1*index2*some text ** test test test",
"index1*index2*index1*some text",
"index1*index2*index2*some text",
"index2*index1*index1*some text"
};
var l2 = new List<string> {
"index1*index1*index2*some text ** test test test",
"index2*index1*index1*some text",
"index2*index1*index2*some text"
};
var eq = new EqCmp();
Console.WriteLine("Elements that are both in l1 and l2:");
foreach (var line in l1.Intersect(l2, eq))
Console.WriteLine(line);
Console.WriteLine("\nElements that are in l1 but not in l2:");
foreach (var line in l1.Except(l2, eq))
Console.WriteLine(line);
// Etc...
}
}
...prints the following result:
Elements that are both in l1 and l2:
index1*index1*index2*some text ** test test test
index2*index1*index1*some text
Elements that are in l1 but not in l2:
index1*index1*index1*some text
index1*index2*index1*some text
index1*index2*index2*some text
List one = new List();
List two = new List();
List three = new List();
HashMap<String,Integer> intersect = new HashMap<String,Integer>();
for(one: String index)
{
intersect.put(index.next,intersect.get(index.next) + 1);
}
for(two: String index)
{
if(intersect.containsKey(index.next))
{
three.add(index.next);
}
}

Categories