I have a collection of string in C#. My Code looks like this:
string[] lines = System.IO.File.ReadAllLines(#"d:\SampleFile.txt");
What i want to do is to find the max length of string in that collection and store it in variable. Currently, i code this manually, like?
int nMaxLengthOfString = 0;
for (int i = 0; i < lines.Length;i++ )
{
if (lines[i].Length>nMaxLengthOfString)
{
nMaxLengthOfString = lines[i].Length;
}
}
The code above does the work for me, but i am looking for some built in function in order to maintain efficiency, because there will thousand of line in myfile :(
A simpler way with LINQ would be:
int maxLength = lines.Max(x => x.Length);
Note that if you're using .NET 4, you don't need to read all the lines into an array first, if you don't need them later:
// Note call to ReadLines rather than ReadAllLines.
int maxLength = File.ReadLines(filename).Max(x => x.Length);
(If you're not using .NET 4, it's easy to write the equivalent of File.ReadLines.)
That will be more efficient in terms of memory, but fundamentally you will have to read every line from disk, and you will need to iterate over those lines to find the maximum length. The disk access is likely to be the bottleneck of course.
The efficiency will certainly not be worse in your case, if not better.
But if you're looking to be succinct, try lambdas with LINQ:
lines.Aggregate((a, b) => Math.Max(a.Length, b.Length));
Btw, minor point: you can technically stop reading if the amount of data left is less than the longest line you've found. So you can technically save some steps, although it's probably not worth the code.
Completely irrelevant, but just because I feel like it, here's the (elegant!) Scheme version:
(reduce max (map length lines))
Related
I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718
In some string[] array I have a list of web-sites
amazon.com
google.com
ieee.org
...
I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?
For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.
Can someone help me with faster approach? Thank you in advance
Good solution would leverage hashing. My approach would be following
Hash all your known hosts (the string[] collection that you mention)
Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
Sort the list (hashes.Sort())
When looking up a url:
Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
If the hash exists, then you have this host in your list
Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.
Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.
Create a Dictionary of domain to counter.
For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.
I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.
You can read this question, the answers will be help you:
High performance "contains" search in list of strings in C#
Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop
as in something like
int l = url.length;
int position = 0;
while (position < l)
{
if (url[i] == website[0])
{
//test rest of web site from position in an other loop
if (exactMatch(url,position, website))
}
}
Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.
Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:
var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
var domain = naiveDomainExtractor(line);
if(tally.ContainsKey(domain)) tally[domain]++;
}
...which takes a third of a second on my not particularly speedy machine, including generation of test data.
Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.
You'd have to test the performance but you might try converting the urls to the actual System.Uri object.
Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:
IEnumerable<Uri> inputUrls = File.ReadAllLines(#"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
Say I have a large byte[] and I'm not only looking to see if, but also where, a smaller byte[] is in the larger array. For example:
byte[] large = new byte[100];
for (byte i = 0; i < 100; i++) {
large[i] = i;
}
byte[] small = new byte[] { 23, 24, 25 };
int loc = large.IndexOf(small); // this is what I want to write
I guess I'm asking about looking for a sequence of any type (primitive or otherwise) within a larger sequence.
I faintly remember reading about a specific approach to this in strings, but I don't remember the name of the algorithm. I could easily write some way to do this, but I know there's a good solution and it's on the tip of my tongue. If there's some .Net method that does this, I'll take that too (although I'd still appreciate the name of the searching algorithm for education's sake).
You can do it with LINQ, like this:
var res = Enumerable.Range(0, large.Length-1)
.Cast<int?>()
.FirstOrDefault(n => large.Skip(n.Value).Take(small.Length).SequenceEqual(small));
if (res != null) {
Console.Println("Found at {0}", res.Value);
} else {
Console.Println("Not found");
}
The approach is self-explanatory except for the Cast<int?> part: you need it to decide between finding the result at the initial location of the large array, when zero is returned, and not finding the result at all, when the return is null.
Here is a demo on ideone.
The complexity of the above is O(M*N), where M and N are lengths of the large and small arrays. If the large array is very long, and contains a significant number of "almost right" sub-sequences that match long prefixes of small, you may be better off implementing an advanced algorithm for searching sequences, such as the Knuth–Morris–Pratt (KMP) algorithm. The KMP algorithm speeds up the search by making an observation that when a mismatch occurs, the small sequence contains enough information on how far ahead you can move in the large sequence based on where in the small sequence is the first mismatch. A look-up table is prepared for the small sequence, and then that table is used throughout the search to decide how to advance the search point. The complexity of KMP is O(N+M). See the Wikipedia article linked above for pseudocode of the KMP algorithm.
Are you thinking of Lambda expressions? That is what came to my mind when you said a more specific approach with strings.
http://www.dotnetperls.com/array-find
I'm trying to find multiple ways to solve Project Euler's problem #13. I've already got it solved two different ways, but what I am trying to do this time is to have my solution read from a text file that contains all of the numbers, from there it converts it and adds the column numbers farthest to the right. I also want to solve this problem in a way such that if we were to add new numbers to our list, the list can contain any amount of rows or columns, so it's length is not predefined (non array? I'm not sure if a jagged array would apply properly here since it can't be predefined).
So far I've got:
static void Main(string[] args)
{
List<int> sum = new List<int>();
string bigIntFile = #"C:\Users\Justin\Desktop\BigNumbers.txt";
string result;
StreamReader streamReader = new StreamReader(bigIntFile);
while ((result = streamReader.ReadLine()) != null)
{
for (int i = 0; i < result.Length; i++)
{
int converted = Convert.ToInt32(result.Substring(i, 1));
sum.Add(converted);
}
}
}
which reads the file and converts each character from the string to a single int. I'm trying to think how I can store that int in a collection that is like 2D array, but the collection needs to be versatile and store any # of rows / columns. Any ideas on how to store these digits other than just a basic list? Is there maybe a way I can set up a list so it's like a 2D array that is not predefined? Thanks in advance!
UPDATE: Also I don't want to use "BigInteger". That'd be a little too easy to read the line, convert the string to a BigInt, store it in a BigInt list and then sum up all the integers from there.
There is no resizable 2D collection built into the .NET framework. I'd just go with the "jagged arrays" type of data structure, just with lists:
List<List<int>>
You can also vary this pattern by using an array for each row:
List<int[]>
If you want to read the file a little simpler, here is how:
List<int[]> numbers =
File.EnumerateLines(path)
.Select(lineStr => lineStr.Select(#char => #char - '0').ToArray())
.ToList();
Far less code. You can reuse a lot of built-in stuff to do basic data transformations. That gives you less code to write and to maintain. It is more extensible and it is less prone to bugs.
If you want to select a column from this structure, do it like this:
int colIndex = ...;
int[] column = numbers.Select(row => row[index]).ToArray();
You can encapsulate this line into a helper method to remove noise from your main addition algorithm.
Note, that the efficiency of all those patterns is far less than a 2D array, but in your case it is good enough.
In this case you can simply use an 2D array, since you actually do know in advance its dimensions: 100 x 50.
If for some reason you want to solve a more general problem, you may indeed use a List of Lists, List>.
having said that, I wonder: are you actually trying to sum up all the numbers? if so, I would suggest another approach: consider just which section part of the 50 digit numbers actually influences the first digits of their sum. Hint: you don't need the entire number.
I'm trying to parse a large text string. I need to split the original string in blocks of 15 characters(and the next block might contain white spaces, so the trim function is used). I'm using two strings, the original and a temporary one. This temp string is used to store each 15 length block.
I wonder if I could fall into a performance issue because strings are immutable. This is the code:
string original = "THIS IS SUPPOSE TO BE A LONG STRING AN I NEED TO SPLIT IT IN BLOCKS OF 15 CHARACTERS.SO";
string temp = string.Empty;
while (original.Length != 0)
{
temp = original.Substring(0, 14).Trim();
original = original.Substring(14, (original.Length -14)).Trim();
}
I appreciate your feedback in order to find a best way to achieve this functionality.
You'll get slightly better performance like this (but whether the performance gain will be significant is another matter entirely):
for (var startIndex = 0; startIndex < original.Length; startIndex += 15)
{
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
}
This performs better because you're not copying the last all-but-15-characters of the original string with each loop iteration.
EDIT
To advance the index to the next non-whitespace character, you can do something like this:
for (var startIndex = 0; startIndex < original.Length; )
{
if (char.IsWhiteSpace(string, startIndex)
{
startIndex++;
continue;
}
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
startIndex += 15;
}
I think you are right about the immutable issue - recreating 'original' each time is probably not the fastest way.
How about passing 'original' into a StringReader class?
If your original string is longer than few thousand chars, you'll have noticable (>0.1s) processing time and a lot of GC pressure. First Substring call is fine and I don't think you can avoid it unless you go deep inside System.String and mess around with m_FirstChar. Second Substring can be avoided completely when going char-by-char and iterating over int.
In general, if you would run this on bigger data such code might be problematic, it of course depends on your needs.
In general, it might be a good idea to use StringBuilder class, which will allow you to operator on strings in "more mutable" way without performance hit, like remove from it's beggining without reallocating whole string.
In your example however I would consider throwing out lime that takes substring from original and substitute it with some code that would update some indexes pointing where you should get new substring from. Then while condition would be just checking if your index as at the end of the string and your temp method would take substring not from 0 to 14 but from i, where i would be this index.
However - don't optimize code if you don't have to, I'm assuming here that you need more performance and you want to sacrifice some time and/or write a bit less understandable code for more efficiency.
What's the fastest way to parse strings in C#?
Currently I'm just using string indexing (string[index]) and the code runs reasonably, but I can't help but think that the continuous range checking that the index accessor does must be adding something.
So, I'm wondering what techniques I should consider to give it a boost. These are my initial thoughts/questions:
Use methods like string.IndexOf() and IndexOfAny() to find characters of interest. Are these faster than manually scanning a string by string[index]?
Use regex's. Personally, I don't like regex as I find them difficult to maintain, but are these likely to be faster than manually scanning the string?
Use unsafe code and pointers. This would eliminate the index range checking but I've read that unsafe code wont run in untrusted environments. What exactly are the implications of this? Does this mean the whole assembly won't load/run, or will only the code marked unsafe refuse to run? The library could potentially be used in a number of environments, so to be able to fall back to a slower but more compatible mode would be nice.
What else might I consider?
NB: I should say, the strings I'm parsing could be reasonably large (say 30k) and in a custom format for which there is no standard .NET parser. Also, performance of this code is not super critical, so this partly just a theoretical question of curiosity.
30k is not what I would consider to be large. Before getting excited, I would profile. The indexer should be fine for the best balance of flexibility and safety.
For example, to create a 128k string (and a separate array of the same size), fill it with junk (including the time to handle Random) and sum all the character code-points via the indexer takes... 3ms:
var watch = Stopwatch.StartNew();
char[] chars = new char[128 * 1024];
Random rand = new Random(); // fill with junk
for (int i = 0; i < chars.Length; i++) chars[i] =
(char) ((int) 'a' + rand.Next(26));
int sum = 0;
string s = new string(chars);
int len = s.Length;
for(int i = 0 ; i < len ; i++)
{
sum += (int) chars[i];
}
watch.Stop();
Console.WriteLine(sum);
Console.WriteLine(watch.ElapsedMilliseconds + "ms");
Console.ReadLine();
For files that are actually large, a reader approach should be used - StreamReader etc.
"Parsing" is quite an inexact term. Since you talks of 30k, it seems that you might be dealing with some sort of structured string which can be covered by creating a parser using a parser generator tool.
A nice tool to create, maintain and understand the whole process is the GOLD Parsing System by Devin Cook: http://www.devincook.com/goldparser/
This can help you create code which is efficient and correct for many textual parsing needs.
As for your points:
is usually not useful for parsing which goes further than splitting a string.
is better suited if there are no recursions or too complex rules.
is basically a no-go if you haven't really identified this as a serious problem. The JIT can take care of doing the range checks only when needed, and indeed for simple loops (the typical for loop) this is handled pretty well.