I am trying to build an efficient algorithm that can process thousands of rows of data that contains zip codes of customers. I would then want to cross check those zip codes against a grouping of around 1000 zip codes, but I have about 100 columns of 1000 zip codes. A lot of these zip codes are consecutive numbers, but there is also a lot of random zip codes thrown in there. So what I would like to do is group consecutive zip codes together that I can then just check to see if the zip code falls within that range instead of checking it against every single zip code.
Sample data -
90001
90002
90003
90004
90005
90006
90007
90008
90009
90010
90012
90022
90031
90032
90033
90034
90041
This should be grouped as follows:
{ 90001-90010, 90012, 90022, 90031-90034, 90041 }
Here's my idea for the algorithm:
public struct gRange {
public int start, end;
public gRange(int a, int b) {
start = a;
if(b != null) end = b;
else end = a;
}
}
function groupZips(string[] zips){
List<gRange> zipList = new List<gRange>();
int currZip, prevZip, startRange, endRange;
startRange = 0;
bool inRange = false;
for(int i = 1; i < zips.length; i++) {
currZip = Convert.ToInt32(zips[i]);
prevZip = Convert.ToInt32(zips[i-1]);
if(currZip - prevZip == 1 && inRange == false) {
inRange = true;
startRange = prevZip;
continue;
}
else if(currZip - prevZip == 1 && inRange == true) continue;
else if(currZip - prevZip != 1 && inRange == true) {
inRange = false;
endRange = prevZip;
zipList.add(new gRange(startRange, endRange));
continue;
}
else if(currZip - prevZip != 1 && inRange == false) {
zipList.add(new gRange(prevZip, prevZip));
}
//not sure how to handle the last case when i == zips.length-1
}
}
So as of now, I am unsure of how to handle the last case, but looking at this algorithm, it doesn't strike me as efficient. Is there a better/easier way to be sorting a group of numbers like this?
Here is a O(n) solution even if your zip codes are not guaranteed to be in order.
If you need the output groupings to be sorted, you can't do any better than O(n*log(n)) because somewhere you'll have to sort something, but if grouping the zip codes is your only concern and sorting the groups isn't required then I'd use an algorithm like this. It makes good use of a HashSet, a Dictionary, and a DoublyLinkedList. To my knowledge this algorithm is O(n), because I believe that a HashSet.Add() and HashSet.Contains() are performed in constant time.
Here is a working dotnetfiddle
// I'm assuming zipcodes are ints... convert if desired
// jumbled up your sample data to show that the code would still work
var zipcodes = new List<int>
{
90012,
90033,
90009,
90001,
90005,
90004,
90041,
90008,
90007,
90031,
90010,
90002,
90003,
90034,
90032,
90006,
90022,
};
// facilitate constant-time lookups of whether zipcodes are in your set
var zipHashSet = new HashSet<int>();
// lookup zipcode -> linked list node to remove item in constant time from the linked list
var nodeDictionary = new Dictionary<int, DoublyLinkedListNode<int>>();
// linked list for iterating and grouping your zip codes in linear time
var zipLinkedList = new DoublyLinkedList<int>();
// initialize our datastructures from the initial list
foreach (int zipcode in zipcodes)
{
zipLinkedList.Add(zipcode);
zipHashSet.Add(zipcode);
nodeDictionary[zipcode] = zipLinkedList.Last;
}
// object to store the groupings (ex: "90001-90010", "90022")
var groupings = new HashSet<string>();
// iterate through the linked list, but skip nodes if we group it with a zip code
// that we found on a previous iteration of the loop
var node = zipLinkedList.First;
while (node != null)
{
var bottomZipCode = node.Element;
var topZipCode = bottomZipCode;
// find the lowest zip code in this group
while (zipHashSet.Contains(bottomZipCode - 1))
{
var nodeToDel = nodeDictionary[bottomZipCode - 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if previous zip code is in our group, too
bottomZipCode--;
}
// get string version zip code bottom of the range
var bottom = bottomZipCode.ToString();
// find the highest zip code in this group
while (zipHashSet.Contains(topZipCode + 1))
{
var nodeToDel = nodeDictionary[topZipCode + 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if next zip code is in our group, too
topZipCode++;
}
// get string version zip code top of the range
var top = topZipCode.ToString();
// add grouping in correct format
if (top == bottom)
{
groupings.Add(bottom);
}
else
{
groupings.Add(bottom + "-" + top);
}
// onward!
node = node.Next;
}
// print results
foreach (var grouping in groupings)
{
Console.WriteLine(grouping);
}
** a small refactoring of the common linked list node deletion logic is in order
If Sorting is Required
A O(n*log(n)) algorithm is much simpler, because once you sort your input list the groups can be formed in one iteration of the list with no additional data structures.
I believe you are overthinking this one. Just using Linq against an IEnumerable can search 80,000+ records in less than 1/10 of a second.
I used the free CSV zip code list from here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv
using System;
using System.IO;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;
namespace ZipCodeSearchTest
{
struct zipCodeEntry
{
public string ZipCode { get; set; }
public string City { get; set; }
}
class Program
{
static void Main(string[] args)
{
List<zipCodeEntry> zipCodes = new List<zipCodeEntry>();
string dataFileName = "free-zipcode-database.csv";
using (FileStream fs = new FileStream(dataFileName, FileMode.Open, FileAccess.Read))
using (StreamReader sr = new StreamReader(fs))
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
string[] lineVals = line.Split(',');
zipCodes.Add(new zipCodeEntry { ZipCode = lineVals[1].Trim(' ', '\"'), City = lineVals[3].Trim(' ', '\"') });
}
bool terminate = false;
while (!terminate)
{
Console.WriteLine("Enter zip code:");
var userEntry = Console.ReadLine();
if (userEntry.ToLower() == "x" || userEntry.ToString() == "q")
terminate = true;
else
{
DateTime dtStart = DateTime.Now;
foreach (var arrayVal in zipCodes.Where(z => z.ZipCode == userEntry.PadLeft(5, '0')))
Console.WriteLine(string.Format("ZipCode: {0}", arrayVal.ZipCode).PadRight(20, ' ') + string.Format("City: {0}", arrayVal.City));
DateTime dtStop = DateTime.Now;
Console.WriteLine();
Console.WriteLine("Lookup time: {0}", dtStop.Subtract(dtStart).ToString());
Console.WriteLine("\n\n");
}
}
}
}
}
In this particular case, it is quite possible that a hash will be faster. However, the range-based solution will use a lot less memory, so it would be appropriate if your lists were very large (and I'm not convinced that there are enough possible zipcodes for any list of zipcodes to be large enough.)
Anyway, here's a simpler logic for making the range list and finding if a target is in a range:
Make ranges a simple list of integers (or even zipcodes), and push the first element of zip as its first element.
For each element of zip except the last one, if that element plus one is not the same as the next element, add both that element plus one and the next element to ranges.
Push one more than the last element of zip at the end of `ranges.
Now, to find out if a zipcode is in ranges, do a binary search into ranges for the smallest element which is greater than the target zipcode. [Note 1] If the index of that element is odd, then the target is in one of the ranges, otherwise it isn't.
Notes:
AIUI, the BinarySearch method on a C# list returns the index of the element found or the complement of the index of the first larger element. To get the result needed by the suggested algorithm, you could use something like index >= 0 ? index + 1 : ~index, but it might be simpler to just search for the zipcode one less than the target and then use the complement of the low-order bit of the result.
Related
I'm reading a text file that contains continents , countries , capitals and the population of those countries. Here is the text file image of file with info. I then input a value , let's say I input "Birmanie" the StreamReader instance then takes the info from the NEXT line which would be "Bolivie". "Pays" is the input. My goal is to read the line of the country that the user inputs and then later on extract the info from that line.
Here's my code.
while (!srRecherche.EndOfStream)
{
lireLigneRechercher = srRecherche.ReadLine();
if (lireLigneRechercher.IndexOf(Pays,StringComparison.CurrentCultureIgnoreCase) >= 0)
{
for (int i = 1; i <= 35; i++)
{
lireCharacteres += (char)srRecherche.Read();
}
for (int i = 1; i <= 74; i++)
{
srRecherche.Read();
}
}
}
The for loops are there so that I can skip through the rest of the information and only read the country's name.
Here is an example that might help. The numbers (index positions on the lines) are completely made up, but hopefully you'll see where I'm going. Instead of trying to read with a reader, I would read all the lines and put them into a collection of some sort. I used a struct and a HashSet below, but you could use a class and a List, or a SortedSet, or a Collection, or many other options. By transforming each line into a struct/class, you gain the liberty of doing any sort of analysis or manipulation you want without having to back-track. Since your data seems relatively fixed, you can also gain some advantage by storing the entire list in memory and then finding (in your cached list) what the user asks for with a Where() or FirstOrDefault() instead of reading the file with each new input.
class Program
{
static void Main(string[] args)
{
string userInput = "Birmanie";
var lines = File.ReadAllLines("c:\\myfile.txt");
HashSet<FileLine> fileLines = new();
foreach (var line in lines)
{
var fileLine = new FileLine()
{
Country = line.Substring(0, 25).Trim(),
City = line.Substring(35, 20).Trim(),
Population = Convert.ToInt32(line.Substring(55, 20).Trim()),
Continent = line.Substring(75).Trim()
};
fileLines.Add(fileLine);
}
int pop = fileLines.FirstOrDefault(l => l.City == userInput).Population;
}
}
struct FileLine
{
public string Country;
public string City;
public int Population;
public string Continent;
}
I am using hashet, linq Intersect() and Count() to find intersection of two lists of strings.
Code being used
private HashSet<string> Words { get; }
public Sentence(IEnumerable<string> words)
{
Words = words.ToHashSet();
}
public int GetSameWordCount(Sentence sentence)
{
return Words.Intersect(sentence.Words).Count();
}
Method GetSameWordCount is Taking > 90% of program runtime as there are milions of Sentences to compare with each other.
Is there any faster way to do this?
I am using .net core 3.1.1 / C# 8 so any recent features can be used.
More info:
Input data is coming from text file (e.g. book excerpt, articles from web).
Sentences are then unaccented, lowercased and split to words by whitespace >regex.
Short words (<3 length) are ignored.
I am creating groups of sentences which have N words in common and ordering >these groups by number of shared words.
The below code will utilize HashSet<T>.Contains method which is more performant. Time complexity of HashSet<T>.Contains is O(1).
public int GetSameWordCount(Sentence sentence)
{
var count;
foreach(var word in sentence.Words)
{
if(Words.Contains(word))
count++;
}
return count;
}
Note
If the list of the words is sorted you can use below approach.
var enumerator1 = set1.GetEnumerator();
var enumerator2 = set2.GetEnumerator();
var count = 0;
if (enumerator1.MoveNext() && enumerator2.MoveNext())
{
while (true)
{
var value = enumerator1.Current.CompareTo(enumerator2.Current);
if (value == 0)
{
count++;
if (!enumerator1.MoveNext() || !enumerator2.MoveNext())
break;
}
else if (value < 0)
{
if (!enumerator1.MoveNext())
break;
}
else
{
if (!enumerator2.MoveNext())
break;
}
}
}
I'm looking for different solutions, including those in which is forbidden to use .NET libraries and those where I can use all of the advantages of them.
Here is the problem, I have two text files, textFile1 and textFile2. Each of them contains sorted integer numbers(this is the most important condition), like these displayed below :
textFile1 textFile2
0 1
2 3
4 5
I need to create 3rd text file, for example textFile3 by merging those two files, and expected result should be :
textFile3
0
1
2
3
4
5
My first idea was to copy those two text files line by line into two separate arrays and than use solution for merging two sorted arrays in new one, provided
in this question.
After that, I will copy those members of new array into textFile3, line by line.
Do you have any suggestion ? Maybe better approach ? Please write all of your ideas here, each of them will be helpful to me .
Merging two files is a fairly simple modification to merging two arrays. The idea is to replace the array index increment with a read of the next line of the file. For example, the standard merge algorithm that I show in my blog (http://blog.mischel.com/2014/10/24/merging-sorted-sequences/) is:
while (not end of List A and not end of List B)
if (List A current item <= List B current item)
output List A current item
advance List A index
else
output List B current item
advance List B index
// At this point, one of the lists is empty.
// Output remaining items from the other
while (not end of List A)
output List A current item
advance List A index
while (not end of List B)
output List B current item
advance List B index
To make that merge files, you start by opening and reading the first line of each file. It gets kind of screwy, though, because you have to check for end of file. "Get the next line" is a bit ... odd.
int item1;
int item2;
bool eof1 = false;
bool eof2 = false;
string temp;
var file1 = File.OpenText(textFile1);
temp = file1.ReadLine();
if (temp == null)
eof1 = true;
else
item1 = int.Parse(temp);
// do the same thing for file2
Then we can do the standard merge:
while (!eof1 && !eof2)
{
if (item1 <= item2)
{
outputFile.WriteLine(item1);
// get next item from file1
temp = file1.ReadLine();
if (temp == null)
eof1 = true;
else
item1 = int.Parse(temp);
}
else
{
// output item2 and get next line from file2
}
}
// and the cleanup
while (!eof1)
{
// output item1, and get next line from file1
}
while (!eof2)
{
// output item2, and get next file from file2
}
The only thing different is that getting the next item is more involved than just incrementing an array index.
Merging two ordered sequences can easily be generalized and implemented as extension method like this:
public static class Algorithms
{
public static IEnumerable<T> MergeOrdered<T>(this IEnumerable<T> seq1, IEnumerable<T> seq2, IComparer<T> comparer = null)
{
if (comparer == null) comparer = Comparer<T>.Default;
using (var e1 = seq1.GetEnumerator())
using (var e2 = seq2.GetEnumerator())
{
bool more1 = e1.MoveNext(), more2 = e2.MoveNext();
while (more1 && more2)
{
int compare = comparer.Compare(e1.Current, e2.Current);
yield return compare < 0 ? e1.Current : e2.Current;
if (compare <= 0) more1 = e1.MoveNext();
if (compare >= 0) more2 = e2.MoveNext();
}
for (; more1; more1 = e1.MoveNext())
yield return e1.Current;
for (; more2; more2 = e2.MoveNext())
yield return e2.Current;
}
}
}
Then the concrete task can be accomplished simply with:
static void Merge(string inputFile1, string inputFile2, string outputFile)
{
Func<string, IEnumerable<KeyValuePair<int, string>>> readLines = file =>
File.ReadLines(file).Select(line =>
new KeyValuePair<int, string>(int.Parse(line), line));
var inputLines1 = readLines(inputFile1);
var inputLines2 = readLines(inputFile2);
var comparer = Comparer<KeyValuePair<int, string>>.Create(
(a, b) => a.Key.CompareTo(b.Key));
var outputLines = inputLines1.MergeOrdered(inputLines2, comparer)
.Select(item => item.Value);
File.WriteAllLines(outputFile, outputLines);
}
They are both sorted lists and to avoid memory consumption, open a reader to both files. Read two line from both, compare ahead, write the sorted results and take action based on the current line of each file. E.g:Treat your sorted value in each file as a pointer and keep comparing and advancing from the lesser side until completion. This will ensure a small memory footprint that will perform better for large files than for smaller ones.
You can pinch an algorithm off of the web, here is one and another that even mentions 0(1). Ignore the fact it talks about arrays, your files are effectively sorted arrays so you don't need to duplicate that in memory.
Given a data text file which looks like
21,7,11
20,10,12
17,7,18
These represent height, temperature and carbon percentage.
I have read in the file as a .txt file using system.io. Is this correct? from here how would I calculate the maximum temperature?
{
string s;
System.IO.StreamReader inputFile = new System.IO.StreamReader(DataFile);
s = inputFile.ReadLine();
int noDataLines = int.Parse(s);
}
You need to read all the lines and compare each value to find out max temperature.
Something like below (untested code!) should be done. There are lot of assumptions in this code and you may have to change it to suit your case.
{
string s;
int maxValue=-1, temp=-1;
using(System.IO.StreamReader in = new System.IO.StreamReader(DataFile))
{
while (in.Peek() >= 0)
{
s = in.ReadLine();
if(int.tryParse(s.split(",")[1], out temp)
{
if(temp>maxValue)
maxValue = temp;
}
}
}
}
You will most likely want to create a two-dimensional list or array, and in this example I am using a list.
{
List<List<int>> intList = new List<int>(); // This creates a two dimensional list.
System.IO.StreamReader inputFile = new System.IO.StreamReader(DataFile);
string line = inputFile:ReadLine();
while (line != null) // Iterate over the lines in the document.
{
intList.Add( // Adding a new row to the list.
line.Split(',').Select(int.Parse).ToList()
// This separates the line by commas, and turns it into a list of integers.
);
line = inputFile:ReadLine(); // Move to the next row.
}
}
I will admit that this is certainly not a very concise method of doing it, but it is relatively straightforward.
To access it, do this:
int element = intList[1, 2]; // Accessing 2nd row, 3rd column.
Example
If I had a text file with these lines:
The cat meowed.
The dog barked.
The cat ran up a tree.
I would want to end up with a matrix of rows and columns like this:
0 1 2 3 4 5 6 7 8 9
0| t-h-e- -c-a-t- -m-e-o-w-e-d-.- - - - - - - -
1| t-h-e- -d-o-g- -b-a-r-k-e-d-.- - - - - - - -
2| t-h-e- -c-a-t- -r-a-n- -u-p- -a- -t-r-e-e-.-
Then I would like to query this matrix to quickly determine information about the text file itself. For example, I would quickly be able to tell if everything in column "0" is a "t" (it is).
I realize that this might seem like a strange thing to do. I am trying to ultimately (among other things) determine if various text files are fixed-width delimited without any prior knowledge about the file. I also want to use this matrix to detect patterns.
The actual files that will go through this are quite large.
Thanks!
For example, I would quickly be able to tell if everything in column "0" is a "t" (it is).
int column = 0;
char charToCheck = 't';
bool b = File.ReadLines(filename)
.All(s => (s.Length > column ? s[column] : '\0') == charToCheck);
What you can do is read the first line of your text file and use it as a mask. Compare every next line to the mask and remove every character from the mask that is not the same as the character at the same position. After processing al lines you'll have a list of delimiters.
Btw, code is not very clean but it is a good starter I think.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace DynamicallyDetectFixedWithDelimiter
{
class Program
{
static void Main(string[] args)
{
var sr = new StreamReader(#"C:\Temp\test.txt");
// Get initial list of delimiters
char[] firstLine = sr.ReadLine().ToCharArray();
Dictionary<int, char> delimiters = new Dictionary<int, char>();
for (int i = 0; i < firstLine.Count(); i++)
{
delimiters.Add(i, firstLine[i]);
}
// Read subsequent lines, remove delimeters from
// the dictionary that are not present in subsequent lines
string line;
while ((line = sr.ReadLine()) != null && delimiters.Count() != 0)
{
var subsequentLine = line.ToCharArray();
var invalidDelimiters = new List<int>();
// Compare all chars in first and subsequent line
foreach (var delimiter in delimiters)
{
if (delimiter.Key >= subsequentLine.Count())
{
invalidDelimiters.Add(delimiter.Key);
continue;
}
// Remove delimiter when it differs from the
// character at the same position in a subsequent line
if (subsequentLine[delimiter.Key] != delimiter.Value)
{
invalidDelimiters.Add(delimiter.Key);
}
}
foreach (var invalidDelimiter in invalidDelimiters)
{
delimiters.Remove(invalidDelimiter);
}
}
foreach (var delimiter in delimiters)
{
Console.WriteLine(String.Format("Delimiter at {0} = {1}", delimiter.Key, delimiter.Value));
}
sr.Close();
}
}
}
"I am trying to ultimately (among other things) determine if various text files are fixed-width (...)"
If that's so, you could try this:
public bool isFixedWidth (string fileName)
{
string[] lines = File.ReadAllLines(fileName);
int length = lines[0].Length;
foreach (string s in lines)
{
if (s.length != Length)
{
return false;
}
}
return true;
}
Once you get that lines variable, you can access any character as though they were in a matrix. Like char c = lines[3][1];. However, there is no hard guarantee that all lines are the same length. You could pad them to be the same length as the longest one, if you so wanted.
Also,
"how would I query to get a list of all columns that contain a space character for ALL rows (for example)"
You could try this:
public bool CheckIfAllCharactersInAColumnAreTheSame (string[] lines, int colIndex)
{
char c = lines[0][colIndex];
try
{
foreach (string s in lines)
{
if (s[colIndex] != c)
{
return false;
}
}
return true;
}
catch (IndexOutOfRangeException ex)
{
return false;
}
}
Since it's not clear where you're have difficulty exactly, here are a few pointers.
Reading the file as strings, one per line:
string[] lines = File.ReadAllLines("filename.txt");
Obtaning a jagged array (a matrix) of characters from the lines (this step seems unnecessary since strings can be indexed just like character arrays):
char[][] charMatrix = lines.Select(l => l.ToCharArray()).ToArray();
Example query: whether every character in column 0 is a 't':
bool allTs = charMatrix.All(row => row[0] == 't');