Appropriate data structure for searching in strings (.net)

Appropriate data structure for searching in strings (.net) - c#

I kind of toil to describe my situation, therefore my post might geht al litle longer.
I want to search for given keys in strings. The strings are lines of a text file, the comparison is to be done as the file is beeing read line by line.
There is a class which has the properties NUMBER and TYPE among others. That are the keys for which is to be searched in the line string.
A trivial solution would be to store the class instances in a list and run through that list for each and every line and see if the line string contains the keys of the current list entry.
The performance of this implementation though would be horrible, since on the average for every line the program will loop through the entire list. This is since every key in the list occures at most one time in the file. So there are a lot of lines which don't contain a key.
I hope you guys get what I'm trying to explain and get the idea.
Example for objects:
O1:
ID - 1
NR - 1587
TYPE - COMPUTER
O2:
ID - 2
NR - 5487
TYPE - TV
text file lines:
bla bla \t 8745 RADIO
fsdakfjd9 9094km d9943
dkjd894 4003p \t 5487 TV
sdj99 43s39 kljljkljfsd
...
On line 3 the program should find the match and save the ID 2 together with the line content.
Thanks for any input ...
Toby

Looking up strings in your file is intensive so ideally, you only want to do this once.
I think it's ideal if you store class references in a Dictionary or a Hashtable.
Then you can do something like
var myDictionary = new Dictionary<string, ObjectType>();
while(string line = reader.ReadLine())
{
// Parse the possible key out of the line
if (myDictionary.ContainsKey(keyFromLine) doSomething(line, myDictionary[keyFromLine]);
}
void doSomething(string line, ObjectType instance)
{
// Unwrap the line and store appropriate values
}

Splitting, counting within strings is by nature resource and time intensive. You need parsing and searching. You have to loop through all the strings and save it, and then search it using Dictionary<key, value>. Try to loop the fewest and the way to accomplish that is by running the program on all the lines and saving it first. Don't scan lines on every search.

Related

get the position of a string in a text file based on the line number in c#

I have an input text file that comes from a third party and i wrote a c# program to process it and get the results. I have the results and I need to update the same file with the results. The third party updates their DB based on this output file. I need to get the position of the string to update the file.
Ex: The input file looks this way:
Company Name: <some name> ID: <some ID>
----------------------------------------------------
Transaction_ID:0000001233 Name:John Amount:40:00 Output_Code:
-----------------------------------------------------------------------
Transaction_ID:0000001234 Name:Doe Amount:40:00 Output_Code:
------------------------------------------------------------------------
Please note: transaction_ID is unique in each row.
The Output file should be:
Company Name: <some name> ID: <some ID>
----------------------------------------------------
Transaction_ID:0000001233 Name:John Amount:40:00 Output_Code:01
-----------------------------------------------------------------------
Transaction_ID:0000001234 Name:Doe Amount:40:00 Output_Code:02
---------------------------------------------------------------------------
The codes 01 and 02 are the results of the c# program and have to be updated in the response file.
I have the code find out the position of "Transaction_ID:0000001233" and "Output_Code:". I am able to update the first row. But I am not able to get the position of the "Output_Code:" for the second row. How do I identify the string based on the line number?
I cannot rewrite the whole response file as it has other unwanted columns.
The best option here would be to update the existing file.
long positionreturnCode1 = FileOps.Seek(filePath, "Output_Code:");
//gets the position of Output_Code in the first row.
byte[] bytesToInsert = System.Text.Encoding.ASCII.GetBytes("01");
FileOps.InsertBytes(bytesToInsert, newPath, positionreturnCode1);
// the above code inserts "01" in the correct position. ie:first row
long positiontransId2 = FileOps.Seek(filePath, "Transaction_ID:0000001234");
long positionreturnCode2 = FileOps.Seek(filePath, "Output_Code:");
// still gets the first row's value
long pos = positionreturnCode2 - positiontransId2;
byte[] bytesToInsert = System.Text.Encoding.ASCII.GetBytes("02");
FileOps.InsertBytes(bytesToInsert, newPath, pos);
// this inserts in a completely different position.
I know the logic is wrong. But I am trying to get the position of output code value in the second row.

Don't try to "edit" the existing file. There is too much room for error.
Rather, assuming that the file format will not change, parse the file into data, then rewrite the file completely. An example, in pseudo-code below:
public struct Entry
{
public string TransactionID;
public string Name;
public string Amount;
public string Output_Code;
}
Iterate through the file and create a list of Entry instances, one for each file line, and populate the data of each Entry instance with the contents of the line. It looks like you can split the text line using white spaces as a delimiter and then further split each entry using ':' as a delimiter.
Then, for each entry, you set the Output_Code during your processing phase.
foreach(Entry entry in entrylist)
entry.Output_Code = MyProcessingOfTheEntryFunction(entry);
Finally iterate through your list of entries and rewrite the entire file using the data in your Entry list. (Making sure to correctly write the header and any line spacers, etc..)
OpenFile();
WriteFileHeader();
foreach(Entry entry in entrylist)
{
WriteLineSpacer();
WriteEntryData(entry);
}
CloseFile();

To start with, I'll isolate the part that takes a transaction and returns a code, since I don't know what that is, and it's not relevant. (I'd do the same thing even if I did know.)
public class Transaction
{
public Transaction(string transactionId, string name, decimal amount)
{
TransactionId = transactionId;
Name = name;
Amount = amount;
}
public string TransactionId { get; }
public string Name { get; }
public decimal Amount { get; }
}
public interface ITransactionProcessor
{
// returns an output code
string ProcessTransaction(Transaction transaction);
}
Now we can write something that processes a set of strings, which could be lines from a file. That's something to think about. You get the strings from a file, but would this work any different if they didn't come from a file? Probably not. Besides, manipulating the contents of a file is harder. Manipulating strings is easier. So instead of "solving" the harder problem we're just converting it into an easier problem.
For each string it's going to do the following:
Read a transaction, including whatever fields it needs, from the string.
Process the transaction and get an output code.
Add the output code to the end of the string.
Again, I'm leaving out the part that I don't know. For now it's in a private method, but it could be described as a separate interface.
public class StringCollectionTransactionProcessor // Horrible name, sorry.
{
private readonly ITransactionProcessor _transactionProcessor;
public StringCollectionTransactionProcessor(ITransactionProcessor transactionProcessor)
{
_transactionProcessor = transactionProcessor;
}
public IEnumerable<string> ProcessTransactions(IEnumerable<string> inputs)
{
foreach (var input in inputs)
{
var transaction = ParseTransaction(input);
var outputCode = _transactionProcessor.ProcessTransaction(transaction);
var outputLine = $"{input} {outputCode}";
yield return outputLine;
}
}
private Transaction ParseTransaction(string input)
{
// Get the transaction ID and whatever values you need from the string.
}
}
The result is an IEnumerable<string> where each string is the original input, unmodified except for the output code appended that the end. If there were any extra columns in there that weren't related to your processing, that's okay. They're still there.
There are likely other factors to consider, like exception handling, but this is a starting point. It gets simpler if we completely isolate different steps from each other so that we only have to think about one thing at a time.
As you can see, I've still left things out. For example, where do the strings come from? Do they come from a file? Where do the results go? Another file? Now it's much easier to see how to add those details. They seemed like they were the most important, but now we've rearranged this so that they're the least important.
It's easy to write code that reads a file into a collection of strings.
var inputs = file.ReadLines(path);
When you're done and you have a collection of strings, it's easy to write them to a file.
File.WriteAllLines(path, linesToWrite);
We wouldn't add those details into the above classes. If we do, we've restricted those classes to only working with files, which is unnecessary. Instead we just write a new class which reads the lines, gets a collection of strings, passes it to the other class to get processed, gets back a result, and writes it to a file.
This is an iterative process that allows us to write the parts we understand and leave the parts we haven't figured out for later. That keeps us moving forward solving one problem at a time instead of getting stuck trying to solve a few at once.
A side effect is that the code is easier to understand. It lends itself to writing methods with just a few lines. Each is easy to read. It's also much easier to write unit tests.
In response to some comments:
If the output code doesn't go at the end of the line - it's somewhere in the middle, you can still update it:
var line = line.Replace("Output_Code:", "Output_Code:" + outputCode);
That's messy. If the line is delimited, you could split it, find the element that contains Output_Code, and completely replace it. That way you don't get weird results if for some reason there's already an output code.
If the step of processing a transaction includes updating a database record, that's fine. That can all be within ITransactionProcessor.ProcessTransaction.
If you want an even safer system you could break the whole thing down into two steps. First process all of the transactions, including your database updates, but don't update the file at all.
After you're done processing all of the transactions, go back through the file and update it. You could do that by looking up the output code for each transaction in the database. Or, processing transactions could return a Dictionary<string, string> containing the transaction ids and output codes. When you're done with all the processing, go through the file a second time. For each transaction ID, see if there's an output code. If there is, update that line.

The additions here are send in a position based on where your main program has already updated and keep that moving forward ahead the length of what you also added.
I believe if I am reading the code there and in your example correctly this should make you scoot along through the file.
This function is within the utils that you linked in your comment.
public static long Seek(string file, long position, string searchString)
{
//open filestream to perform a seek
using (System.IO.FileStream fs =
System.IO.File.OpenRead(file))
{
fs.Position = position;
return Seek(fs, searchString);
}
}

Search algorithm for partial words in C#

I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.

First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.

How to compare an array loaded from file with another array loaded from another file c#

I have to do a program in C# Form, which has to load from a file which looks something like that:
100ACTGGCTTACACTAATCAAG
101TTAAGGCACAGAAGTTTCCA
102ATGGTATAAACCAGAAGTCT
...
120GCATCAGTACGTACCCGTAC
20 lines formed with a number (ID) and 20 letters (ADN); the other file looks like that:
TGCAACGTGTACTATGGACC
In few words, this is a game where a murder is done, there are 20 people; i have to load and split the letters and.. i have to compare them and in the end i have to find the best match.
I have no idea how to do that, I don't know how to load the letters in the array and then to split them.. and then to compare them.

What you want to do here, is use something like a calculation of the Levenshtein distance between the strings.
In simple terms, that provides a count of how many single letters you have to change for a string to become equal to another. In the context of DNA or Proteins, this can be interpreted as representing the number of mutations between two individuals or samples. A shorter distance will therefore indicate a closer relationship between the two.
The algorithm can be fairly heavy computationally, but will give you a good answer. It's also quite fun and enlightening to implement. You can find a couple of ways of implementing it under the wikipedia article.
If you find it challenging to understand how it works, I recommend you set up an example grid by hand, with one short string horizontally along the top, and one vertically along the left side, and try going through the calculations manually, just to understand the concept properly (it can be confusing at first, but is really not that difficult).

This is a simple match function. It might not be of the complexity your game requires. This solution does not require an explicit split on the strings in order to get an array of DNA "letters". The DNA is compared in place.
Compare each "suspect" entry to the "evidence one.
int idLength = 3;
string evidence = //read from file
List<string> suspects = //read from file
List<double> matchScores = new List<double>();
foreach (string suspect in suspects)
{
int count = 0;
for (int i = idLength; i < suspect.Length; i++)
{
if (suspect[i + idLength] == evidence[i]) count++;
}
matchScores.Add(count * 100 / evidence.Length);
}
The matchScores list now contains all the individual match scores. I did not save the maximum match score in a separate variable as there can be several "suspects" with the same score. To find out which subject has the best match, just iterate the matchScores list. The index of the best match is the index of the suspect in the suspects list.
Optimization notes:
you could check each "suspect" string to see where (i.e. at what index does) the DNA sequence starts, as it could be variable;
a dictionary could be used here, instead of two lists, with the "suspect string" as key and the match score as value

Matching a string in a Large text file?

I have a list of strings containing about 7 million items in a text file of size 152MB. I was wondering what could be best way to implement the a function that takes a single string and returns whether it is in that list of strings.

Are you going to have to match against this text file several times? If so, I'd create a HashSet<string>. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.
152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a HashSet<string> will make repeated lookups very fast indeed.
The absolute simplest way to do this is probably to use File.ReadAllLines, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:
HashSet<string> strings = new HashSet<string>(File.ReadAllLines("data.txt"));
...
if (strings.Contains(stringToCheck))
{
...
}

Depends what you want to do. When you want to repeat the search for matches again and again, I'd load the whole file into memory (into a HashSet). There it is very easy to search for matches.

Text file with each line represent a user, can I update a particular line?

If I have a text file like:
123, joe blow, USA
Where the first values represent:
USERID, NAME, COUNTRY
If my file has 5000 rows, could I update a particular row somehow using C#?

While it's possible to alter a line in a file - you can only do so safely if the replacing text is exactly the same length. Since this is rarely the case, your best bet is to replace the entire contents of the file with the new results.
Line-based text files are not a great format for storing information that is likely to change for exactly this reason.
Now, assuming the file is not too large, you can load it in its entirety into an array of strings, replace the text of a particular line, and the write the file back out:
using System.IO;
var linesOfText File.ReadLines( "/myfile.txt", Text.Encoding.UTF8 ).ToArray();
linesOfText[123] = "456, sally smith, USA";
File.WriteAllLines( "/myfile.txt", linesOfText );
EDIT: In the interest of brevity, I provided an example above that uses indexed position to update a particular line. In any real-world usage, you should parse the lines and search for the one you want to update/replace rather than relying on an offset.
I would not use the above code if the file is excessively large, ~5,000 lines of 50 characters is relatively small (< 1Mb) so, I wouldn't worry. But if you're going to do this type of operation over and over, you may want to rethink how you store this data.

You have to read the entire file, update your line and then write the file again.
However, having said that you could read the file a line at a time writing each line to a new file (modifying the required line as necessary). Then delete (or rename) the original and rename the new file.

You have two options.
Use a Stream and find the index of your row and insert some characters
Load the entire file into an array, modify it and re-write the whole file.
With option number 1, your code will be more complex, but it will scale to work even if your file grows to 500,000,000 rows.
With option number 2, the code is very simple, but since you load the whole thing to memory it is not very efficent for large datasets.
For option 2, code like this might get you started:
var sbld = new System.Text.StringBuilder();
var rows = System.IO.File.ReadAllLines(#"C:\YourUserTextFile.txt");
foreach(var row in rows)
{
if(!updateThisRow(row))
{
var ar = row.Split(',');
var id = ar[0];
var name = ar[1];
var country = ar[2];
// do update here
sbld.AppendLine(String.Format("{0},{1},{2}", id, name, country));
}
else
{
sbld.AppendLine(row);
}
}
System.IO.File.WriteAllText(C:\YourUserTextFile.txt", sbld.ToString());
The updateThisRow method might be something you write to see its a row you care to update.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.