Filter text in a variable - c#

I have a variable (serial_and_username_and_subType) that contains this type of text:
CT-AF-23-GQG %username1% *subscriptionType*
DHR-345349-E %username2% *subscriptionType*
C3T-AF434-234-GQG %username3% *subscriptionType*
34-7-HHDHFD-DHR-345349-E %username4% *subscriptionType*
example: ST-NN1-CQ-QQQ-G12 %RandomDUDE12% *Lifetime*
after that, i have an IF instruction that checks if the user inputs something that is present in serial_and_username_and_subType.
if (userInput.Contains
(serial_and_username_and_subType))......
then, what i would like to do (but i am having troubles) is that when someone enters a serial, the program prints the corrispective username and subscription:
for example:
Please enter your Serial:
> ST-NN1-CQ-QQQ-G12
Welcome, RandomDUDE12!
You currently have a Lifetime subscription!
does anyone know a method or a way to obtain what i need?

You are already using Contains(). The other things you could use are
Substring()
Split()
IndexOf()
Split() is probably the easiest one as long as you can guarantee that neither % nor * are part of the serial, username or license:
var s = "ST-NN1-CQ-QQQ-G12 %RandomDUDE12% *Lifetime*";
var splitPercent = s.Split('%');
Console.WriteLine(splitPercent[1]);
var splitStar = s.Split('*');
Console.WriteLine(splitStar[1]);
This approach will work fine as long as you have few licenses only (maybe a few thousand are ok, because PCs are fast). If you have many licenses (like millions), you probably want to separate all that information so that it is not in a string, but a data structure. You would then use a dictionary and access the information directly, instead of iterating through all of them.

Related

Writing Dynamic IF Functions by user entry

I'm still pretty new with coding(C#) and am writing a file reader based on user criteria(What string they want to search).
What my program does:
The user can enter search a sting / multiple strings (with AND OR functionality)
The program interprets the user entry and re-writes the string into code. e.g.
string fileLine = File.ReadAllText(line);
USER:
(hi AND no) OR yes
PROGRAM:
if((fileLine.Contains(hi) && fileLine.Contains(no)) || fileLine.Contains(yes))
What I'm trying to do:
When matching the string to the File string i use an IF Function:
if(fileLine.Contains(hi))
{
//do A LOT of stuff here.
}
My first idea was to make a string out of the the entered string and replace the "condition" in the IF Function.
Am i going about this in the wrong way? What would the best way of achieving this be?
Parameterizing the input is a good idea. It looks like you're really trying to do two things: first, you want to determine what the user's input was, and then you want to act on that input.
So, first, read the user-inputted line. If your typical expected input is something like (bob AND casey) OR josh then it should be pretty simple to implement a regex-based grammar on the backend, though as juharr says, it's more complicated for you that way. But, assuming user input is AND/OR and grouped by parenthesis, you'll likely want to break it down like this:
For each unit of filtering - each parenthetical - you want to hold 2 pieces of information: the items being filtered on (bob, casey) and the operator (AND)
These units are order sensitive; filtering on (bob AND casey) OR josh is different from bob AND (casey OR josh). So you also want a higher level representation of the order to search in. For (bob AND casey) OR josh, you'll be determining validity of the search based on {result of bob AND casey} OR {result of josh}.
These are probably objects. :)
When you have your user input in a standardized form, you will want to sanity check it, and inform the user if you found something you can't parse (an unclosed parentheses, etc).
After informing the user, then and only then should you perform the actual file searching. I would suggest treating each unit of search (item 2 above) as its own "search" and using a switch statement for the search operators, e.g.
switch (operator) {
case operator.AND:
inString = fileLine.Contains(itemOne) && fileLine.Contains(itemTwo)
case operator.OR:
inString = fileLine.Contains(itemOne) || fileLine.Contains(itemTwo)
}
Additionally, you will want to handle cases where you are comparing bool AND string.Contains and bool OR string.Contains.
If you have something like bob AND josh OR casey you could do a one-level linear comparison where you simply chopped them up and passed them through the comparison one at a time (bob and josh returns a bool, then pass that bool to a comparison for a bool OR string.Contains operation)
This structure also means you're less likely to get spaghetti code if your scope changes (meaning, you won't have an ever-increasing series of if statements), and you can handle unpredictable input and notify the user if something is wrong.

Better method of handling/reading these files (HCFA medical claim form)

I'm looking for some suggestions on better approaches to handling a scenario with reading a file in C#; the specific scenario is something that most people wouldn't be familiar with unless you are involved in health care, so I'm going to give a quick explanation first.
I work for a health plan, and we receive claims from doctors in several ways (EDI, paper, etc.). The paper form for standard medical claims is the "HCFA" or "CMS 1500" form. Some of our contracted doctors use software that allows their claims to be generated and saved in a HCFA "layout", but in a text file (so, you could think of it like being the paper form, but without the background/boxes/etc). I've attached an image of a dummy claim file that shows what this would look like.
The claim information is currently extracted from the text files and converted to XML. The whole process works ok, but I'd like to make it better and easier to maintain. There is one major challenge that applies to the scenario: each doctor's office may submit these text files to us in slightly different layouts. Meaning, Doctor A might have the patient's name on line 10, starting at character 3, while Doctor B might send a file where the name starts on line 11 at character 4, and so on. Yes, what we should be doing is enforcing a standard layout that must be adhered to by any doctors that wish to submit in this manner. However, management said that we (the developers) had to handle the different possibilities ourselves and that we may not ask them to do anything special, as they want to maintain good relationships.
Currently, there is a "mapping table" set up with one row for each different doctor's office. The table has columns for each field (e.g. patient name, Member ID number, date of birth etc). Each of these gets a value based on the first file that we received from the doctor (we manually set up the map). So, the column PATIENT_NAME might be defined in the mapping table as "10,3,25" meaning that the name starts on line 10, at character 3, and can be up to 25 characters long. This has been a painful process, both in terms of (a) creating the map for each doctor - it is tedious, and (b) maintainability, as they sometimes suddenly change their layout and then we have to remap the whole thing for that doctor.
The file is read in, line by line, and each line added to a
List<string>
Once this is done, we do the following, where we get the map data and read through the list of file lines and get the field values (recall that each mapped field is a value like "10,3,25" (without the quotes)):
ClaimMap M = ClaimMap.GetMapForDoctor(17);
List<HCFA_Claim> ClaimSet = new List<HCFA_Claim>();
foreach (List<string> cl in Claims) //Claims is List<List<string>>, where we have a List<string> for each claim in the text file (it can have more than one, and the file is split up into separate claims earlier in the process)
{
HCFA_Claim c = new HCFA_Claim();
c.Patient = new Patient();
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
//...and so on...
ClaimSet.Add(c);
}
Sorry this is so long...but I felt that some background/explanation was necessary. Are there any better/more creative ways of doing something like this?
Given the lack of standardization, I think your current solution although not ideal may be the best you can do. Given this situation, I would at least isolate concerns e.g. file read, file parsing, file conversion to standard xml, mapping table access etc. to simple components employing obvious patterns e.g. DI, strategies, factories, repositories etc. where needed to decouple the system from the underlying dependency on the mapping table and current parsing algorithms.
You need to work on the DRY (Don't Repeat Yourself) principle by separating concerns.
For example, the code you posted appears to have an explicit knowledge of:
how to parse the claim map, and
how to use the claim map to parse a list of claims.
So there are at least two responsibilities directly relegated to this one method. I'd recommend changing your ClaimMap class to be more representative of what it's actually supposed to represent:
public class ClaimMap
{
public ClaimMapField Name{get;set;}
...
}
public class ClaimMapField
{
public int StartingLine{get;set;}
// I would have the parser subtract one when creating this, to make it 0-based.
public int StartingCharacter{get;set;}
public int MaxLength{get;set;}
}
Note that the ClaimMapField represents in code what you spent considerable time explaining in English. This reduces the need for lengthy documentation. Now all the M.Name.Split calls can actually be consolidated into a single method that knows how to create ClaimMapFields out of the original text file. If you ever need to change the way your ClaimMaps are represented in the text file, you only have to change one point in code.
Now your code could look more like this:
c.Patient.FullName = cl[map.Name.StartingLine].Substring(map.Name.StartingCharacter, map.Name.MaxLength).Trim();
c.Patient.Address = cl[map.Address.StartingLine].Substring(map.Address.StartingCharacter, map.Address.MaxLength).Trim();
...
But wait, there's more! Any time you see repetition in your code, that's a code smell. Why not extract out a method here:
public string ParseMapField(ClaimMapField field, List<string> claim)
{
return claim[field.StartingLine].Substring(field.StartingCharacter, field.MaxLength).Trim();
}
Now your code can look more like this:
HCFA_Claim c = new HCFA_Claim
{
Patient = new Patient
{
FullName = ParseMapField(map.Name, cl),
Address = ParseMapField(map.Address, cl),
}
};
By breaking the code up into smaller logical pieces, you can see how each piece becomes very easy to understand and validate visually. You greatly reduce the risk of copy/paste errors, and when there is a bug or a new requirement, you typically only have to change one place in code instead of every line.
If you are only getting unstructured text, you have to parse it. If the text content changes you have to fix your parser. There's no way around this. You could probably find a 3rd party application to do some kind of visual parsing where you highlight the string of text you want and it does all the substring'ing for you but still unstructured text == parsing == fragile. A visual parser would at least make it easier to see mistakes/changed layouts and fix them.
As for parsing it yourself, I'm not sure about the line-by-line approach. What if something you're looking for spans multiple lines? You could bring the whole thing in a single string and use IndexOf to substring that with different indices for each piece of data you're looking for.
You could always use RegEx instead of Substring if you know how to do that.
While the basic approach your taking seems appropriate for your situation, there are definitely ways you could clean up the code to make it easier to read and maintain. By separating out the functionality that you're doing all within your main loop, you could change this:
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
to something like this:
var parser = new FormParser(cl, M);
c.PatientFullName = FormParser.GetName();
c.PatientAddress = FormParser.GetAddress();
// etc
So, in your new class, FormParser, you pass the List that represents your form and the claim map for the provider into the constructor. You then have a getter for each property on the form. Inside that getter, you perform your parsing/substring logic like you're doing now. Like I said, you're not really changing the method by which your doing it, but it certainly would be easier to read and maintain and might reduce your overall stress level.

Creating a character variation algorithm for a synonym table

I have a need to create a variation/synonym table for a client who needs to make sure if someone enters an incorrect variable, we can return the correct part.
Example, if we have a part ID of GRX7-00C. When the client enters this into a part table, they would like to automatically create a variation table that will store variations that this product could be. Like GBX7-OOC (letter O instead of number 0). Or if they have the number 1, to be able to use L or I.
So if we have part GRL8-OOI we could have the following associated to it in the variation table:
GRI8-OOI
GRL8-0OI
GRL8-O0I
GRL8-OOI
etc....
I currently have a manual entry for this, but there could be a ton of variations of these parts. So, would anyone have a good idea at how I can create a automatic process for this?
How can I do this in C# and/or SQL?
I'm not a C# programmer, but for other .NET languages it would make more sense to me to create a list of CHARACTERS that are similar, and group those together, and use RegEx to evaluate if it matches.
i.e. for your example:
Original:
GRL8-001
Regex-ploded:
GR(l|L|1)(8|b|B)-(0|o|O)(0|o|O)(1|l|L)
You could accomplish this by having a table of interchangeable characters and running a replace function to sub the RegEx for the character automatically.
Lookex function psuedocode (works like soundex but for look alike instead of sound alike)
string input
for each char c
if c in "O0Q" c = 'O'
else if c in "IL1" c = 'I'
etc.
compute a single Lookex code and store that with each product id. If user's entry doesn't match a product id, compute the Lookex code on their entry and search for all products having that code (there could be more than 1). This would consume minimal space, and be quite fast with a single index, and inexpensive to compute as well.
Given your input above, what I would do is not store a table of synonyms, but instead, have a set of rules checked against a master dictionary. So for example, if the user types in a value that is not found in the dictionary, change O to 0, and check for that existing in the dictionary. Change GR to GB and check for that. Etc. All the variations they want to allow described above can be explained as rules that you can apply one at a time or in combination and check if the resulting entry exists. That way you do not have to have a massive dictionary of synonyms to maintain and update.
I wouldn't go the synonym route at all.
I would cleanse all values in the database using a standard rule set.
For every value that exists, replace all '0's with 'O's, strip out dashes etc, so that for each real value you have only one modified value and store that in a seperate field\table.
Then I would cleanse the input the same way, and do a two-part match. Check the actual input string against the actual database values(this will get you exact matches), and secondly check the cleansed input against the cleansed values. Then order the output against the actual database values using a distance calc such as Levenshtein Distance to get the most likely match.
Now for the input:
GRL8-OO1
With parts:
GRL8-00I & GRL8-OOI
These would all normalize to the same value GRL8OOI, though the distance match would be closer for GRL8-OOI, so that would be your closest bet.
Granted this dramatically reduces the "uniqueness" of your part numbers, but the combo of the two-part match and the Levenshtein should get you what you are looking for.
There are several T-SQL implementations of Levenshtein available

C# Console - Inputting and removing from an array

for part of a small assignment I have, i've been asked to create an array to store names and addresses taken from input that the user gives and to be able to later delete a name and address from the array.
Any help or links to helping me understand how to achieve this would be highly appreaciated, thanks.
EDIT - The array is to be set up like an address book, and when printed to the screen it displays like so: "Bloggs, Joe"
It must be surname then forename. I know how to acquire and store the information the user will give, being their names and addresses, but I am stuck on how to add this into an array. The array doesn't have to be infinite, as I am supposed to allocate the array whatever size I wish.
At the start of the program it will be part of, the user will be given a menu, and they can choose to add a record, delete a record or print the book to the screen. So i am meant to be using methods where suitable.
Well, to start with, an array is the wrong data structure to use here.
Arrays are always a fixed size - whereas you want to be able to add elements and later remove them. Assuming you're using C# 2 or higher, you should probably use a List<T>.
Now, the next thing is to work out what T should be. It sounds like you want to store details of people - so you should create a Person class (or perhaps Contact) to encapsulate the name and address... that way you can have a List<Person>.
The next task is probably to work out how to ask the user for input and convert that input into an instance of Person.
Basically, break the task up into small bits - and then feel free to ask questions about any specific bits which you find hard.
I seem to remember this exact same assignment from my CS classes.
The prof wanted us to use linked lists. As John Skeet points out above, .NET has List<T>, which is basically a linked list (with the added feature of being able to be reference each item by index like an array)
You can use a Serializer for the saving part.
Check out the BinaryFormatter class and XmlSerializer.
XmlSerializer is preferred because the file is human-readable and efficiency is usually less important considering the type and purpose of your app.
Using XmlSerializer is as simple as:
var filename = "c:\....\addressbook.xml";
if (File.Exists(filename))
File.Delete(filename);
using (var sw = new StreamWriter(filename))
{
var xs = new XmlSerializer(typeof(List<Person>));
xs.Serialize(sw, myAddressBook);
}

Best way to detect similar email addresses?

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1#gmail.com, username1a#gmail.com, username1b#gmail.com, etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?
The test code I'm using now is:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;
namespace LevenshteinAnalyzer
{
class Program
{
const string INPUT_FILE = #"C:\Input.txt";
const string OUTPUT_FILE = #"C:\Output.txt";
static void Main(string[] args)
{
var inputWords = File.ReadAllLines(INPUT_FILE);
var outputWords = new SortedSet<string>();
for (var i = 0; i < inputWords.Length; i++)
{
if (i % 100 == 0)
Console.WriteLine("Processing record #" + i);
var word1 = inputWords[i].ToLower();
for (var n = i + 1; n < inputWords.Length; n++)
{
if (i == n) continue;
var word2 = inputWords[n].ToLower();
if (word1 == word2) continue;
if (outputWords.Contains(word1)) continue;
if (outputWords.Contains(word2)) continue;
var distance = LevenshteinAlgorithm.Compute(word1, word2);
if (distance <= 2)
{
outputWords.Add(word1);
outputWords.Add(word2);
}
}
}
File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
Console.WriteLine("Found {0} words", outputWords.Count);
}
}
}
Edit: Some of the stuff I'm trying to catch looks like:
01234567890#gmail.com
0123456789#gmail.com
012345678#gmail.com
01234567#gmail.com
0123456#gmail.com
012345#gmail.com
01234#gmail.com
0123#gmail.com
012#gmail.com
You could start by applying some prioritization to which emails to compare to one another.
A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.
For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.
You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.
Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.
Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.
1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2
2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.
OR
Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.
You could add a few optimizations:
1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.
2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.
3) Consider stripping the domain in general. spammer3#gmail.com and spammer3#hotmail.com will never trigger your flag.
If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.
Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.
Just for completeness, you should consider the semantics of email addresses as well, in terms of:
Gmail treats user.name and username as being the same, so both are valid email addresses belonging to the same user. Other services may do this as well. LBushkin's suggestion to strip special characters would help here.
Sub-adrressing can potentially trip your filter if users wise up to it. You'd want to drop the sub-address data before comparison.
You might want to look at the full data set to see if there is other commonality between accounts that have spoofed emails.
i don't know what your application does, but if there are other key points, then use those to filter down what addresses you are going to compare.
Sort everything into a hashtable first. The key should be the domain name of the email; "gmail.com". Strip out special characters from the values, as was mentioned above.
Then check all the gmail.com's against one another. That should be much faster. Do not compare things that are more than 3 characters different in length.
As a second step, check all the keys against one another, and develop groupings there. (gmail.com == googlemail.com, for example.)
I agree with others comments about comparing email addresses not being to helpful, since users could just aswell create fraudulent disimilar looking addresses.
I think a better to come with other solutions, such as limiting the amount of emails you can write down per hour/day, or the time between those addresses being received by you and being sent to the users. Basically, work it out in a way where it is comfortable to send to send a few invites per day, but a PITA to send out many. I guess most users would forget/give up to do it if they had to do it through out a relatively long period of time in order to get their freebies.
Is there any way you can do a check on the IP address of the person creating the email. That would be a simple way to determine, or at least give you added information about whether the different email addresses have come from the same person.

Categories