So do I just ignore this code analysis warning by suppressing it? Or is there a way to truly fix it?
Here is a user story that comes close to mine, but I changed it slightly so that company information isn't on the site...
Say I have a website for a company that ships to 15 countries, and they want to show the names of those countries in the user's language of choice from the appropriate resources.resx file.
Now my "options" in a list are more complex than just a name/value or key/value pair. So the current code has a method that returns all the options, so it might look like:
return new[]
{
new CountryOption(code1, resourceKey1, someOtherValue1),
new CountryOption(code2, resourceKey2, someOtherValue2),
new CountryOption(code3, resourceKey3, someOtherValue3),
... (repeat 12 more times so I have 15 countries)
};
Thus I get a list (IEnumerable<CountryOption>) of all the countries from which to choose.
Most such applications will read this sort of information from a database, but this data rarely changes and putting it into a database will slow the performance of the site. One could put this into a flat file to read, but again, compiled in code will be faster. Finally, we do have some unit tests to make sure this information is correct that run with each build (harder to do for information in a database).
Is the only way to reduce cyclomatic complexity for a list of known values to read it from some source outside the code? (If so, the suppressing the message is probably the right thing to do.)
As others pointed out, the creation of a collection by itself had a cyclomatic complexity of 1. But the object that was in the collection was using a func<string> and that drove the complexity up - basically 2 for each item in the collection! Thanks for the help.
Related
I am looking for the best data structure for the following case:
In my case I will have thousands of strings, however for this example I am gonna use two for obvious reasons. So let's say I have the strings "Water" and "Walter", what I need is when the letter "W" is entered both strings to be found, and when "Wat" is entered "Water" to be the only result. I did a research however I am still not quite sure which is the correct data structure for this case and I don't want to implement it if I am not sure as this will waste time. So basically what I am thinking right now is either "Trie" or "Suffix Tree". It seems that the "Trie" will do the trick but as I said I need to be sure. Additionally the implementation should not be a problem so I just need to know the correct structure. Also feel free to let me know if there is a better choice. As you can guess normal structures such as Dictionary/MultiDictionary would not work as that will be a memory killer. I am also planning to implement cache to limit the memory consumption. I am sorry there is no code but I hope I will get a answer. Thank you in advance.
You should user Trie. Tries are the foundation for one of the fastest known sorting algorithms (burstsort), it is also used for spell checking, and is used in applications that use text completion. You can see details here.
Practically, if you want to do auto suggest, then storing upto 3-4 chars should suffice.
I mean suggest as and when user types "a" or "ab" or "abc" and the moment he types "abcd" or more characters, you can use map.keys starting with "abcd" using c# language support lamda expressions.
Hence, I suggest, create a map like:
Map<char, <Map<char, Map<char, Set<string>>>>> map;
So, if user enters "a", you look for map[a] and finds all children.
I have a need to check if a list of items contains a string...so kind of like the list gets filtered as the user types in a search box. So, on the text changed event, I am checking if the entered text is contained in one of the listox items and filtering out...so
something like:
value.Contains(enteredText)
I was wondering if this is the fastest and most efficient way to filter out listbox items?
Is Contains() method the best way to search for substrings in C#?
I'd say that in all but very exceptional circumstances, it's fast and efficient enough, and even in such exceptional circumstances it's likely to be a purely academical problem. If you use it and come across any bottlenecks in your logic related to this then I'd be surprised, but only then would it be worth looking at, then chances are you'll be looking elsewhere.
Contains is one of the cheapest methods in my code completion filtering algorithm (Part 6 #6, where #7 and the fuzzy logic matching described in the footnote are vastly more expensive), which doesn't have problems keeping up with even a fast typing user and thousands of items in the dropdown.
I highly doubt it will cause you problems.
Although this is not the fastest option globally, it is the fastest one for which you do not need to code anything. It should be sufficient for filtering drop-down items.
For longer texts, you may want to go with the KMP Algorithm, which has a linear timing complexity. Note, however, that it would not make any difference for very short search strings.
For searches that have lots of matches (e.g. ones that you get for the first one to two characters) you may want to precompute a table that maps single letters and letter pairs to the rows in your drop-down list for a much faster look-up at the expense of using more memory (a pretty standard tradeoff in programming in general).
I want to do a super fast geocode lookup, returning co-ordinates for an input of Town, City or Country. My knowledge is basic but from what I understand writing it in C is a good start. I was thinking it makes sense to have a tree structure like this:
England
Kent
Orpington
Chatam
Rochester
Dover
Edenbridge
Wiltshire
Swindon
Malmsbury
In my file / database I will have the co-ordinate and the town/city name. If give my program the name "Kent" I want a program that can return me the co-ordinate assoaited with "Kent" in the fastest way possible
Should I store the data in a binary file or a SQL database for performance reasons?
What is the best method of searching this data? Perhaps binary tree searching?
How should the data be stored? perhaps?
Here's a little advice, but not much more than that:
If you want to find places by name, or name prefix, as you indicate that you wish to, then you would be ill-advised to set up a data structure which stores the data in a hierarchy of country, region, town as you suggest you might. If you have an operation that dominates the use of your data structure you are generally best picking the data structure to suit the operation.
In this case an alphabetical list of places would be more suited to your queries. To each place not at the topmost level you would want to add some kind of reference to the name of its 'parent'. If you have an alphabetical list of places you might also want to consider an index , perhaps one which points directly to the first place in the list which starts with each letter of the alphabet.
As you describe your problem it seems to have much more in common with storing words in a dictionary (I mean the sort of thing in which you look up words rather than any particular collection data-type in any specific programming language which goes under the same name) than with most of what goes under the guise of geo-coding.
My guess would be that a gazetteer including the names of all the world's towns, cities, regions and countries (and their coordinates) which have a population over, say, 1000, could be stored in a very simple data structure (basically a list) with an index or two for rapid location of the first A place-name, the first B, and so on. With a little compression you could probably hold this in the memory of most modern desktop PCs.
I think the best advice I can give is to use whatever language you are familiar with to get the results you want. Worry about performance once your code works. Then you can look at translating very specific pieces of functionality into C or C++ one at a time until you have the results you want.
You should not worry about how the information is stored, except not to duplicate data.
You should create one or more indices for the data. The indicies are associative arrays / maps data structures that contain a key (the item you want to search) and a value (such as the record and other information associated with the key). This will enable you with fast lookups without altering your data for each type of search.
On the other hand, your case is an excellent fit for a data base. I suggest you let the database manager your data (such as efficient lookups). After all, that is what they live for.
See also: At what point is it worth using a database?
I'm looking for some suggestions on better approaches to handling a scenario with reading a file in C#; the specific scenario is something that most people wouldn't be familiar with unless you are involved in health care, so I'm going to give a quick explanation first.
I work for a health plan, and we receive claims from doctors in several ways (EDI, paper, etc.). The paper form for standard medical claims is the "HCFA" or "CMS 1500" form. Some of our contracted doctors use software that allows their claims to be generated and saved in a HCFA "layout", but in a text file (so, you could think of it like being the paper form, but without the background/boxes/etc). I've attached an image of a dummy claim file that shows what this would look like.
The claim information is currently extracted from the text files and converted to XML. The whole process works ok, but I'd like to make it better and easier to maintain. There is one major challenge that applies to the scenario: each doctor's office may submit these text files to us in slightly different layouts. Meaning, Doctor A might have the patient's name on line 10, starting at character 3, while Doctor B might send a file where the name starts on line 11 at character 4, and so on. Yes, what we should be doing is enforcing a standard layout that must be adhered to by any doctors that wish to submit in this manner. However, management said that we (the developers) had to handle the different possibilities ourselves and that we may not ask them to do anything special, as they want to maintain good relationships.
Currently, there is a "mapping table" set up with one row for each different doctor's office. The table has columns for each field (e.g. patient name, Member ID number, date of birth etc). Each of these gets a value based on the first file that we received from the doctor (we manually set up the map). So, the column PATIENT_NAME might be defined in the mapping table as "10,3,25" meaning that the name starts on line 10, at character 3, and can be up to 25 characters long. This has been a painful process, both in terms of (a) creating the map for each doctor - it is tedious, and (b) maintainability, as they sometimes suddenly change their layout and then we have to remap the whole thing for that doctor.
The file is read in, line by line, and each line added to a
List<string>
Once this is done, we do the following, where we get the map data and read through the list of file lines and get the field values (recall that each mapped field is a value like "10,3,25" (without the quotes)):
ClaimMap M = ClaimMap.GetMapForDoctor(17);
List<HCFA_Claim> ClaimSet = new List<HCFA_Claim>();
foreach (List<string> cl in Claims) //Claims is List<List<string>>, where we have a List<string> for each claim in the text file (it can have more than one, and the file is split up into separate claims earlier in the process)
{
HCFA_Claim c = new HCFA_Claim();
c.Patient = new Patient();
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
//...and so on...
ClaimSet.Add(c);
}
Sorry this is so long...but I felt that some background/explanation was necessary. Are there any better/more creative ways of doing something like this?
Given the lack of standardization, I think your current solution although not ideal may be the best you can do. Given this situation, I would at least isolate concerns e.g. file read, file parsing, file conversion to standard xml, mapping table access etc. to simple components employing obvious patterns e.g. DI, strategies, factories, repositories etc. where needed to decouple the system from the underlying dependency on the mapping table and current parsing algorithms.
You need to work on the DRY (Don't Repeat Yourself) principle by separating concerns.
For example, the code you posted appears to have an explicit knowledge of:
how to parse the claim map, and
how to use the claim map to parse a list of claims.
So there are at least two responsibilities directly relegated to this one method. I'd recommend changing your ClaimMap class to be more representative of what it's actually supposed to represent:
public class ClaimMap
{
public ClaimMapField Name{get;set;}
...
}
public class ClaimMapField
{
public int StartingLine{get;set;}
// I would have the parser subtract one when creating this, to make it 0-based.
public int StartingCharacter{get;set;}
public int MaxLength{get;set;}
}
Note that the ClaimMapField represents in code what you spent considerable time explaining in English. This reduces the need for lengthy documentation. Now all the M.Name.Split calls can actually be consolidated into a single method that knows how to create ClaimMapFields out of the original text file. If you ever need to change the way your ClaimMaps are represented in the text file, you only have to change one point in code.
Now your code could look more like this:
c.Patient.FullName = cl[map.Name.StartingLine].Substring(map.Name.StartingCharacter, map.Name.MaxLength).Trim();
c.Patient.Address = cl[map.Address.StartingLine].Substring(map.Address.StartingCharacter, map.Address.MaxLength).Trim();
...
But wait, there's more! Any time you see repetition in your code, that's a code smell. Why not extract out a method here:
public string ParseMapField(ClaimMapField field, List<string> claim)
{
return claim[field.StartingLine].Substring(field.StartingCharacter, field.MaxLength).Trim();
}
Now your code can look more like this:
HCFA_Claim c = new HCFA_Claim
{
Patient = new Patient
{
FullName = ParseMapField(map.Name, cl),
Address = ParseMapField(map.Address, cl),
}
};
By breaking the code up into smaller logical pieces, you can see how each piece becomes very easy to understand and validate visually. You greatly reduce the risk of copy/paste errors, and when there is a bug or a new requirement, you typically only have to change one place in code instead of every line.
If you are only getting unstructured text, you have to parse it. If the text content changes you have to fix your parser. There's no way around this. You could probably find a 3rd party application to do some kind of visual parsing where you highlight the string of text you want and it does all the substring'ing for you but still unstructured text == parsing == fragile. A visual parser would at least make it easier to see mistakes/changed layouts and fix them.
As for parsing it yourself, I'm not sure about the line-by-line approach. What if something you're looking for spans multiple lines? You could bring the whole thing in a single string and use IndexOf to substring that with different indices for each piece of data you're looking for.
You could always use RegEx instead of Substring if you know how to do that.
While the basic approach your taking seems appropriate for your situation, there are definitely ways you could clean up the code to make it easier to read and maintain. By separating out the functionality that you're doing all within your main loop, you could change this:
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
to something like this:
var parser = new FormParser(cl, M);
c.PatientFullName = FormParser.GetName();
c.PatientAddress = FormParser.GetAddress();
// etc
So, in your new class, FormParser, you pass the List that represents your form and the claim map for the provider into the constructor. You then have a getter for each property on the form. Inside that getter, you perform your parsing/substring logic like you're doing now. Like I said, you're not really changing the method by which your doing it, but it certainly would be easier to read and maintain and might reduce your overall stress level.
I recently launched my humble side project and would like to add a "related submissions" section when viewing a submission. Exactly like what SO is doing here - see right column, titled "Related"
Considering that each submission has a title and a set of tags, what is most effective (optimum result), most efficient (fast, memory friendly) way to query the database for related submissions?
I can think of one way to do this (which I'll post as an answer) but I'm very interested to see what others have to say. Or perhaps there's already a standard way of achieving this?
Here's my two cent solution:
To achieve the best output, we need to put “weight” on the query results.
To start with, each submission in the database is assumed to have a weight of zero.
Then, if a submission in the "pool" shares one tag with the current submission, we'd add +3 to the found submission. Hence, if another submission is found that shares two tags with the current submission, we add +6 to the weight.
Next, we split/tokenize the title of the current submission and remove “stop words”.
I’ve seen a list of stop words from google, but for now I’ll define my stop words to be: [“of”, “a”, “the”, “in”]
Example:
Title “The Best Submission of All Times”
Result the array: ["The", “Best”, “Submission”, “of”, “All”, “Times”]
Remove stop words: [“Best”, “Submission”, “All”, “Times”]
Then we query the database for submissions containing any of the mentioned titles, and for each result we add the weight: +2
And finally sort the list descending by weight and take the top N results.
What do you think? (be gentle!)
If I understand well, you need a technique to find whether two posts are "similar" one to each other. You may want to use a probabilistic model for that:
http://en.wikipedia.org/wiki/Mutual_information
The idea would be to say that if two posts share a lot of "uncommon" words, they are probably speaking on the same topic. For detecting uncommon words, depending on your application, you may use a general table of frequencies, or maybe better, build it yourself on the universe of the words of your posts (but you will need to have enough of them to have something relevant).
I would not limit myself on title and tags, but I would overweight them in the research.
This kind of ideas is very common in spam filtering. I unfortunately the time to make a full review, but a quick google search gives:
http://www.aclweb.org/anthology/P/P04/P04-3024.pdf
karlmicha.googlepages.com/acl2004_poster.pdf