In a logistics software which is written in C#, I need to check if given postal code is between one of the ranges at database.
For Germany, for example
Range: 47000-48000
Given Postal Code: 47057
Result: True
for numeric postal codes, it's alright. But what about UK postal codes? W11 2BQ is an example postal code from london.
one of the basic ideas is, converting postal codes to numbers by converting each character into its ascii code and writing left to right simply.
so
W11 2BQ -> 87 49 49 32 50 66 81 -> 87,494,932,506,681
so one simple postal code becomes a very big number and that disturbs me. english postal codes can vary in sizes (up to 8 chars) so this makes the resulting number even bigger.
I use sql server to check if given postal code is in range.
Is there any official technique to deal with UK postal codes for range calculation?
Best,
Alper
'Postal codes' are a completely arbitrary system, instead of attempting to code against it - unless your objective is to write a comprehensive postal code comprehension library - I would strongly recommend finding/paying for/stealing a library that can do what you're asking for.
If it helps, the basic rules for UK postal codes are (as far as I remember from ones I've seen):
[A-Z]{1,2} - 2-character code representing the sorting depot in the locality
\d - subdivision of the sorting depot's jurisdiction
[mandatory space]
\d[A-Z]{2} - alphanumeric code for a contiguous region occupied by a group of 10-100 addresses
It wouldn't surprise me if my summarisation is wrong/incomplete. All the postal codes I've seen are in the form I mention, but I don't know the actual rules, so there could be others in a different format. It's included merely to give a broad appraisal of the nature of the system.
The GeoNames Web Service might be a good place to start. It should be possible to validate postal codes through their API somehow. I think you can export the database too, if you want to write up your own validation logic.
I don't think converting the zip code into ASCII is a good idea. The reason is quite obvious. If you have a ASCII converted value like 1210121 (just an example), the problem is to separate as (12) (10) (121) or (12) (101) (21). This seems to be a lot of work for such little gain.
Although, couldn't you use SQL??
select * from ZIPTable where zipcode IN ('G4543','G3543')
Note: You can get the zipcode values from a subquery.
Related
I've Google'd and read quite a bit on QR codes and the maximum data that can be used based on the various settings, all of it being in tabular format. I can't seem to find anything giving a formula or a proper explanation of how these values are calculated.
What I would like to do is this:
Present the user with a form, allowing them to choose Format, EC & Version.
Then they can type in some data and generate a QR code.
Done deal. That part is easy.
The addition I would like to include is a "remaining character count" so that they (the user) can see how much more data they can type in, as well as what effect the properties have on the storage capacity of the QR code.
Does anyone know where I can find the formula(s)? Or do I need to purchase ISO 18004:2006?
A formula to calculate the amount of data you could put in a QRcode would be quite complex to make, not mentioning it would need some approximations for the calculation to be possible. The formula would have to calculate the amount of modules dedicated to the data in your QRCode based on its version, and then calculate how many codewords (which are sets of 8 modules) will be used for the error correction.
To calculate the amount of modules that will be used for the data, you need to know how many modules will be used for the function patterns. While this is not a problem for the three finder patterns, the timing or the version/format information, there will be a problem with the alignment patterns as their number is dependent on the QRCode's version, meaning you anyway would have to use a table at that point.
For the second part, I have to say I don't know how to calculate the number of error correcting codewords based on the correction capacity. For some reason, there are more error correcting codewords used that there should to match the error correction capacity, as for example a 6-H QRCode can correct up to 32.6% of the data, instead of the 30% set by the H correction level.
In any case, as you can see a formula would be quite complex to implement. Using a table like already suggested is probably the best thing you could do.
I wrote the original AIM specification for QR Code back in the '90s for Denso Corporation, and was also project editor for both editions of the ISO/IEC 18004 standard. It was felt to be much easier for people producing code printing software to use a look-up table rather than calculate capacities from a formula - no easy job as there are several independent variables that have to be taken into account iteratively when parsing the text to be encoded to minimise its length in bits, in order to achieve the smallest symbol. The most crucial factor is the mix of characters in the data, the sequence and lengths of sub-strings of numeric, alphanumeric, Kanji data, with the overhead needed to signal each change of character set, then the required level of error correction. I did produce a guidance section for this which is contained in the ISO standard.
The storage is calculated by the QR mode and the version/type that you are using. More specifically the calculation is based on how 'compressible' the characters are and what algorithm that the qr generator is allowed to use on the content present.
More information can be found http://en.wikipedia.org/wiki/QR_code#Storage
Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.
I'm developing an application for taking orders in C# and DevExpress, and I need a function that generates a unique order number. The order number must contain letters and digits and has a length of 20 ..
I've seen things like Guid.NewGuid() but I don't want it to be totally random, nor to be just an auto increment number ..
Can anyone help? even if it's a script in a different language, I need ideas desperately :)
You can create type of your own .
lets say yyyyMMddWWW-YYY-XXXXXXX where WWW is the store number, YYY the cashier id XXXXXXX is a hexadecimal number ( -> maybe an actual autoincrement number that you turn it into hex ) . This is just an idea . Im afraid you have to decide by the elements of your system how it will be .
edited : also if you can apply a check digit algorithm on it will also help in avoiding mistakes
Two different methods:
Create MD5 or SHA1 hash of current time
Hash of increment number
One thought comes to mind.
Take the DateTime.Now.Ticks convert it to hexadecimal string.
Voila, String.Format("{0:X}", value);
If not long enough , you said you need 20 digits, you can always pad with zeros.
Get the mother board ID
Get the hdd ID
Merge it by any way
Add your secret code
Apply MD5
Apply Base54
Result: the serial code which is linked to the currect client PC :)
My two cents.
If you need ideas then take a look at the Luhn and Luhn mod N algorithms.
While these algorithms are not unique code generators, they may give you some ideas on how to generate codes that can be validated (such that you could validate the code for correctness before sending it off to the database).
Like Oded suggested, Guid is not random (well, not if you have a network card). It's based on time and location coordinates. See Raymond Chens blog post for a detailed explanation.
You are best off using an auto incremented int for order ids. I don't understand why you wouldn't want to use it or failing that a Guid?
I can't think of any way other then an auto id to maintain uniqueness and represent the order of your different orders in your system.
I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance
The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.
I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.
soundex may also be useful
In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.
I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:
110 Test St
110 Test St.
110 Test Street
Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.
For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".
I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.
We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.
I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.
Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?
One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.
I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.
Cheers -
[ben at afsinc dot ca]
If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm
Disclaimer: I had a role in its development and work for the company.
In the UK we would use:
House Name or Number (where name includes Flat number for apartment blocks)
Postcode
You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.
Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.
If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.
If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.
Eg. 110 Test Street Apt 3. Anywhere California 90210 =>
Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
Given that this is a street address, pull out the street name (store in lower case)
Given that this is a street address, pull out the street number
Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)
eStreet //1.an enum of possible address types eg. eStreet, eRuralRoute,...
|
eStreet //2.an enum of street types eg. eStreet, eBlvd, eWay,...
/ | \
Name Number Apt
| | |
test 110 3
Eg. RR#3 Anywhere California 90210 =>
Get the type of address: rural route
Given that this is a rural route address, get the route number
eRuralRoute
|
3
You'll need to do something similar for country state and zip information.
Then compare the resulting trees.
This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)
If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.
My company uses AccuMail Gold for this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.
If you dont chose to use an existing system, one idea is to do the following:
Extract numbers from the address line
replace common street words with blanks
create match string
ie: "555 Canal Street":
Extract number gives "555" + "Canal Street"
Replace street words gives "555" + "Canal"
Create match string gives "555Canal"
"Canal st 555" would give the same match string.
By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.
By extracting numbers and separating them from the string it does not matter if they are first or last.
use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.
force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier
have a two process "save"
after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
after they select the new one save it, if they pick an existing one use that ID
clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.
look into SOUNDEX and DIFFERENCE
I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.
You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.