hashing sensitive data

hashing sensitive data - c#

I need to scramble the names and logins of all the users in a UAT database we have. (because of the data protection act)
However, there is a catch.
The testers still need to be able to login using the hashed login names
so if a user login is "Jesse.J.James" then the hash should be something like
Ypois.X.Qasdf
i.e. approximately the same length, with the dots in the same place
so MD5, sha1 etc would not be suitable as they would create very long strings and also add their own special characters such as + and = which are not allowed by the validation regex.
So I'm looking for some suggestions as to how to achieve this
I guess I need to rollmy own hashing algorith
anyone done anything similar?
I am using c# but I guess that is not so important to the algorithm
thanks alot
ADDED -
Thanks for all the answers. I think I am responsible for the confusion by using the word "Hash" when that is not what needed to be done

Testers should NOT be logging in as legitimate users. That would clearly violate the non-repudiation requirement of whatever data protection act you're working under.
The system should not allow anyone to log in using the hashed value. That defeats the whole purpose of hashing!
I'm sorry I am not answering your specific question, but I really think your whole testing system should be reevaluated.
ADDED:
The comments below by JPLemme shed a lot of light on what you are doing, and I'm afraid that I completely misunderstood (as did those who voted for me, presumably).
Part of the confusion is based on the fact that hashes are typically used to scramble passwords so that no one can discover what another person's password is, including those working on the system. That is, evidently, the wrong context (and now I understand why you are hashing usernames instead of just passwords). As JPLemme has pointed out, you are actually working with a completely separate parrallel system into which live data has been copied and anonymized, and the secure login process that uses hashed (and salted!) passwords will not be molested.
In that case, WW's answer below is more relevant, and I recommend everyone to give your up votes to him/her instead. I'm sorry I misunderstood.

You do not need to hash the data. You should just randomize it so it has no relation to the original data.
For example, update all the login names, and replace each letter with another random letter.

I think you are taking the wrong approach here. The idea of a hash is that it is one-way, noone should be able to use that hash to access the system (and if they can then you are likely still in violation of the data protection act. Also, testers should not be using real accounts unless those accounts are their own.
You should have the testers using mock accounts in a separated environment. By using mock accounts in a separate environment there is no danger in giving the testers the account information.

Generally speaking, it is ill advised to roll your own encryption/hashing algorithms. The existing algorithms do what they do for a reason.
Would it really be so bad to either give the testers an access path that hashed the user names for them or just have them copy/paste SHA-1 hashes?

Hashes are one-way, by definition.
If all you are trying to protect from is casual perusal of the data (so the encryption level is low), do something simple like a transposition cypher (a 1-1 mapping of different characters to one another -- A becomes J, B becomes '-', etc). Or even just shift everything by one (IBM becomes HAL).
But do recognize that this is by no means a guarantee of privacy or security. If those are qualities you are looking for, you can't have testers impersonating real users, by definition.

Did this recommendation go through your organization's auditing department? You might want to talk to them if not, it's not at all clear the scheme you're using protects your organization from liability.

Why not use a test data generator for the data that could identify an individual?
Creating test data in a database

To give you some more information:
I need to test a DTS package that imports all the users of the system from a text file into our database. I will be given the live data.
However, once the data is in the database it must be scrambled so that it doesnt make sense to the casual reader but allows testers to log in to the system

thanks for all the answers. I think you are almost certainly right about our test strategy being wrong.
I'll see if I can change the minds of the powers that be

Related

Can some hack my key license system by analysing my CLR DLLs?

My software (written in C#/.NET) have a simple key license system to activate certain resources. The way it works is: it creates a unique code based on the running computer's hardware, then mix this value with the client's activated licenses to create a password that will, on that specific computer, liberate access to the determinated resources. The key given to the client is a file with the password.
The way it verifies this is even simpler: the software calculates the expected password, and then matches with the password stored on the file. If it matches, the resources are liberated.
So, since the software itself calculates the correct password, I wonder if it's possible to someone take the software's DLLs and hack them to discover the calculation method.

Yes, and if you don't put effort in to obfuscating your code it is trivial to do.
There will always be ways to get around any protection you put in place, the only thing you can do is make it difficult enough that any attacker will get too frustrated and declare it is not worth his time to try and reverse engeneer your software. It is just a matter of how much time/money is it worth it to you to keep that one extra person from trying.
I wrote a fairly extensive answer to a similar question here that goes over what steps you can do to mitigate the problem, but there is nothing you can do to stop it.

Unique Computer Login

I am looking for a bit of help. I realize there are many threads that explain the difficulties and problems of uniquely identifying a computer as far as piracy preventions and user licenses. This situation is a tad bit different in the fact that users must have an active account to log in and use the software. And this option will only be on a requested basis not for every account.
The issue arises when some of the companies have requested instead of admin accounts, they would like admin locations. I am looking if there a good way to do this, or if this will still have the same issues of changing hardware/ spoofing MAC's.
Some of the machine need uniquely identifiable we will have remote access to, while others we won't.
We run on a .NET platform
The only way to use our software is active log-in.
Thanks in advance for any help provided.

I agree with other answers but have an additional suggestion:
Every Windows generates unique SID on installation... you can get that via DirectoryEntry in the objectSID item of Properties... see http://msdn.microsoft.com/en-us/library/system.directoryservices.directoryentry.aspx
hope this helps a bit...
EDIT - getting MachineSID as string (corrected as per comment):
string MachineSID = new SecurityIdentifier((byte[])new
DirectoryEntry(string.Format("WinNT://{0},Computer",
Environment.MachineName)).Children.Cast<DirectoryEntry>().First().
InvokeGet("objectSID"),0).AccountDomainSid.Value.ToString();
you need to add a reference to System.DirectoryServices and make sure to have using System.Linq; and using System.Security.Principal; and using System.DirectoryServices;.

It is essentially impossible to prevent people from "spoofing" a location. So plead with the clients to allow for a layer of authentication above the "location" they request.
Short of that you may want to take some loosely identifyable information such as the MAC, IP, or other specs and send it as an encrypted string. Anyone sniffing on the network wont be able to tell what the data being sent is so will have a harder time spoofing it if that is their goal. If they manage to decrypt the message then the data is in the open but until they're able to read it it provides a minor layer of security.
I still recommend against this idea but I'm sure it can be done. There is a truckload of issues you'd be forced to deal with that exist outside the software domain itself and would complicate things much more than a strong authentication scheme. Hopefully other members here can provide good examples to use but you don't want any false positives or otherwise (Dynamic IPs getting in the way etc.)

IMHO, there is no good way to use the hardware as a primary means of authentication. You could do something like have an admin account that should be tied to certain hardware, and then try to heuristically detect changes in hardware but that's a scenario where you have an account AND hardware instead of an account OR hardware.
All preaching aside, if you can't convince the company that it's a bad idea, what I would do is provision those certain machines with keys that you authenticate with. Then you have full control over if/when to allow those keys to authenticate, you can revoke their access, but still give the effect of it being the machine that's authenticating, and not an account. It's still got all of the advantages and flexibility of being software controlled with the same effect of being hardware-based.

How to obfuscate string constants?

We have an application which contains sensitive information and I'm trying my best to secure it. The sensitive information includes:
The main algorithm
The keys for an encryption/decryption algorithm
I've been looking at Obfuscating the code but it doesn't seem to help much as I can still decompile it. However, my biggest concern is that the keys used for encryption of serial numbers etc are clearly visible when you decompile the code, even when it's Obfuscated.
Can anyone suggest how I can secure these strings?
I realise one of the methods might be to remove any decryption from the application itself, while this may be possible in part, there are some features which have to use encryption/decryption - mainly to save a config file and to pass an 'authorisation' token to a DLL to perform a calculation.

There are ways to do what you want, but it isn't cheap and it isn't easy.
Is it worth it?
When looking at whether to protect software, we first have to answer a number of questions:
How likely is this to happen?
What is the value to someone else of your algorithm and data?
What is the cost to them of buying a license to use your software?
What is the cost to them of replicating your algorithm and data?
What is the cost to them of reverse engineering your algorithm and data?
What is the cost to you of protecting your algorithm and data?
If these produce a significant economic imperative to protect your algorithm/data then you should look into doing it. For instance if the value of the service and cost to customers are both high, but the cost of reverse engineering your code is much lower than the cost of developing it themselves, then people may attempt it.
So, this leads on to your question
How do you secure your algorithm and data?
Discouragement
Obfuscation
The option you suggest, obfuscating the code, messes with the economics above - it tries to significantly increase the cost to them (5 above) without increasing the cost to you (6) very much. The research by the Center for Encrypted Functionalities has done some interesting research on this. The problem is that as with DVD encryption it is doomed to failure if there is enough of a differential between 3, 4 and 5 then eventually someone will do it.
Detection
Another option might be a form of Steganography, which allows you to identify who decrypted your data and started distributing it. For instance, if you have 100 different float values as part of your data, and a 1bit error in the LSB of each of those values wouldn't cause a problem with your application, encode a unique (to each customer) identifier into those bits. The problem is, if someone has access to multiple copies of your application data, it would be obvious that it differs, making it easier to identify the hidden message.
Protection
SaaS - Software as a Service
A more secure option might be to provide the critical part of your software as a service, rather than include it in your application.
Conceptually, your application would collect up all of the data required to run your algorithm, package it up as a request to a server (controlled by you) in the cloud, your service would then calculate your results and pass it back to the client, which would display it.
This keeps all of your proprietary, confidential data and algorithms within a domain that you control completely, and removes any possibility of a client extracting either.
The obvious downside is that clients are tied into your service provision, are at the mercy of your servers and their internet connection. Unfortunately many people object to SaaS for exactly these reasons. On the plus side, they are always up to date with bug fixes, and your compute cluster is likely to be higher performance than the PC they are running the user interface on.
This would be a huge step to take though, and could have a huge cost 6 above, but is one of the few ways to keep your algorithm and data completely secure.
Software Protection Dongles
Although traditional Software Protection Dongles would protect from software piracy, they wouldn't protect against algorithms and data in your code being extracted.
Newer Code Porting dongles (such as SenseLock†) appear to be able to do what you want though. With these devices, you take code out of your application and port it to the secure dongle processor. As with SaaS, your application would bundle up the data, pass it to the dongle (probably a USB device attached to your computer) and read back the results.
Unlike SaaS, data bandwidth would be unlikely to be an issue, but performance of your application may be limited by the performance of your SDP.
† This was the first example I could find with a google search.
Trusted platform
Another option, which may become viable in the future is to use a Trusted Platform Module and Trusted Execution Technology to secure critical areas of the code. Whenever a customer installs your software, they would provide you with a fingerprint of their hardware and you would provide them with a unlock key for that specific system.
This key would would then allow the code to be decrypted and executed within the trusted environment, where the encrypted code and data would be inaccessible outside of the trusted platform. If anything at all about the trusted environment changed, it would invalidate the key and that functionality would be lost.
For the customer this has the advantage that their data stays local, and they don't need to buy a new dongle to improve performance, but it has the potential to create an ongoing support requirement and the likelihood that your customers would become frustrated with the hoops they had to jump through to use software they have bought and paid for - losing you good will.
Conclusion
What you want to do is not simple or cheap. It could require a big investment in software, infrastructure or both. You need to know that it is worth the investment before you start along this road.

All efforts will be futile if someone is motivated enough to break it. No one has managed to figure this out yet, even the biggest software companies.
I'm trying my best to secure it
I'm not saying this as a scathing criticism, just you need to be aware of what your trying to achieve is currently assumed to be impossible.
Obfuscation is security through obscurity, it does have some benefit as it will deter the most incompetent of hacker attempts, but largely it is wasted effort that could perhaps be better spent in other areas of development.
In answer to your original question, you are going to run into problems with intelligent compilers, they might automatically piece together the string into the compiled application removing some of your obfuscation efforts as a compilation optimisations. It would be hard to maintain as well, so I would reconsider your risk analysis model and perhaps resign yourself to the fact it can be cracked and if it has any value probably will be.

I recently read a very simple solution to OP.
Simple declare your constants as readonly string, not const string. That simple. Apparently const variables get written to a stack area in the binary but written as plain text whereas readonly strings get added to the constructor and written as a byte array instead of text.
I.e. If you search for it, you won't find it.
That was the question, right?

Using a custom algorithm (security through obscurity?), combined with storing the key inside the application, is simply not secure.
If you are storing some kind of a password, then you can use a one-way hashing function to ensure that decrypted data is unavailable anywhere in your code.
If you need to use a symmetric encryption algorithm, use a well known and tested one, like AES-256. But the key obviously cannot be stored inside your code.
[Edit]
Since you mentioned encryption of serial numbers, I believe you a one-way hashing function (like SHA-256) would really suit your needs better.
The idea is to hash your serial numbers during build time into their hashed representations, which cannot be reversed (SHA-256 is considered to be a pretty safe algorithm, compared to, say, MD5). During run time, you only need to apply the same hash function to the user input, and compare hashed values only. This way none of the actual serial numbers are available to the attacker.

#Tom Gullen have given a proper answer.
I merely got some suggestions on how you can make it harder for the users to access your keys and algorithm.
As for the algorithm: Do not compile your algorithm at compile time, but at runtime. To be able to do this you need to specify an interface which contains the methods for the algorithm. The interface is used to run it. Then add the source code for the algorithm as an encrypted string (embedded resource). Decrypt it at runtime and use CodeDom to compile it into a .NET class.
Keys: The usual way is to store spread parts of your key in different places in the application. Store each part as byte[] instead of string to make it a bit harder to find them.
If all your users have an internet connection: Fetch the algorithm source code and the keys using SSL instead.
Note that everything will be pieced together at runtime, anyone with a bit of more knowledge can inspect/debug your application to find everything.

i dont think you can easily obfuscate string constants, so if possible, dont use them :) you can use assembly resources instead, those you can encrypt however you want.

Depends what you're trying to do but can you use asymmetric encryption? That way you only need to store public keys with no need to obfuscate them.

What method of encryption is suitable for encrypting individual words, and also an entire document?

I need to save several documents to the cloud and need to save the documents, document metadata, and words/phrases for searching.
My plan is to use a symmetric cypher for encrypting the whole document, but I'm unsure of the right way to hash each word. I would like something secure, but I don't want to increase the count of characters in each word unnecessarily.
What implementation is most suitable for doing a symmetric encryption against a document, and what is the best way to hash a word or phrase without making it many times larger than it needs to be?

First, I suggest different tags. It sounds like you're really interested in offloading searching to a server in a cryptographically secure way (such that the server doesn't have access to the plaintext and such that the client need not transfer the entire index).
Issues:
An attacker being able to figure out which words are in the index (and which are not) could be an issue for you. You should state whether it is as a part of your requirements.
An attacker being able to figure out which items in the index occur more frequently could be an issue for you. You should state whether it is as a part of your requirements.
An attacker being able to associate words with a document could be an issue for you. You should state whether it is as a part of your requirements.
An attacker may be able to subvert the server entirely and observe queries / retrievals. You should state security needs in this circumstance as well.
Probably others I haven't thought of.
I'm assuming that you're designing your own, but there is probably some prior art, research, etc. that would be smarter than I am below:
For the first, I suggest that you should hash the words, combining the plaintext with a secret (not shared with the index server) before hashing, and truncating the hash to the point where it is likely to be non-unique in the index. This costs you hash efficiency, but helps prevent an attacker from using the hash as a plaintext equivalent or experimentally determining the secret
For the second and third, you should encrypt any indexed data (such as counts or document+position) and decrypt it on the client. This may cost you latency.
For the fourth, you'd want to consider concealing real requests inside groups of unrelated requests, things like that, but you'd want a lot of math to make sure you weren't still vulnerable to statistical analysis.
For the fifth, do some web research. I'm confident there will be stuff out there, and this is a pretty specific (and less common) need, so you'll want someone who put more thought into it than I just have.

Your requirements are mutually exclusive. That kind of metadata will leak a huge amount of information about the document content, to the point it can't be called secure.
Furthermore, encrypting individual words is futile. The difficulty of breaking encryption is usually said to be as difficult as breaking the key, but this assumes the information content in the plaintext is greater than that in the key. For single words, that certainly isn't true.

Generating a Tamper Proof Signature of some data?

I have a piece of data. At the moment, it's an XML file, but the architecture may change. So let's assume for the moment it's a C# Class.
When I store the data on disk or in the database, I need to add some sort of signature or fingerprint or checksum or whatever to ensure that no one can modify the data. The caveat: even an administrator or developer with access to all source code should not be able to modify it.
I assume that since someone with full code access can create a new signature easily (the signing needs to be done programatically, so no manual passphrase entry), the signature somehow needs to contain some additional data. Ideally I should be able to extract this data back from the signature, for example the date of signing and some strings.
My general approach is to use symmetric encryption. I generate a Hash, i.e. SHA-512 from all the fields and then encrypt that hash and my additional data with to get my signature, using the hash as password. To decrypt, my function would generate the hash from the actual data in the file, and try to decrypt the signature. That would not be tamper-proof though as it's easy to generate a signature where the signing date and additional information is still intact.
As I am not an expert on the field, I believe that I am trying to re-invent the wheel, and that it's not a very good wheel. I just wonder if there is some standard approach? I believe that part of my request is impossible (after all, if someone controls the entire environment, that person also controls the system time), but I still wonder how this is generally tackled?

It sounds to me like you want a combination of a digital signature with a secure digital timestamp.
In brief, after signing your data, you call a third party web service to provide an official timestamp and their own digital signature linking that timestamp to your signature value, thus providing evidence that the original signature (and thus the original data) was created on or before that date. With this scheme, even if the original signing key is later compromised, revoked or otherwise invalidated, any signatures that were made before the invalidation are still valid thanks to the timestamp.
A tamper-resistant hardware signature device may help. If the target hardware is fairly recent it may have some support already on the motherboard in the form of a TPM, but there are plenty of vendors out there willing to charge an arm and a leg for their own hardware security modules, or somewhat less for a smart card.
Sufficient security may not be achievable by technology alone. You may need independent validation of the system. You may need remote CCTV monitoring and recording of the machine's location or other physical security measures to detect or stop tampering. You may need third-party code escrow, review and signing to ensure that the code loaded on the machine is what was intended, and to deter and/or detect the insertion of backdoor logic into the code.
The bottom line is that how much money, time and effort you need to spend on this depends very much on what you stand to lose if records are forged.

You need both a digital signature and a trusted timestamp. The trusted timestamp gets a third-party involved to validate the message. Then any attacker doesn't have 'full control' of the whole system.

You may want to leverage PGP by using GPGME (GnuPG Made Easy) a library designed to make access to GnuPG easier for applications.

Jeffrey Hantin's answer is the best I think you're going to be able to do. It's NOT perfect, though:
1) It doesn't stop your black hat from making a totally fake transaction.
2) It doesn't perfectly stop tampering with the transaction. Yes, the new transaction will have a different timestamp but how do you prove the timestamp has been messed with if they clean up the relevant data? Even if you give them some tamperproof receipt (hash & sign the data on it), when it comes to a showdown how do you prove whose record was faked?

You want a digital signature using asymmetric cryptography.
This article seems to have some good examples and explanations.

This is basically what code signing is except, in your situation, it's not code that is actually what is getting signed. You will either have to arrange for a certificate to be purchased or set up your own certificate server.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.