Compressing big number (or string) to small value

Compressing big number (or string) to small value - c#

My ASP.NET page has following query string parameter:
…?IDs=1000000012,1000000021,1000000013,1000000022&...
Here IDs parameter will always have numbers separated by something, in this case ,. Currently there are 4 numbers but normally they would be in between 3 and 7.
Now, I am looking for method to convert each big number from above into smallest possible value; specifically compressing value of IDs query string parameter. Both, compressing each number algorithm or compressing whole value of IDs query string parameter are welcome.
Encode or decode is not an issue; just compressing the value IDs query string parameter.
Creating some unique small value for IDs and then retrieving its value from some data source is out of scope.
Is there an algorithm to compress such big numbers to small values or to compress value of the IDs query string parameter all together?

You basically need so much room for your numbers because you are using base 10 to represent them. An improvement would be to use base 16 (hex). So for example, you could represent 255 (3 digits) as ff (2 digits).
You can take that concept further by using a much larger number base... the set of all characters that are valid query string parameters:
A-Z, a-z, 0-9, '.', '-', '~', '_', '+'
That gives you a base of 67 characters to work with (see Wikipedia on QueryString).
Have a look at this SO post for approaches to converting base 10 to arbitrary number bases.
EDIT:
In the linked SO post, look at this part:
string xx = IntToString(42,
new char[] { '0','1','2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x'});
That's almost what you need. Just expand it by adding the few characters it is missing:
yz.-~_+
That post is missing a method to go back to base 10. I'm not going to write it :-) but the procedure is like this:
Define a counter I'll call TOTAL.
Look at the right most character and find it's position in the array.
TOTAL = (the position of the character in the array)
Example: Input is BA1. TOTAL is now 1 (since "1" is in position 1 in the array)
Now look at the next character left of the first one and find it's position in the array.
TOTAL += 47 * (the position of the character in the array)
Example: Input is BA1. TOTAL is now (47 * 11) + 1 = 518
Now look at the next character left of the previous one and find it's position in the array.
TOTAL += 47 * 47 * (the position of the character in the array)
Example: Input is BA1. Total is now (47 * 47 * 10) + (47 * 11) + 1 = 243508
And so on.
I suggest you write a unit test that converts a bunch of base 10 numbers into base 47 and then back again to make sure your conversion code works properly.
Note how you represented a 6 digit base 10 number in just 3 digits of base 47 :-)

What is the range of your numbers? Assuming they can fit in a 16-bit integer, I would:
Store all your numbers as 16-bit integers (2 bytes per number, range -32,768 to 32,767)
Build a bytestream of 16-bit integers (XDR might be a good option here; at very least, make sure to handle endianness correctly)
Base64 encode the bytestream, using the modified base64 encoding for URLs (net is about 3 characters per number)
As an added bonus you don't need comma characters anymore because you know each number is 2 bytes.
Alternatively, if that isn't good enough, I'd use zlib to compress your stream of integers and then base64 the zlib-compressed stream. You can also switch to 32-bit integers if 16-bit isn't a large enough range (i.e. if you really need numbers in the 1,000,000,000 range).
Edit:
Maybe too late, but here's an implementation that might do what you need:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Scratch {
class Program {
static void Main(string[] args) {
//var ids = new[] { 1000000012, 1000000021, 1000000013, 1000000022 };
var rand = new Random();
var ids = new int[rand.Next(20)];
for(var i = 0; i < ids.Length; i++) {
ids[i] = rand.Next();
}
WriteIds(ids);
var s = IdsToString(ids);
Console.WriteLine("\nResult string is: {0}", s);
var newIds = StringToIds(s);
WriteIds(newIds);
Console.ReadLine();
}
public static void WriteIds(ICollection<Int32> ids) {
Console.Write("\nIDs: ");
bool comma = false;
foreach(var id in ids) {
if(comma) {
Console.Write(",");
} else {
comma = true;
}
Console.Write(id);
}
Console.WriteLine();
}
public static string IdsToString(ICollection<Int32> ids) {
var allbytes = new List<byte>();
foreach(var id in ids) {
var bytes = BitConverter.GetBytes(id);
allbytes.AddRange(bytes);
}
var str = Convert.ToBase64String(allbytes.ToArray(), Base64FormattingOptions.None);
return str.Replace('+', '-').Replace('/', '_').Replace('=', '.');
}
public static ICollection<Int32> StringToIds(string idstring) {
var result = new List<Int32>();
var str = idstring.Replace('-', '+').Replace('_', '/').Replace('.', '=');
var bytes = Convert.FromBase64String(str);
for(var i = 0; i < bytes.Length; i += 4) {
var id = BitConverter.ToInt32(bytes, i);
result.Add(id);
}
return result;
}
}
}

Here's another really simple scheme that should give good compression for a set of numbers of the form N + delta where N is a large constant.
public int[] compress(int[] input) {
int[] res = input.clone();
Arrays.sort(res);
for (int i = 1; i < res.length; i++) {
res[i] = res[i] - res[i - 1];
}
return res;
}
This should reduce the set {1000000012,1000000021,1000000013,1000000022} to the list [1000000012,1,9,1], which you can then compress further by representing the numbers in base47 encoding as described in another answer.
Using simple decimal encoding, this goes from 44 characters to 16 characters; i.e. 63%. (And using base47 will give even more compression).
If it is unacceptable to sort the ids, you don't get quite as good compression. For this example, {1000000012,1000000021,1000000013,1000000022} compresses to the list [1000000012,9,-8,9]. That is just one character longer for this example
Either way, this is better than a generic compression algorithm or encoding schemes ... FOR THIS KIND OF INPUT.

If the only issue is the URL length, you can convert numbers to base64 characters, then convert them back to numbers at the server side

how patterned are the IDs you are getting? if digit by digit, the IDs are random, then the method I am about to propose won't be very efficient. but if the IDs you gave as an example are representative of the types you'd be getting, then perhaps the following could work?
i motivate this idea by example.
you have for example, 1000000012 as ID that you'd like to compress. why not store it as [{1},{0,7},{12}]? This would mean that the first digit is a 1 followed by 7 zeros followed by a 12. Thus if we use the notation {x} that would represent one instance of x, while if we use {x,y} that would mean that x occurs y times in a row.
you could extend this with a little bit of pattern matching and/or function fitting.
for example, pattern matching: 1000100032 would be [{1000,2}{32}].
for example, function fitting:
if your IDs are 10 digits, then split the ID into two 5 digit numbers and store the equation of the line that goes through both points. if ID = 1000000012, the you have y1 = 10000 and y2 = 12. therefore, your slope is -9988 and your intercept is 10000 (assuming x1 = 0, x2 = 1). In this case, it's not an improvement, but if the numbers were more random, it could be. Equivalently, you could store the sequence of IDs with piecewise linear functions.
in any case, this mostly depends on the structure of your IDs.

I assume you are doing this as a workaround for request URL length restrictions ...
Other answers have suggested encoding the decimal id numbers in hex, base47 or base64, but you can (in theory) do a lot better than that by using LZW (or similar) to compress the id list. Depending on how much redundancy there is in your ID lists, you could get significantly more than 40% reduction, even after re-encoding the compressed bytes as text.
In a nut-shell, I suggest that you find an off-the-shelf text compression library implemented in Javascript and use it client side to compress the ID list. Then encode the compressed bytestring using base47/base64, and pass the encoded string as the URL parameter. On the server side do the reverse; i.e. decode followed by decompress.
EDIT: As an experiment, I created a list of 36 different identifiers like the ones you supplied and compressed it using gzip. The original file is 396 bytes, the compressed file is 101 bytes, and the compressed + base64 file 138 bytes. That is a 65% reduction overall. And the compression ratio could actually improve for larger files. However, when I tried this with a small input set (e.g. just the 4 original identifiers), I got no compression, and after encoding the size was larger than the original.
Google "lzw library javascript"
In theory, there might be simpler solution. Send the parameters as "post data" rather than in the request URL, and get the browser to apply the compression using one of the encodings that it understands. That will give you more savings too since there is no need to encode the compressed data into legal URL characters.
The problem is getting the browser to compress the request ... and doing that in a browser independent way.

Related

How the space complexity of this algorithm is O(1)

I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?
// Function to remove duplicates
static string removeDuplicatesFromString(string string1)
{
// keeps track of visited characters
int counter = 0;
char[] str = string1.ToCharArray();
int i = 0;
int size = str.Length;
// gets character value
int x;
// keeps track of length of resultant String
int length = 0;
while (i < size) {
x = str[i] - 97;
// check if Xth bit of counter is unset
if ((counter & (1 << x)) == 0) {
str[length] = (char)('a' + x);
// mark current character as visited
counter = counter | (1 << x);
length++;
}
i++;
}
return (new string(str)).Substring(0, length);
}
It seems that I don't understand Space Complexity.

I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?
It does not.
The algorithm takes as its input an arbitrary sized string consisting only of 26 characters, and therefore the output is only ever 26 characters or fewer, so the output array need not be of the size of the input.
You are correct to point out that the implementation given on the site allocates O(n) extra space unnecessarily for the char array.
Exercise: Can you fix the char array problem?
Harder Exercise: Can you describe and implement a string data structure that implements the contract of a string efficiently but allows this algorithm to be implemented actually using only O(1) extra space for arbitrary strings?
Better exercise: The fact that we are restricted to an alphabet of 26 characters is what enables the cheesy "let's just use an int as a set of flags" solution. Instead of saying that n is the size of the input string, what if we allow arbitrary sequences of arbitrary values that have an equality relation; can you come up with a solution to this problem that is O(n) in the size of the output sequence, not the input sequence?
That is, can you implement public static IEnumerable<T> Distinct<T>(this IEnumerable<T> t) such that the output is deduplicated but otherwise in the same order as the input, using O(n) storage where n is the size of the output sequence?
This is a better exercise because this function is actually implemented in the base class library. It's useful, unlike the toy problem.
I note also that the problem statement assumes that there is only one relevant alphabet with lowercase characters, and that there are 26 of them. This assumption is false.

Selecting set of binary sequences to avoid similarity

I want to be able to programatically generate a set of binary sequences of a given length whilst avoiding similarity between any two sequences.
I'll define 'similar' between two sequences thus:
If sequence A can be converted to sequence B (or B to A) by bit-shifting A (non-circularly) and padding with 0s, A and B are similar (note: bit-shifting is allowed on only one of the sequences otherwise both could always be shifted to a sequence of just 0s)
For example: A = 01010101 B = 10101010 C = 10010010
In this example, A and B are similar because a single left-shift of A results in B (A << 1 = B). A and C are not similar because no bit-shifting of one can result in the other.
A set of sequences is defined is dissimilar if no subset of size 2 is similar.
I believe there could be multiple sets for a given sequence length and presumably the size of the set will be significantly less than the total possibilities (total possibilities = 2 ^ sequence length).
I need a way to generate a set for a given sequence length. Does an algorithm exist that can achieve this? Selecting sequences one at a time and checking against all previously selected sequences is not acceptable for my use case (but may have to be if a better method doesn't exist!).
I've tried generating sets of integers based on primes numbers and also the golden ratio, then converting to binary. This seemed like it might be a viable method, but I have been unable to get it to work as expected.
Update: I have written a function in C# to that uses a prime number modulo to generate the set without success. Also I've tried using the Fibonacci sequence which finds a mostly dissimilar set, but of a size that is very small compared to the number of possibilities:
private List<string> GetSequencesFib(int sequenceLength)
{
var sequences = new List<string>();
long current = 21;
long prev = 13;
long prev2 = 8;
long size = (long)Math.Pow(2, sequenceLength);
while (current < size)
{
current = prev + prev2;
sequences.Add(current.ToBitString(sequenceLength));
prev2 = prev;
prev = current;
}
return sequences;
}
This generates a set of sequences of size 41 that is roughly 60% dissimilar (sequenceLength = 32). It is started at 21 since lower values produce sequences of mostly 0s which are similar to any other sequence.
By relaxing the conditions of similarity to only allowing a small number of successive bit-shifts, the proportion of dissimilar sequences approaches 100%. This may be acceptable in my use case.
Update 2:
I've implemented a function following DCHE's suggestion, by selecting all odd numbers greater than half the maximum value for a given sequence length:
private static List<string> GetSequencesOdd(int length)
{
var sequences = new List<string>();
long max = (long)(Math.Pow(2, length));
long quarterMax = max / 4;
for (long n = quarterMax * 2 + 1; n < max; n += 2)
{
sequences.Add(n.ToBitString(length));
}
return sequences;
}
This produces an entirely dissimilar set as per my requirements. I can see why this works mathematically as well.

I can't prove it, but from my experimenting, I think that your set is the odd integers greater than half of the largest number in binary. E.g. for bit sets of length 3, max integer is 7, so the set is 5 and 7 (101, 111).

Verifying modular sum checksum in c#

I'm working with an embedded system that returns ASCII data that includes (what I believe to be) a modular sum checksum. I would like to verify this checksum, but I've been unable to do so based on the manufacturers specification. I've also been unable to accomplish the opposite and calculate the same checksum based off the description.
Each response from the device is in the following format:
╔═════╦═══════════════╦════════════╦════╦══════════╦═════╗
║ SOH ║ Function Code ║ Data Field ║ && ║ Checksum ║ ETX ║
╚═════╩═══════════════╩════════════╩════╩══════════╩═════╝
Example:
SOHi11A0014092414220&&FBEA
Where SOH is ASCII 1. e.g.
#define SOH "\x01"
The description of the checksum is as follows:
The Checksum is a series of four ASCII-hexadecimal characters which provide a check on the integrity of all the characters preceding it, including the control
characters. The four characters represent a 16-bit binary count which is the 2's complemented sum of the 8-bit binary representation of the message characters after the parity bit (if enabled) has been cleared. Overflows are ignored. The data integrity check can be done by converting the four checksum characters to the 16-bit
binary number and adding the 8-bit binary representation of the message characters to it. The binary result should be zero.
I've tried a few different interpretations of the specification, including ignoring SOH as well as the ampersands, and even the function code. At this point I must be missing something very obvious in either my interpretation of the spec, or the code I've been using to test. Below you'll find a simple example (data was taken from a live system), if it were correct, the lower word in the validate variable would be 0:
static void Main(string[] args)
{
unchecked
{
var data = String.Format("{0}{1}", (char) 1, #"i11A0014092414220&&");
const string checkSum = "FBEA";
// Checksum is 16 bit word
var checkSumValue = Convert.ToUInt16(checkSum, 16);
// Sum of message chars preceeding checksum
var mySum = data.TakeWhile(c => c != '&').Aggregate(0, (current, c) => current + c);
var validate = checkSumValue + mySum;
Console.WriteLine("Data: {0}", data);
Console.WriteLine("Checksum: {0:X4}", checkSumValue);
Console.WriteLine("Sum of chars: {0:X4}", mySum);
Console.WriteLine("Validation: {0}", Convert.ToString(validate, 2));
Console.ReadKey();
}
}
Edit
While the solution provided by #tinstaafl works for this particular example, it doesn't work when providing a larger record such as the below:
SOHi20100140924165011000007460904004608B40045361000427DDD6300000000427C3C66000000002200000745B4100045B3D8004508C00042754B900000000042774D8D0000000033000007453240004531E000459F5000420EA4E100000000427B14BB000000005500000744E0200044DF4000454AE000421318A0000000004288A998000000006600000744E8C00044E7200045469000421753E600000000428B4DA50000000&&
BA6C
Theoretically you could keep incrementing/decrementing a value in the string until the checksum matched, it just so happened that using the character 1 rather than the ASCII SOH control character gave it just the right value, a coincidence in this case.

Not sure if this is exactly what you're looking for, but by using an integer of 1 for the SOH instead of a char value of 1, taking the sum of all the characters and converting the validate variable to a 16 bit integer, I was able to get validate to equal 0:
var data = (#"1i11A0014092414220&&");
const string checkSum = "FBEA";
// Checksum is 16 bit word
var checkSumValue = Convert.ToUInt16(checkSum, 16);
// Sum of message chars preceeding checksum
var mySum = data.Sum<char>(c => c);
var validate = (UInt16)( checkSumValue + mySum);
Console.WriteLine("Data: {0}", data);
Console.WriteLine("Checksum: {0:X4}", checkSumValue);
Console.WriteLine("Sum of chars: {0:X4}", mySum);
Console.WriteLine("Validation: {0}", Convert.ToString(validate, 2));
Console.ReadKey();

Way to generate a unique number that does not repeat in a reasonable time?

I'm integrating/testing with a remote web service and even though it's the "QA" endpoint, it still enforces a unique email address on every call.
I can think of DateTime.Now.Ticks (e.g. 634970372342724417) and Guid.NewGuid(), but neither of those can be coalesced into an email with max. 20 chars (or can they?).
I suppose it's not that hard to write out to a file a number that contains the last number used and then use email1#x.com, email2#x.com, etc... but if I can avoid persisting state I always do.
Does anyone have a trick or an algorithm that gives something of a short length "guid" that is unique to a reasonably long time period (say a year) that I could use for my email addresses of max length 20 chars with (max length of guid) = 14 = 20 - length of "#x.com"?

If you assume that you will not generate two e-mail addresses at the same 'tick', then you can indeed use the ticks to generate an e-mail address.
However, if ticks is a 64-bit number, and you write out that number, you will end up with more than 20 characters.
The trick is to encode your 64-bit number using a different scheme.
Assume that you can use the 26 characters from the western alphabet + 10 digits. This makes 36 possible characters. If you take 5 bits, you can represent 32 characters. That should be enough.
Take the 64-bits and divide them in groups of 5 bits (64 /5 is about 13 groups). Translate every 5 bits to one character. That way you end up with 13 characters, and you can still add a character in front of it).
long ticks = DateTime.Now.Ticks;
byte[] bytes = BitConverter.GetBytes(ticks);
string id = Convert.ToBase64String(bytes)
.Replace('+', '_')
.Replace('/', '-')
.TrimEnd('=');
Console.WriteLine (id);
Yields:
Gq1rNzbezwg

If you get the following digits from your date-time, you should be able to make it work...
Soemthing like:
DateTime.Now.ToString("yyMMddHHmmssff");
which is 16 characters, leaving 4 for some other prefix as you need.
So, Feb 21, 2013, at approximately 10:21 would be "130321102142" and the next one would be "130321102169", etc...
Have a look at http://msdn.microsoft.com/en-us/library/zdtaw1bw.aspx for more details on datetime formatting.

Since you specified at least 1 second between each call, this should work :
DateTime.Now.ToString("yyyyMMddHHmmss");
its exactly 14 characters.

Just to add... If you want to use number only from ticks, you can by using substring, for example:
int onlyThisAmount = 20;
string ticks = DateTime.Now.Ticks.ToString();
ticks = ticks.Substring(ticks.Length - onlyThisAmount);

/// <summary>
/// Get a unique reference number.
/// </summary>
/// <returns></returns>
public string GetUniqueReferenceNumber(char firstChar)
{
var ticks = DateTime.Now.Ticks;
var ticksString = ticks.ToString();
var ticksSubString = ticksString.Substring((ticksString.Length - 15 > 0) ? ticksString.Length - 15 : 0);
if (this.currentTicks.Equals(ticks))
{
this.currentReference++;
if (this.currentReference >= 9999)
{
// Only when there are very fast computers.
System.Threading.Thread.Sleep(1);
}
return (firstChar + ticksSubString + this.currentReference.ToString("D4")).PadRight(20, '9');
}
this.currentReference = -1;
this.currentTicks = ticks;
return (firstChar + ticksSubString).PadRight(20, '9');
}
In my case I needed to create a unique reference number with a unique first character and a maximum of 20 characters. Maybe you can use the function below, it allows you to create 9999 unique numbers within one tick. (zero included)
Of course you can create your own implementation without the first character and maximum character count of 20

public async Task<string> GeneratePatientNumberAsync()
{
var random = new Random();
var chars = DateTime.Now.Ticks + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789" + DateTime.Now.Ticks;
return new string(Enumerable.Repeat(chars, 5)
.Select(s => s[random.Next(s.Length)]).ToArray());
}

How to convert 2 Guids into string of max 50 characters length (2 way conversion)

have an interesting problem - I need to convert 2 (randomly) generated Guids into a string. Here are the constraints:
string max 50 charactes length.
only numbers and small letters can be used (0123456789abcdefghijklmnopqrstuvwxyz)
the algorithm has to be 2 way - need to be able to decode the encoded string into same 2 separate guids.
I've browsed a lot looking for toBase36 conversion bo so far no luck with Guid.
Any ideas? (C#)

First of all, you're in luck, 36^50 is around 2^258.5, so you can store the information in a 50 byte base-36 string. I wonder, though, why anybody would have to use base-36 for this.
You need to treat each GUID as a 128-bit number, then combine them into a 256-bit number, which you will then convert to a base-36 'number'. Converting back is doing the same in reverse.
Guid.ToByteArray will convert a GUID to a 16 byte array. Do it for both GUIDs and you have a 32 byte (which is 256 bits) array. Construct a BigInt from that array (there's a constructor), and then just convert that number to base-36.
To convert a number to base-36, do something like this (I assume everything is positive)
const string digits = "0123456789abcdefghijklmnopqrstuvwxyz";
string ConvertToBase36(BigInt number)
{
string result = "";
while(number > 0)
{
char digit = string[number % 36];
result += digit;
number /= 36;
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.