Sort anything lexicographically - c#

I have this code which I think I found somewhere on the internet some years ago and it doesn't quite work.
The purpose is to take any string and from that create a string that is lexicographically sorted by a large number - because then inverse (descending) ordering can be achieved by subtracting the number from another even larger number.
private static BigInteger maxSort = new BigInteger(Encoding.Unicode.GetBytes("5335522543087813528200259404529154678271640415603227881439560533607051111046319775598721171814499900"));
public static string GetSortString(string str, bool descending)
{
var sortNumber = new BigInteger(Encoding.Unicode.GetBytes(str));
if (descending)
{
sortNumber = maxSort - sortNumber;
}
return "$SORT!" + sortNumber.ToString().PadLeft(100, '0') + ":" + str;
}
The reason I need this is because I want to use it to insert as RowKey in Azure Table Storage which is the only way to sort in Table Storage. I need to sort any text, any number and any date, both ascending and descending.
Can anyone see the issue with the code or have any code that serves the same purpose?
The question is tagged with C# but of course this is not a question of syntax so if you have the answer in any other code that would be fine too.
Example
I want to convert any string to a number which is lexicographically sorted correctly - because if it's a number, then I can invert it and sort descending.
So for example, if I can convert:
ABBA to 1234
Beatles to 3131
ZZ Top to 9584
Then those numbers would sort them correctly ... and, if I subtract them from a large number, I would be able to invert the sort order:
10000 - 1234 = 8766
10000 - 3131 = 6869
10000 - 9584 = 0416
Of course, to support longer text input, I need to subtract them from a very large number, which is why I use the very large BigInteger.
Current output from this code
ABBA: $SORT!0000000000000000000000018296156958359617:ABBA
Beatles: $SORT!0000000009111360792640460912278748069954:Beatles
ZZ TOP: $SORT!0000000000000096715522885596192519618650:ZZ TOP
As you can see, the longest text gets the highest number. I have also tried to add padding immediately on the input str, but that didnt help either.
Answer
The accepted answer worked. For descending sort order, the "BigInteger" trick from above could be used.
There is some limitation as to how long the sortable string can be.
Here is the final code:
private static BigInteger maxSort = new BigInteger(Encoding.Unicode.GetBytes("5335522543087813528200259404529154678271640415603227881439560533607051111046319775598721171814499900"));
public static string GetSortString(string str, bool descending)
{
BigInteger result = 0;
int maxLength = 42;
foreach (var c in str.ToCharArray())
{
result = result * 256 + c;
}
for (int i = str.Length; i < maxLength; i++)
{
result = result * 256;
}
if (descending)
{
result = maxSort - result;
}
return "$SORT!" + result;
}

If you were looking for a way to give a a value to any string so that you could sort them accordingly to the number and get the same result as above you can't. The reason is that strings don't have any length limit. Because you can always add a char to a string and thereby get a larger number even through it should have a lower lexicographical value.
If they have a length limit you can do something like this
pseudo code
bignum res = 0;
maxLength = 42;
for (char c : string)
res = res * 256 + c
for (int i = string.length; i < maxLength; i++)
res = res *256
If you want to optimize a bit, the last loop could be a lookup table. If your only using a-z, the times 256 could reduced to 26 or 32.

Related

C# Type of String Index

I need to access a very large number in the index of the string which int and long can't handle. I had to use ulong but the problem is that the indexer can only handle the type int.
This is my code and I have marked the line where the error is located. Any ideas how to solve this?
string s = Console.ReadLine();
long n = Convert.ToInt64(Console.ReadLine());
var cont = s.Count(x => x == 'a');
Console.WriteLine(cont);
Console.ReadKey();
The main idea of the code is to identify how many 'a's there are in the string. What are some other ways I can do this?
EDIT:
i didn't know that is the string index Capicity cant exceed the int type. and i fixed my for loop by replacing it with this linq line
var cont = s.Count(x => x == 'a');
now since my string can't exceed certain amount. so how i can repeat my string to append its char for 1,000,000,000,000 times rather than using this code
for (int i = 0; i < 20; i++)
{
s += s;
}
since this code is generating random char numbers in the string and if i raised the 20 may cause to overflow so i need to adjust it to repeat itself to make the string[index] = n // the long i declared above.
so for example if my string input is "aba" and n is 10 so the string will be "abaabaabaa" // total chars 10
PS: I Edited the original code
I assume you got a programming assignment or online coding challenge, where the requirement was "Count all instances of the letter 'a' in this > 2 GB file". You solution is to read the file in memory at once, and loop over it with a variable type that allows values over 2GB.
This causes an XY problem. You cannot have an array that large in memory in the first place, so you're not going to reach the point where you need a uint, long or ulong to index into it.
Instead, use a StreamReader to read the file in chunks, as explained in for example Reading large file in chunks c#.
You can repeat your string using an infinite sequence. I haven't added any check for valid arguments, etc.
static void Main(string[] args)
{
long count = countCharacters("aba", 'a', 10);
Console.WriteLine("Count is {0}", count);
Console.WriteLine("Press ENTER to exit...");
Console.ReadLine();
}
private static long countCharacters(string baseString, char c, long limit)
{
long result = 0;
if (baseString.Length == 1)
{
result = baseString[0] == c ? limit : 0;
}
else
{
long n = 0;
foreach (var ch in getInfiniteSequence(baseString))
{
if (n >= limit)
break;
if (ch == c)
{
result++;
}
n++;
}
}
return result;
}
//This method iterates through a base string infinitely
private static IEnumerable<char> getInfiniteSequence(string baseString)
{
int stringIndex = 0;
while (true)
{
yield return baseString[stringIndex++ % baseString.Length];
}
}
For the given inputs, the result is 7
I highly recommend you rethink the way you are doing this, but a quick fix would be to use a foreach loop instead:
foreach(char c in s)
{
if (c == 'a')
cont++;
}
Alternative using Linq:
cont = s.Count(c => c == 'a');
I'm not sure about what n is supposed to do. According to your code it limits the string length but your question never mentions why or to what end.
i need to access a very large number in the index of the string which
int, long can't handle
this statement is not true
c# string's max length is int.Max since string.Length is an integer and it is limited by that. You should be able to do
for (int i = 0; i <= n; i++)
The maximum length of a string cannot exceed the size of an int so there really is no point in using ulong or long to index into the string.
Simply put, you're trying to solve the wrong problem.
If we disregard the fact that the program is likely to cause an out of memory exception when building such a long string, you can simply fix your code by switching to an int instead of a ulong:
for (int i = 0; i <= n; i++)
Having said that you can also use LINQ to do this:
int cont = s.Take(n + 1).Count(c => c == 'a');
Now, in the first sentence of your question you state this:
I need to access a very large number in the index of the string which int and long can't handle.
This is wholly unnecessary because any legal index of a string will fit inside an int.
If you need to do this on some input that's longer than the maximum length of a string in .NET, you'll need to change your approach; use a Stream instead trying to read all input into a string.
char seeking = 'a';
ulong count = 0;
char[] buffer = new char[4096];
using (var reader = new StreamReader(inStream))
{
int length;
while ((length = reader.Read(buffer, 0, buffer.Length)) > 0)
{
count += (ulong)buffer.Count(c => c == seeking);
}
}

Fast and low-memory-consumption way to read in pair of numbers from file and process them?

Okay, so this is my challenge taken from CodeEval. I have to read numbers from a file that is formatted in a standard way, it has a pair of numbers separated by a comma on each line (x, n). I have to read in the pair values and process them, then print out the smallest multiple of n which is greater than or equal to x, where n is a power of 2.
EXACT REQUIREMENT: Given numbers x and n, where n is a power of 2, print out the smallest multiple of n which is greater than or equal to x. Do not use division or modulo operator.
I have come up with a number of solutions, but none of them satisfy the computer's conditions to let me pass the challenge. I only get a partial completion with scores that vary from 30 to 80 (from 100).
I'm assuming that my solutions do not pass the speed but more likely the memory-usage requirements.
I would greatly appreciate it if anyone can enlighten me and offer some better, more efficient solutions.
Here are two of my solutions:
var filePath = #"C:\Users\myfile.txt";
int x;
int n;
using (var reader = new StreamReader(filePath))
{
string numsFile = string.Empty;
while ((numsFile = reader.ReadLine()) != null)
{
var nums = numsFile.Split(',').ToArray();
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
}
<<<>>>
var fileNums = File.ReadAllLines(filePath);
foreach (var line in fileNums)
{
var nums = line.Split(',').ToArray();
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
Method to check numbers
public static int DangleNumbers(int x, int n)
{
int m = 2;
while ((n * m) < x)
{
m += 2;
}
return m * n;
}
I'm fairly new to C# and programming but these two ways I found to get the best score from several others I have tried. I'm thinking that it's not too optimal for a new string to be created on each iteration, nor do I know how to use a StringBuilder and get the values into an Int from it.
Any pointers in the right direction would be appreciated as I would really like to get this challenge passed.
The smallest multiple of n that is larger or equal to x is likely this:
if(x <= n)
{
return n;
}
else
{
return x % n == 0 ? x : (x/n + 1) * n;
}
As x and n are integers, the result of x/n will be truncated (or effectively rounded down). So the next integer larger than x that is a multiple of n is (x/n + 1) * n
Since you missed the requirements, the modulo version was the most obvious choice. Though you still got your method wrong. m = 2 would not result in the smallest being returned but it could actually be the double of the smallest if n is already larger than x.
x = 7, n = 8 would get you 16 instead of 8.
Also adding 2 to m would result in a similar problem.
x = 5, n = 2 would get you 8 instead of 6.
use the following method instead:
public static int DangleNumbers(int x, int n)
{
int result = n;
while(result < x)
result += n;
return result;
}
Still capable of begin optimized but at least right according to the (now) stated constraints.
I have tried to improve the solution with some suggestions from you guys and take the variables outside the loop and drop the ToArray() call which was redundant.
static void Main(string[] args)
{
var filePath = #"C:\Users\sorin\Desktop\sorvas.txt";
int x;
int n;
string[] nums;
using (var reader = new StreamReader(filePath))
{
string numsFile = string.Empty;
while ((numsFile = reader.ReadLine()) != null)
{
nums = numsFile.Split(',');
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
}
}
public static int DangleNumbers(int x, int n)
{
int m = 2;
while ((n * m) < x)
{
m += 2;
}
return m * n;
}
So it looks like this. The thing is that even if now the numbers have slightly improved, I got a lower score.
May it be their system to blame ?
Using the first option of reading line by line (rather than reading all lines) is clearly going to use less memory (except potentially in the case where the file is very small (eg "1,1") in which case the overhead of the reader may cause problems but at that point the memory used is probably irrelevant.
Likewise declaring the variables outside the loop is generally better but in this case since the objects are value types I'm not sure it makes a difference.
Lastly the most efficient way of doing your DangleNumbers method is probably using bitwise logic operators and the fact that n is always a power of 2. Here is my attempt:
public static int DangleNumbers3(int x, int n)
{
return ((x-1) & ~(n-1))+n;
}
Essentially it relies on the fact that in binary a power of n is always a 1 followed by zero or more zeros. Thus a multiple of n will always end in that same number of zeros. So if n has M zeros after the one then you can take the binary form of x and if it already ends in M zeros then you have your answer. Otherwise you zero out the last M digits at which point you have the multiple of n that is just under x and then you add 1.
In the code ~(n-1) is a bitmask that has M zeros at the end and the leading digits are all 1. Thus when you AND it with a number it will zero out the trailing digits. I apply this to (x-1) to avoid having to do the check for if it is already the answer and have special cases.
It is important to note that this only works because of the special form of n as a power of 2. This method avoids the need for any loops and thus should run much faster (it has five operations total and no branching at all compared to other looping methods which will tend to have at the very least an operation and a comparison per loop.

How to get count of numbers in int and how to split a number without making a string

I have a number like 601511616
If all number's length is multiple of 3, how can a split the number into an array without making a string
Also, how can I count numbers in the int without making a string?
Edit: Is there a way to simply split the number, knowing it's always in a multiple of 3... good output should look like this: {616,511,601}
You can use i % 10 in order to get the last digit of integer.
Then, you can use division by 10 for removing the last digit.
1234567 % 10 = 7
1234567 / 10 = 123456
Here is the code sample:
int value = 601511616;
List<int> digits = new List<int>();
while (value > 0)
{
digits.Add(value % 10);
value /= 10;
}
// digits is [6,1,6,1,1,5,1,0,6] now
digits.Reverse(); // Values has been inserted from least significant to the most
// digits is [6,0,1,5,1,1,6,1,6] now
Console.WriteLine("Count of digits: {0}", digits.Count); // Outputs "9"
for (int i = 0; i < digits.Count; i++) // Outputs "601,511,616"
{
Console.Write("{0}", digits[i]);
if (i > 0 && i % 3 == 0) Console.Write(","); // Insert comma after every 3 digits
}
IDEOne working demonstration of List and division approach.
Actually, if you don't need to split it up but only need to output in 3-digit groups, then there is a very convenient and proper way to do this with formatting.
It will work as well :)
int value = 601511616;
Console.WriteLine("{0:N0}", value); // 601,511,616
Console.WriteLine("{0:N2}", value); // 601,511,616.00
IDEOne working demonstration of formatting approach.
I can't understand your question regarding how to split a number into an array without making a string - sorry. But I can understand the question about getting the count of numbers in an int.
Here's your answer to that question.
Math.Floor(Math.Log10(601511616) + 1) = 9
Edit:
Here's the answer to your first question..
var n = 601511616;
var nArray = new int[3];
for (int i = 0, numMod = n; i < 3; numMod /= 1000, i++)
nArray[i] = numMod%1000;
Please keep in mind there's no safety in this operation.
Edit#3
Still not perfect, but a better example.
var n = 601511616;
var nLength = (int)Math.Floor(Math.Log10(n) + 1)/ 3;
var nArray = new int[nLength];
for (int i = 0, numMod = n; i < nLength; numMod /= 1000, i++)
nArray[i] = numMod%1000;
Edit#3:
IDEOne example http://ideone.com/SSz3Ni
the output is exactly as the edit approved by the poster suggested.
{ 616, 511, 601 }
Using Log10 to calculate the number of digits is easy, but it involves floating-point operations which is very slow and sometimes incorrect due to rounding errors. You can use this way without calculating the value size first. It doesn't care if the number of digits is a multiple of 3 or not.
int value = 601511616;
List<int> list = new List<int>();
while (value > 0) // main part to split the number
{
int t = value % 1000;
value /= 1000;
list.Add(t);
}
// Convert back to an array only if it's necessary, otherwise use List<T> directly
int[] splitted = list.ToArray();
This will store the splitted numbers in reverse order, i.e. 601511616 will become {616, 511, 601}. If you want the numbers in original order, simply iterate the array backwards. Alternatively use Array.Reverse or a Stack
Since you already know they are in multiples of 3, you can just use the extracting each digit method but use 1000 instead of 10. Here is the example
a = 601511616
b = []
while(a):
b.append(a%1000)
a = a//1000
print(b)
#[616, 511, 601]

Create a numeric value from text

I have searched allot on google without any results I am looking for
I would like to get a numeric value from any string in C#
Ex.
var myString = "Oompa Loompa";
var numericoutput = convertStringToNumericValue(myString);
output/value of numericoutput is something like 612734818
so when I put in another string let say "C4rd1ff InTernaT!onal is # gr3at place#"
the int output will be something like 73572753.
The Values must stay constant, for example so if I enter the same text again of "Oompa Loompa" then I get 612734818 again.
I thought maybe in the way of using Font Family to get the char index of each character, but I don't know where to start with this.
The reason for this is so that I can use this number to generate an index out of a string with other data in it, and with the same input string, get that same index again to recall the final output string for validation.
Any help or point in the right direction would be greatly appreciated
Thanks to Tim I ended up doing the following:
var InputString = "My Test String ~!##$%^&*()_+{}:<>?|";
byte[] asciiBytes = Encoding.ASCII.GetBytes(InputString);
int Multiplier = 1;
int sum = 0;
foreach (byte b in asciiBytes)
{
sum += ((int)b) * Multiplier;
Multiplier++;
}
Obviously this will not work for 1000's of characters, but it is good enough for short words or sentences
int.MaxValue = 2 147 483 647
As an alternative to converting the string to it's bytes, and if a hash won't meet the requirements, here are a couple of shorter ways to accomplish getting a numeric value for a string:
string inputString = "My Test String ~!##$%^&*()_+{}:<>?|"
int multiplier = 0;
int sum = 0;
foreach (char c in inputString)
{
sum += ((int)c) * ++multiplier;
}
The above code outputs 46026, as expected. The only difference is it loops through the characters in the string and gets their ASCII value, and uses the prefix ++ operator (Note that multiplier is set to 0 in this case - which is also the default for int).
Taking a cue from Damith's comment above, you could do the same with LINQ. Simply replace the foreach above with:
sum = inputString.Sum(c => c * ++multiplier);
Finally, if you think you'll need a number larger than Int32.MaxValue, you can use an Int64 (long), like this:
string inputString = "My Test String ~!##$%^&*()_+{}:<>?|"
int multiplier = 0;
long largeSum = 0;
foreach (char c in inputString)
{
largeSum += ((int)c) * ++multiplier;
}

Fixed Length Int Obfuscator. Does anyone know how to do this?

I am using a generic class to convert an INT to a X base:
BaseX basex = new BaseX("abcdefghijklmnopqrstuvwxyz");
var a = basex.ToBaseX(1002);
var b = basex.FromBaseX("aghe");
And the BaseX class is as follows:
public class BaseX {
private readonly string _digits;
public BaseX(string digits) {
_digits = digits;
}
public string ToBaseX(int number) {
var output = "";
do {
output = _digits[number % _digits.Length] + output;
number = number / _digits.Length;
}
while (number > 0);
return output;
}
public int FromBaseX(string number) {
return number.Aggregate(0, (a, c) => a * _digits.Length + _digits.IndexOf(c));
}
}
I am using the lowercase base but I can use any other base.
Is it possible to make the output in the base X always the same length?
I think I should use "Multiplicative Inverse" and some similar process with mapping and encoding but I am not sure how to do this ...
Could I get some help to create this?
Basically, my objective is instead of creating random fixed lenght codes to use in promotions or in ID obfuscation just create a fixed length of an INT (The ID on the database).
Thank You,
Miguel
If I understand you correctly you want to pad the generated value with "zeroes". E.g. if you were using plain numbers and you wanted an ID of length 10 and the ID was 1234 the padded ID would be 0000001234.
The simplest way is to pad the generated value. You can add a new method to the BaseX class:
public string ToBaseX(int number, int width) {
var output = ToBaseX(number);
return output.PadLeft(width, _digits[0]);
}
With this method basex.ToBaseX(1002, 10) returns
aaaaaaabmo
and basex.FromBaseX("aaaaaaabmo") returns
1002
In the comments you indicate that the resulting string aaaaaaabmo does not seem very random. But then you can use the approach that Eric Lippert describes in the article A practical use of multiplicative inverses that you are referring to.
First you need to pick an upper limit to the numbers you want to obfuscate (and this number should fit into a 32 bit integer). Eric Lippert uses 1000000000 (1 billion). You then need to pick a number less than the limit that is coprime with the limit (e.g. they do not share any prime factors). Eric Lippert chooses 387420489 (and explains that any number that ends in 9 will be coprime with a number that is a power of 10). You then need to calculate the modular multiplicative inverse of this number, e.g. a number inverse-x that satisfies the following condition:
387420489 * inverse-x = 1 (mod 1000000000)
You can use the extended Euclidian algorithm for this calculation for instance using an online calculator. The modular multiplicative inverse is 513180409.
To obfuscate you number you can use this code (to avoid overflow it is important to perform the calculation using 64 bit integers):
var value = 1002;
var m = 1000000000L;
var x = 387420489L;
var inverseX = 513180409L;
var encoded = value*x%m;
var decoded = encoded*inverseX%m;
For this particular calculation encoded is 195329978.
If you want to use the lower case letters to represent the obfuscated number you can use your BaseX class to convert the number to base 26. You can compute the maximum letters required to represent any number below 1 billion:
Math.Log(1000000000)/Math.Log(26) = 6.36054383137796
This means that you need no more than 7 letters to represent your number.
I have combined all this into two simple methods using some constants you can easily customize:
static class Obfuscator {
const Int64 modulo = 1000000000L;
const Int64 coprime = 280619659L;
const Int64 inverseCoprime = 687208739L;
const String digits = "abcdefghijklmnopqrstuvwxyz";
const Int32 maxDigits = 7; // Math.Log(modulo)/Math.Log(digits.Length) rounded up.
public static String Obfuscate(Int32 originalValue) {
if (originalValue >= modulo || originalValue < 0)
throw new ArgumentOutOfRangeException();
var value = (Int32) (originalValue*coprime%modulo);
var buffer = new Char[maxDigits];
var i = maxDigits;
do {
buffer[--i] = digits[value%digits.Length];
value /= digits.Length;
} while (value > 0);
while (i > 0)
buffer[--i] = digits[0];
return new String(buffer);
}
public static Int32 Deobfuscate(String obfuscatedValue) {
if (String.IsNullOrEmpty(obfuscatedValue))
throw new ArgumentException();
var value = obfuscatedValue
.Aggregate(0, (a, c) => a*digits.Length + digits.IndexOf(c));
return (Int32) (value*inverseCoprime%modulo);
}
}
Only detail to be aware of is that 0 is obfuscated into aaaaaaa. For any number between 1 and 999999999 (inclusive) you get what looks like a random string of 7 characters.

Categories