Decrypting text using frequency analysis in c#.

Decrypting text using frequency analysis in c#. - c#

I've been tasked with decrypting a text file using frequency analysis. This isn't a do it for me question but i have absolutley no idea what to do next. What i have so far reads in the text from file and counts the frequency of each letter. If someone could point me in the right direction as to swapping letters depending on their frequency it would be much appreciated.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace freqanaly
{
class Program
{
static void Main()
{
string text = File.ReadAllText("c:\\task_2.txt");
char[,] message = new char[2,26];
Console.Write(text); int count = 0;
for (int x = 'A'; x <= 'Z'; x++)
{
message[0, count] = (char)x;
Console.WriteLine(message[0, count]);
count++;
}
foreach (char c in text)
{ count = 0;
for (int x = 'A'; x <= 'Z'; x++)
{
if (c == x)
{
message[1, count]++;
}
count++;
}
}
Console.ReadKey();
for (int x = 0; x <= 25; x++)
{
Console.Write(message[0, x]); Console.Write(" = "); Console.WriteLine((int)message[1, x]);
}
Console.ReadKey();
}
}
}

This IS encrypted data, just using a simple subsitution cipher (I assume). See the definition of encoding/encrypting.
http://www.perlmonks.org/index.pl?node_id=66249
Regardless, as Sergey suggested, get a letter frequency table and match frequencies. You will have to take into account some deviation, since there is no guarantee there are exacltly 8.167% of 'A's in the document (perhaps in this document the percent of 'A's are 8.78 or 7.65%). Also, be sure to evaluate on every occurance of A, and not differentiate 'a' from 'A'. This can be handled with a simple ToUpper or ToLower transform on the character; just be consistant.
Also, when you start getting into less common, but still popular letters, you will need to handle that. C, F, G, W, and M are all around the 2% +/- mark, so you will need to play with the decrypted text till the letters fit in the word, and in other words within the document where this character substitution will also happen. This concept is similar to fitting numbers in a Suduko matrix. Luckily, once you find where a letter should go, it cascades through out the document and you can start to see the decrypted plain text emerge. As an example, '(F)it' and '(W)it' are both valid words, but if you see '(F)hen' in the document when you substitute a 'F', you can make a good guess that you should substitute this character with a 'W' instead. (T)here and (W)here is another example, and a word ()hen won't provide any guidance by itself, since both (W)hen and (T)hen are valid words. It is here you have to incorporate contextual clues as to which word makes sense. "Then is a good time to start our attack?" doesn't make as much sense as "When is a good time to start our attack?".
All of this is assusming you are using a monoalphebetic substitution. A polyalphebetic substitution is more difficult, and you may need to look into cracking the Vigenère cipher examples to try to figure out a way around this problem.
I suggest reading "The Code Book" by S. Singh. It is a very interesting read and easy to digest the historical ciphers used and how they were cracked.
http://www.google.com/products/catalog?q=the+code+book&rls=com.microsoft:en-us:IE-SearchBox&oe=&um=1&ie=UTF-8&tbm=shop&cid=5361323398438876518&sa=X&ei=hpR0T-HyObSK2QWvgvH-Dg&ved=0CFoQ8wIwBQ#

Next you should grab some of publically available English frequency lists (from Wikipedia, for example) and compare the actual frequencies table you got with it - in order to find the replacements for letters.

Related

Fastest way to split a huge text into smaller chunks

I have used the below code to split the string, but it takes a lot of time.
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd();
int startPos = 0;
ArrayList alSegments = new ArrayList();
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
startPos = startPos + segmentSize;
}
}
Please suggest me an alternative way to split the string into smaller chunks of fixed size

First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.
One proposed (and untested) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength)
{
var characters = StringInfo.GetTextElementEnumerator(value);
while (characters.MoveNext())
yield return String.Concat(Take(characters, desiredLength));
}
private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
for (int i = 0; i < count; ++i)
{
yield return (string)enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).
About your code note that:
You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).
There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.
In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).

Below is my analysis of your question and code (read the comments)
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string?
int startPos = 0;
ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string>
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want?
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split"
startPos = startPos + segmentSize;
}
}
Making all kind of assumption, below is the code I would recommend for splitting long string. It is just a clean way of doing what you are doing in the sample. You can optimize this, but not sure how fast you are looking for.
static void Main(string[] args) {
string fileNamePath = "ConsoleApplication1.pdb";
var segmentSize = 32;
var op = ReadSplit(fileNamePath, segmentSize);
var joinedSTring = string.Join(Environment.NewLine, op);
}
static List<string> ReadSplit(string filePath, int segmentSize) {
var splitOutput = new List<string>();
using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) {
char []buffer = new char[segmentSize];
while (!file.EndOfStream) {
int n = file.ReadBlock(buffer, 0, segmentSize);
splitOutput.Add(new string(buffer, 0, n));
}
}
return splitOutput;
}
I haven't done any performance tests on my version, but my guess is that it is faster than your version.
Also, I am not sure how you plan to consume the output, but a good optimization when doing I/O is to use async calls. And a good optimization (at the cost of readability and complexity) when handling large string is to stick with char[]
Note that
You might have to deal with Character encoding issues while reading the file
If you already have the long string in memory and file reading was just include in the demo, then you should use the StringReader class instead of the StreamReader class

The best way to 4 letter words based on string with limits?

So I'm using this code down here to figure out all the words that could be spelled out of the alphabet variable, the problem is , I build this alphabet variable each time I call this based on the board of random letters in front of the user. What i see though , and of course, is "aaab" for example...
What I'm after is for code to only use the letter as many times as it appears in the alphabet var, so that it can't do something like "aaab" but just "ab"
I understand this code that I found in another thread is made to build combinations of the letters into 4 letter words, or arrangements,
I'm wondering if theres a simple way using SelectMany or Select, to not add up its self if its already been used, keep in mind there could be multiple "a's" in the alphabet var to begin with, so if theres 2 A's in there, it should still be able to to AAB, just not AAAB. I am a newbie, I know that I could go through my own list and add letters together based on how many times they actually exist in the alphabet string..im just wondering if theres a way to interupt i or x and not add to q if its already been used...
sorry if this is confusing... thank you :)
// I found this in another thread and seemed to work great and fast.
var alphabet = "abcd";
var q = alphabet.Select(x => x.ToString());
int size = 4;
for (int i = 0; i < size - 1; i++)
q = q.SelectMany(x => alphabet, (x, y) => x + y);
foreach (var item in q)
( DO STUFF)

To reach your goal, you must find a way to mark letters in your alphabet which are already used and avoid using these letters a second time.
To do so you need a data structure which can store more than the letters alone, so a list of letters (or a string) is not sufficient.
Try to bulid a list of classes like this one:
class UsedLetter
{
char letter;
bool used;
}
Then you can mark each letter as used after you drew it from the list.
Improvement
You may also store your alphabet as a list of characters:
List<char> alphabet;
and remove each letter from the alphabet after its drawn.

Here's how I have achieved what I think you're after:
using System;
using System.Collections.Generic;
using System.Linq;
namespace WordPerms
{
class Program
{
Stack<char> chars = new Stack<char>();
List<string> words = new List<string>();
static void Main(string[] args)
{
Program p = new Program();
p.GetChar("abad");
foreach (string word in p.words)
{
Console.WriteLine(word);
}
}
// This is called recursively to build the list of words.
private void GetChar(string alpha)
{
string beta;
for (int i = 0; i < alpha.Length; i++)
{
chars.Push(alpha[i]);
beta = alpha.Remove(i, 1);
GetChar(beta);
}
char[] charArray = chars.Reverse().ToArray();
words.Add(new string(charArray));
if (chars.Count() >= 1)
{
chars.Pop();
}
}
}
}
Hope that helps, Greg.

Comparing a string[index] to another string

I am in the process of learning C# and I'm building a hangman game from scratch as one of my first projects.
Everything works except for the part that replaces the dashes of the hidden word with the correctly guessed letters.
For example: ----- becomes G-EA- after you guess G, E, and A.
I have a for loop that logically seems like it'd do the job except I can't use the == operator for strings or chars.
for (int i = 0; i <= answer.Length; i++) //answer is a string "theword"
{
if (answer[i] == passMe) //passMe is "A" for example
{
hiddenWord = hiddenWord.Remove(i, 1);
hiddenWord = hiddenWord.Insert(i, passMe);
}
}
I've scoured the net trying to find a good solution. Most recommend using Regex or other commands I haven't learned yet and therefore don't fully understand how to implement.
I've tried converting both to char format in the hope that it would fix it, but no luck so far. Thanks in advance for any help.

If passMe is a string of only one char then
if (answer[i] == passMe[0])
In this way you compare the character at i-th position with the character at the first position of your user input
There is also a serious error in your code.
Your loop goes off by one, change it to
for (int i = 0; i < answer.Length; i++)
The arrays in NET start at index zero and, the max index value possible, is always one less than the length of the array.

answer[i] refers to a character and passMe is a single character string. (not a character)
Try this
for (int i = 0; i <= answer.Length; i++) //answer is a string "theword"
{
if (answer[i] == passMe[0]) //passMe is "A" for example
{
hiddenWord = hiddenWord.Remove(i, 1);
hiddenWord = hiddenWord.Insert(i, passMe);
}
}
you need to compare a character with a character.

Sort the arguments and insert each word into an array of strings

So let's say I have these arguments: "aword1 bword dword zword aword2"
I must take them and insert them into an array of strings like this:
arrayofstrings[0][] = aword1 aword2
arrayofstrings[1][] = bword
arrayofstrings[2][] = null/nothing
arrayofstrings[3][] = dword
...
arrayofstrings[25][] = zword
I already know how to do this but it's not elegant.
My solution: check each word in args for it's first letter (26 if's or 26 cases) and increment counters in a 26 integer array so I can know how much space I must allocate and check and copy each word into the [][]arrayofstrings.
How can I do it without 26 if's or cases? Much appreciated. In case you didn't notice I'm new to c#(coming from c++, vb).
ps: my alphabet contains 26 letters.
Later edit:
Here's my sample code (just for you to get an ideea of what I want):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Lab4LineComanda
{
class Program
{
static void Main(string[] args)
{
string[][] ArrayOfStrings = new string[25][];
int[] counters = new int[26];
for (int i = 0; i < args.Length; i++)
{
if (args[i].First() == 'a')
counters[0]++;
if (args[i].First() == 'b')
counters[1]++;
//if's for each letter which I don't feel like doing
if (args[i].First() == 'z')
counters[25]++;
}
Array.Sort(args);
ArrayOfStrings[0] = new string[counters[0]];
ArrayOfStrings[1] = new string[counters[1]];
for (int i = 0; i < counters[0]; i++)
ArrayOfStrings[0][i] = args[i];
//again, for each letter i must do this, which, again, I don't feel like doing(there must be better way)
}
}
}
I don't know how many words that start with 'a' are(if any). That goes for every letter.
I must use a string[][] for education purposes. I used the "c++" tag because to me it's more like a logic problem and not a language specific problem but it's my bad.
Sorry for confuses, it's hard for me to explain exactly what I want because English is hard(for me) :).

Here's a way to do it in 3 lines:
IList<IList<string>> allMyStrings = Enumerable.Repeat(0, 26)
.Select(x => new List<string>()).ToArray();
foreach (var arg in "aword1 bword dword zword aword2".Split(' '))
allMyStrings[arg[0] - 'a'].Add(arg);
Instead of a string[][], I used a List<string>[], so that the lists inside will resize themselves as needed. And for calculating which list to add the string to, I use arg[0] - 'a', which calculates the difference in the char values between the first letter of the argument and the character a. It's not general code, in that it will not work correctly if any of the arguments don't start with a letter from a-z, but it works for your example.

You can use the fact that in character table all the letters from a to z are going in sequential order. So you can allocate List<List<String>> and get the index to add by ord('a') - ord(firstChar).

Constantly Incrementing String

So, what I'm trying to do this something like this: (example)
a,b,c,d.. etc. aa,ab,ac.. etc. ba,bb,bc, etc.
So, this can essentially be explained as generally increasing and just printing all possible variations, starting at a. So far, I've been able to do it with one letter, starting out like this:
for (int i = 97; i <= 122; i++)
{
item = (char)i
}
But, I'm unable to eventually add the second letter, third letter, and so forth. Is anyone able to provide input? Thanks.

Since there hasn't been a solution so far that would literally "increment a string", here is one that does:
static string Increment(string s) {
if (s.All(c => c == 'z')) {
return new string('a', s.Length + 1);
}
var res = s.ToCharArray();
var pos = res.Length - 1;
do {
if (res[pos] != 'z') {
res[pos]++;
break;
}
res[pos--] = 'a';
} while (true);
return new string(res);
}
The idea is simple: pretend that letters are your digits, and do an increment the way they teach in an elementary school. Start from the rightmost "digit", and increment it. If you hit a nine (which is 'z' in our system), move on to the prior digit; otherwise, you are done incrementing.
The obvious special case is when the "number" is composed entirely of nines. This is when your "counter" needs to roll to the next size up, and add a "digit". This special condition is checked at the beginning of the method: if the string is composed of N letters 'z', a string of N+1 letter 'a's is returned.
Here is a link to a quick demonstration of this code on ideone.

Each iteration of Your for loop is completely
overwriting what is in "item" - the for loop is just assigning one character "i" at a time
If item is a String, Use something like this:
item = "";
for (int i = 97; i <= 122; i++)
{
item += (char)i;
}

something to the affect of
public string IncrementString(string value)
{
if (string.IsNullOrEmpty(value)) return "a";
var chars = value.ToArray();
var last = chars.Last();
if(char.ToByte() == 122)
return value + "a";
return value.SubString(0, value.Length) + (char)(char.ToByte()+1);
}
you'll probably need to convert the char to a byte. That can be encapsulated in an extension method like static int ToByte(this char);
StringBuilder is a better choice when building large amounts of strings. so you may want to consider using that instead of string concatenation.

Another way to look at this is that you want to count in base 26. The computer is very good at counting and since it always has to convert from base 2 (binary), which is the way it stores values, to base 10 (decimal--the number system you and I generally think in), converting to different number bases is also very easy.
There's a general base converter here https://stackoverflow.com/a/3265796/351385 which converts an array of bytes to an arbitrary base. Once you have a good understanding of number bases and can understand that code, it's a simple matter to create a base 26 counter that counts in binary, but converts to base 26 for display.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Decrypting text using frequency analysis in c#. - c#

Next you should grab some of publically available English frequency lists (from Wikipedia, for example) and compare the actual frequencies table you got with it - in order to find the replacements for letters.

Related

Fastest way to split a huge text into smaller chunks

The best way to 4 letter words based on string with limits?

Comparing a string[index] to another string

Sort the arguments and insert each word into an array of strings

Constantly Incrementing String

Categories

Resources