Huffman Coding task.
what I doing.
Read string from file, prepare Huffman structure, encode string to bits and save that bits to binary file.
What I need:
Decode string from binary file but encoding and decoding must be independent. After closing app for e.q.
I saving to binary file like that:
A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111;
00000110110010101100111110111110;
And need to read it and decode. So I think I need to build Huffman structure again from that but how?
I see this options
Encoder and decoder always use the same tree, it never changes. So the decoder already knows, that 000 means A.
Tree is appended before the message in binary format. Encoder and decoder have to know the exact format for storing the tree, there are many possibilities how to do this. In simplest case there would be number of encoded characters and for every character its ascii code, length of Huffman code and the code itself.
Tree is built on the fly using adaptive Huffman coding, but it does not seem to be Your case.
Since you know A:000;l:001;a:10; :110;m:010;k:011;o:1110;t:1111; You can try to traverse the string 00000110110010101100111110111110 a character at a time. also have a switch statement for each of the characters, and their code. When ever you come across a case, for eg000, you can output A. This is one way I can see you being able to go back to the string. I am sure there is a better way out there.
hope this helps.
Assuming "Adaptive Huffman", it's not usual to decide yourself what code to use for each character.
The usual sequence is
Analyze the text to be encoded. That means counting the occurrences of each character. In the English language 'e' would be more frequent than 'x', 'y' or 'z' for example.
Sort the arrays of char/occurrence in ascending order.
Build a BTree - that means combining the two lowest, adding their counts and making a new tree node. Ignore those two and look for the next pair of lowest occurrences (which might include the node you just made). This continues until you end up with a BTree with one root. (There are lots of helpful images of this). I can explain this in more detailed steps if necessary.
From the root of the tree you "walk" to each leaf. For each "left" add a '0' and for each right a '1'. When you reach the leaf, you have the code for that letter. If your text has many e's it will have the shortest code and no other code will start with the same sequence of bits. This is the idea, the most frequent characters have the shortest code, thus bigger memory savings.
Now, by walking the tree you have the code (varying lengths) for each character.
Encode your text to a string of bits.
To decode you use the same tree. You say it must work "after closing app" so you will have to store the tree in some form with the encoded data.
In your comment you mention the problem with having varying length codes. There is no ambiguity. In an extreme case, if you had more e's than all other characters combined, the tree would be very lopsided. 'e' would be encoded as '1' and all other letters would have codes of varying lengths, beginning with 0.
Related
Being a computer programming rookie, I was given homework involving the use of the playing card suit symbols. In the course of my research I came across an easy way to retrieve the symbols:
Console.Write((char)6);
gives you ♠
Console.Write((char)3);
gives you ♥
and so on...
However, I still don't understand what logic C# uses to retrieve those symbols. I mean, the ♠ symbol in the Unicode table is U+2660, yet I didn't use it. The ASCII table doesn't even contain these symbols.
So my question is, what is the logic behind (char)int?
For these low numbers (below 32), this is an aspect of the console rather than C#, and it comes from Code page 437 - though it won't include the ones that have other meanings that the console actually uses, such as tab, carriage return, and bell. This isn't really portable to any context where you're not running directly in a console window, and you should use e.g. 0x2660 instead, or just '\u2660'.
The logic behind (char)int is that char is a UTF-16 code unit, one or two of which encode a Unicode codepoint. Codepoints are naturally ordinal numbers, being an identifier for a member of a character set. They are often written in hexadecimal, and specifically for Unicode, preceded by U+, for example U+2660.
UTF-16 is a mapping between codepoint and code units. Code units being 16 bits can be operated on as integers. Since a char holds one code unit, you can convert an short to a char. Since the different integer types can interoperate, you can convert an int to a char.
So, your short (or int) has meaning as text only when it represents a UTF-16 code unit for a codepoint that only has one code unit. (You could also convert an int holding a whole codepoint to a string.)
Of course, you could let the compiler figure it out for you and make it easier for your readers, too, with:
Console.Write('♥');
Also, forget ASCII. It's never the right encoding (except when it is). In case it's not clear, a string is a counted sequence of UTF-16 code units.
Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.
I have been given a large quantity of Xml's where I need to pull out parts of the text elements and reuse it for other purposes. (I am using XDocument to pull Xml data).
But, how do I decode the text contained in the elements? What is even the formatting used here? A few examples:
"What is the meaning of this® asks Sonny."
"The big centre cost 1¾ million pounds"
"... lost it. ® The next ..."
I have tried HttpUtility.HtmlDecode but that did not do the trick. If I decode twice the "®" turns into a ® which is obviously not right.
Looks like ® are line breaks. The ® are probably question marks. The 190 one, I don't even know. Perhaps a dot or comma?
Any ideas would be welcome.
It does appear that the strings you show have been HTML encoded, and then XML encoded (or HTML again).
It is correct that ® -> ® -> ® (the registered trademark symbol) per the ISO Latin-1 entities - ® should behave the same way
Similarly ¾ would turn into a fraction representing three quarters.
I read How can I detect the encoding/codepage of a text file
It's not possible to detect encoding. However is it possible to detect whether encoding is one of two allowed?
For example I allow user to use Unicode UTF-8 and iso-8859-2 for their csv files. Is it possible to detect whether it is former or latter?
For example I allow user to use
Unicode UTF-8 and iso-8859-2 for their
csv files. Is it possible to detect
whether it is former or latter?
It's not possible with 100% accuracy because, for example, the bytes C3 B1 are an equally valid representation of "Ăą" in ISO-8859-2 as they are of "ñ" in UTF-8. In fact, because ISO-8859-2 assigns a character to all 256 possible bytes, every UTF-8 string is also a valid ISO-8859-2 string (representing different characters if non-ASCII).
However, the converse is not true. UTF-8 has strict rules about what sequences are valid. More than 99% of possible 8-octet sequences are not valid UTF-8. And your CSV files are probably much longer than that. Because of this, you can get good accuracy if you:
Perform a UTF-8 validity check. If it passes, assume the data is UTF-8.
Otherwise, assume it's ISO-8859-2.
However is it possible to detect
whether encoding is one of two
allowed?
UTF-32 (either byte order), UTF-8, and CESU-8 can be reliably detected by validation.
UTF-16 can be detected by presence of a BOM (but not by validation, since the only way for an even-length byte sequence to be invalid UTF-16 is to have unpaired surrogates).
If you have at least one "detectable" encoding, then you can check for the detectable encoding, and use the undetectable encoding as a fallback.
If both encodings are "undetectable", like ISO-8859-1 and ISO-8859-2, then it's more difficult. You could try a statistical approach like chardet uses.
Since it is impossible to detect the encoding, you still cannot detect it even when you limit it down to two possible encodings.
The only thing that I can think of is that you could try encoding it in one of the two possible encodings, but then you would have to check to see if it came out right. This would involve parsing of the text and even then you would not be 100% certain if it was right.
Both of those encodings share the same meaning for all octets <128.
So you would need to look at octets >= 128 to make the determination. Since in UTF-8 octets >= 128 always occur in groups (for 2 octet on longer sequences to encode a single code point) then a three octet sequence {<128, >=128, <128} would be an indication of ISO-8859-2.
If the file contains no, or very few octets outside ASCII (i.e. <128) then your ability to determine will be impossible or limited. Of course if the file starts with a UTF-8 encoded BOM (quite likely if from Windows) then you know it is UTF-8.
It is generally more reliable to use some meta-data (as XML does with its declaration) that relying on a heuristic because it is possible someone has sent you ISO-8859-3.
If you use a StreamReader there is an overload which will detect the encoding if possible (BOM) but defaults to UTF8 if detection fails.
I would suggest you use two options (UTF8 or Current) and if the user selects Current you use
var encoding = Encoding.GetEncoding(
CultureInfo.CurrentCulture.TextInfo.OEMCodePage);
var reader = new StreamReader(encoding);
which will most hopefully be the right encoding.
See my (recent) answer to the linked question: How can I detect the encoding/codepage of a text file
This class will check whether it is possible that the file is UTF-8, and then it will attempt to guess whether it is probable.
Basically i have binary data, i dont mind if it's unreadable but im writing it to a file which is parsed and so it's importance newline characters are taken out.
I thought i had done the right thing when i converted to string....
byte[] b = (byte[])SubKey.GetValue(v[i]);
s = System.Text.ASCIIEncoding.ASCII.GetString(b);
and then removed the newlines
String t = s.replace("\n","")
but its not working ?
Newline might be \r\n, and your binary data might not be ASCII encoded.
Firstly newline (Environment.Newline) is usually two characters on Windows, do you mean removing single carriage-return or line-feed characters?
Secondly, applying a text encoding to binary data is likely to lead to unexpected conversions. E.g. what will happen to buyes of the binary data that do not map to ASCII characters?
New line character may be \n or \r or \r\n depends on operating system type, in order this is markers for Linux, Macintosh and Windows.
But if you say you file is binary from what you know they have newlines in ASCII in her content?
If this is binary file this may be a some struct, if this they struct you after remove newline characters shift left all data after the this newline and corrupt data in her.
I would imagine removing the bytes in a binary chunk which correspond the line feeds would actually corrupt the binary data, thereby making it useless.
Perhaps you'd be better off using base64 encoding, which will produce ASCII-safe output.
If this is text data, then load it as text data (using the correct encoding), replace it as as a string, and re-encode it (using the correct encoding). For some encodings you might be able to do a swap at the file level (without decoding/encoding), but I wouldn't bet on it.
If this is any other binary representation, you will have to know the exact details. For example, it is common (but not for certain) for strings embedded in part of a binary file to have a length prefix. If you change the data without changing the length prefix, you've just corrupted the file. And to change the length prefix you need to know the format (it might be big-endian/little-endian, any fixed number of bytes, or the prefix itself could be variable length). Or it might be delimited. Or there might be relative offsets scattered through the file that all need fixing.
Just as likely; you could by chance have the same byte sequence in the binary that doesn't represent a newline; you could be completely trashing the data.