with .NET things are fairly simple - it is all (including ARM ASFAIK) running little endian .
The question that I have is: what is happing on Mono and (potentially) big endian systems? Do the bits reverse (when compared to x86) in Int32 / Int64 structure or does the framework force little endian rule-set?
Thanks
Your assertion that all MS .NET are little endian is not correct. It depends on the architecture that you are running on - the CLR spec says so:
From the CLI Annotated Standard (p.161) — Partition I, section 12.6.3: "Byte Ordering":
For data types larger than 1 byte, the byte ordering is dependent on the target CPU. Code that depends on byte ordering may not run on all platforms. [...]
(taken from this SO answer)
See this answer for more information on the internals of BitConverter and how it handles endianness.
A list of behavioral changes I can think of at the moment (unchecked and incomplete):
IPAddress.HostToNetworkOrder and IPAddress.NetworkToHostOrder
Nearly everything in BitConverter
BinaryReader and BinaryWriter (EDIT: From documentation: "BinaryReader reads this data type in little-endian format.")
Binary serialization
Everything that reads and writes Unicode in default encoding from/to streams (UnicodeEncoding) (EDIT: Default is defined as little endian)
and of course every (runtime library) function using these.
Usually Microsoft doesn't mention endianness in their docs - with some strange exceptions. For instance, BinaryReader.ReadUInt16 is defined to read little endian. Nothing mentioned for the other methods. One may assume that binary serialization is always little-endian, even on big-endian machines.
Note that XNA on XBox360 is big-endian, so this not just a theoretical problem with Mono.
c#/.Net does not make any claims on endian. int32/64 are atomic not structures.
As far as I know such conversion would happen outside the scope of your code and hidden to you. It's called "managed code" for some reasons, including such potential issues.
To know if bytes are "reversed", just check BitConverter.IsLittleEndian:
if (BitConverter.IsLittleEndian)
{
// reverse bytes
}
Considering how similar .Net and Mono are by design, I'd say they probably handle endianness the same.
You can always test it by creating a managed int with a known value, then using reflection or marshalling to access the memory and take a look.
Related
Quick intro : the question is about UTF-8 vs UTF-16.
*I tried my best to keep it as short and specific as possible please bear with me.
I know there's gazillion of variations of the specific issue UTF-8/16 not mentioning the global encoding subject,
which was the start of my questioning (ANSI vs UNICODE) and I guess it's not *MY* quest only,
as it could serve many other (performance motivated) beginners in c++.
being more specific - to the point:
giving the Following Environment parameters:
WINDOWS platform
C++ AND C#
using some english
/russian/hebrew
*lets say this is a constant.
can I use UTF-8 (half the size of UTF-16) and "get away with it" ?
...saving space and time
TLDR
I have recently moved to using C++, in the last few days I have tried to decide how to handle strings which is one of the most expensive datatypes to process, I have followed almost every famous and less famous articles on the encoding issue, though the more i tried to continue searching the more confused I have become, regarding the compatibility, while keeping high performance application without crossing the boundaries of the *framework
I have used the term framework although I am planning to do most of I/O via Native c++
can I use UTF-8 ? do I want UTF-8, i know one thing !
windows 'blood' type is UTF-16, although I think Low Level I/O and also HTTP uses/defaults/prefers/benefits from UTF-8
but I am on windows and still working with .NET
what can I use to max out my apps performance, querying manipulating saving to database...
a point I have read in a less famous [article]
A bit of research
This is a compilation of research I did to answer your problem:
Hebrew and Cyrillic in Unicode
According to Wikipedia, the Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F (I don't know the proportions):
https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet
According to Wikipedia, again, the cyrrilic can be found in the following bolcks: U+0400–U+04FF, U+0500–U+052F, U+2DE0–U+2DFF, U+A640–U+A69F, U+1D2B, U+1D78, U+FE2E–U+FE2F
https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
UTF-8 vs. UTF-16
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
UTF-16 can represent the following glyphs with two bytes: U+0000 to U+D7FF and U+E000 to U+FFFF, which means all the characters above will be represented with two bytes (a wchar_t on Windows).
To represent Herbew and Cyrillic, UTF-8 will always need at least two bytes, and possibly three:
U+0000 - U+007F : 1 byte
U+0080 - U+07FF : 2 bytes
U+0800 - U+FFFF : 3 bytes
Windows
You said it yourself: Windows's DNA is UTF-16. No matter what delusional websites claim, the WinAPI won't change to UTF-8 because that makes not sense from Microsoft's viewpoint (breaking compatibility with previous Windows' applications just to make Linux lovers happy? Seriously?).
When you will develop under Windows, everything Unicode there will be optimized/designed for UTF-16.
Even the "char" API from the WinAPI is just a wrapper that will convert your char strings into wchar_t strings before calling the UTF-16 you should have been calling directly anyway.
Test!
As your problem seems to be mainly I/O, you should experiment to see if there is a meaningful difference between reading/writing/sending/receiving UTF-16 vs. UTF-8 with sample data.
Conclusion
From every fact above, I see either a neutral choice between UTF-8 and UTF-16 (the russian and cyrillic glyphs) (*), or a choice leading to UTF-16 (windows).
So, my own conclusion, unless your tests show otherwise, would be to stick to UTF-16 on Windows.
(*) You could sample a few strings in all the languages you are using, and try to have statistics on the averages the most common characters are used.
Bonus?
Now, in your stead, I would avoid using directly wchar_t on Windows.
Instead, I would use the _T(), TCHAR and <tchar.h> macro/typedef/include machinery offered by Windows: With but a few macros defined (UNICODE and _UNICODE if memory serves), as well as a few smart overloads, you can:
use wchar_t and utf-16 on Windows
use utf-8 on Linux
Which will make your code more portable should you switch to another OS.
Please read this article
http://www.joelonsoftware.com/articles/Unicode.html
Please read it carefully.
Now regarding performance I very much doubt it that you'll see any difference.
You choose your encoding based on what your program is supposed to do.
Is it supposed to communicate with other programs?
Are you storing information in database that will be accessed by other people?
Performance and disk space are not your first priorities when deciding which encoding to use.
I'm trying to write a simple reader for AutoCAD's DWG files in .NET. I don't actually need to access all data in the file so the complexity that would otherwise be involved in writing a reader/writer for the whole file format is not an issue.
I've managed to read in the basics, such as the version, all the header data, the section locator records, but am having problems with reading the actual sections.
The problem seems to stem from the fact that the format uses a custom method of storing some data types. I'm going by the specs here:
http://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf
Specifically, the types that depend on reading in of individual bits are the types I'm struggling to read. A large part of the problem seems to be that C#'s BinaryReader only lets you read in whole bytes at a time, when in fact I believe I need the ability to read in individual bits and not simply 8 bits or a multiple of at a time.
It could be that I'm misunderstanding the spec and how to interpret it, but if anyone could clarify how I might go about reading in individual bits from a stream, or even how to read in some of the variables types in the above spec that require more complex manipulation of bits than simply reading in full bytes then that'd be excellent.
I do realise there are commercial libraries out there for this, but the price is simply too high on all of them to be justifiable for the task at hand.
Any help much appreciated.
You can always use BitArray class to do bit wise manipulation. So you read bytes from file and load them into BitArray and then access individual bits.
For the price of any of those libraries you definitely cannot develop something stable yourself. How much time did you spend so far?
I have a few questions about endian-ness that are related enough that I warrant putting them in as one question:
1) Is endian-ness decided by .Net or by the hardware?
2) If it's decided by the hardware, how can I figure out what endian the hardware is in C#?
3) Does endian-ness affect binary interactions such as ORs, ANDs, XORs, or shifts? I.E. Will shifting once to the right always shift off the least significant bit?
4) I doubt it, but is there a difference in endian-ness from different versions of the .Net framework? I assume they're all the same, but I've learned to stop assuming about some of the lower level details such as this.
If need be, I can ask these as different questions, but I figure anybody who knows the answer to one of these probably knows the answer to all of them (or can point me in a good direction).
1) The hardware.
2) BitConverter.IsLittleEndian
3) Endianness does not affect bitwise operations. Shifting to the right is shifting in the least-significant-bit direction. UPDATE from Oops' comment: However, endianness does affect binary input and output. When reading or writing values larger than a byte (e.g., reading an int from a BinaryReader or using BitConverter), you do have to account for endianness. Once the values are read in correctly, then all bitwise operations act as normal.
4) Most versions of .NET are little endian. Notable exceptions include the XBox and some platforms supported by Mono or the Compact Framework.
1) neither nor... or you can say either
it was decide by the hardware developers. And you must decide about it if you write software that read/writes certain file formats without using external libraries.
There is no problem with endiannes If you read from a file
* a byte
but you have to decide how to interprete it in little or big endian format
if you read all other primitive data types like
* integers,
* strings,
* floats.
The hardware does not help here. The BitConverter does not help here. Only the documentation of the file format could help and testing your code of course...
edit: found ad good explanation here:
http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/
I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?
Concerning the quality, I am focusing on algorithmic aspects (hence, the quality of the hash as it impacts large hash-tables, not for security concerns).
Then, concerning the stability, I wondering about the potential versionning issues that might arise from one .NET version to the next.
Some lights on those two aspects would be very appreciated.
I can't give you any details about the quality (though I would assume it is pretty good given that string is one of the framework's core classes that is likely to be used as a hash key).
However, regarding the stability, the hash code produced on different versions of the framework is not guaranteed to be the same, and it has changed in the past, so you absolutely must not rely on the hash code being stable between versions (see here for a reference that it changed between 1.1 and 2.0). In fact, it even differs between the 32-bit and 64-bit versions of the same framework version; from the docs:
The value returned by GetHashCode is platform-dependent. For a specific string value, it differs on the 32-bit and 64-bit versions of the .NET Framework.
This is an old question, but I'd like to contribute by mentionning this microsoft bug about hash quality.
Summary: On 64b, hash quality is very low when your string contains '\0' bytes. Basically, only the start of the string will be hashed.
If like me, you have to use .Net strings to represent binary data as key for high-performance dictionaries, you need to be aware of this bug.
Too bad, it's a WONTFIX... As a sidenote, I don't understand how they could say that modifying the hashcode being a breaking change, when the code includes
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
and the hashcode is already different in x86/64b anyway.
I just came across a related problem to this. On one of my computers (a 64 bit one) I had a problem that I tracked down to 2 different objects being identical except for the (stored) hashcode. That hashcode was created from a string....the same string!
m_storedhash = astring.GetHashCode();
I dont know how these two objects ended up with different hash codes given they were from the same string however I suspect what happened is that within the same .NET exe, one of the class library projects I depend upon has been set to x86 and another to ANYCPU and one of these objects was created in a method inside the x86 class lib and the other object (same input data, same everything) was created in a method inside the ANYCPU class library.
So, does this sound plausible: Within the same executable in memory (not between processes) some of the code could be running with the x86 Framework's string.GetHashCode() and other code x64 Framework's string.GetHashCode() ?
I know that this isn't really included the meanings of quality and stability that you specified, but it's worth being aware that hashing extremely large strings can produce an OutOfMemoryException.
https://connect.microsoft.com/VisualStudio/feedback/details/517457/stringcomparers-gethashcode-string-throws-outofmemoryexception-with-plenty-of-ram-available
The quality of the hash codes are good enough for their intended purpose, i.e. they doesn't cause too many collisions when you use strings as key in a dictionary. I suspect that it will only use the entire string for calculating the hash code if the string length is reasonably short, for huge strings it will brobably only use the first part.
There is no guarantee for stability across versions. The documentation clearly says that the hashing algorithm may change from one version to the next, so that the hash codes are for short term use.
Judy array is fast data structure that may represent a sparse array or a set of values. Is there its implementation for managed languages such as C#? Thanks
It's worth noting that these are often called Judy Trees or Judy Tries if you are googling for them.
I also looked for a .Net implementation but found nothing.
Also worth noting that:
The implementation is heavily designed around efficient cache usage, as such implementation specifics may be highly dependent on the size of certain constructs used within the sub structures. A .Net managed implementation may be somewhat different in this regard.
There are some significant hurdles to it that I can see (and there are probably more that my brief scan missed)
The API has some fairly anti OO aspects (for example a null pointer is viewed as an empty tree) so simplistic, move the state pointer to the LHS and make functions instance methods conversion to C++ wouldn't work.
The implementation of the sub structures I looked at made heavy use of pointers. I cannot see these efficiently being translated to references in managed languages.
The implementation is a distillation of a lot of very complex ideas that belies the simplicity of the public api.
The code base is about 20K lines (most of it complex), this doesn't strike me as an easy port.
You could take the library and wrap the C code in C++/CLI (probably simply holding internally a pointer that is the c api trie and having all the c calls point to this one). This would provide a simplistic implementation but the linked libraries for the native implementation may be problematic (as might memory allocation).
You would also probably need to deal with converting .Net strings to plain old byte* on the transition as well (or just work with bytes directly)
Judy really doesn't fit well with managed languages. I don't think you'll be able to use something like SWIG and get the first layer done automatically.
I wrote PyJudy and I ended up having to make some non-trivial API changes to fit well in Python. For example, I wrote in the documentation:
JudyL arrays map machine words to
machine words. In practice the words
store unsigned integers or pointers.
PyJudy supports all four mappings as
distinct classes.
pyjudy.JudyLIntInt - map unsigned
integer keys to unsigned integer
values
pyjudy.JudyLIntObj - map unsigned
integer keys to Python object values
pyjudy.JudyLObjInt - map Python
object keys to unsigned integer
values
pyjudy.JudyLObjObj - map Python
object keys to Python object values
I haven't looked at the code for a few years so my memories about it are pretty hazy. It was my first Python extension library, and I remember I hacked together a sort of template system for code generation. Nowadays I would use something like genshi.
I can't point to alternatives to Judy - that's one reason why I'm searching Stackoverflow.
Edit: I've been told that my timing numbers in the documentation are off from what Judy's documentation suggests because Judy is developed for 64-bit cache lines and my PowerBook was only 32 bits.
Some other links:
Patricia tries (http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/ )
Double-Array tries (http://linux.thai.net/~thep/datrie/datrie.html)
HAT-trie (http://members.optusnet.com.au/~askitisn/index.html)
The last has comparison numbers for different high-performance trie implementations.
This is proving trickier than I thought. PyJudy might be worth a look, as would be Tie::Judy. There's something on Softpedia, and something Ruby-ish. Trouble is, none of these are .NET specifically.