I am writing algorithms that work on series of numeric data, where sometimes, a value in the series needs to be null. However, because this application is performance critical, I have avoided the use of nullable types. I have perf tested the algorithms to specifically compare the performance of using nullable types vs non-nullable types, and in the best case scenario nullable types are 2x slower, but often far worse.
The data type most often used is double, and currently the chosen alternative to null is double.NaN. However I understand this is not the exact intended usage for the NaN value, so am unsure whether there are any issues with this I cannot foresee and what the best practise would be.
I am interested in finding out what the best null alternatives are for the following data types in particular: double/float, decimal, DateTime, int/long (although others are more than welcome)
Edit: I think I need to clarify my requirements about performance. Gigs of numerical data are processed through these algorithms at a time which takes several hours. Therefore, although the difference between eg 10ms or 20ms is usually insignificant, in this scenario it really does makes a significant impact to the time taken.
Well, if you've ruled out Nullable<T>, you are left with domain values - i.e. a magic number that you treat as null. While this isn't ideal, it isn't uncommon either - for example, a lot of the main framework code treats DateTime.MinValue the same as null. This at least moves the damage far away from common values...
edit to highlight only where no NaN
So where there is no NaN, maybe use .MinValue - but just remember what evils happen if you accidentally use that same value meaning the same number...
Obviously for unsigned data you'll need .MaxValue (avoid zero!!!).
Personally, I'd try to use Nullable<T> as expressing my intent more safely... there may be ways to optimise your Nullable<T> code, perhaps. And also - by the time you've checked for the magic number in all the places you need to, perhaps it won't be much faster than Nullable<T>?
I somewhat disagree with Gravell on this specific edge case: a Null-ed variable is considered 'not defined', it doesn't have a value. So whatever is used to signal that is OK: even magic numbers, but with magic numbers you have to take into account that a magic number will always haunt you in the future when it becomes a 'valid' value all of a sudden. With Double.NaN you don't have to be afraid for that: it's never going to become a valid double. Though, you have to consider that NaN in the sense of the sequence of doubles can only be used as a marker for 'not defined', you can't use it as an error code in the sequences as well, obviously.
So whatever is used to mark 'undefined': it has to be clear in the context of the set of values that that specific value is considered the value for 'undefined' AND that won't change in the future.
If Nullable give you too much trouble, use NaN, or whatever else, as long as you consider the consequences: the value chosen represents 'undefined' and that will stay.
I am working on a large project that uses NaN as a null value. I am not entirely comfortable with it - for similar reasons as yours: not knowing what can go wrong. We haven't encountered any real problems so far, but be aware of the following:
NaN arithmetics - While, most of the time, "NaN promotion" is a good thing, it might not always be what you expect.
Comparison - Comparison of values gets rather expensive, if you want NaN's to compare equal. Now, testing floats for equality isn't simple anyway, but ordering (a < b) can get really ugly, because nan's sometimes need to be smaller, sometimes larger than normal values.
Code Infection - I see lots of arithmetic code that requires specific handling of NaN's to be correct. So you end up with "functions that accept NaN's" and "functions that don't" for performance reasons.
Other non-finites NaN is nto the only non-finite value. Should be kept in mind...
Floating Point Exceptions are not a problem when disabled. Until someone enables them. True story: Static intialization of a NaN in an ActiveX control. Doesn't sound scary, until you change installation to use InnoSetup, which uses a Pascal/Delphi(?) core, which has FPU exceptions enabled by default. Took me a while to figure out.
So, all in all, nothing serious, though I'd prefer not to have to consider NaNs that often.
I'd use Nullable types as often as possible, unless they are (proven to be) performance / ressource constraints. One case could be large vectors / matrices with occasional NaNs, or large sets of named individual values where the default NaN behavior is correct.
Alternatively, you can use an index vector for vectors and matrices, standard "sparse matrix" implementations, or a separate bool/bit vector.
Partial answer:
Float and Double provide NaN (Not a Number). NaN is a little tricky since, per spec, NaN != NaN. If you want to know if a number is NaN, you'll need to use Double.IsNaN().
See also Binary floating point and .NET.
Maybe the significant performance decrease happens when calling one of Nullable's members or properties (boxing).
Try to use a struct with the double + a boolean telling whether the value is specified or not.
One can avoid some of the performance degradation associated with Nullable<T> by defining your own structure
struct MaybeValid<T>
{
public bool isValue;
public T Value;
}
If desired, one may define constructor, or a conversion operator from T to MaybeValid<T>, etc. but overuse of such things may yield sub-optimal performance. Exposed-field structs can be efficient if one avoids unnecessary data copying. Some people may frown upon the notion of exposed fields, but they can be massively more efficient that properties. If a function that will return a T would need to have a variable of type T to hold its return value, using a MaybeValid<Foo> simply increases by 4 the size of thing to be returned. By contrast, using a Nullable<Foo> would require that the function first compute the Foo and then pass a copy of it to the constructor for the Nullable<Foo>. Further, returning a Nullable<Foo> will require that any code that wants to use the returned value must make at least one extra copy to a storage location (variable or temporary) of type Foo before it can do anything useful with it. By contrast, code can use the Value field of a variable of type Foo about as efficiently as any other variable.
Related
Since I decided to diversify myself with Rust and Go I became overly concerned about copying / reference / moving etc.
And recently I really wondered if ValueTuple also suffer from the typical caveat of struct that is that its size should not be greater than 16 bytes to avoid a performance when copying the value type here and there: https://stackoverflow.com/a/1082341/4636721
So if say we have a value tuple (decimal, decimal, decimal, decimal) that means we are better off using classic Tuple<decimal, decimal, decimal, decimal> class to pass around that tuple?
[EDIT]
An example of use case: let's say the method below would be call a lot
public (decimal, decimal, decimal, decimal) GetSuperImportantTuple(int input)
{
var aParameter = GetAParameter(input);
// Copy when getting that tuple
var tuple = GetA4DecimalsValueTuple();
// Copy into that function
var anotherParameter = GetAnotherParameter(tuple);
// Copy when returning the value
return TransformValueTuple(tuple, anotherParameter);
}
Like always, it depends. Value types and Reference types differ. This difference can be relevant for the performance. However, you can also decide to just pass something by reference argument. If you want to know what really runs faster for the situation and the way you want to use it, test it with the stopwatch. Performance, is typically something to measure, not something to ask.
Given that mutable structs are generally regarded as evil (e.g., Why are mutable structs “evil”?), are there potential benefits that might have prompted the designers of the .NET framework to make System.Windows.Point & System.Windows.Vector mutable?
I'd like to understand this so I can decide whether it would make sense to make my own similar structs mutable (if ever). It's possible the decision to make Point and Vector mutable was just an error in judgment, but if there was a good reason (e.g., a performance benefit), I'd like to understand what it was.
I know that I've stumbled over the implementation of the Vector.Normalize() method a few times because it, surprise (!), does not return a fresh Vector. It just alters the current vector.
I always think it should work like this:
var vector = new Vector(7, 11);
var normalizedVector = vector.Normalize(); // Bzzz! Won't compile
But it actually works like this:
var vector = new Vector(7, 11);
vector.Normalize(); // This compiles, but now I've overwritten my original vector
...so, it seems like immutability is a good idea simply for avoiding confusion, but again, perhaps it's worth that potential confusion in some cases.
These types are in the System.Windows namespace and are generally used in WPF applications. The XAML markup of an application is a big part of the framework so for a lot of things, they need a way to be expressed using XAML. Unfortunately there's no way to invoke non-parameterless constructors using WPF XAML (but it is possible in loose XAML) so trying to call a constructor with the appropriate arguments to initialize it wouldn't be possible. You can only set the values of the object's properties so naturally, these properties needed to be mutable.
Is this a bad thing? For these types, I'd say no. They are just for holding data, nothing more. If you wanted to get the size a Window wanted to be, you'd access the DesiredSize to get the Size object representing the size it wanted. You're not meant to "change the desired size" by altering the Width or Height properties of the Size object you get, you change the size by providing a new Size object. Looking at it this way is a lot more natural I believe.
If these objects were more complex and did more complicated operations or had state, then yes, you wouldn't want to make these types neither mutable nor structs. However since they're just about as simple and basic as it can get (essentially a POD), structs would be appropriate here.
Such types are mutable because, contrary to what some people might claim, mutable value-type semantics are useful. There are a few places where .net tries to pretend that value types should have the same semantics as reference types. Since mutable value-type semantics are fundamentally different from mutable reference-type semantics, pretending they're the same will cause problems. That doesn't make them "evil", however--it merely shows a flaw in an object model which assumes that acting upon a copy of something will be semantically equivalent to acting upon the original. True if the thing in question is an object reference; generally true--but with exceptions--if it's an immutable structure; false if it's a mutable structure.
One of the beautiful things about structs with exposed fields is that their semantics are readily ascertained by even simple inspection. If one has a Point[100] PointArray, one has 100 distinct instances of Point. If one says PointArray[4].X = 9;, that will change one item of PointArray and no other.
Suppose instead of using struct Point, one had a mutable class PointClass:
class PointClass {public int X; public int Y;};
How many PointClass instances are stored in PointClass[100] PointClassArray? Is there any way to tell? Will the statement PointClass[4].X = 9 affect the value of PointClass[2].X? What about someOtherObject.somePoint.X?
While the .net collections are not well suited to storage of mutable structs, I would nonetheless regard:
Dictionary<string, Point>;
...
Point temp = myDict["George"];
temp.X = 9;
myDict["George"] = temp;
to have relatively clear semantics, at least in the absence of threading issues. While I consider it unfortunate that .net collections don't provide a means by which one could simply say myDict[lookupKey].X = 9; I would still regard the above code as pretty clear and self-explanatory without having to know anything about Point other than the fact that it has a public integer field called X. By contrast, if one had a Dictionary<PointClass>, it would be unclear what one should be expected to do to change the X value associated with "George". Perhaps the PointClass instance associated with George is not used anywhere else, in which case one may simply write the appropriate field. On the other hand, it's also possible that someone else has grabbed a copy of MyDict["George"] for the purpose of capturing the values therein, and isn't expecting that the PointClass object he's grabbed might change.
Some people might think "Point" should be an immutable struct, but the effect of a statement like somePoint.X = 5; can be fully determined knowing only that somePoint is a variable of type Point, which in turn is a struct with a public int field called X. If Point were an immutable struct, one would have to instead say something like somePoint = new Point(5, somePoint.Y);, which would, in addition to being slower, require examining the struct to determine that all of its fields are initialized in the constructor, with X being the first and Y the second. In what sense would that be an improvement over somePoint.X = 5;?
BTW, the biggest 'gotcha' with mutable structs stems from the fact that there's no way for the system to distinguish struct methods which alter 'this' from those which do not. A major shame. The preferred workarounds are either to use functions which return new structs derived from old ones, or else use static functions which accept "ref" struct parameters.
Possibilities:
It seemed like a good idea at the time to someone who didn't consider the use-cases where it would bite people. List<T>.Enumerator is a mutable struct that was used because it seemed like a good idea at the time to take advantage of the micro-opts that would often happen. It's almost the poster-child for mutable structs being "evil" as it's bitten more than a few people. Still, it seemed like a good idea to someone at the time...
They did think of the downsides, but had some use-case known to them where the performance differences went in struct's favour (they don't always) and was considered important.
They didn't consider structs evil. "Evil" is an opinion about down-sides beating up-sides, not a demonstrable fact, and not everyone has to agree with something even if Eric Lippert and Jon Skeet say it. Personally I think they're not evil, they're just misunderstood; but then again, evil is often easier to deal with than misunderstood for a programmer, so that's actually worse... ;) Maybe those involved disagree.
I fully appreciate the atomicity that the Threading.Interlocked class provides; I don't understand, though, why the Add function only offers two overloads: one for Integers, another for Longs. Why not Doubles, or any other numeric type for that matter?
Clearly, the intended method for changing a Double is CompareExchange; I am GUESSING this is because modifying a Double is a more complex operation than modifying an Integer. Still it isn't clear to me why, if CompareExchange and Add can both accept Integers, they can't also both accept Doubles.
Others have addressed the "why?". It is easy however to roll your own Add(ref double, double), using the CompareExchange primitive:
public static double Add(ref double location1, double value)
{
double newCurrentValue = location1; // non-volatile read, so may be stale
while (true)
{
double currentValue = newCurrentValue;
double newValue = currentValue + value;
newCurrentValue = Interlocked.CompareExchange(ref location1, newValue, currentValue);
if (newCurrentValue.Equals(currentValue)) // see "Update" below
return newValue;
}
}
CompareExchange sets the value of location1 to be newValue, if the current value equals currentValue. As it does so in an atomic, thread-safe way, we can rely on it alone without resorting to locks.
Why the while (true) loop? Loops like this are standard when implementing optimistically concurrent algorithms. CompareExchange will not change location1 if the current value is different from currentValue. I initialized currentValue to location1 - doing a non-volatile read (which might be stale, but that does not change the correctness, as CompareExchange will check the value). If the current value (still) is what we had read from location1, CompareExchange will change the value to newValue. If not, we have to retry CompareExchange with the new current value, as returned by CompareExchange.
If the value is changed by another thread until the time of our next CompareExchange again, it will fail again, necessitating another retry - and this can in theory go on forever, hence the loop. Unless you are constantly changing the value from multiple threads, CompareExchange will most likely be called only once, if the current value is still what the non-volatile read of location1 yielded, or twice, if it was different.
Update 2022/8/17
As Dr. Strangelove and Theodor Zoulias pointed out in the comments, when location1 == Double.NaN, Add() would turn into an infinite loop.
So I had to change
if (newCurrentValue == currentValue)
to
if (newCurrentValue.Equals(currentValue))
The Interlocked class wraps around the Windows API Interlocked** functions.
These are, in turn, wrapping around the native processor API, using the LOCK instruction prefix for x86. It only supports prefixing the following instructions:
BT, BTS, BTR, BTC, XCHG, XADD, ADD, OR, ADC, SBB, AND, SUB, XOR, NOT, NEG, INC, DEC
You'll note that these, in turn, pretty much map to the interlocked methods. Unfortunately, the ADD functions for non-integer types are not supported here. Add for 64bit longs is supported on 64bit platforms.
Here's a great article discussing lock semantics on the instruction level.
As Reed Copsey has said, the Interlocked operations map (via Windows API functions) to instructions supported directly by the x86/x64 processors. Given that one of those functions is XCHG, you can do an atomic XCHG operation without really caring what the bits at the target location represent. In other words, the code can "pretend" that the 64-bit floating point number you are exchanging is in fact a 64-bit integer, and the XCHG instruction won't know the difference. Thus, .Net can provide Interlocked.Exchange functions for floats and doubles by "pretending" that they are integers and long integers, respectively.
However, all of the other operations actually do operate on the individual bits of the destination, and so they won't work unless the values actually represent integers (or bit arrays in some cases.)
I suspect that there are two reasons.
The processors targeted by .NET support interlocked increment only for integer types. I believe this is the LOCK prefix on x86, probably similar instructions exist for other processors.
Adding one to a floating point number can result in the same number if it is big enough, so I'm not sure if you can call that an increment. Perhaps the framework designers are trying to avoid nonintuitive behavior in this case.
I was wondering if the enum structure type has a limit on its members. I have this very large list of "variables" that I need to store inside an enum or as constants in a class but I finally decided to store them inside a class, however, I'm being a little bit curious about the limit of members of an enum (if any).
So, do enums have a limit on .Net?
Yes. The number of members with distinct values is limited by the underlying type of enum - by default this is Int32, so you can get that many different members (2^32 - I find it hard that you will reach that limit), but you can explicitly specify the underlying type like this:
enum Foo : byte { /* can have at most 256 members with distinct values */ }
Of course, you can have as many members as you want if they all have the same value:
enum { A, B = A, C = A, ... }
In either case, there is probably some implementation-defined limit in C# compiler, but I would expect it to be MIN(range-of-Int32, free-memory), rather than a hard limit.
Due to a limit in the PE file format, you probably can't exceed some 100,000,000 values. Maybe more, maybe less, but definitely not a problem.
From the C# Language Specification 3.0, 1.10:
An enum type’s storage format and
range of possible values are
determined by its underlying type.
While I'm not 100% sure I would expect Microsoft C# compiler only allowing non-negative enum values, so if the underlying type is an Int32 (it is, by default) then I would expect about 2^31 possible values, but this is an implementation detail as it is not specified. If you need more than that, you're probably doing something wrong.
You could theoretically use int64 as your base type in the enum and get 2^63 possible entries. Others have given you excellent answers on this.
I think there is a second implied question of should you use an enum for something with a huge number of items. This actually directly applies to your project in many ways.
One of the biggest considerations would be long term maintainability. Do you think the company will ever change the list of values you are using? If so will there need to be backward compatibility to previous lists? How significant a problem could this be? In general, the larger the number of members in an enum correlates to a higher probability the list will need to be modified at some future date.
Enums are great for many things. They are clean, quick and simple to implement. They work great with IntelliSense and make the next programmer's job easier, especially if the names are clear, concise and if needed, well documented.
The problem is an enumeration also comes with drawbacks. They can be problematic if they ever need to be changed, especially if the classes using them are being persisted to storage.
In most cases enums are persisted to storage as their underlying values, not as their friendly names.
enum InsuranceClass
{
Home, //value = 0 (int32)
Vehicle, //value = 1 (int32)
Life, //value = 2 (int32)
Health //value = 3 (int32)
}
In this example the value InsuranceClass.Life would get persisted as a number 2.
If another programmer makes a small change to the system and adds Pet to the enum like this;
enum InsuranceClass
{
Home, //value = 0 (int32)
Vehicle, //value = 1 (int32)
Pet, //value = 2 (int32)
Life, //value = 3 (int32)
Health //value = 4 (int32)
}
All of the data coming out of the storage will now show the Life policies as Pet policies. This is an extremely easy mistake to make and can introduce bugs that are difficult to track down.
The second major issue with enums is that every change of the data will require you to rebuild and redeploy your program. This can cause varying degrees of pain. On a web server that may not be a big issue, but if this is an app used on 5000 desktop systems you have an entirely different cost to redeploy your minor list change.
If your list is likely to change periodically you should really consider a system that stores that list in some other form, most likely outside your code. Databases were specifically designed for this scenario or even a simple config file could be used (not the preffered solution). Smart planning for changes can reduce or avoid the problems associated with rebuilding and redeploying your software.
This is not a suggestion to prematurely optimize your system for the possibility of change, but more a suggestion to structure the code so that a likely change in the future doesn't create a major problem. Different situations will require difference decisions.
Here are my rough rules of thumb for the use of enums;
Use them to classify and define other data, but not as data
themselves. To be clearer, I would use InsuranceClass.Life to
determine how the other data in a class should be used, but I would
not make the underlying value of {pseudocode} InsuranceClass.Life = $653.00 and
use the value itself in calculations. Enums are not constants. Doing
this creates confusion.
Use enums when the enum list is unlikely to change. Enums are great
for fundamental concepts but poor for constantly changing ideas.
When you create an enumeration this is a contract with future
programmers that you want to avoid breaking.
If you must change an enum, then have a rule everyone follows that
you add to the end, not the middle. The alternative is that you
define specific values to each enum and never change those. The
point is that you are unlikely to know how others are using your
enumerations underlying values and changing them can cause misery for anyone
else using your code. This is an order of magnitude more important
for any system that persists data.
The corollary to #2 and #3 is to never delete a member of an enum.
There are specific circles of hell for programmers who do this in a codebase used by others.
Hopefully that expanded on the answers in a helpful way.
This question already has answers here:
Closed 14 years ago.
It seems that unsigned integers would be useful for method parameters and class members that should never be negative, but I don't see many people writing code that way. I tried it myself and found the need to cast from int to uint somewhat annoying...
Anyhow what are you thoughts on this?
Duplicate
Why is Array Length an Int and not an UInt?
The idea, that unsigned will prevent you from problems with methods/members that should not have to deal with negative values is somewhat flawed:
now you have to check for big values ('overflow') in case of error
whereas you could have checked for <=0 with signed
use only one signed int in your methods and you are back to square "signed" :)
Use unsigned when dealing with bits. But don't use bits today anyway, except you have such a lot of them, that they fill some megabytes or at least your small embedded memory.
Using the standard ones probably avoids the casting to unsigned versions. In your code, you can probably maintain the distinction ok, but lots of other inputs and 3rd party libraries wont and thus the casting will just drive people mad!
I can't remember exactly how C# does its implicit conversions, but in C++, widening conversions are done implicitly. Unsigned is considered wider than signed, and so this leads to unexpected problems:
int s = 5;
unsigned int u = 25;
// s > u is false
int s = -1;
unsigned int u = 25;
// s > u is **TRUE**! Error error!
In the example above, s overflowed, so it's value will be something like 4294967295. This has caused me problems before, I often have methods return -1 to say "no match" or something like that, and with the implicit conversion it just fails to do what I think it should.
After a while programmers learnt to almost always use signed variables, except in exceptional cases. Compilers these days also produce warnings for this which is very helpful.
One reason is that public methods or properties referring to unsigned types aren't CLS compliant.
You'll almost always see this attribute applied to .Net assemblies as various wizards include it by default:
[assembly: CLSCompliant(true)]
So basically if your assembly includes the attribute above, and you try to use unsigned types in your public interface with the outside world, you'll get a compilation error.
For simplicity. Modern software involves enough casts and conversions. There's a benefit to stick to as few, commonly available data types as possible to reduce complexity and ambiguity about proper interfaces.
there is no real need. Declaring something as unsigned to say numbers should be positive is a poor mans attempt at validation.
In fact it would be better to just have a single number class that represented all numbers.
To Validate numbers you should use some other technique because generally the constraint isn't just positive numbers, it's a set range. It's usually best to use the most unconstrained method for representing numbers and then if you want to change the rules for allowable values, you change JUST the validation rules NOT the type.
Unsigned data types are carried over from the old days when memomry was a premium. So now we don't really need them for that purpose. Combine that with casting and they are a little cumbersome.
It's not advisable to use unsigned integer because if u assigned negative values to it, all hell break lose. However, if you insists on doing it the right way, try using Spec#, declare it as an integer (where you would have used uint) and attach an invariant to it saying it can never be negative.
You're right in that it probably would be better to use uint for things which should never be negative. In practice, though, there's a few reasons against it:
int is the 'standard' or 'default', it has a lot of inertia behind it.
You have to use annoying/ugly explicit casts everywhere to get back to int, which you're probably going to need to do a lot due to the first point after all.
When you do cast back to int, what happens if your uint value is going to overflow it?
A lot of the time, people use ints for things that are never negative, and then use -1 or -99 or something like that for error codes or uninitialised values. It's a little lazy perhaps, but again this is something that uint is less flexible with.
The biggest reason is that people are usually too lazy, or not thinking closely enough, to know where they're appropriate. Something like a size_t can never be negative, so unsigned is correct.
Casting from signed to unsigned can be fraught with peril though, because of peculiarities in how sign bits are handled by the underlying architecture.
You shouldn't need to manually cast it, I don't think.
Anyway, it's because for most applications, it doesn't matter - the range for either is big enough for most uses.
It is because they are not CLS compliant, which means your code might not run as you expect on other .NET framework implementations or even on other .NET languages (unless supported).
Also it is not interoperable with other systems if you try to pass them. Like a web service to be consumed by Java, or call to Win32 API
See that SO post on the reason why as well