When do you use float and when do you use double

  • Frequently in my programming experience I need to make a decision whether I should use float or double for my real numbers. Sometimes I go for float, sometimes I go for double, but really this feels more subjective. If I would be confronted to defend my decision, I would probably not give sound reasons.

    When do you use float and when do you use double? Do you always use double, only when memory constraints are present you go for float? Or you use always float unless the precision requirement requires you to use double? Are there some substantial differences regarding computational complexity of basic arithemtics between float and double? What are the pros and cons of using float or double? And have you even used long double?

    In many cases you want to use neither, but rather a decimal floating or fixedpoint type. Binary floating point types can't represent most decimals exactly.

    Related to What causes floating point rounding errors?. @CodesInChaos my answer there suggests resources to help you make that determination, there is no *one-size-fits-all* solution.

    Very good answer found at: Stack Overflow

    For decimals I would use neither. I would use an integer and store the value multiplies by 100.

    What exactly do you mean by "decimals". If you need to represent values like 0.01 exactly (say, for money), then (binary) floating-point is not the answer. If you merely means non-integer numbers, then floating-point is likely ok -- but then "decimals" is not the best word to describe what you need.

    @Keith I mean just like when one needs to store a floating-point number. Doesn't necessarily to be decimal - it can also be sound or image data, for example.

    @JakubZaverka: I've edited your question to refer to "real numbers" rather than "decimals".

    Considering (as of today) most graphics cards accept floats over doubles, graphics programming often uses single precision.

    You don't always have a choice. For example, on the Arduino platform, both double and float equate to float. You need to find an add-in library to handle real doubles.

  • The default choice for a floating-point type should be double. This is also the type that you get with floating-point literals without a suffix or (in C) standard functions that operate on floating point numbers (e.g. exp, sin, etc.).

    float should only be used if you need to operate on a lot of floating-point numbers (think in the order of thousands or more) and analysis of the algorithm has shown that the reduced range and accuracy don't pose a problem.

    long double can be used if you need more range or accuracy than double, and if it provides this on your target platform.

    In summary, float and long double should be reserved for use by the specialists, with double for "every-day" use.

    I would probably not consider float for a few thousand values unless there were a performance problem related to floating point caching and data transfer. There is usually a substantial cost to doing the analysis to show that float is precise enough.

    As an addendum, if you need compatibility with other systems, it can be advantageous to use the same data types.

    I'd use floats for millions of numbers, not 1000s. Also, some GPUs do better with floats, in that specialized case use floats. Else, as you say, use doubles.

    @PatriciaShanahan - 'performance problem related to..' A good example is if you are planning to use SSE2 or similar vector instructions, you can do 4 ops/vector in float (vs 2 per double) which can give a significant speed improvement (half as many ops and half as much data to read & write). This can significantly lower the threshold where using floats becomes attractive, and worth the trouble to sort out the numeric issues.

    I endorse this answer with one additional advice: When one is operating with RGB values for display, it is acceptable to use `float` (and occasionally half-precision) because neither the human eye, the display, or the color system has that many bits of precision. This advice is applicable for say OpenGL etc. This additional advice does not apply to medical images, which have more strict precision requirements.

    It should be noted that ops on long double are often extremely slow, as much as 5x, because they generally have to be implemented in software

    IMHO it all depends on the application and the target of accepted accuracy. For example: if one is working on a 3D Graphics Rendering Engine where you are operating on millions or billions of vertices in a 3D Cartesian graph, then it is more efficient to use `float` with some loss of precision for performance gain. On the other hand if precision is of a higher importance such as in an application that works with sub atomic particles or astronomy type situations and the efficiency of performance is not as high of a priority then by all means use a `double`.

    ... Also the effects of caching should play a large role in the decision to choose one type over another. If the order of magnitude of floating point arithmetic operations is low, then double should be fine for both memory consumption, caching and the speed time complexity of the function - operation, once the magnitude of operations exceeds a limit that is dependent on the architecture, OS, language - compiler then float should be considered. There is always a tradeoff between one or the other.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM