In computing, floating point describes a numerical representation system in which a string of digits (or bits) represents a real number. Computing is usually defined like the activity of using and developing Computer technology Computer hardware and software. In Computer programming and some branches of Mathematics, a string is an ordered Sequence of Symbols. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication In Mathematics, the real numbers may be described informally in several different ways
The term floating point refers to the fact that the radix point (decimal point, or, more commonly in computers, binary point) can "float": that is, it can be placed anywhere relative to the significant digits of the number. In Mathematics and Computing, a Radix point (or radix character) is the symbol used in numerical representations to separate the Integer The significant figures (also called significant digits and abbreviated sig figs) of a number are those digits that carry meaning contributing to its accuracy This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation. Scientific notation, also sometimes known as standard form or as exponential notation, is a way of writing numbers that accommodates values too large or small to be Over the years several different floating-point representations have been used in computers; however, for the last ten years the most commonly encountered representation is that defined by the IEEE 754-1985 Standard. The IEEE Standard for Binary Floating-Point Arithmetic ( IEEE 754) is the most widely-used standard for floating-point computation and is followed by many
The advantage of floating-point representation over fixed-point (and integer) representation is that it can support a much wider range of values. In Computing, a fixed-point number representation is a Real data type for a number that has a fixed number of digits after (and sometimes also before the In computer science the term integer is used to refer to a Data type which represents some finite subset of the mathematical Integers These are also known as For example, a fixed-point representation that has eight decimal digits, with the decimal point assumed to be positioned after the sixth digit, can represent the numbers 123456. 78, 8765. 43, 123. 00, and so on, whereas a floating-point representation with eight decimal digits could also represent 1. 2345678, 1234567. 8, 0. 000012345678, 12345678000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of slightly less precision.
The speed of floating-point operations is an important measure of performance for computers in many application domains. It is measured in "megaFLOPS" (million floating-point operations per second), or gigaflops, etc. Measuring performance In order for FLOPS to be useful as a measure of floating-point performance a standard benchmark must be available on all computers of interest For other meanings see Giga (disambiguation Giga- (symbol G is a prefix in the SI system of units denoting 109 World-class supercomputer installations are generally rated in teraflops. The TOP500 project ranks and details the 500 most powerful known Computer systems in the world teras- (symbol T) is a prefix in the SI system of units denoting 1012, or 1000000000000 (1 trillion In June 2008, the IBM Roadrunner supercomputer achieved 1. Roadrunner is a Supercomputer built by IBM at the Los Alamos National Laboratory in New Mexico, USA. 026 petaflops, or 1. In Physics and Mathematics, peta- (symbol P) is a prefix in the SI ( System of units) denoting 1015 026 quadrillion floating-point operations per second.
A number representation (called a numeral system in mathematics) specifies some way of storing a number that may be encoded as a string of digits. A numeral system (or system of numeration) is a Mathematical notation for representing numbers of a given set by symbols in a consistent manner The arithmetic is defined as a set of actions on the representation that simulate classical arithmetic operations.
There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. In Mathematics and Computing, a Radix point (or radix character) is the symbol used in numerical representations to separate the Integer In a positional Numeral system, the decimal separator is a Symbol used to mark the boundary between the integral and the fractional If the radix point is omitted then it is implicitly assumed to lie at the right (least significant) end of the string (that is, the number is an integer). The integers (from the Latin integer, literally "untouched" hence "whole" the word entire comes from the same origin but via French In fixed-point systems, some specific convention is made about where the radix point is located in the string. In Computing, a fixed-point number representation is a Real data type for a number that has a fixed number of digits after (and sometimes also before the For example, the convention could be made that the string consists of 8 decimal digits, with the point in the middle, so that "00012345" has a value of 1. 2345.
In scientific notation, the given number is scaled by a power of 10 so that it lies within a certain range – typically between 1 and 10, with the radix point appearing immediately after the first digit. Scientific notation, also sometimes known as standard form or as exponential notation, is a way of writing numbers that accommodates values too large or small to be The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter's moon Io is 152853. TemplateInfobox Planet.--> Io (ˈaɪoʊ, or as Greek 5047 seconds. The second ( SI symbol s) sometimes abbreviated sec, is the name of a unit of Time, and is the International System of Units This is represented in standard-form scientific notation as 1. 528535047×105 seconds.
Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
The significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent — to the right if the exponent is positive or to the left if the exponent is negative. The significand (also Coefficient or Mantissa) is the part of a floating-point number that contains its significant digits Using base-10 (the familiar decimal notation) as an example, the number 152853. This article gives a mathematical definition For a more accessible article see Decimal. 5047, with ten decimal digits of precision, is represented as the significand 1528535047 together with an exponent of 5. To recover the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1. 528535047 × 105, or 152853. 5047.
Symbolically, this final value is

where s is the value of the significand (after taking into account the implied radix point), b is the base, and e is the exponent.
Equivalently, this is:

where s here means the integer value of the entire significand, and p is the precision: the number of digits in the significand.
The significand always stores the most significant digits in the number: the first non-zero digits. When the significand is adjusted in this way so that its leftmost digit is nonzero, it is said to be normalized, and its value obeys 1 ≤ s < b, given that the radix point is assumed to follow the first digit. In Computing, a normal number is a non-zero number in a floating-point representation which is within the balanced range supported by a given floating-point format Zero is a special case and is normally represented as s = 0, e = 0. (Subnormal numbers and certain other cases also need special treatment; see dealing with exceptional cases. In Computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the gap around zero in Floating )
Historically, different bases have been used for floating-point, but until recently almost all modern computer architectures used base 2, or binary. The binary numeral system, or base-2 number system, is a Numeral system that represents numeric values using two symbols usually 0 and 1. In binary, the significand is a string of bits (1s and 0s) of length p. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication For example, the number π rounded to 24 bits is 11. IMPORTANT NOTICE Please note that Wikipedia is not a database to store the millions of digits of π please refrain from adding those to Wikipedia as it could cause technical problems 001001000011111101101. In binary single-precision (24-bit) floating-point, this is represented as s = 110010010000111111011011 with e = 1 (where s is assumed to have a binary point after the first bit). After normalisation, the first bit of a non-zero binary significand is always 1 and hence need not be actually encoded, giving an extra bit of precision. Normalization can therefore be thought of as a form of compression; it allows a binary significand to be compressed into a field one bit shorter than the maximum precision, at the expense of extra processing.
The way in which the significand, exponent and sign bits are internally stored on a computer is implementation-dependent. The common IEEE formats are described later.
The word "mantissa" is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. The common logarithm is the Logarithm with base 10 It is also known as the decadic logarithm, named after its base Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand.
By allowing the radix point to be adjustable, floating-point notation allows calculations over a wide range of magnitudes, using a fixed number of digits, while maintaining good precision. In Mathematics and Computing, a Radix point (or radix character) is the symbol used in numerical representations to separate the Integer For example, in a decimal floating-point system with three digits, the multiplication that humans would write as
would be expressed as
In a fixed-point system with the decimal point at the left, it would be
A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string.
The range of floating-point numbers depends on the number of bits used for representation of the significand (the significant digits of the number) and for the exponent. On a typical computer system, a 'double precision' (64-bit) floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10−308 to 10308 (because 308 is approximately 1023 * log10(2), since the range of the exponent is [-1022,1023]). The complete range of the format is from about −10308 through +10308 (see IEEE 754-1985). The IEEE Standard for Binary Floating-Point Arithmetic ( IEEE 754) is the most widely-used standard for floating-point computation and is followed by many
The floating-point system of numbers was used by the Kerala School of mathematics in 14th century India to investigate and rationalise about the convergence of series. India, officially the Republic of India (भारत गणराज्य inc-Latn Bhārat Gaṇarājya; see also other Indian languages) is a country In the absence of a more specific context convergence denotes the approach toward a definite value as time goes on or to a definite point a common view or opinion or In Mathematics, a series is often represented as the sum of a Sequence of terms That is a series is represented as a list of numbers with
In 1938, Konrad Zuse of Berlin, completed the "Z1", the first mechanical binary programmable computer. Year 1938 ( MCMXXXVIII) was a Common year starting on Saturday (link will display the full calendar of the Gregorian calendar. Konrad Zuse (ˈkɔnʁat ˈtsuːzə June 22, 1910 Berlin - December 18, 1995 Hünfeld) was a German The Z1 was a mechanical Computer created by Konrad Zuse in 1936. It was based on boolean algebra and had most of the basic ingredients of modern machines, using the binary system and today's standard separation of storage and control. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital Zuse's 1936 patent application (Z23139/GMD Nr. Year 1936 ( MCMXXXVI) was a Leap year starting on Wednesday (link will display the full calendar of the Gregorian calendar. 005/021) also suggests a von Neumann architecture (re-invented in 1945) with program and data modifiable in storage. The von Neumann architecture is a design model for a stored-program Digital computer that uses a processing unit and a single separate storage structure Year 1945 ( MCMXLV) was a Common year starting on Monday (link will display the full calendar Originally the machine was called the "V1" but it was retroactively renamed after the war, to avoid confusion with the V1 missile. It worked with floating-point numbers having a 7-bit exponent, 16-bit mantissa, and a sign bit. The memory used sliding metal parts to store 16 such numbers, and worked well; but the arithmetic unit was less successful, occasionally suffering from certain mechanical engineering problems. The program was read from punched discarded 35 mm movie film. Data values could be entered from a numeric keyboard, and outputs were displayed on electric lamps. The machine was not a general purpose computer because it lacked looping capabilities. The Z3 was completed in 1941 and was program-controlled. Konrad Zuse 's Year 1941 ( MCMXLI) was a Common year starting on Wednesday (the link will display 1941 calendar of the Gregorian calendar.
Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the Z4 in 1950, followed by the IBM 704 in 1954. The Z4 Computer was the world's second commercial digital computer designed by German engineer Konrad Zuse, built by his company Zuse Apparatebau Year 1950 ( MCML) was a Common year starting on Sunday (link will display the full calendar of the Gregorian calendar. The IBM 704, the first mass-produced Computer with Floating point arithmetic hardware was introduced by IBM in April 1954. Year 1954 ( MCMLIV) was a Common year starting on Friday (link will display full 1954 Gregorian calendar) For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The PDP-11/44 was an extension of the 11/34 that included the cache memory and floating-point units as a standard feature. The PDP-11/44, introduced in 1980 was the last of the Digital Equipment Corporation PDP-11 series of Minicomputers implemented in discrete logic
The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point formats. The UNIVAC 1100/2200 series is a series of compatible 36-bit computer systems beginning with the UNIVAC 1107 in 1962, initially made by Sperry Rand Year 1962 ( MCMLXII) was a Common year starting on Monday (the link is to a full 1962 calendar of the Gregorian calendar. Single precision used 36 bits, organised into a 1-bit sign, 8-bit exponent, and a 27-bit mantissa. Double precision used 72 bits organised as a 1-bit sign, 11-bit exponent, and a 60-bit mantissa. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats. The IBM 7090 was a second-generation Transistorized version of the earlier IBM 709 vacuum tube Mainframe computers and was designed for "large-scale
Prior to the IEEE-754 standard, computers used many different forms of floating-point. The IEEE Standard for Binary Floating-Point Arithmetic ( IEEE 754) is the most widely-used standard for floating-point computation and is followed by many These differed in the word-sizes, the format of the representations, and the rounding behaviour of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy.
The IEEE-754 standard was created in the early 1980s, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among the innovations are these:
The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. The IEEE Standard for Binary Floating-Point Arithmetic ( IEEE 754) is the most widely-used standard for floating-point computation and is followed by many The Institute of Electrical and Electronics Engineers or IEEE (read eye-triple-e) is an international Non-profit, professional organization The IEEE Standard for Binary Floating-Point Arithmetic ( IEEE 754) is the most widely-used standard for floating-point computation and is followed by many This standard is followed by almost all modern machines. Notable exceptions include IBM mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format
The standard provides for many closely-related formats, differing in only a few details. IBM System/360 computers and subsequent machines based on that architecture (mainframes support a Hexadecimal floating-point format Two of these formats are called basic formats, and are ubiquitous in computer hardware and languages:
Any integer less than or equal to 224 can be exactly represented in the single precision format, and any integer less than or equal to 253 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.
The bit representations of IEEE floating-point numbers are monotonic (increasing or decreasing in accordance with the numbers they represent), as long as exceptional values are avoided and the signs are handled properly. IEEE floating-point numbers are equal if and only if their integer bit representations are equal. Floating-point comparisons can therefore be done with simple integer comparisons on the bit patterns, as long as the signs match. However, the actual floating-point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
To a rough approximation, the bit representation of an IEEE floating-point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum. ) This can be exploited in some applications, such as volume ramping in digital sound processing.
Although the 32 bit ("single") and 64 bit ("double") formats are by far the most common, the standard actually allows for many different precision levels. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision format, with a 15 bit exponent, a 64 bit significand, and no hidden bit. The term "extended precision" refers to storage formats for Floating point numbers taking advantage of an opportunity not falling in to a regular sequence of single double and
There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C and related programming languages usually provide these formats via the long double type on such hardware). tags please moot on the talk page first! --> In Computing, C is a general-purpose cross-platform block structured In C and related Programming languages long double refers to a Floating point Data type that may and usually does have greater System vendors may also provide additional extended formats (e. g. 128 bits) emulated in software.
A project for revising the IEEE 754 standard has been under way since 2000 (see IEEE 754r). A late phase of the review was completed on March 10, 2007[1] but final ratification of the new standard is still awaiting a decision as of 2008. Events 241 BC - First Punic War: Battle of the Aegates Islands - The Romans sink the Carthaginian fleet bringing 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE standard formats they are apportioned as follows:
sign exponent (exponent bias) significand total single 1 8 (127) 23 32 double 1 11 (1023) 52 64
While the exponent can be positive or negative, it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0's and all 1's in this field are reserved for special treatment (see dealing with exceptional cases). Therefore the legal exponent range for normalized numbers is [−126, 127] for single precision or [−1022, 1023] for double.
As described earlier, when a binary number is normalized the leftmost bit of the significand is known to be 1. In the IEEE single and double precision formats that bit is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has 24 bits of significand precision, while double precision format has 53.
For example, it was shown above that π, rounded to 24 bits of precision, has:
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as
Floating-point representation, in particular the standard IEEE format, is by far the most common way of representing arbitrary real numbers in computers because it is efficiently handled in most large computer processors. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a However, there are alternatives:
in a completely "formal" way, without dealing with a specific encoding of the significand. A computer algebra system ( CAS) is a software program that facilitates Symbolic mathematics. For other meanings of Maxima see Maxima Maxima is a free Computer algebra system based on a 1982 version of Maple is a general-purpose commercial Computer algebra system. Such programs can evaluate expressions like "sin3π" exactly, because they "know" the underlying mathematics. By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). In Mathematics, a rational number is a number which can be expressed as a Ratio of two Integers Non-integer rational numbers (commonly called fractions Irrational numbers, such as π or √2, or non-terminating rational numbers, must be approximated. IMPORTANT NOTICE Please note that Wikipedia is not a database to store the millions of digits of π please refrain from adding those to Wikipedia as it could cause technical problems The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the number 123456789 clearly cannot be exactly represented if only eight decimal digits of precision are available.
When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value to the original, and the value thus adjusted is called the rounded value.
Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0. 5) while the number 1/3 does not (0. 333. . . ). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0. 1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
where, as previously, s is the significand and e is the exponent.
When rounded to 24 bits this becomes
which is actually 0. 100000001490116119384765625 in decimal.
As a further example, the real number π, represented in binary as an infinite series of bits is
but is
when approximated by rounding to a precision of 24 bits. For lip-rounding in phonetics see Labialisation and Roundedness.
In binary single-precision floating-point, this is represented as s = 110010010000111111011011 with e = −23 (or e = 1 if s is not an integer but is assumed to have a binary point after the first bit). This has a decimal value of
whereas the true value of π is
The result of rounding differs from the true value by about 0. 03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon. In Numerical analysis, Computational physics, and Simulation, discretization error is Error resulting from the fact that a function In Floating point arithmetic, the machine epsilon (also called macheps, machine precision or unit roundoff) is for a particular Floating
The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. For numbers with an exponent of 0, an ULP is exactly 2−23 or about 10−7 in single precision, and about 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.
Rounding modes are used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more significant digits than there are digits in the significand. There are several different rounding schemes (or rounding modes). For lip-rounding in phonetics see Labialisation and Roundedness. Often, truncation was the typical approach. In Mathematics, truncation is the term for limiting the number of digits right of the Decimal point, by discarding the least significant ones Since the introduction of IEEE 754, the default method (round to nearest, ties to even, sometimes called Banker's Rounding) is more commonly used. For lip-rounding in phonetics see Labialisation and Roundedness. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. [3] In the case of a tie, the value that would make the significand end in a 0 bit is chosen. This IEEE standard applies to all fundamental algebraic operations, including square root, in the absence of exceptional conditions. It means that IEEE-compliant hardware behavior is completely determined in all 32 or 64 bits. ("Library" functions such as cosine and log are not mandated. )
Alternative rounding options are also available. IEEE-754-compliant hardware offers the following rounding modes:
Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic. Interval arithmetic, also called interval mathematics, interval analysis, and interval computation, is a method in Mathematics.
A further use of rounding modes is when a number is explicitly rounded to a certain number of decimal (or binary) places, as when rounding a result to euros and cents (two decimal places). In this case a common rounding mode is again "round to nearest, ties away from zero", in which a tie is rounded up for positive values.
For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples. In mathematical numeral systems, the base or radix is usually the number of unique digits, including zero that a positional Numeral The fundamental principles are the same in any radix or precision. In mathematical numeral systems, the base or radix is usually the number of unique digits, including zero that a positional Numeral As usual, s denotes the significand and e denotes the exponent.
A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and we then proceed with the usual addition method:
123456. 7 = 1. 234567 * 10^5 101. 7654 = 1. 017654 * 10^2 = 0. 001017654 * 10^5 Hence: 123456. 7 + 101. 7654 = (1. 234567 * 10^5) + (1. 017654 * 10^2) = (1. 234567 * 10^5) + (0. 001017654 * 10^5) = (1. 234567 + 0. 001017654) * 10^5 = 1. 235584654 * 10^5
In detail:
e=5; s=1. 234567 (123456. 7) + e=2; s=1. 017654 (101. 7654) e=5; s=1. 234567 + e=5; s=0. 001017654 (after shifting) -------------------- e=5; s=1. 235584654 (true sum: 123558. 4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1. 235585 (final sum: 123558. 5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error. For the acrobatic movement roundoff see Roundoff. A round-off error, also called rounding error, is the difference between the In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1. 234567 + e=-3; s=9. 876543 e=5; s=1. 234567 + e=5; s=0. 00000009876543 (after shifting) ---------------------- e=5; s=1. 23456709876543 (true sum) e=5; s=1. 234567 (after rounding/normalization)
Another problem of loss of significance occurs when two close numbers are subtracted. In the following example e = 5; s = 1. 234571 and e = 5; s = 1. 234567 are representations of the rationals 123457. 1467 and 123456. 659.
e=5; s=1. 234571 - e=5; s=1. 234567 ---------------- e=5; s=0. 000004 e=-1; s=4. 000000 (after rounding/normalization)
The best representation of this difference is e = −1; s = 4. 877000, which differs more than 20% from e = −1; s = 4. 000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Loss of significance is an undesirable effect in calculations using floating-point arithmetic Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems. Numerical analysis is the study of Algorithms for the problems of continuous mathematics (as distinguished from Discrete mathematics)
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.
e=3; s=4. 734612 × e=5; s=5. 417242 ----------------------- e=8; s=25. 648538980104 (true product) e=8; s=25. 64854 (after rounding) e=9; s=2. 564854 (after normalization)
Division is done similarly, but is more complicated.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm and digital division). Booth's multiplication algorithm is a Multiplication algorithm that multiplies two signed binary numbers in two's complement notation. Several algorithms exist to perform division in digital designs [4] For a fast, simple method, see the Horner method. In Numerical analysis, the Horner scheme or Horner algorithm, named after William George Horner, is an Algorithm for the efficient evaluation
Floating-point computation in a computer can run into three kinds of problems:
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. See also Software portability In Computer science, porting is the process of adapting software so that an executable program can be created Modern IEEE-compliant systems have a uniform way of handling these situations. An important part of the mechanism involves error values that result from a failing computation, and that can propagate silently through subsequent computation until they are detected at a point of the programmer's choosing.
The two error values are "infinity" (often denoted "INF"), and "NaN" ("not a number"), which covers all other errors. In Computing, NaN ( N ot a N umber is a value or symbol that is usually produced as the result of an operation on invalid input operands "Infinity" does not necessarily mean that the result is actually infinite. It simply means "too large to represent".
Both of these are encoded with the exponent field set to all 1's. (Recall that exponent fields of all 0's or all 1's are reserved for special meanings. ) The significand field is set to something that can distinguish them—typically zero for INF and nonzero for NaN. The sign bit is meaningful for INF, that is, floating-point hardware distinguishes between +∞ and −∞.
When a nonzero number is divided by zero (the divisor must be exactly zero), a "zerodivide" event occurs, and the result is set to infinity of the appropriate sign. In other cases in which the result's exponent is too large to represent, such as division of an extremely large number by an extremely small number, an "overflow" event occurs, also producing infinity of the appropriate sign. This is different from a zerodivide, though both produce a result of infinity, and the distinction is usually unimportant in practice.
Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as
When the result of an operation has an exponent too small to represent properly, an "underflow" event occurs. The hardware responds to this by changing to a format in which the significand is not normalized, and there is no "hidden" bit—that is, all bits of the significand are represented. The exponent field is set to the reserved value of zero. The significand is set to whatever it has to be in order to be consistent with the exponent. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal". In Computer science, denormal numbers or denormalized numbers (now often called subnormal numbers) fill the gap around zero in Floating Denorms are perfectly legal operands to arithmetic operations.
If no significant bits are able to appear in the significand field, the number is zero. Note that, in this case, the exponent field and significand field are all zeros—floating-point zero is represented by all zeros.
The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.
Other errors, such as division of zero by zero, or taking the square root of −1, cause an "operand error" event, and produce a NaN result. NaNs propagate aggressively through arithmetic operations—any NaN operand to any operation causes an operand error and produces a NaN result.
In summary, there are five special "events" that may occur, though some of them are quite benign:
Computer hardware is typically able to raise exceptions when these events occur. Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of a condition that changes the normal flow of execution How this is done is system-dependent. Usually these exceptions are all masked (disabled), relying only on the propagation of error values. Sometimes overflow, zerodivide, and operand error are enabled.
The fact that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, leads to many surprising situations.
For example, the non-representability of 0. 1 and 0. 01 means that the result of attempting to square 0. 1 is neither 0. 01 nor the representable number closest to it. In 24-bit (single precision) representation, 0. 1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is
Squaring this number gives
Squaring it with single-precision floating-point hardware (with rounding) gives
But the representable number closest to 0. 01 is
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
// Enough digits to be sure we get the correct approximation. double pi = 3. 1415926535897932384626433832795; double z = tan(pi/2. 0);
will give a result of 16331239353195370. 0. In single precision (using the tanf function), the result will be −22877332. 0.
By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0. 1225×10-15 in double precision, or −0. 8742×10-7 in single precision. [5]
While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they are not necessarily associative. In Mathematics, commutativity is the ability to change the order of something without changing the end result In Mathematics, associativity is a property that a Binary operation can have That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit decimal arithmetic:
1234. 567 + 45. 67844 = 1280. 245 1280. 245 + 0. 0004 = 1280. 245 but 45. 67840 + 0. 0004 = 45. 67844 45. 67844 + 1234. 567 = 1280. 246
They are also not necessarily distributive. In Mathematics, and in particular in Abstract algebra, distributivity is a property of Binary operations that generalises the distributive law That is, (a + b) ×c may not be the same as a×c + b×c:
1234. 567 × 3. 333333 = 4115. 223 1. 234567 × 3. 333333 = 4. 115223 4115. 223 + 4. 115223 = 4119. 338 but 1234. 567 + 1. 234567 = 1235. 802 1235. 802 × 3. 333333 = 4119. 340
In addition to loss of significance, inability to represent numbers such as π and 0. 1 exactly, and other slight inaccuracies, the following phenomena may occur:
Because of the issues noted above, naive use of floating-point arithmetic can lead to many problems. The creation of thoroughly robust floating-point software is a complicated undertaking, and a good understanding of numerical analysis is essential. Numerical analysis is the study of Algorithms for the problems of continuous mathematics (as distinguished from Discrete mathematics)
In addition to careful design of programs, careful handling by the compiler is required. A compiler is a Computer program (or set of programs that translates text written in a computer language (the source language) into another Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.
Floating-point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. TemplateInfobox Planet.--> Io (ˈaɪoʊ, or as Greek The proton ( Greek πρῶτον / proton "first" is a Subatomic particle with an Electric charge of one positive An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. [6] The "decimal" data type of the C# programming language, and the IEEE 854 standard, are designed to avoid the problems of binary floating-point representation, and make the arithmetic always behave as expected when numbers are printed in decimal. C# (pronounced C Sharp is a Multi-paradigm
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.
Expectations from mathematics may not be realised in the field of floating-point computation. For example, it is known that
, and that
. These facts cannot be counted on when the quantities involved are the result of floating-point computation.
A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to the references at the bottom of this article. Descriptions of a few simple techniques follow.
The use of the equality test (if (x==y) . . . ) is usually not recommended when expectations are based on results from pure mathematics. Such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) . . . ), where epsilon is sufficiently small and tailored to the application, such as 1. 0E-13). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.
An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like
3253. 671 + 3. 141276 -------- 3256. 812
The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors. In Numerical analysis, the Kahan summation algorithm (also known as compensated summation) significantly reduces the Numerical error in the total obtained
Computations may be rearranged in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. Archimedes of Syracuse ( Greek:) ( c. 287 BC – c 212 BC was a Greek mathematician, Physicist, Engineer The recurrence formula for the circumscribed polygon is:



Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:
i 6 × 2i × ti, first form 6 × 2i × ti, second form 0 3. 4641016151377543863 3. 4641016151377543863 1 3. 2153903091734710173 3. 2153903091734723496 2 3. 1596599420974940120 3. 1596599420975006733 3 3. 1460862151314012979 3. 1460862151314352708 4 3. 1427145996453136334 3. 1427145996453689225 5 3. 1418730499801259536 3. 1418730499798241950 6 3. 1416627470548084133 3. 1416627470568494473 7 3. 1416101765997805905 3. 1416101766046906629 8 3. 1415970343230776862 3. 1415970343215275928 9 3. 1415937488171150615 3. 1415937487713536668 10 3. 1415929278733740748 3. 1415929273850979885 11 3. 1415927256228504127 3. 1415927220386148377 12 3. 1415926717412858693 3. 1415926707019992125 13 3. 1415926189011456060 3. 1415926578678454728 14 3. 1415926717412858693 3. 1415926546593073709 15 3. 1415919358822321783 3. 1415926538571730119 16 3. 1415926717412858693 3. 1415926536566394222 17 3. 1415810075796233302 3. 1415926536065061913 18 3. 1415926717412858693 3. 1415926535939728836 19 3. 1414061547378810956 3. 1415926535908393901 20 3. 1405434924008406305 3. 1415926535900560168 21 3. 1400068646912273617 3. 1415926535898608396 22 3. 1349453756585929919 3. 1415926535898122118 23 3. 1400068646912273617 3. 1415926535897995552 24 3. 2245152435345525443 3. 1415926535897968907 25 3. 1415926535897962246 26 3. 1415926535897962246 27 3. 1415926535897962246 28 3. 1415926535897962246 The true value is 3. 141592653589793238462643383. . .
While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.