

small (250x250 max)
medium (500x500 max)
Large
Extra Large
large ( > 500x500)
Full Resolution


ANALYSIS AND IMPLEMENTATION OF DECIMAL ARITHMETIC HARDWARE IN NANOMETER CMOS TECHNOLOGY By IVAN DARIO CASTELLANOS Bachelor of Science in Electrical Engineering Universidad de los Andes Bogotá, Colombia 2001 Master of Science in Electrical Engineering Illinois Institute of Technology Chicago, Illinois 2004 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2008 ii ANALYSIS AND IMPLEMENTATION OF DECIMAL ARITHMETIC HARDWARE IN NANOMETER CMOS TECHNOLOGY Dissertation Approved: Dr. James E. Stine Dissertation Adviser Dr. Louis Johnson Dr. Sohum Sohoni Dr. Nohpill Park Dr. A. Gordon Emslie Dean of the Graduate College iii TABLE OF CONTENTS Chapter Page 1. INTRODUCTION...............................................................................................................1 1.1 Importance of Decimal Arithmetic.............................................................................2 1.2 The Decimal FloatingPoint Standard.......................................................................4 1.3 A case for Decimal Arithmetic in GeneralPurpose Computer Architectures ...........7 2. BACKGROUND.................................................................................................................9 2.1 Binary Comparison ...................................................................................................9 2.1.1 Magnitude Comparator Design ..........................................................................11 2.1.2 Two’s complement and binary floatingpoint comparator ..................................14 2.2 Addition...................................................................................................................17 2.2.1 Binary addition....................................................................................................17 2.2.2 Carry save addition (CSA)..................................................................................19 2.2.3 4:2 Compressors ................................................................................................20 2.2.4 Decimal excess3 addition .................................................................................21 2.2.5 Direct decimal addition.......................................................................................23 2.2.6 Decimal FloatingPoint Adder.............................................................................24 2.3 Binary Multiplication................................................................................................26 2.4 Decimal Multiplication .............................................................................................30 2.4.1 High frequency decimal multiplier ......................................................................34 2.4.2 Multiplication with efficient partial product generation........................................35 3. DECIMAL FLOATINGPOINT COMPARATOR ..............................................................37 iv Chapter Page 3.1 Decimal floatingpoint comparison .........................................................................37 3.2 Comparator Design.................................................................................................40 3.3 Coefficient magnitude comparison .........................................................................42 3.4 Special case scenarios ...........................................................................................44 3.4.1 One or both numbers is infinite ..........................................................................45 3.4.2 Both operands are zero......................................................................................46 3.4.3 Exponent difference offrange............................................................................46 3.4.4 Alignment shiftout, overflow ..............................................................................47 3.4.5 Coefficient comparison.......................................................................................48 3.5 Combined binary floatingpoint, two’s complement and decimal floatingpoint comparator........................................................................................................................50 4. EXPERIMENTS FOR DECIMAL FLOATINGPOINT DIVISION BY RECURRENCE ....53 4.1 Decimal Division by Digit Recurrence Theory ........................................................53 4.2 Quotient Digit Selection ..........................................................................................56 4.3 Considerations for the IEEE754 Decimal Case ....................................................60 5. DECIMAL PARTIAL PRODUCT REDUCTION...............................................................65 5.1 Decimal CarrySave Addition .................................................................................68 5.2 Delay analysis of the decimal 3:2 counter by recoding ..........................................75 5.3 Decimal 4:2 Compressor Trees..............................................................................77 6. PARTIAL PRODUCT GENERATION SCHEMES...........................................................83 6.1 Multiplier Recoding .................................................................................................83 6.2 Multiplicand Multiples Generation...........................................................................87 6.2.1 2x and 5x with Conventional Binary Logic .........................................................87 6.2.2 2x and 5x using BCD recoding...........................................................................88 6.3 Partial Product Generation Architectures ...............................................................90 v Chapter Page 7. RESULTS........................................................................................................................93 7.1 Decimal comparator design....................................................................................95 7.2 Decimal/Binary Combined Comparator ..................................................................96 7.3 Partial Product Reduction Schemes.......................................................................99 7.4 Partial Product Generation Architectures .............................................................102 8. CONCLUSIONS............................................................................................................105 BIBLIOGRAPHY...................................................................................................................109 APPENDIX A – DENSELY PACKED DECIMAL ENCODING ............................................113 vi LIST OF TABLES Table Page Table 1. Numerical differences between decimal .............................................................3 Table 2. Decimal floatingpoint format ..............................................................................5 Table 3. Decimal FP Combination Field............................................................................6 Table 4. Floatingpoint Condition Codes.........................................................................11 Table 5. BCD Magnitude Comparison ............................................................................13 Table 6. Excess3 Code..................................................................................................21 Table 7. Decimal Multiplication Table, from [20]. ............................................................32 Table 8. Restricted range, signed magnitude products, from [32]. .................................35 Table 9. BCD Magnitude Comparison ............................................................................42 Table 10. 8421, 4221 and 5211 BCD representations....................................................71 Table 11. Radix 5/4 digit recoding...................................................................................85 Table 12. Area and delay estimates................................................................................95 Table 13. Area and delay results for comparator designs.............................................105 Table 14. Dynamic and Static power results for comparator designs. ..........................106 Table 15. Comparison results for proposed compressor trees .....................................106 Table 16. Dynamic and static power comparison results for proposed.........................107 Table 17. Area and delay results for VAM [40], LN [42] architectures ..........................107 Table 18. Dynamic and static power consumption for partial........................................108 vii Table Page Table 19. DPD Encoding / Compression, taken from [8]...............................................113 Table 20. DPD Decoding / Expansion, taken from [8]...................................................114 viii LIST OF FIGURES Figure Page Figure 1. Example: decimal floatingpoint representation of number 8.35.......................7 Figure 2. Block diagram of the binary portion of the comparator, taken from [15]. .........12 Figure 3. Magnitude comparator example ......................................................................13 Figure 4. Full adder cell and Truth Table. .......................................................................18 Figure 5. 4bit Ripple Carry Adder. .................................................................................18 Figure 6. Full adder cells used for carrysave addition. ..................................................20 Figure 7. Weinberger 4:2 Binary Compressor.................................................................21 Figure 8. Adder for Excess3 code, taken from [20]........................................................23 Figure 9. Decimal FloatingPoint adder, from [23]. .........................................................25 Figure 10. Partial products during multiplication, taken from [24]. ..................................26 Figure 11. Carry Save Array Multiplier, CSAM................................................................28 Figure 12. Wallace Tree multiplier reduction for two 4bit operands. ..............................29 Figure 13. Implementation detail for Wallace tree, 5th column ........................................29 Figure 14. Three column 8bit 4:2 binary compressor tree .............................................31 Figure 15. Decimal multiplication table Left and Right algorithm example......................33 Figure 16. Classic approach to floatingpoint comparison. .............................................39 Figure 17. Decimal floatingpoint comparator design......................................................41 Figure 18. Coefficient Aligner, logarithmic barrel shifter..................................................41 Figure 19. BCD magnitude comparator. .........................................................................44 ix Figure Page Figure 20. Combined two’s complement, binary and decimal comparator......................52 Figure 21. Robertson's diagram showing selection intervals for q=k1 and k. ................55 Figure 22. Truncated inputs to the QDS function............................................................56 Figure 23. PD Diagram, Sk selection points as a function of truncated d, [34]. ..............58 Figure 24. Decimal division by digit recurrence implementation. ....................................64 Figure 25. Multiplication Algorithm ..................................................................................65 Figure 26. Binary full adder cells used for decimal carrysave addition. .........................68 Figure 27. Block diagram for full adder cells used for decimal CarrySave addition.......69 Figure 28. Result of the addition, carry vector shifted left. ..............................................70 Figure 29. Multiplication by 2, recoded result..................................................................71 Figure 30. Multiplication by 2, result in BCD4221. .........................................................72 Figure 31. Decimal 3:2 counter with BCD recoding example..........................................73 Figure 32. Block diagram for decimal 3:2 counter [38]....................................................74 Figure 33. Nine gates Full Adder / 3:2 counter cell. ........................................................76 Figure 34. Decimal 9:2 counter, adapted from [38].........................................................78 Figure 35. Proposed decimal 4:2 compressor. All signals are decimal (4bits)...............79 Figure 36. Decimal 8:2 compressor, all signals are decimal (4bits)...............................80 Figure 37. Decimal 8:2 compressor. Critical path and Δ delays shown.........................81 Figure 38. Decimal 16:2 compressor, all signals are decimal (4bits).............................82 Figure 39. Multiplication Algorithm ..................................................................................84 Figure 40. Digit recoding for radix5, [38]........................................................................86 Figure 41. Quintupling through BCD recoding. ...............................................................89 Figure 42. LangNannarelli radix10 partial product generation. .....................................91 Figure 43. VásquezAntelo radix5 partial product generation........................................92 x Figure Page Figure 44. Design Flow methodology..............................................................................94 Figure 45. Concept block diagram for implementation comparison. ...............................96 Figure 46. Delay estimates for comparator designs........................................................97 Figure 47. Area estimates for comparator designs. ........................................................98 Figure 48. Dynamic power consumption for comparator designs. ..................................98 Figure 49. Delay estimates for compressor trees vs. counter trees designs.................100 Figure 50. Area estimates for compressor trees vs. counter trees designs. .................101 Figure 51. Dynamic power estimates for compressor trees vs. counter trees designs. 101 Figure 52. Delay results, partial product generation architectures................................103 Figure 53. Area comparison for partial product generation architectures. ....................103 Figure 54. Dynamic power comparison, partial product generation..............................104 1 1. INTRODUCTION Many scientific, engineering and commercial applications call for operations with real numbers. In many cases, a fixedpoint numerical representation can be used. Nevertheless, this approach is not always feasible since the range that may be required is not always attainable with this method. Instead, floatingpoint numbers have proven to be an effective approach as they have the advantage of a dynamic range, but are more difficult to implement, less precise for the same number of digits, and include roundoff errors. The floatingpoint numerical representation is similar to scientific notation differing in that the radix point location is fixed usually to the right of the leftmost (most significant) digit. The location of the represented number’s radix point, however, is indicated by an exponent field. Since it can be assigned to be anywhere within the given number of bits, numbers with a “floating” radix point have a wide dynamic range of magnitudes that can be handled while maintaining a suitable precision. The IEEE standardized the floatingpoint numerical representation for computers in 1985 with the IEEE754 standard [1]. This specific encoding of the bits is provided and the behavior of arithmetic operations is precisely defined. This IEEE format minimizes calculation anomalies, while permitting different implementation possibilities. Since the 1950’s binary arithmetic has become predominantly used in computer operations given its simplicity for implementation in electronic circuits. Consequently, the heavy utilization of binary floatingpoint numbers mandates the IEEE binary floatingpoint standard to be 2 required for all existing computer architectures, since it simplifies the implementation. More importantly, it allows architectures to efficiently communicate with one another, since numbers adhere to the same IEEE standard. Although binary encoding in computer systems is prevalent, decimal arithmetic is becoming increasingly important and indispensable as binary arithmetic can not always satisfy the necessities of many current applications in terms of robustness and precision. Unfortunately, many architectures still resort to software routines to emulate operations on decimal numbers or, worse yet, rely on binary arithmetic and then convert to the necessary precision. When this happens, many software routines and binary approximations could potentially leave off crucial bits to represent the value necessary and potentially cause severe harm to many applications. 1.1 Importance of Decimal Arithmetic Decimal operations are essential in financial, commercial and many different Internet based applications. Decimal numbers are common in everyday life and are essential when data calculation results must match operations that would otherwise be performed by hand [2]. Some conventions even require an explicit decimal approximation. A study presented in [3] shows that numeric data in commercial applications, like banking, insurance and airlines is predominantly decimal well up to 98%. Furthermore, another study discussed in [4] shows that decimal calculations can incur a 50% to 90% processing overhead. One of the main causes for decimal’s performance cost is that binary numbers cannot represent most decimal numbers exactly. A number like 0.1, for example, would require an infinite recurring binary number, whereas, it can be accurately represented with a 3 decimal representation. This implies that it is not always possible to guarantee the same results between binary floating point and decimal arithmetic. This is further illustrated in the following table where the number 0.9 is continuously divided by 10. Table 1. Numerical differences between decimal and binary floatingpoint numbers. Decimal Binary 0.9 0.9 0.09 0.089999996 0.009 0.009 0.0009 9.0E4 0.00009 9.0E5 0.000009 9.0E6 9E7 9.0000003E7 9E8 9.0E8 9E9 9.0E9 9E10 8.9999996E10 It is, therefore, considerably difficult to develop and test applications that require this type of calculations and that use exact realworld data like commercial or financial values. Even legal requirements, like the Euro (€) currency regulations, dictate the working precision and rounding method to be used for calculations in decimal digits [5]Error! Reference source not found.. These requirements can only be met by working in base 10, using an arithmetic which preserves precision. Typically, decimal computations are performed on binary hardware through software emulation and mathematical approximations, since requirements specific to decimal numbers cannot always be met in pure binary form. These requirements may include arithmetic that preserves the number of decimal places (including trailing zeroes or unnormalized coefficients) and decimal rounding among others. In all cases, any scaling, rounding, or exponent has to be handled explicitly by the applications or the programmer, a complex and very errorprone task. Since binary computations for decimal arithmetic tend to be slow, significant performance improvements may result 4 from using decimal floatingpoint hardware. Native (hardware) decimal floatingpoint arithmetic will make programming far simpler and more robust, and produce a significantly better performance in computer applications. The impact of this type of hardware can improve decimal floatingpoint calculations speed by two or three orders of magnitude compared to a software approach and is further highlighted with IBM’s release of the Power6 processor, the first UNIX microprocessor able to calculate decimal floatingpoint arithmetic in hardware [7]. As an example, shown in [4], division of a JIT (Java JustInTime compiled) 9digit BigDecimal number type, takes more than 13,000 clock cycles on an Intel® Pentium™ processor, while a 9digit decimal addition requires more than 1,100 clock cycles. On the other hand, binary arithmetic takes 41 cycles for integer division and 3 cycles for an addition on the same processor. Dedicated decimal hardware would be comparable to these values, if available. 1.2 The Decimal FloatingPoint Standard The increasing importance of decimal arithmetic is highlighted by the specifications being included in the current revision draft of the IEEE754 standard for floatingpoint arithmetic or IEEE754R [8]. Decimal floatingpoint numbers are in a format similar to scientific notation: (1)S x Coefficient x 10 (Exponent – Bias), where S is either 1 or 0 and determines the sign of the number. The exponent is biased to avoid negative representations. In other words, all exponents are represented in relation to a known value given to exponent zero. For example, if the bias is 127 numbers below 127 are negative and above 127 are positive. To illustrate a specific 5 example, suppose an exponent of 125 is utilized with a bias of 127, this exponent represents a value of 2 according to the IEEE standard. The current draft specifies the representation of three decimal number types: decimal32, decimal64 and decimal128 encoded in 32, 64 and 128bits respectively. The value of the number is encoded in four different fields. An illustration of this representation for decimal64 numbers is shown in Table 2, taken from [9]. Table 2. Decimal floatingpoint format Length (bits) 1 5 8 50 Description Sign Combination Field Exponent Continuation Coefficient Continuation Decimal64 numbers are comprised of a 16 digit coefficient and a 10bit biased exponent. The sign bit indicates the sign of the number as indicated earlier, in the same way as binary floatingpoint numbers. Both exponent and coefficient are encoded with part of their value given in the combination field: the two Most Significant Bits (MSBs) for the exponent and the Most Significant Digit (MSD) of the coefficient. The combination field also determines if the number represented is a finite number, an infinite number or a NaN (NotaNumber). Quiet and signaling NaNs (in which case an exception is triggered or signaled) are determined by the first bit of the exponent continuation field. Table 3, shown in [9], illustrates the combination field which depends if the number is Infinity, a NaN or a finite number. The combination field is encoded differently as well if the finite number’s MSD is greater than or equal to 8 or if the number is less as illustrated in the first two entries of the table. The exponent is therefore a biased unsigned 10bit binary number and the coefficient is given by a specific 10bit per 3 decimaldigits encoding representing a 16 decimal digits number. An additional important characteristic of 6 decimal floatingpoint numbers is that they are not normalized, as opposed to binary floatingpoint numbers. Table 3. Decimal FP Combination Field Combination Exponent Coefficient Field (5 Bits) MSBs (2bits) MSD (4bits) a b c d e Finite < 8 a b 0 c d e 1 1 c d e Finite > 7 c d 1 0 0 e 1 1 1 1 0 Infinity       1 1 1 1 1 NaN       Type For example, suppose a programmer wants to encode the number 8.35 into decimal64. The first step is to break the number into its coefficient and exponent, which produces 835 (with 13 leading zero decimal digits given that coefficients are 16 digits long) and –2 respectively, i.e. –835x10–2. For decimal64 numbers, where the bias value is of 39810, an exponent of –2 becomes 39610 (01 1000 11002). The combination field for –835x10–2 contains the two most significant bits or MSBs of the exponent (01 in this example) and the most significant digit (MSD) of the coefficient (4bits, 0000 in this case since the MSD is zero). According to Table 3, for finite numbers with the most significant digit value below 8, the 5bit combination field abcde decodes ab as the Exponent’s MSBs and 0cde as the MSD. To illustrate an example, the number –8.35 becomes 01  000. The remaining 8bits of the exponent, 0x8C, are arranged in the exponent continuation field. Finally, the coefficient is given in the coefficient continuation field using Densely Packed Decimal (DPD) encoding [10]. DPD encoding provides an efficient method of storing and translating 10bit / 3 decimal digits into BCD representation and vice versa by using simple Boolean expressions. A more detailed explanation of DPD can be found in the Appendix. 7 The DPD codification of the three BCD decimal digits into 10bits is called compression and it depends on the size of each digit, small or large (3bit for less than or equal to 7, and 4bits for greater than 7). A specific mapping is used in each situation: when all digits are small, left digit is small, middle digit is large, etc [10]. The three digits 835 are given in BCD as bits abcd efgh ijkm (1000 0011 0101)2. Bits a, e and i are used to indicate if the numbers are large or small. For this specific case, in which left digit is a large number, the mapping used for the encoding has the form [jkd fgh 1 10 m] or 0x23D (see second table in Appendix A). Therefore, the decimal64 representation for –8.35 is, in hexadecimal, A2 30 00 00 00 00 02 3D. Figure 1. Example: decimal floatingpoint representation of number 8.35. 1.3 A case for Decimal Arithmetic in GeneralPurpose Computer Architectures Decimal arithmetic has long been studied in computer architectures, however, most silicon implementations of digital logic suffered due to area requirements. By the year 8.35  835 x 102 835 = 23Dhex 39610 = 01 1000 Separate in Coefficient and Biasing with 39810 DPD encoding Combination field: exp. MSBs = 012 coeff. MSD = 00002 01000 with 13 leading zeroes abcde Table row 1 1 01000 10001100 0000 … 0010 0011 1101 sign Combination Field Exponent Continuation Coefficient Continuation A2 30 00 00 00 00 02 3D HEX = 8.35 in decimal64 8 2010, processors with 2 billion transistors are expected to be developed [11]. Therefore, the large number of transistors available within silicon implementations and the increased sophistication of design tools gives designers the ability to include new and important features, such as decimal arithmetic. Previous implementations in decimal arithmetic include highspeed multipliers [12][13][14], algorithms for decimal adders [15][16][17] and multioperand addition [18][19], and algorithms for decimal partial product generation [14][19][20]. Although there has been a large amount of interest and research interest into decimal arithmetic architectures, many of the architectures fail to produce designs that are targeted at real implementations, especially at designs below 180nm. This dissertation attempts to study these designs by offering possible solutions and implementations in decimal arithmetic and, in some cases, how they possibly can be combined with binary arithmetic to produce combined binary/decimal arithmetic units. 9 2. BACKGROUND Decimal arithmetic operations were significantly researched in the 1950’s and the latter part of the 20th century, but nonetheless binary arithmetic hardware took over computer calculations. The reasoning behind this came after Burks, Goldstine, and von Neumann published a preliminary study on computer design [12]. They argued that for scientific research, simplicity was the major advantage of binary hardware and therefore increasing its performance and reliability. Furthermore, decimal numbers would need to be stored in binary form requiring extra storage space (bits) to maintain the same precision as binaries and require more circuitry than operations performed in pure binary form. Nevertheless if conversions from decimal to binary and viceversa are needed then it is significantly more efficient to perform operations in decimal hardware [4]. In many cases, techniques developed for binary arithmetic hardware can be applied to some extent to decimal hardware. It is therefore important to explore relevant approaches and research for binary arithmetic since an invaluable insight into solving decimal arithmetic problems can be gained. 2.1 Binary Comparison An important element in general purpose and application specific architectures is the comparator [22]. The design of high speed and efficient comparators aids in the performance of these architectures. The idea of designing efficient comparators however is not new as seen from previous studies in [23], [23], [23], [25]. Nevertheless further 10 gains in area usage and power can be obtained by designing a comparator that can handle different datatypes using an efficient compatible comparison method. The work on [26] presents the design and implementation of a high performance comparator capable of handling 32bit and 64bit two’s complement numbers and single and double precision binary floatingpoint numbers. This type of design is especially useful to reduce costs in processors, since it allows the same hardware to be used to compare multiple data types. A novel approach to the magnitude comparison problem was utilized with a comparator module that has logarithmic delay. This design can also be easily extended to support 128bit binary floating point numbers and can accommodate pipelining to improve throughput. The IEEE 754 standard [1] specifies floatingpoint comparisons where the relation between two numbers is given greater than, less than, equal or unordered. When either of the operands compared are NotaNumber or NaN the result of the comparison is unordered. If the NaN is a signaling NaN then an Invalid exception flag bit is asserted. The result of the comparison is represented by Floatingpoint Condition Codes or FCC [27]. Table 4 shows the FCC representation of the comparison result in this design. Note that bit FCC[1] is analogous to a greater than flag (GT) and FCC[0] is analogous to a less than flag (LT). When both flags are zero the numbers are equal and when both are one the numbers are unordered. 11 Table 4. Floatingpoint Condition Codes. FCC [1] FCC [0] (GT) (LT) 0 0 A = B 0 1 A < B 1 0 A > B 1 1 Unordered Relation As illustrated in [26], the combined comparator is composed of three blocks. The 2bit Sel signal indicates the type of operands being compared. The first block converts 32bit operands to 64bit operands so that all operand sizes are handled by the same hardware. Like most floatingpoint implementations, 32bit numbers are converted to 64 bit numbers to simplify the logic. The second block performs a magnitude comparison. Finally, the third block takes care of exceptions and special cases according to the IEEE 754 standard. It also correctly handles the signs of the input operands. 2.1.1 Magnitude Comparator Design The magnitude comparator devised in [26], and shown in Figure 2, is the core of the comparator module. The two operands A and B are compared in stages. The first stage compares corresponding 2bit pairs from each operand. Two output bits, GT (greater than) and LT (less than), from each element indicate if the result of the compared pair is greater than, less than or equal as shown in Table 5. If the bit pairs of A and B are denoted by A[2i+1, 2i] and B[2i+1, 2i] then the values for GT[i]1 and LT[i]1 (where the subscript 1 indicates the first stage of the comparison) are given by: [ ] = [2 +1]⋅ [2 +1] + [2 +1]⋅ [2 ]⋅ 1 GT i A i B i A i A i B[2i] + A[2i]⋅ B[2i +1]⋅ B[2i] , [ ] = [2 +1]⋅ [2 +1] + [2 +1]⋅ 1 LT i A i B i A i A[2i]⋅ B[2i] + A[2i]⋅ B[2i +1]⋅ B[2i]. 12 for ( 0 ≤ i ≤ ⎡n / 2⎤ −1) where n is the operand size, in this case 64. Figure 2. Block diagram of the binary portion of the comparator, taken from [26]. In subsequent stages the same process is used except the GT[i]j signals replace the A[i] signals and LT[i]j replace B[i] where j denotes the comparator stage. There is however an additional reduction possible in subsequent stages since GT and LT can not be equal to 1 at the same time and therefore for j > 1 the equations are simplified to: j j j j GT[i] GT[2i 1] GT[2i] LT[2i 1] 1 = + + ⋅ + + , j j j j LT[i] LT[2i 1] GT[2i 1] LT[2i] 1 = + + + ⋅ + . A total of ⎡log ( )⎤ 2 k = n stages are required to obtain the final result given by GT[0]k and LT[0]k. In the case of this implementation with 64bits, n = 64 and k = 6 stages are required. A B 64 64 Input Conversion 64 64 Magnitude Comparator LT, GT 2 Exception Handling 2 Sel 2 LT, GT Floatingpoint / two’s complement. 32bit / 64bit. 13 Table 5. BCD Magnitude Comparison GT[i] LT[i] Result 0 0 A[2i+1,2i] = B[2i+1,2i] 0 1 A[2i+1,2i] < B[2i+1,2i] 1 0 A[2i+1,2i] > B[2i+1,2i] 1 1 invalid B CMP A G 0 1 LT B CMP A G 1 0 LT B CMP A G 0 0 LT Result : 0 1 A < B B CMP A G 0 1 LT 10 00 11 01 10 01 00 11 Top number A: 8Dhex Bot. number B: 93hex 01 00 B CMP A G LT 01 10 B CMP A G LT 10 01 B CMP A G LT 0 1 1 0 Figure 3. Magnitude comparator example, A = 0x8D and B = 0x93. n= 8, k = 3 stages necessary. Figure 3 illustrates an example using this logarithmic tree magnitude comparator. The operand size in this case is 8bits and therefore only 3 stages are necessary (k=3). The comparison to be computed is A to B where A = 0x8D and B = 0x93. Each number is separated in bit pairs and each corresponding pair is compared individually. Subsequent stages group LT and GT signals together as shown. The final stage yields GT = 0 and LT = 1 as expected giving A < B. 14 2.1.2 Two’s complement and binary floatingpoint comparator The magnitude of the operands however is not the only characteristic considered when comparing numbers. To compare two’s complement or floatingpoint numbers the sign should also be considered. The third stage shown in Figure 2 sets the LT output to one in any of the following four cases: 1) A is negative and B is positive. 2) A and B are positive and the magnitude of A is less than the magnitude of B. 3) A and B are negative two’s complement numbers and the magnitude of A is less than the magnitude of B. 4) A and B are negative floating point numbers and the magnitude of A is greater than the magnitude of B. To minimize the complexity of the other predicates like Greater Than (GT) and Equal to (EQ), logic is asserted based on whether the input operands are LT or not LT given that LT, GT, and EQ cannot all be asserted simultaneously. This translates to simple logic for both 32 and 64bit numbers. In order to make sure the values for the cases listed above are produced correctly for the implementation presented in this paper, only LT[0]6 and GT[0]6 are computed utilizing the logic since EQ[0]6 can be produced by the following equation: 6 6 6 EQ[0] = LT[0] +GT[0] The subindex 6 denotes the sixth level of the magnitude comparator, or the final stage given that the operands considered are 64bits, i.e. LT[0]6 and GT[0]6 are the outputs of the magnitude comparator module. 15 Consequently, the values of LT, EQ, or GT for the whole design can be produced for two’s complement numbers as: [0] ( [63] [63] [63] [63]) 6 EQ = EQ ⋅ A ⋅ B + A ⋅ B 6 LT = (A[63]⋅ B[63]+ B[63]⋅ LT[0] + A[63]⋅ LT[0] ) ⋅ EQ 6 GT = LT + EQ Floatingpoint comparisons on the other hand are complicated because of the incorporation of exceptions which are mandated by the IEEE 754 standard. The major exception that should be detected with comparisons is if the operands are Unordered. According to the IEEE 754 standard, values are unordered if either operand is a NaN and a floatingpoint comparison is being performed. The hardware for detecting unordered may vary from one processor to the next, because the standard allows discretion in defining specific signaling and quiet NaN’s bit patterns. The IEEE 754 standard also states that comparisons must also output an Invalid Operation exception if either operand is a signaling NaN [1]. Furthermore, a final test must be performed to make sure +0 and −0 compare Equal, regardless of the sign. In summary, the floatingpoint comparisons must be able to handle Invalid operations, both types of NaN’s, and not differentiate between both types of zeroes. As with two’s complement numbers, the comparator is simplified by computing whether the two operands are Equal or Less than each other. Once these two outputs are known, it is simple to produce Greater than output. For the combined unit, the Less than comparison utilizes the same cases tabulated previously accounting for a floatingpoint operation. On the other hand, floatingpoint comparisons for Equal need to be modified to account for 16 either equal operands or the comparison of zeroes. Therefore, the combined comparator (two’s complement and floatingpoint) handles two cases for determining whether the operands are Equal: 1) The operand magnitudes are equal AND the operands’ signs are equal. 2) The operand magnitudes are zero AND the operands are floating point numbers. The final equations for the combined comparator are given below, where Azero represents a literal testing of whether A is +0 or −0, fp represents a literal specifying a floatingpoint operation and UO represents a literal indicating unordered operands: UO A B fp NaN NaN = ( + ) ⋅ [0] ( [63] [63] [63] [63] 6 EQ = EQ ⋅ A ⋅ B + A ⋅ B + Azero ⋅ fp) ⋅UO 6 LT = (A[63]⋅ B[63]+ A[63]⋅ B[63]⋅ LT[0] + A ⋅ B ⋅ LT ⋅ fp 6 [63] [63] [0] + A[63]⋅ B[63]⋅ LT[0] ⋅ fp) ⋅ EQ⋅UO 6 GT = LT + EQ +UO For these equations, logic is saved by only testing whether A is zero since EQ[0]6 already indicates if the operands are equal making a test of B equal to zero redundant. Sign extension for 32bit two’s complement numbers is implemented by sign extending the 32nd bit into the upper 32bits of the comparator. IEEE singleprecision numbers do not need to be converted to doubleprecision numbers, since the two formats have the same basic structure and the exponents are biased integers. The logic to detect NaNs and zeros for the two floatingpoint formats differs slightly, since single precision numbers have smaller significands and exponents than double precision numbers. 17 2.2 Addition Addition is a fundamental arithmetic operation and the design of efficient adders aids as well in the performance and efficiency of other operation units like multipliers and dividers. Decimal addition has been researched but not as heavily as binary addition and only a handful of research papers can be found on the topic. Nevertheless, binary arithmetic is important for the decimal case since decimal numbers are represented in binary and many concepts and techniques developed for binary can be applied to some extent as well. An overview of some relevant binary addition concepts is therefore necessary. 2.2.1 Binary addition One of the most basic elements in addition is the Full Adder (FA) or 3:2 counter. Adders are sometimes called counters, because they technically count the number of inputs that are presented at their input [28]. The FA takes three single bit inputs, xi, yi and ci and produces two single bit outputs si and ci+1 corresponding to [29]: xi + yi + ci = 2·ci+1 + si , where ci is commonly referred to as the carryin and ci+1 the carryout. The logic equations for the Full Adder cell are given by: si = xi ⊕ yi ⊕ ci and ci = xi yi + xi ci + yi ci . 18 Figure 4. Full adder cell and Truth Table. The full adder cell can be utilized to create nbit operand adders as shown in the next figure. This simple approach, called Ripple Carry Adder or Carry Propagate addition (RCA/CPA), has the disadvantage of a significant timeconsuming delay due to the long carry chain as the carry propagates from c0 to c1 all the way until the MSB, in this case s3. Figure 5. 4bit Ripple Carry Adder. In order to speed up this process, certain aspects of the addition in each cell can be exploited, as is the case for the Carry Lookahead Adder (CLA). If a carry is present at the FA’s carryin from the previous significant bit it is said to propagate if either xi or yi Full Adder yi si xi ci ci+1 x i y i c i c i+1 s i 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 FA y3 x3 s3 c4 FA y2 x2 s2 c3 FA y1 x1 s1 c2 FA y0 x0 s0 c1 c0 19 are equal to 1. On the other hand if a carry is generated within the FA cell, when both xi and yi are 1, the cell is said to generate a carryout. Logic equations for generate and propagate signals (g and p) as well as an equation describing when a carryout takes place can therefore be determined from the inputs: g = xi · yi , p = xi + yi , ci+1 = gi + ci · pi The last equation can be utilized to determine the carryout of the next significant bit FA: ci+2 = gi+1 + ci+1 · pi+1 = gi+1 + (gi + ci · pi)·pi+1 . Showing that ci+2 can be obtained exclusively with the operands inputs without the need of the carry ci+1 as in Figure 5. This provides a method of obtaining the carryout result for each bit position without the need of a carry chain and hence speeding up significantly the process. As the operand size is increased, however, the complexity of each new bit’s carryout logic grows significantly making the method impractical for operands of more than 4bits, depending on the technology used. In this case, further techniques allow carry generate and propagate signals to be obtained for an nbit block and improve the adder’s implementation. 2.2.2 Carry save addition (CSA) Carrysave addition is the idea of utilizing addition without carries connected in series as in the Ripple Carry Adder but instead to count and hence avoid the ripple carry chain. In 20 this way multioperand additions can be carried out without the excessive delay resulting from long carry chains. The following example shows how a 4bit CSA accepts three 4 bit numbers and generates a 4bit partial sum and 4bit carry vector, avoiding the connection of each bit adder’s carryout to the carryin of the next adder. Figure 6. Full adder cells used for carrysave addition. The example shown demonstrates how performing addition in a given array (each column in the figure) produces an output with a smaller number of bits; in this case form 3 bits to 2. This process is called reduction and is very useful during multiplication. 2.2.3 4:2 Compressors One particular useful carrysave adder is the 4:2 compressor presented in [30]. The main reason for using compressors is that their carryout (cout) is no longer dependent on the cin, as shown in Figure 7. This gives compressors a significant advantage over traditional carrysave adder trees implemented with 3:2 counters in that it can expedite processing the carry chain while still maintaining a regular structure. 3 0 0 1 1 7 0 1 1 1 + 8 1 0 0 0 12 1 1 0 0 6 0 0 1 1  A B C FULL ADDER Cell SUM Carry Decimal Value 21 Figure 7. Weinberger 4:2 Binary Compressor. 2.2.4 Decimal excess3 addition As stated earlier, usually decimal numbers are stored as Binary Coded Decimals (BCD). BCD numbers have 6 unused combinations, from 10102 to 11112, and this complicates addition and subtraction for the decimal case. Furthermore, negative numbers can not be represented in two’s complement fashion which is a common method for subtraction for the binary case. A different coding for decimal numbers, called Excess3, is important since it has many useful properties for subtraction and addition. Excess3 code can be generated by just adding a binary 3 to the common BCD code, as shown in Table 6. Table 6. Excess3 Code. Decimal Value Excess3 Code 0 0011 1 0100 2 0101 3 0110 4 0111 5 1000 6 1001 7 1010 8 1011 9 1100 c2 cin c1 s 3:2 Counter 3:2 Counter 22 Except for some corrections necessary during addition/subtraction, common binary techniques can be applied for arithmetic operations. Most importantly the addition of two numbers creates a decimal carry which is available by using the carry output of the most significant binary bit. This occurs because the addition of two excess3 digits creates a result in excess6 which already eliminates the unwanted 6 binary combinations. Furthermore, the code is selfcomplementing [31]. This implies that a subtraction or negative number addition can be obtained by inverting all bits of the digit and adding a binary ulp, in the same way as two’s complement binary numbers. The following equation, taken from [31], shows the operation result of two Excess3 numbers added together, where the underline represents a digit in BCD: SUM = D1 + D2 = D1 + 3 + D2 + 3 = D1 + D2 + 6 There are two possibilities to consider for the sum result. When D1 + D2 < 10 then no carry to the next higher digit is needed and the Excess6 result can be corrected by just subtracting 3 from the sum. This can be easily accomplished by adding 13 and ignoring the carry output, which effectively subtracts 16. When D1 + D2 ≥ 10 a decimal carry should be signaled to the digit in the next place. This can be accomplished by sending the Carry out signal of the most significant bit. Nevertheless this sends a carry of 16 (6 too much) and hence by adding 3 the result is restored into Excess3 code. Note that in both cases the correction requires the addition of 3 or 13 which can be accomplished by a simple inverter on the LSB output. Figure 8 shows an implementation of an Excess3 adder. 23 Figure 8. Adder for Excess3 code, taken from [31]. 2.2.5 Direct decimal addition The use of Excess3 code permits the addition of two decimal numbers by using a correction method to that corrects the six unwanted values in BCD code after the operation takes place (10102 to 11112). Regardless, a different approach proposed in [15] presents logic that performs direct decimal addition where a combinational element has as inputs two 4bit BCD numbers xi and yi and a carryin ci[0] and outputs a 4bit BCD digit si and a 1bit carryout ci+1[0] satisfying: (ci+1, si) = xi + yi + ci[0] , where ci+1 represents ten times the weight of si . The following are the logic equations that describe the direct decimal adder [12]: g [ j] x [ j] y [ j] i i i = ⋅ 0 ≤ j ≤ 3 “generate” FA C S FA C S FA C S FA C S FA C S A3 B3 A2 B2 A1 B1 A0 B0 Cin S3 S2 S1 S0 Cout FA C S • FA S 24 p [ j] x [ j] y [ j] i i i = + 0 ≤ j ≤ 3 “propagate” h [ j] x [ j] y [ j] i i i = ⊕ 0 ≤ j ≤ 3 “addition” [1] [0] ( [0] [0]) [3] [2] ( [2] [1]) [3] ( [3] [2]) ( [3] [1]) ( [2] [1]) i i i i i i i i i i i i i i i i i c g p c l p g p g k g p p p p g p = + ⋅ = + + ⋅ = + ⋅ + ⋅ + ⋅ [1] (( [1] ) [1]) (( [1] ) [1]) [0] [0] [0] i i i i i i i i i i s h k c h l c s h c = ⊕ ⋅ + ⊕ ⋅ = ⊕ [2] ( [2] [1]) ( [3] [2] [1]) (( [3] ( [2] [1])) [1]) i i i i i i i i i i s = p ⋅ g + p ⋅ h ⋅ p + g + h ⋅ h ⋅ c ((( [3] [2] [1]) ( [2] [1]) ( [3] [2])) [1]) i i i i i i i i p ⋅ p ⋅ p + g ⋅ g + p ⋅ p ⋅ c [0] ( [1]) [3] (( ) [1]) ((( [3] [3]) ( [3] [2] [1])) [1]) i 1 i i i i i i i i i i i i i c k l c s k l c g h h h h c = + ⋅ = ⋅ ⋅ + ⋅ + ⋅ ⋅ ⋅ + These equations describe a decimal full adder that can be utilized for either carrysave or carry propagate addition. 2.2.6 Decimal FloatingPoint Adder To the author’s knowledge the only published work to date of an arithmetic module compliant with the IEEE754 current revision draft is the decimal floatingpoint adder published by Thompson, Karra and Schulte in [16] and hence its inclusion in this section is of significance. This design differs from previous decimal adders in that it is fully compliant with the standard including special value cases and exception handling, and 25 that it is capable of generating a complete result in a single cycle instead of a single digit per cycle. Figure 9. Decimal FloatingPoint adder, from [16]. Figure 9 shows a block diagram of the adder design. Initially the two IEEE754 decimal numbers are decoded into their sign bits, coefficient (BCD) and Exponent fields (two’s complement binary). The operand exchange block orders the coefficients according to which number’s exponent is greater followed by the operation unit which determines the actual operation to be performed (addition or subtraction) depending on the signs of the operands. The coefficients, or significands, are aligned and a conversion into Excess3 format follows for their respective binary addition and flag bits determination. The result is finally corrected, depending on the previously set flags, shifted and rounded allowing it to be encoded back into IEEE754 decimal format. The adder presented in this work also 26 allows up to 5 stages of pipelining which improves its critical path delay. Also it is of importance since it explores the advantage of using Excess3 coding for decimal addition. 2.3 Binary Multiplication Since decimal operations are performed on binary circuits, understanding how binary multiplication is achieved aids in the application and development of new techniques for the decimal case. A brief overview on its most significant implementations is given in this Section for that purpose. The multiplication of two binary numbers, multiplier X (xN1, xN2, …, x1, x0) and multiplicand Y (yM1, yM2, …, y1, y0) is determined by the following equation [32]: Σ Σ ΣΣ − = − = + − = − = = ⎟⎠ ⎞ ⎜⎝ ⎛ ⎟ ⎟⎠ ⎞ ⎜ ⎜⎝ ⎛ = 1 0 1 0 1 0 1 0 2 2 2 N i M j i j i j N i i i M j j j P y x x y . This is illustrated in Figure 10 for the case of 6bit operands: Figure 10. Partial products during multiplication, taken from [32]. 27 Each partial product bit position can be generated by a simple AND gate between corresponding positions of the multiplier and multiplicand bits as shown in figure 3. All partial products are reduced or “compressed” by addition into a single product result. A direct implementation of this method is given in the Carry Save Array Multiplier or CSAM. In this type of multiplier the partial product array shown in Figure 10 is skewed into a square shape so that its implementation is more efficient for VLSI. The compression is performed by the use of full adders (FAs / 3:2 counters) and half adders (HAs). Figure 11 shows this array for the multiplication of two 8bit operands. MFA and MHA cells represent full adders and half adders with an additional AND gate input to generate the partial product. The highlighted arrow shows the critical path of the circuit, the longest carry propagation chain. In Figure 10, which has 6bit operands instead of 8, this would correspond to the addition of the 6th column (partial products x0y5 to x5y0) plus the carry propagation through the last adder that generates the product, from p5 to p11. This long carry chain limits significantly the performance of the multiplier and is even more considerable as the operand size is incremented. One of the most significant works that addressed this problem was proposed by Wallace [33]. Wallace suggested the use full adders and half adders in a recursive fashion adding three elements at a time in a carry propagate free way. In this manner, the partial product array can be reduced in stages subsequently to two numbers without carry propagation. When the resulting two numbers are obtained, a Carry Propagate Addition (CPA) takes place to obtain the final result [34]. 28 Figure 11. Carry Save Array Multiplier, CSAM. This is shown in the example in Figure 12. The diagram illustrates a 4bit multiplication where the resulting partial products are shown at the top, analogous to the 6bit multiplication of Figure 10. Each dot represents a partial product bit position (e.g. x0y5, x3y2, etc.) The second step shows the partial product array reorganized in a triangular shape where the oval around the dots represents a full adder (3 inputs) or a half adder (2 inputs). The result of each addition produces two elements, a sum and a carryout to its next significant bit position (column to its left). The process is repeated again until at the final stage only two bit array numbers are left, and a reduced size carry propagate addition is required to produce the final result. 29 Figure 12. Wallace Tree multiplier reduction for two 4bit operands. Carry and sum bits for the Half Adder shown. Figure 13 illustrates how a Wallace tree for 6bit operands is implemented using FAs and HAs. In this case the partial products corresponding to the 5th column are detailed. A partial product column array of 5bits feeds a FA and a HA. The carryout bits produced are the inputs for the 2nd stage FA on the next column. The output sum bits are passed directly to the next stage FA, within the same column. In this manner a partial product reduction tree can be formed. Figure 13. Implementation detail for Wallace tree, 5th column (only 2 stages are shown). FA FA 5th Column 11 10 9 8 7 6 5 4 3 2 1 Partial products from column 5 HA 1st Stage 2nd Stage Carryouts for next column, next stage Carryin from previous column C S C S C S To 3rd Stage Final stage CP addition Carry & Sum bits 30 As can be seen from the example shown in Figure 13, the resulting Wallace reduction tree is not regular and hence causes difficulties when the circuit layout is implemented. Nevertheless, the use of 4:2 compressors (exposed in Section 2.2.3), can be organized into efficient interconnection networks for reducing the partial product matrix in a matter that is regular and more suitable for implementation. However, careful attention has to be placed when organizing these compressor trees, because the carry terms within the 4:2 compressor have a weight that is one more than its sum, corresponding to the next significant bit (column to its left). This means that compressor trees must be built according to the following:  The column sum output sum for any compressor tree utilizes the current weight of its column.  The column carry output for any compressor tree must utilize the previous weight of the current column. Therefore, although compressor trees are traditionally drawn as binary trees, they must be organized carefully so that the counter outputs are summed together properly. Figure 14 shows an example of an 8bit compressor tree for three columns. It can be seen that the carryin for each element comes from its previous column. 2.4 Decimal Multiplication Decimal multiplication is considerably more involved and has not being researched as heavily as its binary counterpart. There are however certain studies and ideas from the 1950’s, when decimal arithmetic was researched significantly, that are worth mentioning 31 and that sometimes have aided in more modern developments but nevertheless there are only a very few modern papers on the topic. Figure 14. Three column 8bit 4:2 binary compressor tree One of the main difficulties lies in the generation of the partial products. In the binary case this could be accomplished by a simple AND gate which produced a single bit per digit result. In the decimal case however the inputs are not single bits but decimal numbers usually coded in BCD which implies two 4bit inputs per digit multiplication. The result is in decimal as well and therefore a 4bit output is produced. One possible form of implementing the multiplication algorithm is to follow the penciland paper approach, as shown in [31]. In this method a multiplication table is known beforehand and the result of the multiplication of each multiplicand digit with a multiplier can be determined by table lookup or combinational logic. A performance improvement 32 might be obtained if the resulting number is considered separately and divided into left digit (tens) and right digit (units). An addition accumulator can be used for each digit and the final result computed at then end. Table 7 shows the decimal multiplication table used for each component and Figure 15 shows an example of the algorithm. Table 7. Decimal Multiplication Table, from [31]. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 2 3 4 5 6 7 8 9 2 0 0 0 0 0 1 1 1 1 1 2 0 2 4 6 8 0 2 4 6 8 3 0 0 0 0 1 1 1 2 2 2 3 0 3 6 9 2 5 8 1 4 7 4 0 0 0 1 1 2 2 2 3 3 4 0 4 8 2 6 0 4 8 2 6 5 0 0 1 1 2 2 3 3 4 4 5 0 5 0 5 0 5 0 5 0 5 6 0 0 1 1 2 3 3 4 4 5 6 0 6 2 8 4 0 6 2 8 4 7 0 0 1 2 2 3 4 4 5 6 7 0 7 4 1 8 5 2 9 6 3 8 0 0 1 2 3 4 4 5 6 7 8 0 8 6 4 2 0 8 6 4 2 9 0 0 1 2 3 4 5 6 7 8 9 0 9 8 7 6 5 4 3 2 1 Left Digit Component Right Digit Component Nevertheless, with this approach a performance constraint is that the number of additions required to perform multiplication is one greater than the number of digits in the multiplier and hence slow when compared to other methods. An alternative approach, proposed on [31] as well, attempts to generate the partial product digits by addition instead of a table lookup. This is accomplished by overandover addition where the digits of the multiplier are checked one by one and the multiplicand is added an equivalent number of times to an accumulator. This approach however can be very time consuming as the number of additions required is significant. 33 916 Multiplicand 93 Multiplier LeftComponents RightComponents Accumulator Accumulator 2010 80500 738 1940 82510 2678 2678 85188 Figure 15. Decimal multiplication table Left and Right algorithm example, taken from [31]. To reduce the number of additions and speed up the multiplication a technique called Doubling and Quintupling can be used. With Doubling the value of twice the multiplicand is calculated before the actual accumulation takes place. This allows a faster accumulation as it reduces the number of additions necessary. If the multiplier digit is 5 for example the number of additions required is 3 (2+2+1) instead of 5. The value of twice the multiplicand can be obtained by adding the number to itself with a decimal adder. An important speedup can be accomplished however since the number is only added to itself and some combinations are never present which simplifies significantly the functions for the output result and does not require a full decimal adder. Quintupling on the other hand can also be used and it consists on calculating five times the multiplicand, 5M, before hand. This can be accomplished by noting that a multiplication by 5 can be performed by a multiplication by 10 (decimal left shifting) and a division by 2 (binary right sift) with certain corrections. In this way a value of 5M can be used for the accumulation and further reduce the number of additions required for the multiplication. 34 The discussed ideas have been implemented utilizing mostly a serial or sequential approach as shown in [35], [36], [37] and [38]. Two proposals however are significant and are detailed below. 2.4.1 High frequency decimal multiplier One of the recent research papers on decimal multiplication worth mentioning is the one by Kenney, Schulte and Erle in [13]. The design proposed presents an iterative multiplication that uses some of the ideas exposed above, doubling and quintupling. In this design a set of multiplicand multiples are computed using combinational logic. In this way the values of 2M, 4M, and 5M are obtained and then divided into two sets of multiples: {0M, 1M, 4M, 5M} and {0M, 2M, 4M}. Depending on the value of the multiplier digit, a selector picks a multiple from each set and in that way their addition produces any value from 0M to 9M in a single operation. The design is further improved by allowing a two stage pipeline increasing its operating frequency and by utilizing a new decimal representation for intermediate products which speeds up the process. This representation, called overloaded decimal, permits the complete use of all 4bits comprising the decimal digit and hence the numbers from A16 to F16 are allowed. In this way the correction back into decimal is avoided in each iteration’s addition. The process continues until all digits in the multiplier operand are consumed. In the final product each digit is corrected from overloaded decimal back into BCD by adding 610 when a digit lies in the range of A16  F16, which is easily accomplished with two level logic. A carry of value 1 is also added to the next order digit. 35 2.4.2 Multiplication with efficient partial product generation A different iterative multiplication approach is proposed by Erle, Schwarz and Schulte in [14]. In this design the partial products are calculated in a digitbydigit multiplier creating a digitbyword (multiplicand multiple) signeddigit partial product. The multiplier operand is examined digit by digit from least significant digit to the most significant digit and as each partial product is obtained it is accumulated with previous results to obtain the product. The most significant characteristic in this design is the recoding of the multiplier and restricting the range of each digit by utilizing a redundant representation from 510 to 510. In this way the digit multiplier is simplified since there are no longer two input numbers with ten possibilities each but two inputs with values ranging from 0 to 5. The sign of the product is obtained simply by looking at the signs of the input digits. The multiples of 0 and 1 correspond to trivial multiplication results and therefore the range of input digits considered is virtually restricted to just the numbers from 2 to 5. This significantly speeds up the process since the possible input combinations are reduced from 100 to only 16 but nevertheless complicates the final product calculation since the result needs to be recoded back into BCD from a redundant representation. Table 8 shows the multiplication table for the recoded operand digit values. Table 8. Restricted range, signed magnitude products, from [14]. 36 The problem however with these two last propositions, [13] and [14], is that they have limited parallelization and hence are difficult to use in a pipelined system. In other words, the computation of the multiplication in both cases is highly sequential since the partial products are added by accumulation one by one as they are obtained. This forces the multiplier to be busy and unavailable for further computations in a pipeline until a result is computed. Only then it can accept a new operation which is unacceptable in most of today’s floatingpoint units. It is therefore desirable to research methodologies which allow multiplication to be as parallel as possible as, for example, the CSA Multiplier and the Wallace tree for the binary case. 37 3. DECIMAL FLOATINGPOINT COMPARATOR As stated earlier, a comparator is an important element in general purpose and application specific architectures. The design of an efficient and high speed decimal comparator aids in the performance of these architectures. This design proposes a high performance 64bit decimal floating point comparator, compliant with the current draft of the IEEE754R standard for floatingpoint arithmetic. This is the first implementation of a decimal floatingpoint comparator compliant with the draft standard. The design can also be easily extended to support 128bit decimal floating point numbers and even though it is not pipelined, it can accommodate pipelining to improve throughput. 3.1 Decimal floatingpoint comparison Floating point comparisons are specified by the IEEE 754 standard [1]. The comparator proposed accepts two 64bit decimal floating point numbers. In the same way as binary comparisons, the relation between the two numbers is given by four mutually exclusive conditions: greater than, less than, equal and unordered. The numbers are unordered when either one or both operands compared are NotaNumber or NaN. If the NaN is specified as a signaling NaN then an Invalid exception flag bit is asserted. The result of the comparison is represented by Floatingpoint Condition Codes or FCC and presented in Table 4. Again, bit FCC[1] is analogous to a greater than flag (GT) and FCC[0] is analogous to a less than flag (LT). 38 The design however differs significantly from its binary counterpart mainly because the current IEEE 754 revision specifies that decimal floatingpoint numbers are not normalized and, therefore, the representation is redundant in nature. This implies for example that numbers 125 x 105 and 1250 x 106 are both representable and should be recognized as equal during comparison. Binary floatingpoint numbers on the other hand do not allow redundancy. Without redundancy numbers can be compared as pure binary integer numbers (since biased exponents are used) without the necessity of separating exponent and coefficient and perform alignment as proposed in [26]. The core of the comparison lies on the magnitude comparator used for the coefficients. A usual scheme to approach the comparison of the two coefficients is to subtract them. Taking into account the signs of the operands, the sign of the result determines if the comparison is greater than, less than or equal when the subtraction result is zero. This type of approach is advantageous in a system in the sense that the existing decimal floatingpoint adder/subtractor hardware can be utilized also for this purpose without an area increment. A decimal comparator however is only reasonable if it provides a significant speed improvement at the cost of a small area overhead when compared to a floatingpoint subtraction approach. To its advantage, however, the comparator can benefit from the fact that the difference between both numbers is not required and that the result of the subtraction does not need to be rounded and recoded into decimal floatingpoint standard again. Furthermore, an adder/subtractor requires a greater working digit precision (extra guard bits for example) than what can be represented in the format to account for rounding and normalization [29]. 39 Before subtraction takes place, since the floatingpoint numbers to be compared may have different exponents, their coefficients need to be aligned. Once alignment is performed, and taking into account the signs of the operands, the sign of the subtraction result determines if the comparison is greater than, less than or equal when the subtraction result is zero. This type of approach is advantageous in a system in the sense that the existing addition/subtraction hardware can be utilized also for this purpose without an area increment. The example in Figure 16 illustrates the classic approach to comparison where the number 3149 x 1023 is compared to 90201 x 1016. Figure 16. Classic approach to floatingpoint comparison. 0 0 0 0 3 1 4 9 7 6 5 4 3 2 1 0 0 0 0 9 0 2 0 1 7 6 5 4 3 2 1 0 A = 3149 x 1023 B = 90201 x 1016 Exponent difference = 7, coefficient alignment is necessary The number with the biggest exponent (big) is shifted left to remove leading zeros. 3 1 4 9 0 0 0 0 7 6 5 4 3 2 1 0 0 0 0 9 0 2 0 1 7 6 5 4 3 2 1 0 A = 31490000 x 1019 B = 90201 x 1016 ? ? The number with the smallest coefficient (small) is shifted right by exponent difference (3) Exponent difference = 3, alignment is still necessary 3 1 4 9 0 0 0 0 7 6 5 4 3 2 1 0 0 0 0 0 0 0 9 0 7 6 5 4 3 2 1 0 A = 31490000 x 1019 B = 90.201 x 1019 ? Exponent difference = 0! Coefficient comparison can take place. notice how the digits are lost right after shifting A > B 40 The comparator proposed however utilizes a scheme that avoids subtraction for the coefficient comparison and instead uses a faster approach. It also avoids the use of extra digits. Only a working precision of 16 decimal digits, as in the standard, is used. 3.2 Comparator Design An overview of the design of the decimal floatingpoint comparator is given in Figure 17. A decoding module, IEEE 754 decoder, converts the decimal64 operands (A and B) into a format that can be utilized for comparison. The combination field is processed and the IEEE 754 decoder outputs the number sign, exponent in unsigned 10bit binary form, coefficient as a 64bit BCD encoded number and tells if the number is infinite, a quiet NaN or a signaling NaN. Since alignment is needed, the value of the difference between the operands’ exponents is necessary and subtraction is, therefore, required. The 10bit unsigned binary exponents are compared in the Exponent Comparison module. This module contains a 10bit Carry Lookahead Adder for fast operation and performs subtraction to determine which exponent is smaller or if they are equal and the amount of shifting necessary for the coefficient alignment. This amount is passed to the Coefficient Alignment module. Alignment is performed by left shifting the coefficient of the number with the greatest exponent and, thus, reduce its exponent magnitude. Since the representation allows for 16 digits, shifting is limited from 0 (or no shift) to 15 digits. Larger alignment needs are evaded by treating them as special case scenarios. Consequently, this aids in maintaining the coefficient digit size (i.e. working precision) restricted to 16 digits allowing the coefficient magnitude comparison module to yield a result faster given that 41 its delay and complexity grows logarithmically (log4). These special cases or scenarios will be treated in the following subsections. BCDs 4 2x16 digits LT_mag, GT_mag Exponent Comparison Coefficient Alignment Special case handling module Coefficient Magnitude Comparator Operand B 64 Operand A 64 IEEE 754 Decoder A and B: signs, Infinite, NaN 20 Coefficient A, Coefficient B 6 128 Exponent A, Exponent B Shift amount Aligned coefficients 2 greater_exp, shift_offrange Shift_overflow 1 2 1 GT FCC[1] LT FCC[0] Comparison Result Figure 17. Decimal floatingpoint comparator design. Figure 18. Coefficient Aligner, logarithmic barrel shifter. 4bits hardwired Left Shift Input Data (64bits) 1 BCD Shift 2 BCDs Shift 4 BCDs Shift 8 BCDs Shift Output Data (64bits) Overflow detection 8bits hardwired Left Shift 16bits hardwired Left Shift 32bits hardwired Left Shift 4bit encoded shift amount 42 3.3 Coefficient magnitude comparison Once the 64bit/16digit coefficients are correctly aligned their magnitude can be compared. The magnitude comparator designed is based on the comparator proposed in [26] and exposed in Section 2.1 with significant performance modifications tailored specifically for BCD number comparison. The two operands A and B are compared in stages. The first stage compares corresponding 4bits BCD digits from each operand. Sixteen of these elements are used in parallel to process the complete 16digit coefficients. Two output bits (GT and LT) from each of these elements indicate the result as greater than, less than or equal.Table 9 shows the magnitude comparison where GT[i]j and LT[i]j represent greater than and less than flags respectively as in Section 2.1 and A[i] and B[i] represent the digit i of the operand coefficients. The subscript j indicates the stage of the comparison. Table 9. BCD Magnitude Comparison GT[i] LT[i] Result 0 0 A[i] = B[i] 0 1 A[i] < B[i] 1 0 A[i] > B[i] 1 1 invalid The first stage’s elements (one per digit pair compared) have 8 input bits (two BCDs) and 2 single bit outputs each, producing a truth table of 256 possibilities per output. Since the numbers compared are BCD encoded, and not binary, the truth table is simplified by ignoring all entries where the 4bit BCD numbers are greater than 9. This reduces the cases to be considered from 256 to 100 (the rest are don’t cares) and reduces significantly the minimized sumofproducts expressions for LT[i] 1 and GT[i] 1. 43 Subsequent stages compare the results of the previous stage in the same manner forming a logarithmic tree, comparing 4bit GT[i]j sets to corresponding LT[i]j sets of the result where GT[i]j replaces A[i] and LT[i]j replace B[i] signals. These elements are further optimized given that GT[i]j and LT[i]j are never both asserted at the same time as shown in Table 9. The truth table cases are further reduced from 100 to 82 producing fast and simplified minimized sumofproducts expressions for LT[i]j+1 and GT[i]j+1 considering it is a 4bit 2 number comparison: LT[i]j+1 = LT[i+3]j + (GT[i+3] j’ • LT[i+2] j) + (GT[i+3] j • GT[i+2]j’ • LT[i+1] j) + (GT[i+3] j’ • GT[i+2] j’ • GT[i+1] j’ • LT[i] j) GT[i]j+1 = GT[i+3]j + (GT[i+2] j • LT[i+3] j’) + (GT[i+1] j • LT[i+3]j’ • LT[i+2] j’) + (GT[i] j • LT[i+3] j’ • LT[i+2] j’ • LT[i+1] j’) The number of comparator stages is given by k = log4(4n) where n is the coefficient digit size. With n = 16 digits a total of log4 (4x16) = 3 stages are needed to yield the resulting LT and GT for the magnitude comparison of the coefficients. Figure 19 illustrates an example using this logarithmic tree magnitude comparator. The operand size is reduced to 4 digits instead of 16 for clarity. The comparison to be computed is A to B where A = 271310 and B = 219510. Each number is separated in digits and each corresponding digit pair is compared individually. Subsequent stages group LT and GT signals together as shown. The final stage yields GT = 1 and LT = 0 as expected giving A > B. 44 0011 0100 B CMP A G 1 0 LT B CMP A G 0 1 LT B CMP A G 0 0 LT B CMP A G 1 0 LT B CMP A G 0 1 LT 0010 0111 0001 0011 0010 0001 1001 0101 Top number A: 2713BCD Bot. number B: 2195BCD Result: A > B Figure 19. BCD magnitude comparator. 3.4 Special case scenarios The result of the comparison of the two operands cannot always be obtained by a magnitude comparison of the aligned coefficients. It is possible for either of the operands to be a NaN (in which case the result should be unordered), plus/negative infinity or plus/negative zero which are both representable. These possibilities are treated as special cases and can be determined early in the decoding phase. The coefficients of the operands cannot always be correctly aligned within the 16digit working precision for the magnitude compare module. There are two possibilities that can arise that are also considered as special cases. The first one occurs when the absolute value of the exponent difference between operands is greater than 15 and the second when the alignment of the coefficient produces a digit shifted out (overflow). This is treated along with the cases for NaNs, infinities and zeros and their respective signaling flags. 45 Each of the possible comparison scenario cases sets its own output signals LT and GT which affect the FCC comparison result. Five different mutually exclusive enable flag bits are used to indicate when each scenario occurs. The special case handling module, at the bottom of Figure 17, is the one responsible for determining the result according to the given situation. 3.4.1 One or both numbers is infinite If one or both numbers are infinite the comparison result can be obtained right away by examining the sign of the operands. If the operands are A and B then: A is less than B if A is negative infinity and B is positive infinity OR if A is negative infinity and B is not infinity OR if B is positive infinity and A is not infinity. LT_inf = (inf_A · sign_A · sign_B’) + (inf_A · sign_A · inf_B’ ) + (inf_A’ · inf_B · sign_B’) A is greater than B if A is positive infinity and B is negative OR if A is positive infinity and B is not infinity OR if A is not infinity and B is negative infinity. GT_inf = (inf_A · sign_A’ · sign_B) + (inf_A · sign_A’ · inf_B’) + (inf_A’ · inf_B · sign_B); Note that if both numbers are positive infinite or negative infinite the result is 11 for GT_inf and LT_inf signaling an unordered comparison. The signaling flag that indicates this scenario is given by: infinite_flag = infinity_A + infinity_B where infinity_A/B is 0 if numbers are finite and 1 if infinite. 46 3.4.2 Both operands are zero In the IEEE754 current draft the value of zero in decimal is indicated by a zero value for the number coefficient as opposed to a zero exponent in binary floatingpoint. If both operands are zero then both numbers are always equal regardless of their sign taking into account that +0 and 0 are both equivalent as specified by the standard. The bits LT and GT remain unmodified (both are zero indicating equality) guarded by the flag enable bit zero_flag which prevents all other scenario modules to affect the result (except for infinite numbers). This flag is given by: zero_flag = (A_zero · B_zero) & infinite_flag’ where A/B_zero is 0 if the number is nonzero and 1 if it is zero. 3.4.3 Exponent difference offrange If the absolute value of the exponent difference between operands is greater than 15 (indicated by the signal shift_offrange in Figure 17) then the coefficients cannot be aligned since the working precision allows 16 digits. This means that one of the numbers is evidently greater in magnitude than the other (e.g. 512 x 1040 and 123 x 103). The comparison result can be obtained by knowing which of the number’s exponent is greater and by examining the signs of the numbers. The signal greater_exp indicates which exponent is greater. A is less than B if exponent B is greater than exponent A (greater_exp = 1) and both numbers are positive OR if A is negative and B is a positive number OR if both numbers are negative and exponent A is greater, (greater_exp = 0). Numbers are never equal. 47 LT_offrange = (greater_exp · sign_A’ · sign_B’) + (sign_A · sign_B’) + (sign_A · sign_B · greater_exp’). A is greater than B if both numbers are positive and exponent A is greater than exponent B (greater_exp = 0) OR if A is positive and B is negative OR if both numbers are negative and exponent B is greater (greater_exp = 1). GT_offrange = (greater_exp’ · sign_A’ · sign_B’) + (sign_A’ ·sign_B) + (sign_A · sign_B · greater_exp) The signaling flag for this case is: exp_flag = shift_offrange • zero_flag’ • infinite_flag’, where shift_offrange is determined by calculating the exponents difference in the exponent compare module (>15). 3.4.4 Alignment shiftout, overflow If the exponent difference is within range (<15), alignment of the coefficient takes place by left shifting. If a digit is shifted out (shift_overflow) then the comparison result can be determined by knowing the signs of the numbers. The greatest exponent determines which of the two numbers compared is the one being aligned with respect to the other. When greater_exp = 0, exponent A is the one aligned and viceversa. If alignment overflow occurs A is less than B if: A is negative and B is positive OR if A alignment overflows (magnitude of number A is greater) and it is negative OR if B alignment overflows (magnitude of number B is greater) and B is positive. 48 LT_of = (sign_A · sign_B’) + (greater_exp’ · sign_A) + (greater_exp · sign_B’) A is greater than B if: A is positive and B is negative OR if A alignment overflows and it is positive OR if B alignment overflows and it is negative. GT_of = (sign_A’ · sign_B) + (greater_exp’ ·sign_A’) + (greater_exp · sign_B) The equation for the signaling flag of this scenario is: align_flag = shift_overflow · exp_flag’ · zero_flag’ · infinite_flag’ where shift_overflow is asserted during alignment if a digit is shifted out. 3.4.5 Coefficient comparison If the exponent difference is within range and no shift overflow occurs after alignment then this indicates that the coefficients are correctly aligned and their comparison can be executed by the dedicated coefficient magnitude comparator module discussed in Section 3.3. The output signals from this module (LT and GT) are renamed to avoid confusion as LT_mag and GT_mag. Nevertheless, the final result of the comparison is not yet determined as the relationships of greater than, less than or equal do not only depend on which number’s magnitude is greater but also on their signs. The result of the comparison in this scenario is then given by the following conditions. A is less than B if: A is negative and B is positive OR A and B are positive and the magnitude of A is less than the magnitude of B OR A and B are negative and the magnitude of A is greater than the magnitude of B. LT_cmp = (sign_A · sign_B’) + (sign_A’ · sign_B’ · LT_mag) + (sign_A · sign_B · GT_mag) 49 A is greater than B if: A is positive and B is negative OR if A and B are both positive and the magnitude of A is greater than B OR if both numbers are negative and the magnitude of A is less than B. GT_cmp = (sign_A’ · sign_B) + (sign_A’ · sign_B’ · GT_mag) + (sign_A · sign_B · LT_mag) Mag_flag determines if the outputs GT_cmp and LT_cmp should affect the final result and it is given by: mag_flag = align_flag’ · exp_flag’ · zero_flag’ · inifinite_flag’ In other words, the magnitude comparison of the coefficients is only valid (through mag_flag) if none of the previous enable flags was triggered. The final equations for the comparator considering all possible scenarios are: GT = unordered + ( (GT_inf · infinite_flag) + (GT_offrange · exp_flag) + (GT_of · align_flag) + (GT_cmp · mag_flag) ), LT = unordered + ( (LT_inf • infinite_flag) + (LT_offrange • exp_flag) + (LT_of • align_flag) + (LT_cmp • mag_flag) ). The signal unordered is asserted when either of the operands is a NaN. If this occurs the result is overridden and is always unordered (GT=1, LT=1) as specified in Table 4. The signals GT and LT are finally produced in the special case handling module. It is the one that receives the flags indicating the different scenarios and is responsible for the handling of the different comparison cases described in this section. 50 3.5 Combined binary floatingpoint, two’s complement and decimal floatingpoint comparator Given the similarity of approaches of the decimal comparator described and its binary counterpart exposed in Section 2.1, a single design capable of handling 32bit and 64bit two’s complement numbers, single and double precision binary floatingpoint and 64bit decimal floatingpoint numbers is interesting since it would result especially useful to reduce costs in processors, by allowing the same hardware to be used to compare all three data types. The main difference between the binary and the decimal comparator schemes is that decimal floatingpoint representation is redundant, as stated before, and therefore requires alignment while binary does not. Binary 32bit floatingpoint numbers only require a sign extension given that 32bit and 64bit number formats can both be handled by the same hardware. The logic that can be shared between both cases however is the core of the comparators, the magnitude comparison module which, in the decimal case, comes into effect after alignment. The magnitude comparator logarithmic tree for the decimal case is composed of comparator elements that handle 4bit BCD digits. Each element had two digit inputs (two 4bit BCDs) as opposed to handling 2bit pairs as in the binary case (Section 2.1). Optimization of the BCD element was possible since in BCD no numbers are encoded after 9hex and hence Ahex, Bhex, Chex, Dhex, Ehex and Fhex can be ignored providing further simplification. An additional benefit is the fewer number of stages necessary since 4bit digits are compared instead of 2bit pairs. Nevertheless the binary pair comparator element can be used for the decimal case instead of the BCD and provide a way of saving hardware since it would be used by both formats. Tests and simulations were run 51 however to see the impact of using the bit pair comparison module from the previous section for the decimal comparator and the results justified the joint design of the module proposed. Furthermore, less area would be required when implemented on a system since the decoding of the IEEE 754 decimal floatingpoint can be handled by the decimal floatingpoint unit potentially already existent in the system. An overview of this combined design is shown in Figure 20. The 3bit signal Sel determines the format type of the operands, 32bit or 64bit two’s complement, binary floatingpoint or decimal floatingpoint. If the operands are binary then the sign extension module sign extends 32bit numbers into 64bit so that the same hardware can be utilized. In the decimal case the input for the magnitude comparator is obtained after decoding and coefficient alignment. The exception and special case module handles the result of the magnitude comparator taking into account the signs of the numbers, the data type and the flags for overflow, offrange and others exposed in Section 3.4, for the decimal case. 52 Operand A Operand B 64 64 Sel Decimal IEEE 754 Decoder Sign Extension Logarithmic Magnitude Comparator Exponent Comparison, Coefficient Alignment Exceptions, special cases handling 2 LT, GT 20 128 Exponents Coefficients 3 1 1 GT FCC[1] LT FCC[0] Comparison Result 6 2x64 2x64 Infinites, Signs, NaNs Offrange, greater_exp Flags 3 2 Binary/Decimal BCD BINARY DECIMAL Figure 20. Combined two’s complement, binary and decimal floatingpoint comparator. 53 4. EXPERIMENTS FOR DECIMAL FLOATINGPOINT DIVISION BY RECURRENCE One method widely used for division is performed by recurrence or sequentially. In this method, the quotient is represented by a chosen radix and a digit is produced after each iteration. The quotient digit can also be selected from a redundant digit set as this approach has noteworthy speed and cost advantages. The main difficulty using a digit recurrence algorithm lies in the quotient digit selection function or QDS. Several studies have been made to simplify or improve this function. The Kornerup study presented in [40] shows an accepted analytical approach to determine a minimum number of digits required for the QDS function. This theory, however, is specific to the binary case and, hence, requires modification to be applied to the decimal case. This study attempts to provide an insight into the implementation feasibility of a decimal digit recurrence divider utilizing the recurrence division theory. 4.1 Decimal Division by Digit Recurrence Theory As discussed previously, when implementing division by recurrence, the quotient digit of radix r lies within a symmetric redundant selection set of consecutive integers given by: 2 q D {a,...,1,0,1,...,a} a r j ∈ = ∀ ≥ , (1) 54 such that ā = – a. The redundancy factor or measure of redundancy for a digit set is defined by: −1 = r ρ a with 1 2 1 < ρ ≤ . (2) The main equation when implementing division by recurrence for a dividend, x, and divisor, d, is given by [41]: w[j+1] = rw[j] – dqj+1 , (3) where r denotes the quotient radix, qj+1 the selected quotient digit and w[j] the partial remainder in iteration j. Naturally, in our case, the radix is decimal or r = 10. In order for the recurrence in (3) to be valid through all iterations and guarantee a result, two basic conditions for the QDS should be met: containment and continuity. And, the value of the quotient digit qj+1 is given by the selection function: qj+1 = SEL(rw[ j ], d). (4) The containment condition specifies that the quotient digit selected must maintain the next partial remainder bounded to satisfy convergence of the algorithm, or: − ρ ⋅ d ≤ w[ j] ≤ ρ ⋅ d . (5) This is summarized in Robertson’s diagram, shown in Figure 21, where the limits on the vertical axis for w[j+1] are noted by the horizontal doted lines. The selection interval of rw[j] for which it is possible to select qj+1 = k and keep the next residual bounded is defined as [Lk (d), Uk (d)]. Each interval, as shown in Figure 21, defines a specific interval for a given quotient digit (e.g. qj+1 = k–1 produces the interval given by [Lk1 (d), Uk1 (d)]). 55 Expressions for Lk (d) and Uk (d) can be obtained from Robertson’s diagram defined by [41][29]: L d k d k ( ) = ( −ρ ) ⋅ and U d k d k ( ) = ( + ρ ) ⋅ . (5) Figure 21. Robertson's diagram showing selection intervals for q=k1 and k. The continuity condition ensures that for any possible rw[j] (horizontal axis in Robertson’s diagram) there exists a selectable quotient digit k (i.e. rw[j] lies always within a selection interval [Lk,Uk]), otherwise a quotient digit would not be selectable. Therefore, the overlap present between two consecutive intervals must exist or be equal to zero, at minimum, or Lk ≤ Uk–1. This condition is imposed by: −1 ≤ ≤ k k k L S U , (6) where Sk denotes the partition points within the selection interval [Lk ,Uk1] such that the QDS returning qj+1 may be defined by [40]: S rw j S q k k k j ≤ < ⇒ = +1 +1 [ ] . (7) rw[j] w[j+1] ad −ρd ρd rρd (k1)d kd Lk1 Lk Uk1 Uk Interval for q=k1 Interval for q=k − rρd  56 4.2 Quotient Digit Selection The overlap is critically important, since it allows an inexact value of the divisor and the partial remainder to determine a suitable quotient digit. In this sense, only a limited number of leading digits of both divisor and partial remainder are required. With this consideration, the truncated partial remainder, rŵ[j], is defined as: rwˆ[ j] = rw[ j] +ε rw (8) where εrw denotes the truncation error. Carrysave representations are often utilized for division by recurrence algorithms when computing (3), because carrypropagate adders would lengthen the critical path excessively. The truncation error, using a carrysave representation, is defined by [40]: 0 2 ulp(rwˆ[ j]) rw ≤ε < ⋅ . (9) In a similar way, the truncated divisor is given by: d d = dˆ +ε with 0 ulp(dˆ) d ≤ε < . (10) Figure 22. Truncated inputs to the QDS function. Partial Remainder, rw[j] t . Divisor, d u . 57 Figure 22 illustrates the truncation of both divisor and partial remainder. The number of digits to the right of the radix point is given by u and t, respectively. Since both divisor and partial remainder utilize decimal representations or radix 10, it follows that: t ulp rwi ( ˆ ) = 10− , (11) ulp(dˆ) = 10−u , (12) where ulp indicates unit in the last place (less significant digit). Therefore, the truncated divisor can be represented as an integer multiple m of an ulp(dˆ) : dˆ = m×10−u . (13) Division by digit recurrence theory is often implemented for fractional normalized divisors [41]. Figure 23 illustrates the reasoning behind this, since a nonnormalized divisor would require an infinite precision as d approaches zero. Hence, the values of d are normalized and bounded by: 10 1 1 ≤ d < , (14) which implies, due to (13), that: 10u−1 ≤ m < 10u . (15) Following the analysis approach in [40], Figure 23 shows the partition of the interval [Lk ,Uk1] by Sk, defined in (6), as a function of dˆ and not d. Below Sk, the value of the selected quotient digit is q = k–1, and above Sk, q = k. S (dˆ) k is now a staircase function due to the truncation of both divisor and partial remainder and indicates rectangles 58 where a given quotient digit can be selected due to quantization. The value of S (dˆ) k can also be expressed as an integer multiple of ulp(rŵi) by constants given by k and m, or sk,m: t k k m S (dˆ) = s ×10− , . (16) Figure 23. PD Diagram, Sk selection points as a function of truncated d, [40]. The dotted rectangle in Figure 23 has its lower left hand corner at (dˆ, S (dˆ)) k and is limited by: dˆ ≤ d < dˆ + ulp(dˆ) (17) and ( ˆ) ( ˆ) 2 ( ˆ ) k i k i S d ≤ rw < S d + ⋅ ulp rw . (18) The study by Kornerup defines boundary conditions on this rectangle to determine the minimum amount of digits after truncation for the partial remainder and the divisor; t and u respectively. The rectangle should lie above Sk and, therefore, the boundary condition on its right hand corner yields: d rw[j] 1/r 1 Lk Uk1 Sk . 59 L (dˆ ulp(dˆ)) (k )(dˆ ulp(dˆ)) S (dˆ) k k + = − ρ + ≤ . (19) It follows, using (5), (12) and (16): ⎡ ⎤ t k m (k − )(m⋅10−u +10−u ) = s ×10− , ρ , (20) ⎡ ⎤ k m t u k m s , 10 − ( −ρ )( +1) = . (21) The height of the rectangle is of 2·ulp(rŵi). Nevertheless, consecutive rectangles aligned vertically are spaced by one ulp(rŵi) (resolution of Sk). Overlapping rectangles from the bottom should have a value of k – 1 and, therefore, the midpoint on the left edge should lie under Uk1. This boundary condition, combined with (5), yields: S d ulp rw U d k d k i k ( ˆ) ( ˆ ) ( ˆ) ( 1 ) ˆ 1 + ≤ = − + ⋅ − ρ . (22) Again, using (11) and (16) on this inequality gives: ⎣10 ( 1 ) 1⎦ , s ≤ t −u k − + m − k m ρ . (23) Combining (21) and (23) yields floor and ceiling expressions for the possible values of Sk: ⎡10t −u (k − ρ )(m +1)⎤≤ ⎣10t −u (k −1+ ρ )m −1⎦. (24) Rearranging terms results in an expression of the form: ⎡Am + B⎤ ≤ ⎣(A+ C)m⎦, (25) 60 with A = 10 t–u (k – ρ), B = 10 t–u (k – ρ) + 1 and C = 10 t–u (2ρ –1). For the nontrivial solution, where the quotient selected is zero and given condition (2), it is seen that A ≥ 0, B ≥ 1 and C > 0 for k ≥ 1. For condition (25) to withstand, it is necessary that C·m ≥ B. Nevertheless, the stronger condition C·m – B ≥ 1 allows a minimum of one integer between the floor and ceiling functions yielding: 10t−u (2ρ −1)m −10t−u (k −ρ ) −1 ≥ 1. (26) This condition should hold at the extreme case for the values of m and k in order for (25) to be valid. This occurs at the maximum value for the quotient digit k = a and, due to (15), at the minimum case when m = 10u–1. Rearranging terms gives: 2 10 10 (2 1) 10 ( ) 1 ρ − − −ρ ≤ − − − a u t , (27) which produces a minimal value for t for a known u. Furthermore, it is clearly seen that the numerator in (27) should be positive enabling a minimum u to be obtained as: 10( ) 10 2 1 ρ ρ − − − < a u , (28) 4.3 Considerations for the IEEE754 Decimal Case The application of the previous analysis, with radix = 10, to the IEEE 754 revision draft for decimal floatingpoint requires some considerations. Most significantly, the revision draft specifies that decimal floatingpoint numbers are not normalized and, therefore, the representation is redundant in nature. This implies for example that numbers 125 x 105 and 1250 x 106 are both representable and are equal in magnitude. The standard also specifies that the mantissa is represented as an integer and not a fraction with a leading 61 binary 1, as in the binary case (i.e. 0.1xxxx…2). This complicates the algorithm application, since both the divisor and dividend operands in the binary case are limited in range to 0.12 ≤ x < 12. Unsigned integer division has operands 0 ≤ x < rn – 1 and 0 ≤ d < rn – 1. The division result produces a quotient q such that [41][29]: q = ⎣x / d ⎦. (29) As mentioned previously, basic integer division algorithms require fullprecision for the QDS function. To apply fractional division theory, the divisor d should first be normalized, by shifting, so that the mostsignificant bit is a nonzero digit. With a shift of p digits, the normalized divisor d* is: d* = 10p d . (30) Consequently, the resulting quotient is: q = ⎣x / d ⎦ = 10p ⎣x / d* ⎦. (31) To use the algorithm, fractional operands are modified as defined by: n f x = x × r− , (32) n f d = d* × r− . (33) Expressions (27) and (28) can be used to obtain minimum bounds for the number of decimal digits of the partial remainder and divisor, t and u respectively. There is a 62 choice, however, in the value of a, or the amount of redundancy as shown in (1). In the decimal case a can vary from 6 to 9. As redundancy is incremented (a increases), the overlap described in (6) is augmented thus simplifying the QDS function by allowing for selection of the quotient digit and consequently a smaller lookup table. On the other hand, a greater value of a complicates the generation of the quotient digit multiples ( qj+1d ) needed for the iterative algorithm (3). For example, with a = 9 the possible divisor multiples required are (–9d, – 8d, …, –1d, 0, 1d, …, 8d, 9d). Nevertheless, as a is decremented and the possible quotient multiples are reduced, the additional digit inputs to the QDS function are incremented as more precision is required. Since each digit is a decimal digit the size of a lookup table for the QDS would increase by an order of magnitude with each additional digit required. Therefore, the smallest lookup table size is achieved with a = 9 and, hence, a maximum redundant digit set with ρ = 1, from (2). In this case, (28) and (27) yield: u ≥ 2 and t ≥ 2 implying that only 2 decimal digits are required for the divisor and the partial remainder. The shifted partial remainder, however, can still be within the allowed boundary (5) but be greater than 1 in which case integer digits are needed. Since the divisor is normalized (30) its range is limited to 1/10 ≤ d < 1, this observation is also shown in Figure 23 with the vertical line at 1/r. The possible range for the shifted partial remainder is then given by: rw[ j] = 10⋅w[ j] ≤ 10⋅ρd < 10 , (34) due to the containment condition given in (5). This implies that at most a single integer digit is required. The total number of digit inputs to the QDS function is 5 digits, 2 63 decimals for the divisor (. xx) and 2 decimals and an integer for the partial remainder (x. xx). A table based QDS function will then have 5 decimal inputs and a decimal output. Considering a 5bit encoding for the signed decimal quotient digit output the total number of QDS entries in the table would be: 105 × 5 bits = 100,000 × 5 bits. The division by recurrence algorithm requires a subtraction and a multiplication of the truncated divisor by the quotient digit which can be positive or negative. Since the numbers treated are decimal this complicates significantly the arithmetic operations involved. Furthermore, a significant complication of using a maximal redundant quotient digit set is the generation of extra divisor multiples (–9d, …, 0, …, 9d), as discussed previously. The proposed scheme utilizes ideas from decimal multiplication presented in [31]. The divisor multiples (qj+1d product) generation starts by computing a priori the multiples 0, d, 2d, 4d and 5d which can be added in combination to create all possible divisor multiples from 1d to 9d. Figure 24 shows an overall scheme of the decimal division by recurrence design. 64 Decimal IEEE754 Decoder 64 Dividend 64 Normalization, initial w[0] Exponents Decimal left shift QDS Table Excess3 Converter w[0] rw[j] w[j+1] = rw[j] – qj+1d x .xx .xx Divisor, d qd product generation (qj+1d) qj+1 TRUNCATED Excess3 subtractor Quotient result, rounding, scaling Coefficients (BCD) IEEE754 Recoder Special cases, rounding Decimal Scaling RESULT wS wC Divisor Figure 24. Decimal division by digit recurrence implementation. 65 5. DECIMAL PARTIAL PRODUCT REDUCTION A few of the approaches to decimal multiplication were described in Section 2.4. The different methods discussed included serial or iterative accumulate addition to obtain the multiplication result. Parallel multiplication however is used extensively in binary floatingpoint hardware ([42], [43]) and is of importance if performance is to be extended to the decimal case. Figure 25. Multiplication Algorithm Parallel multiplication can be divided in three main steps, as illustrated in Figure 25. The first step entails partial product generation where the multiplicand multiples are obtained. Then, partial product reduction occurs using a fast addition scheme to reduce the partial products to two. Finally, a carry propagate addition is necessary to obtain the final result. The overall performance of the multiplier, therefore, is closely related to the individual performance for these stages. However, improvement in partial product 7 9 8 x 3 4 7 5 5 8 6 3 1 9 2 + 2 3 9 4 2 7 6 9 0 6 Partial Product Reduction Multiplicand Multiplier Final Carry Propagate Addition Multiplicand multiples (Partial Products) 66 reduction for example, often increases complexity in partial product generation. This is the reasoning behind binary methods, like Booth encoding, where the number of partial products to be added is reduced at the expense of more complex multiplicand multiple generation through recoding [44][45]. Unfortunately, this condition can offset the gains in performance of the resulting multiplier. As discussed in Section 2.3, binary multiplication with tree multipliers typically use carrysave adders to repeatedly reduce the partial product matrix until only two rows remain which are then added using a fast carrypropagate adder to form the final product. Although tree multipliers are typically much faster algorithmically than array multipliers (see Section 2.3), they produce an irregular structure which can affect their performance. Traditional decimal codes are different than binary codes in that more information per bit has to be coded into the digital logic. The most common decimal encoding is 4bit Binary Coded Decimal (BCD) which represents decimal codes 0 through 9. This code is also referred to as BCD8421 where the numbers 8421 represent the weight of each bit in the encoding. BCD8421 has the advantage that each decimal number is represented in a common binary number system and, hence, some of the binary operations can be performed with regular binary logic structures [31]. Although BCD8421 codes are straightforward, they have two distinct disadvantages. First, the binary representation of ten through fifteen has no meaning and must be eliminated. Another major disadvantage is that BCD8421 is not selfcomplementing, whereas, a selfcomplementing BCD code is one where the 9's complement of the decimal digit may be obtained by changing the 1's to 0's and 0's to 1's (bit inversion) [31]. The 9's complement operation is necessary to perform subtraction in much the 67 same way as two’s complement numbers are used to perform subtraction with binary numbers. Although these two disadvantages make BCD8421 more challenging to work with, simple Boolean logic can be used to obtain its 9’s complement: 0 0 C = T 1 1 C = T 2 1 2 1 2 C = T T + T T 3 1 2 3 C = T + T + T where the letters T and C refer to true and complement digit, respectively. Efficient addition of decimal numbers is an interesting topic since it is not only used as a standalone operation but also required in other calculations like multiplication and division. Multioperand addition, where more than two numbers are added, is of particular importance since it can be used to reduce the partial products that occur in a multiplication operation (see Figure 10). To avoid the time consuming carry propagation in multioperand addition, two different schemes used in binary radix numbers are also applicable to the decimal case. These are signeddigit addition and carrysave addition, explored in this work. 68 5.1 Decimal CarrySave Addition One of the most common elements within partial product reduction is the full adder cell or the 3:2 counter, as stated in Section 2.2.1. As with the binary case, decimal partial product reduction can also utilize 3:2 counters except that each digit is 4bits long to accommodate the BCD8421 encoding. In this case, four 3:2 counters are used for each bit of the BCD digit as shown in Figure 26. Figure 27 shows how this scheme would be implemented with a block level diagram. Figure 26. Binary full adder cells used for decimal carrysave addition. The result of the addition is obtained by adding the sum and the shifted carry vector. Each carry bit in H, in Figure 26, represents the carry input for the next significant bit. The addition of both vectors assumes that H is shifted left one bit or, in binary, multiplied by two. 3 0 0 1 1 7 0 1 1 1 + 8 1 0 0 0 12 1 1 0 0 3 0 0 1 1 A: B: C: S: H: Decimal BCD 8 4 2 1 BINARY FULL ADDER SUM Carry 69 Figure 27. Block diagram for full adder cells used for decimal CarrySave addition. The function of the 3:2 counter or full adder can be summarized using the following equation: A + B + C = S + 2 x H where A, B, C, S and H are decimal numbers with Σ= = × 3 i 0 i i A a r such that ai is the bit value of A at position i and ri the corresponding weight (r3r2r1r0 = 8421 for BCD8421). Although using binary 3:2 counters is relatively simple, it has one major disadvantage for BCD8421 coding. As seen in Figure 28, the sum vector S is out of range since the sum S = 1100BCD is not a valid decimal representation and the addition of S and H yields a result in binary as opposed to correctly displaying a valid ABCD8421 B BCD8421 C BCD8421 FA C S FA C S FA C S FA C S 4 4 4 • • • • • • • 8 4 2 1 • • • • • • 4 4 S H 70 BCD8421 number. Consequently, the conversion to BCD would require additional logic and, hence, it is not efficient. Figure 28. Result of the addition, carry vector shifted left. One solution can be obtained by using a different weighted BCD representation during the summation process. That is, the bits are recoded to another valid decimal representation which allows the carries to be generated correctly when a shift occurs. To find the best BCD representation to utilize for the recoding, it is important to look at an important property of adding two numbers. In order to avoid the out of range result, decimal codes are employed such that the sum of the weights of each 4bit decimal code is equal to 9. In the previous example, since the code BCD8421 was used, it was possible to represent digits with a value greater than 9 (10 through 15, since we had 4 bits BCD). If the maximum possible value was 9 instead (since the sum of the weights of all 4bits is 9) then the result will never be out of range. This property is also utilized with binary signed digit adders to avoid carrypropagation [29]. The following BCD codes: 5211, 4311, 4221, and 3321 can satisfy this condition. Additionally, an advantage of these encodings is that they are selfcomplementing, meaning that the 9's complement of the number can be obtained by a simple bit inversion of each of the 4bits [31]. Using this trivial simplification allows the ten's 12 1 1 0 0 + 6 0 0 1 1  S : W = 2 x H : Decimal Value BCD 8 4 2 1 Leftshifted 18 1 0 0 1 0 Out of range [0,9] Result in binary, not in BCD8421 71 complement of the number to easily be obtained in much the same way as the two's complement for binary numbers is performed. Table 10 shows some of the mentioned codes. Table 10. 8421, 4221 and 5211 BCD representations. Decimal BCD8421 BCD4221 BCD5211 0 0000 0000 0000 1 0001 0001 0001  0010 2 0010 0010  0100 0100  0011 3 0011 0011  0101 0101  0110 4 0100 1000  0110 0111 5 0101 1001  0111 1000 6 0110 1010  1100 1001  1010 7 0111 1011  1101 1100  1011 8 1000 1110 1110  1101 9 1001 1111 1111 Furthermore, since the value of 2xH (the carry vector) is required to obtain the final sum, a BCD coding that facilitates a multiplication by 2 is desirable to use. This is because, in binary, a multiplication by 2 is easily accomplished by a bitwise left shift. However, in the selfcomplementing codes shown above the weights do not increase in powers of two and, hence, shifting cannot be applied directly. Nevertheless, if the number is reencoded this might be accomplished in a simpler way. Figure 29. Multiplication by 2, recoded result. H: 6 1 0 0 1 Decimal BCD 5 2 1 1 New encoding W=2xH: 12 1 0 0 1 10 4 2 2 BCD 72 One potential solution for recoding the carries into a selfcomplementing output is to use the 5211 code, shown in Table 10. By using this code, a multiplication by 2 would result in a 10, 4, 2, 2 encoding, as seen in Figure 29 without the need to even modify the bits. Although this encoding is not useful, since it does not satisfy the two conditions described earlier (selfcomplementing and sum inrange), the mostsignificant bit can be passed to the next significant digit or column. The resulting BCD code will be 4221 as illustrated in Figure 30. In summary, if a BCD5211 recoding is utilized, a shift asserts a carryout for the 10 weight and allows the decimal code to be transformed to BCD4221, which is selfcomplementing. The bit shifted out is the carry out to the next decimal digit and the space opened in the least significant bit position after the shift would be available to receive a carry input from the previous significant digit. For the leastsignificant decimal digit, it is assumed that this value is 0. Figure 30. Multiplication by 2, result in BCD4221. 1 0 0 1 Decimal BCD 5 2 1 1 New encoding 1 0 0 1  4 2 2 1 BCD Left Shift 4 2 2 1 BCD W=2 x H : 12 H : 6 Digit X 10 Digit X 1 73 Figure 31. Decimal 3:2 counter with BCD recoding example. Figure 31 also shows the multiplication by 2 operation on H. H is shifted left after recoding allowing a carryin to be accepted (“” in the example) and generating a carryout (“1” in the figure). The value of 2xH is then composed of a decimal digit and a single bit decimal carryout. The decimal digit is W with a value of 1000BCD4221 or 410 considering a carryin of zero to its LSB. The single bit carryout of 1 represents a value of 1 in the tens place and, together with W = 4, represents 14. W has the same arithmetic weight as S. The operation is summarized as: A + B + C = S + 2 x H = S + (Carryout x 10) + W + Carryin. The following figure shows the implementation of the decimal 3:2 counter: 3 0 1 0 1 7 1 1 0 1 + 8 1 1 1 0 4 0 1 1 0 7 1 1 0 1 7 1 1 0 0 A: B: C: S: H4221: H5211: Decimal Value BCD 4 2 2 1 BINARY FULL ADDERS Recoding BCD4221 to BCD5211 Left Shift (x2) W=2xH: 14 1 1 0 0  W in BCD4221 Result =S+2·H=S+W=18 Carryin Carryout 74 Figure 32. Block diagram for decimal 3:2 counter [44]. Although this recoding works nicely, a conversion is necessary to convert the input of the 3:2 counter into the desired decimal code (4221). Using Table 10, a BCD8421 number can be converted into BCD4221 through simple logic utilizing the following Boolean equations: H0 = D1 H1 = D1 + D3 H2 = D3 H3 = D2 + D3, where D3D2D1D0 represent the BCD8421 digit and H3H2H1H0 is the BCD4221 decimal digit. The counter then takes three 4bit decimal numbers as inputs in BCD4221 decimal code. These binary full adders perform the bitwise addition which results in a x 2 ABCD4221 BBCD4221 CBCD4221 3:2 Counter C S 4 4 4 4 4 Carryin 1 1 Carryout WBCD SBCD4221 Singlebit carryout after recoding and left shift. ‘X10’ going to next column Both S and W have the same arithmetic weight Singlebit carryin after recoding and left shift. From previous column 4 75 sum vector S and a carry vector H, both in BCD4221. When H is recoded into BCD 5211, this allows the left shift to obtain the correct decimal value in BCD4221. It is important to see that by using recoding the correct result is obtained. Previously, without recoding, binary counters produced an incorrect result. By using a scheme that allows the carryout to be in the correct notation (i.e. BCD4211 code) the result is produced correctly. Recoding from BCD4221 to BCD5211 can be accomplished using the following Boolean logic expressions, derived from Table 10 [44]: w[3] = h[3] ⋅ (h[2] + h[1] + h[0]) + h[2] ⋅ h[1] ⋅ h[0] w[2] = h[2] ⋅ h[1] ⋅ (h3⊕ h[0]) + (h[3] ⋅ h[0])⊕ h[2]⊕ h[1] w[1] = h[2] ⋅ h[1] ⋅ (h[3]⊕ h[0]) + h[3] ⋅ h[0] ⋅ (h[2]⊕ h[1]) w[0] = (h[2] ⋅ h[1])⊕ h[3]⊕ h[0] , where w3w2w1w0 represent the BCD5211 digit and h3h2h1h0 the digit in BCD4221. This logic is implemented in the block that performs the multiplication of the carry vector by two (“x2” in Figure 32). 5.2 Delay analysis of the decimal 3:2 counter by recoding An insight to the performance and flow structure of the 3:2 decimal counter proposed in [44] (shown in Figure 32) can gained by utilizing the Δ delay model frequently used for binary arithmetic delay analysis. The use of Δ delays aid in considering design trade 76 offs in a circuit. Consider the case of a regular 1bit Full Adder or 3:2 counter. Typically a 3:2 counter can be implemented using nine logic gates as shown in the following figure: Figure 33. Nine gates Full Adder / 3:2 counter cell. If the delay through a logic AND or OR gate is estimated to be of 2 Δ delays and the delay through an inverter gate of 1 Δ then we have the delay of the paths, from inputs to outputs: → = Δ → = Δ → = Δ → = Δ 4 5 , 9 , 10 in out in out C C C Sum A B C A B Sum i.e. it takes 10 Δ delays for the signal to propagate from A or B to the Sum output for example. In the same way, the delay through the “x2” block can be obtained by looking at the logic equations used to implement it. This block consists of a recoding from BCD 4221 to BCD5211 and a single bit hardwired left shift. The left shift does not incur a delay while the BCD recoding can be implemented using simple two level logic. Its delay therefore consists of 2 gate delays and an inverter delay used to create the logic predicates, resulting in 5 Δ . Consequently the delay of the complete decimal 3:2 counter 77 of Figure 32, and considering the delay model for the binary 3:2 counter shown above, it requires: A,B→ S = 10Δ A,B→W = 9Δ + 5Δ = 14Δ C → S = 5Δ , through carryin path, C →W = 4Δ + 5Δ = 9Δ , through carryin path. This analysis can be used to evaluate the feasibility of the proposed partial product reduction scheme described in the next section, using decimal 4:2 compressors. 5.3 Decimal 4:2 Compressor Trees Decimal partial product reduction trees that use the BCD recoding technique are different than their binary counterpart in the way the weights are handled. Binary partial product reduction trees always take the carry and pass it onto the next column as shown in Figure 14. This occurs because the carryout is at a more significant weight than the sum. However, using the coding discussed in the previous section, it is apparent that for the decimal case this is completely different as it produces a carryout digit at the same weight as the sum and a carryout bit to the next significant digit. The use 3:2 counters for partial product reduction is common in binary arithmetic. In [44], partial product reduction is handled by using a tree of decimal 3:2 counters, in a similar fashion to a Dadda tree [46]. This is shown in Figure 34 where the X2 block represents the recoding logic discussed earlier. 78 Figure 34. Decimal 9:2 counter, adapted from [44]. In [44], partial product reduction is handled by using a tree of decimal 3:2 counters, in a similar fashion to a Dadda tree [46]. This is shown in Figure 34 where the X2 block represents the recoding logic discussed earlier where a multiplication by 2 is obtained through BCD recoding. It is apparent however from Figure 34 that the resulting circuit is very irregular difficulting its implementation as its interconnection is complex and most likely affecting its performance and area consumption, in part due to unpredictable long wiring and its general layout. Regularity is one of the reasons why in binary arithmetic the Weinberger 4:2 compressor [30] is used for partial product reduction. Its extension to the decimal case is therefore an interesting problem as it can provide performance enhancements for partial product reduction. W S 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter X 2 X 2 X 2 X 2 X 2 X 2 X 2 79 Apart from its improved regular structure, the proposed decimal 4:2 compressor has the advantage that the carryin is no longer dependent on the carryout, as shown in Figure 35, breaking the carry chain path. This gives compressors a significant advantage over traditional carrysave adder trees in that it can expedite processing the carry chain while still maintaining a regular structure. Although compressor trees are different than traditional counters, they can be organized into efficient interconnection networks for reducing the partial product matrix. This allows partial product reduction trees to be built easily and with regularity. Figure 35. Proposed decimal 4:2 compressor. All signals are decimal (4bits). The new decimal compressor’s structure is similar to the binary case except for the fact that in binary the signals Cout and W are both passed to the next column since their weight corresponds to the next significant bit, unlike S. In decimal however both Cout and W are composed of 2digits, a “tens” digit and a “units” digit where a BCD number represents the “units” weight and a single bit the “tens” value. The “tens” single bit is generated in the X2 blocks during shifting as a carryout and, at the same time, a carryx2 Cout Cin W S BCD4221 BCD4221 x2 C 3 : 2 S C 3 : 2 S D3 D2 D1 D0 80 in is accepted from the previous column. Figure 35 does not show the “tens” single bit generated at the X2 blocks for clarity (carryout) or the single bit carryin. Utilizing the delay statements shown in Section 5.2 allows Δ delay analysis of the decimal compressor: , , , → = 10Δ +10Δ = 20Δ 3 2 1 0 D D D D S , , , → = 10Δ +14Δ = 24Δ 3 2 1 0 D D D D W , , , → = 14Δ 3 2 1 0 out D D D D C C → S = 5Δ in C →W = 4Δ + 5Δ = 9Δ in This model is important since it determines the feasibility of higher order compressors that use this block since, as opposed to the binary case, the Cout and the W signals incur an additional delay due to the X2 block logic. Figure 36. Decimal 8:2 compressor, all signals are decimal (4bits). Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 3:2 S 81 Figure 36 shows an 8:2 compressor tree that performs the same function as Figure 34 yet is regular. Assuming that the inputs are given in time 0, and using the Δ delay expressions determined for the 4:2 compressor the Δ delay for each path can be determined. The time when the corresponding signal is available is shown in the following figure for the 8:2 compressor. Figure 37. Decimal 8:2 compressor. Critical path and Δ delays for signals are shown. The Cout signal only requires 14 Δ as shown and is fast enough that it does not delay further the subsequent 4:2 compressor where its carryout is attached. In other words, the delay is determined by the 24 Δ alone required for the W signal in each compressor. If this was not the case, it will imply that the delay will accumulate with each stage affecting performance, even more significantly as the size of the compressor is increased (Ex: 16:2, 32:2, etc). Subsequent column heights (digits) compressors can easily be built with log4(n) levels, where n represents the number of digits to be compressed (8 digits as shown, 16, etc). Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 3:2 S 14Δ 14Δ 24 Δ 20 Δ 20 Δ 24 Δ 48 Δ 44 Δ 38 Δ 62 Δ 58Δ 82 Figure 38 illustrates a 16digit (64bit) decimal compressor tree using the same technique used to build the 8digit tree from Figure 36. Figure 38. Decimal 16:2 compressor, all signals are decimal (4bits). Decimal W 3:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S 83 6. PARTIAL PRODUCT GENERATION SCHEMES Due to its complexity, decimal multiplication was previously implemented using sequential methods [31]. These designs generally required the iteration of each of the multiplier digits and, depending on its value, the addition of the corresponding multiple of the multiplicand to the result. This value has to be precomputed and stored before the process takes place and is typically performed through costly lookup tables [12]. Enhancements to this approach included the reduction of the multiplier digit range through recoding to reduce the number of combinations, a method similar to Booth multiplication. Digit recoding allows the computation on the fly of multiplicand multiples to avoid their storage and the use of decimal carry save or signmagnitude addition to speed the partial product reduction process [13], [14]. Currently there are only two propositions in the literature of decimal parallel multipliers, [44] and [20], with similar implementation architectures but significantly different multiplicand multiple generation schemes. Although these implementations are noteworthy, there are still many enhancements that require clarification which this section addresses. 6.1 Multiplier Recoding As stated above typical decimal multiplication is more involved than binary requiring the need to generate the radix10 multiplicand multiples 0x, 1x, 2x, …, 9x, as shown in Figure 39 (copied from Section 5 for clarity). Multiples like 3x and 7x are considered 84 “hard multiples”, since their generation with binary logic is not straight forward and usually require a carry propagate addition which is slow. Previous methods for generating multiples utilized either this approach or costly lookup tables. Figure 39. Multiplication Algorithm The common multiplication algorithm is given by: Σ− = = ⋅ = ⋅ 1 0 n i i ip x y x y r , where x and y denote an n digit BCD multiplicand and multiplier, respectively, and r =10 denotes the radix and yi ∈ [0, 9]. Recoding of the multiplier is an efficient technique for reducing its implementation since it permits a different set of multiples to be utilized avoiding the need for generating “hard multiples”. One method recodes each multiplier digit yi ∈ [0, 9] to yi = yHi + yLi where yHi ∈ {0, 5, 10} and yLi ∈ {2, 1, 0, 1, 2}. In this manner only the multiples 2x and 5x, which can be generated without excessive overhead, are required to create all other decimal multiples. This is illustrated in Table 11 in the radix5 columns. Similarly, multiple 10x is produced with a simple 4bit wired left shift (1digit shift). 7 9 8 x 3 4 7 5 5 8 6 3 1 9 2 + 2 3 9 4 2 7 6 9 0 6 Partial Product Reduction Multiplicand Multiplier Final Carry Propagate Addition Multiplicand multiples (Partial Products) 85 In [44], a radix5 multiplier using this approach is described. A block diagram of this digitrecoding scheme is shown in Figure 40. Hotone coded multiplexors are used avoiding the need of an extra ‘0x’ input. The selection signals are determined directly from BCD 8421 input digits, yi: Hi10 i [3] y x = y 5 [2] [1] [0] Hi i i i y x = y + y ⋅ y 2 [2] [1] [3] [0] [1] [0] Li i i i i i i y x = y ⋅ y + y ⋅ y + y ⋅ y 1 [2] [0] [2] [1] [0] Li i i i i i y x = y ⋅ y + y ⋅ y ⋅ y [3] [2] [1] [0] [2] [1] [0] i i i i i i i sign = y + y ⋅ y ⋅ y + y ⋅ y ⋅ y . Table 11. Radix 5/4 digit recoding. Decimal y i y Hi y Li y Hi y Li 0 0 0 0 0 1 0 1 0 1 2 0 2 0 2 3 5 2 4 1 4 5 1 4 0 5 5 0 4 1 6 5 1 4 2 7 5 2 8 1 8 10 2 8 0 9 10 1 8 1 Radix5 Radix4 86 Figure 40. Digit recoding for radix5, [44]. The second recoding approach, similar to the radix5 case, recodes each multiplier digit yi ∈ [0, 9] to yi = yHi + yLi where yHi ∈ {0, 4, 8} and yLi ∈ {2, 1, 0, 1, 2}. It is named radix4 recoding in [19] and is also shown in Table 11. In this approach, the hard multiples are avoided and instead 2x, 4x and 8x are required to generate all others. The multiple 4x can be generated by cascading two consecutive 2x modules. An additional 2x module yields 8x implying a latency three times as large to that required to obtain 2x. The logic equations for the digit recoding of yi are: Hi 8 i [3] ( i [2] i [1] i [0]) y x = y + y ⋅ y ⋅ y 4 [2] ( [1] [0]) Hi i i i y x = y ⊕ y ⋅ y Li 2 i [1] i [0] y x = y ⋅ y 1 [0] Li i y x = y [1] [0] i i sign = y ⋅ y . 10X 5X Hotone MUX {yHi10x, yHi5x} { yLi2x, yLi1x} 9’s complement sign 2X 1X Hotone MUX 87 6.2 Multiplicand Multiples Generation The recoding schemes discussed in the previous section avoid the computation of complicated multiples of x. Instead, in the case of radix5, only 2x and 5x modules are required as all other multiples can be generated from them. Two main approaches for the generation of these multiples can be identified in available literature. The first method for obtaining 2x and 5x is studied in [31] with a conventional binary logic approach and utilized in [12] and [20]. A more recent technique proposed in [19] utilizes BCD recoding and shifting instead. 6.2.1 2x and 5x with Conventional Binary Logic The multiples 2x and 5x can be produced rapidly as opposed to other multiples since in both doubling and quintupling of BCD8421 numbers no carry is generated beyond the next significant digit. Logic equations for doubling and quintupling of BCDs are given in [12]. When doubling takes place, the Least Significant Bit (LSB) of each decimal BCD digit is initially zero. When the digit value is in the range of [59] a carryout of at most one is produced (9 x 2 = 18). Therefore, it will not further propagate since the LSB of the next digit zero as well. The equations for doubling each multiplicand digit can be formed as follows: 2 [0] ( [2] [1] [0]) ( [2] [0]) [3]) −1 −1 −1 −1 −1 −1 = ⋅ ⋅ + ⋅ + i i i i i i i x a a a a a a 2 i [1] ( i [3] i [2] i [0]) ( i [2] i [1] i [0]) ( i [3] i [0]) x = a ⋅a ⋅a + a ⋅a ⋅a + a ⋅a 2 i [2] ( i [1] i [0]) ( i [2] i [1]) ( i [3] i [0]) x = a ⋅ a + a ⋅ a + a ⋅ a 2 i [3] ( i [2] i [1] i [0]) ( i [3] i [0]) x = a ⋅ a ⋅ a + a ⋅ a 88 On the other hand, when a number is odd and quintupling takes place, its value becomes initially five and when the number is even it becomes zero. Quintupling produces a carry out of at most four (9 x 5 = 45). Since the initial value is zero or five, adding the carry results in at most 9, preventing the carry to propagate any further. Therefore, by checking the next significant digit LSB (ai[0]) to check if the digit is 0 or 5, equations for 5x can be determined as follows: 5 [0] ( [0] [3] [1]) ( [0] [1]) ( [0] [3]) −1 −1 −1 −1 = ⋅ ⋅ + ⋅ + ⋅ i i i i i i i i x a a a a a a a 5 [1] ( [0] [2]) ( [0] [2] [1]) ( [2] [1]) −1 −1 −1 −1 −1 = ⋅ + ⋅ ⋅ + ⋅ i i i i i i i i x a a a a a a a = ⋅ ⋅ + ⋅ ⋅ + − − − − 5 [2] ( [0] [3] [1]) ( [0] [2] [1]) i i i 1 i 1 i i 1 i 1 x a a a a a a ( [0] [2] [1]) ( [0] [3]) −1 −1 −1 ⋅ ⋅ +
Click tabs to swap between content that is broken into logical sections.
Rating  
Title  Analysis and Implementation of Decimal Arithmetic Hardware in Nanometer Cmos Technology 
Date  20080701 
Author  Castellanos, Ivan D. 
Keywords  decimal arithmetic; vlsi; arithmetic; decimal multiplication; decimal addition; decimal floatingpoint 
Department  Electrical Engineering 
Document Type  
Full Text Type  Open Access 
Abstract  In today's society, decimal arithmetic is growing considerably in importance given its relevance in financial and commercial applications. Decimal calculations on binary hardware significantly impact performance mainly because most systems utilize software to emulate decimal calculations. The introduction of dedicated decimal hardware on the other hand promises the ability to improve performance by two or three orders of magnitude. The founding blocks of binary arithmetic are studied and applied to the development of decimal arithmetic hardware. New findings are contrasted with existent implementations and validated through extensive simulation. New architectures and a significant study of decimal arithmetic was developed and implemented. The architectures proposed include an IEEE754 current revision draft compliant floatingpoint comparator, a study on decimal division, partial product reduction schemes using decimal compressor trees and a final implementation of a decimal multiplier using advanced techniques for partial product generation. The results of each hardware implementation in nanometer technologies are weighed against existent propositions and show improvements upon area, delay, and power. 
Note  Dissertation 
Rights  © Oklahoma Agricultural and Mechanical Board of Regents 
Transcript  ANALYSIS AND IMPLEMENTATION OF DECIMAL ARITHMETIC HARDWARE IN NANOMETER CMOS TECHNOLOGY By IVAN DARIO CASTELLANOS Bachelor of Science in Electrical Engineering Universidad de los Andes Bogotá, Colombia 2001 Master of Science in Electrical Engineering Illinois Institute of Technology Chicago, Illinois 2004 Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2008 ii ANALYSIS AND IMPLEMENTATION OF DECIMAL ARITHMETIC HARDWARE IN NANOMETER CMOS TECHNOLOGY Dissertation Approved: Dr. James E. Stine Dissertation Adviser Dr. Louis Johnson Dr. Sohum Sohoni Dr. Nohpill Park Dr. A. Gordon Emslie Dean of the Graduate College iii TABLE OF CONTENTS Chapter Page 1. INTRODUCTION...............................................................................................................1 1.1 Importance of Decimal Arithmetic.............................................................................2 1.2 The Decimal FloatingPoint Standard.......................................................................4 1.3 A case for Decimal Arithmetic in GeneralPurpose Computer Architectures ...........7 2. BACKGROUND.................................................................................................................9 2.1 Binary Comparison ...................................................................................................9 2.1.1 Magnitude Comparator Design ..........................................................................11 2.1.2 Two’s complement and binary floatingpoint comparator ..................................14 2.2 Addition...................................................................................................................17 2.2.1 Binary addition....................................................................................................17 2.2.2 Carry save addition (CSA)..................................................................................19 2.2.3 4:2 Compressors ................................................................................................20 2.2.4 Decimal excess3 addition .................................................................................21 2.2.5 Direct decimal addition.......................................................................................23 2.2.6 Decimal FloatingPoint Adder.............................................................................24 2.3 Binary Multiplication................................................................................................26 2.4 Decimal Multiplication .............................................................................................30 2.4.1 High frequency decimal multiplier ......................................................................34 2.4.2 Multiplication with efficient partial product generation........................................35 3. DECIMAL FLOATINGPOINT COMPARATOR ..............................................................37 iv Chapter Page 3.1 Decimal floatingpoint comparison .........................................................................37 3.2 Comparator Design.................................................................................................40 3.3 Coefficient magnitude comparison .........................................................................42 3.4 Special case scenarios ...........................................................................................44 3.4.1 One or both numbers is infinite ..........................................................................45 3.4.2 Both operands are zero......................................................................................46 3.4.3 Exponent difference offrange............................................................................46 3.4.4 Alignment shiftout, overflow ..............................................................................47 3.4.5 Coefficient comparison.......................................................................................48 3.5 Combined binary floatingpoint, two’s complement and decimal floatingpoint comparator........................................................................................................................50 4. EXPERIMENTS FOR DECIMAL FLOATINGPOINT DIVISION BY RECURRENCE ....53 4.1 Decimal Division by Digit Recurrence Theory ........................................................53 4.2 Quotient Digit Selection ..........................................................................................56 4.3 Considerations for the IEEE754 Decimal Case ....................................................60 5. DECIMAL PARTIAL PRODUCT REDUCTION...............................................................65 5.1 Decimal CarrySave Addition .................................................................................68 5.2 Delay analysis of the decimal 3:2 counter by recoding ..........................................75 5.3 Decimal 4:2 Compressor Trees..............................................................................77 6. PARTIAL PRODUCT GENERATION SCHEMES...........................................................83 6.1 Multiplier Recoding .................................................................................................83 6.2 Multiplicand Multiples Generation...........................................................................87 6.2.1 2x and 5x with Conventional Binary Logic .........................................................87 6.2.2 2x and 5x using BCD recoding...........................................................................88 6.3 Partial Product Generation Architectures ...............................................................90 v Chapter Page 7. RESULTS........................................................................................................................93 7.1 Decimal comparator design....................................................................................95 7.2 Decimal/Binary Combined Comparator ..................................................................96 7.3 Partial Product Reduction Schemes.......................................................................99 7.4 Partial Product Generation Architectures .............................................................102 8. CONCLUSIONS............................................................................................................105 BIBLIOGRAPHY...................................................................................................................109 APPENDIX A – DENSELY PACKED DECIMAL ENCODING ............................................113 vi LIST OF TABLES Table Page Table 1. Numerical differences between decimal .............................................................3 Table 2. Decimal floatingpoint format ..............................................................................5 Table 3. Decimal FP Combination Field............................................................................6 Table 4. Floatingpoint Condition Codes.........................................................................11 Table 5. BCD Magnitude Comparison ............................................................................13 Table 6. Excess3 Code..................................................................................................21 Table 7. Decimal Multiplication Table, from [20]. ............................................................32 Table 8. Restricted range, signed magnitude products, from [32]. .................................35 Table 9. BCD Magnitude Comparison ............................................................................42 Table 10. 8421, 4221 and 5211 BCD representations....................................................71 Table 11. Radix 5/4 digit recoding...................................................................................85 Table 12. Area and delay estimates................................................................................95 Table 13. Area and delay results for comparator designs.............................................105 Table 14. Dynamic and Static power results for comparator designs. ..........................106 Table 15. Comparison results for proposed compressor trees .....................................106 Table 16. Dynamic and static power comparison results for proposed.........................107 Table 17. Area and delay results for VAM [40], LN [42] architectures ..........................107 Table 18. Dynamic and static power consumption for partial........................................108 vii Table Page Table 19. DPD Encoding / Compression, taken from [8]...............................................113 Table 20. DPD Decoding / Expansion, taken from [8]...................................................114 viii LIST OF FIGURES Figure Page Figure 1. Example: decimal floatingpoint representation of number 8.35.......................7 Figure 2. Block diagram of the binary portion of the comparator, taken from [15]. .........12 Figure 3. Magnitude comparator example ......................................................................13 Figure 4. Full adder cell and Truth Table. .......................................................................18 Figure 5. 4bit Ripple Carry Adder. .................................................................................18 Figure 6. Full adder cells used for carrysave addition. ..................................................20 Figure 7. Weinberger 4:2 Binary Compressor.................................................................21 Figure 8. Adder for Excess3 code, taken from [20]........................................................23 Figure 9. Decimal FloatingPoint adder, from [23]. .........................................................25 Figure 10. Partial products during multiplication, taken from [24]. ..................................26 Figure 11. Carry Save Array Multiplier, CSAM................................................................28 Figure 12. Wallace Tree multiplier reduction for two 4bit operands. ..............................29 Figure 13. Implementation detail for Wallace tree, 5th column ........................................29 Figure 14. Three column 8bit 4:2 binary compressor tree .............................................31 Figure 15. Decimal multiplication table Left and Right algorithm example......................33 Figure 16. Classic approach to floatingpoint comparison. .............................................39 Figure 17. Decimal floatingpoint comparator design......................................................41 Figure 18. Coefficient Aligner, logarithmic barrel shifter..................................................41 Figure 19. BCD magnitude comparator. .........................................................................44 ix Figure Page Figure 20. Combined two’s complement, binary and decimal comparator......................52 Figure 21. Robertson's diagram showing selection intervals for q=k1 and k. ................55 Figure 22. Truncated inputs to the QDS function............................................................56 Figure 23. PD Diagram, Sk selection points as a function of truncated d, [34]. ..............58 Figure 24. Decimal division by digit recurrence implementation. ....................................64 Figure 25. Multiplication Algorithm ..................................................................................65 Figure 26. Binary full adder cells used for decimal carrysave addition. .........................68 Figure 27. Block diagram for full adder cells used for decimal CarrySave addition.......69 Figure 28. Result of the addition, carry vector shifted left. ..............................................70 Figure 29. Multiplication by 2, recoded result..................................................................71 Figure 30. Multiplication by 2, result in BCD4221. .........................................................72 Figure 31. Decimal 3:2 counter with BCD recoding example..........................................73 Figure 32. Block diagram for decimal 3:2 counter [38]....................................................74 Figure 33. Nine gates Full Adder / 3:2 counter cell. ........................................................76 Figure 34. Decimal 9:2 counter, adapted from [38].........................................................78 Figure 35. Proposed decimal 4:2 compressor. All signals are decimal (4bits)...............79 Figure 36. Decimal 8:2 compressor, all signals are decimal (4bits)...............................80 Figure 37. Decimal 8:2 compressor. Critical path and Δ delays shown.........................81 Figure 38. Decimal 16:2 compressor, all signals are decimal (4bits).............................82 Figure 39. Multiplication Algorithm ..................................................................................84 Figure 40. Digit recoding for radix5, [38]........................................................................86 Figure 41. Quintupling through BCD recoding. ...............................................................89 Figure 42. LangNannarelli radix10 partial product generation. .....................................91 Figure 43. VásquezAntelo radix5 partial product generation........................................92 x Figure Page Figure 44. Design Flow methodology..............................................................................94 Figure 45. Concept block diagram for implementation comparison. ...............................96 Figure 46. Delay estimates for comparator designs........................................................97 Figure 47. Area estimates for comparator designs. ........................................................98 Figure 48. Dynamic power consumption for comparator designs. ..................................98 Figure 49. Delay estimates for compressor trees vs. counter trees designs.................100 Figure 50. Area estimates for compressor trees vs. counter trees designs. .................101 Figure 51. Dynamic power estimates for compressor trees vs. counter trees designs. 101 Figure 52. Delay results, partial product generation architectures................................103 Figure 53. Area comparison for partial product generation architectures. ....................103 Figure 54. Dynamic power comparison, partial product generation..............................104 1 1. INTRODUCTION Many scientific, engineering and commercial applications call for operations with real numbers. In many cases, a fixedpoint numerical representation can be used. Nevertheless, this approach is not always feasible since the range that may be required is not always attainable with this method. Instead, floatingpoint numbers have proven to be an effective approach as they have the advantage of a dynamic range, but are more difficult to implement, less precise for the same number of digits, and include roundoff errors. The floatingpoint numerical representation is similar to scientific notation differing in that the radix point location is fixed usually to the right of the leftmost (most significant) digit. The location of the represented number’s radix point, however, is indicated by an exponent field. Since it can be assigned to be anywhere within the given number of bits, numbers with a “floating” radix point have a wide dynamic range of magnitudes that can be handled while maintaining a suitable precision. The IEEE standardized the floatingpoint numerical representation for computers in 1985 with the IEEE754 standard [1]. This specific encoding of the bits is provided and the behavior of arithmetic operations is precisely defined. This IEEE format minimizes calculation anomalies, while permitting different implementation possibilities. Since the 1950’s binary arithmetic has become predominantly used in computer operations given its simplicity for implementation in electronic circuits. Consequently, the heavy utilization of binary floatingpoint numbers mandates the IEEE binary floatingpoint standard to be 2 required for all existing computer architectures, since it simplifies the implementation. More importantly, it allows architectures to efficiently communicate with one another, since numbers adhere to the same IEEE standard. Although binary encoding in computer systems is prevalent, decimal arithmetic is becoming increasingly important and indispensable as binary arithmetic can not always satisfy the necessities of many current applications in terms of robustness and precision. Unfortunately, many architectures still resort to software routines to emulate operations on decimal numbers or, worse yet, rely on binary arithmetic and then convert to the necessary precision. When this happens, many software routines and binary approximations could potentially leave off crucial bits to represent the value necessary and potentially cause severe harm to many applications. 1.1 Importance of Decimal Arithmetic Decimal operations are essential in financial, commercial and many different Internet based applications. Decimal numbers are common in everyday life and are essential when data calculation results must match operations that would otherwise be performed by hand [2]. Some conventions even require an explicit decimal approximation. A study presented in [3] shows that numeric data in commercial applications, like banking, insurance and airlines is predominantly decimal well up to 98%. Furthermore, another study discussed in [4] shows that decimal calculations can incur a 50% to 90% processing overhead. One of the main causes for decimal’s performance cost is that binary numbers cannot represent most decimal numbers exactly. A number like 0.1, for example, would require an infinite recurring binary number, whereas, it can be accurately represented with a 3 decimal representation. This implies that it is not always possible to guarantee the same results between binary floating point and decimal arithmetic. This is further illustrated in the following table where the number 0.9 is continuously divided by 10. Table 1. Numerical differences between decimal and binary floatingpoint numbers. Decimal Binary 0.9 0.9 0.09 0.089999996 0.009 0.009 0.0009 9.0E4 0.00009 9.0E5 0.000009 9.0E6 9E7 9.0000003E7 9E8 9.0E8 9E9 9.0E9 9E10 8.9999996E10 It is, therefore, considerably difficult to develop and test applications that require this type of calculations and that use exact realworld data like commercial or financial values. Even legal requirements, like the Euro (€) currency regulations, dictate the working precision and rounding method to be used for calculations in decimal digits [5]Error! Reference source not found.. These requirements can only be met by working in base 10, using an arithmetic which preserves precision. Typically, decimal computations are performed on binary hardware through software emulation and mathematical approximations, since requirements specific to decimal numbers cannot always be met in pure binary form. These requirements may include arithmetic that preserves the number of decimal places (including trailing zeroes or unnormalized coefficients) and decimal rounding among others. In all cases, any scaling, rounding, or exponent has to be handled explicitly by the applications or the programmer, a complex and very errorprone task. Since binary computations for decimal arithmetic tend to be slow, significant performance improvements may result 4 from using decimal floatingpoint hardware. Native (hardware) decimal floatingpoint arithmetic will make programming far simpler and more robust, and produce a significantly better performance in computer applications. The impact of this type of hardware can improve decimal floatingpoint calculations speed by two or three orders of magnitude compared to a software approach and is further highlighted with IBM’s release of the Power6 processor, the first UNIX microprocessor able to calculate decimal floatingpoint arithmetic in hardware [7]. As an example, shown in [4], division of a JIT (Java JustInTime compiled) 9digit BigDecimal number type, takes more than 13,000 clock cycles on an Intel® Pentium™ processor, while a 9digit decimal addition requires more than 1,100 clock cycles. On the other hand, binary arithmetic takes 41 cycles for integer division and 3 cycles for an addition on the same processor. Dedicated decimal hardware would be comparable to these values, if available. 1.2 The Decimal FloatingPoint Standard The increasing importance of decimal arithmetic is highlighted by the specifications being included in the current revision draft of the IEEE754 standard for floatingpoint arithmetic or IEEE754R [8]. Decimal floatingpoint numbers are in a format similar to scientific notation: (1)S x Coefficient x 10 (Exponent – Bias), where S is either 1 or 0 and determines the sign of the number. The exponent is biased to avoid negative representations. In other words, all exponents are represented in relation to a known value given to exponent zero. For example, if the bias is 127 numbers below 127 are negative and above 127 are positive. To illustrate a specific 5 example, suppose an exponent of 125 is utilized with a bias of 127, this exponent represents a value of 2 according to the IEEE standard. The current draft specifies the representation of three decimal number types: decimal32, decimal64 and decimal128 encoded in 32, 64 and 128bits respectively. The value of the number is encoded in four different fields. An illustration of this representation for decimal64 numbers is shown in Table 2, taken from [9]. Table 2. Decimal floatingpoint format Length (bits) 1 5 8 50 Description Sign Combination Field Exponent Continuation Coefficient Continuation Decimal64 numbers are comprised of a 16 digit coefficient and a 10bit biased exponent. The sign bit indicates the sign of the number as indicated earlier, in the same way as binary floatingpoint numbers. Both exponent and coefficient are encoded with part of their value given in the combination field: the two Most Significant Bits (MSBs) for the exponent and the Most Significant Digit (MSD) of the coefficient. The combination field also determines if the number represented is a finite number, an infinite number or a NaN (NotaNumber). Quiet and signaling NaNs (in which case an exception is triggered or signaled) are determined by the first bit of the exponent continuation field. Table 3, shown in [9], illustrates the combination field which depends if the number is Infinity, a NaN or a finite number. The combination field is encoded differently as well if the finite number’s MSD is greater than or equal to 8 or if the number is less as illustrated in the first two entries of the table. The exponent is therefore a biased unsigned 10bit binary number and the coefficient is given by a specific 10bit per 3 decimaldigits encoding representing a 16 decimal digits number. An additional important characteristic of 6 decimal floatingpoint numbers is that they are not normalized, as opposed to binary floatingpoint numbers. Table 3. Decimal FP Combination Field Combination Exponent Coefficient Field (5 Bits) MSBs (2bits) MSD (4bits) a b c d e Finite < 8 a b 0 c d e 1 1 c d e Finite > 7 c d 1 0 0 e 1 1 1 1 0 Infinity       1 1 1 1 1 NaN       Type For example, suppose a programmer wants to encode the number 8.35 into decimal64. The first step is to break the number into its coefficient and exponent, which produces 835 (with 13 leading zero decimal digits given that coefficients are 16 digits long) and –2 respectively, i.e. –835x10–2. For decimal64 numbers, where the bias value is of 39810, an exponent of –2 becomes 39610 (01 1000 11002). The combination field for –835x10–2 contains the two most significant bits or MSBs of the exponent (01 in this example) and the most significant digit (MSD) of the coefficient (4bits, 0000 in this case since the MSD is zero). According to Table 3, for finite numbers with the most significant digit value below 8, the 5bit combination field abcde decodes ab as the Exponent’s MSBs and 0cde as the MSD. To illustrate an example, the number –8.35 becomes 01  000. The remaining 8bits of the exponent, 0x8C, are arranged in the exponent continuation field. Finally, the coefficient is given in the coefficient continuation field using Densely Packed Decimal (DPD) encoding [10]. DPD encoding provides an efficient method of storing and translating 10bit / 3 decimal digits into BCD representation and vice versa by using simple Boolean expressions. A more detailed explanation of DPD can be found in the Appendix. 7 The DPD codification of the three BCD decimal digits into 10bits is called compression and it depends on the size of each digit, small or large (3bit for less than or equal to 7, and 4bits for greater than 7). A specific mapping is used in each situation: when all digits are small, left digit is small, middle digit is large, etc [10]. The three digits 835 are given in BCD as bits abcd efgh ijkm (1000 0011 0101)2. Bits a, e and i are used to indicate if the numbers are large or small. For this specific case, in which left digit is a large number, the mapping used for the encoding has the form [jkd fgh 1 10 m] or 0x23D (see second table in Appendix A). Therefore, the decimal64 representation for –8.35 is, in hexadecimal, A2 30 00 00 00 00 02 3D. Figure 1. Example: decimal floatingpoint representation of number 8.35. 1.3 A case for Decimal Arithmetic in GeneralPurpose Computer Architectures Decimal arithmetic has long been studied in computer architectures, however, most silicon implementations of digital logic suffered due to area requirements. By the year 8.35  835 x 102 835 = 23Dhex 39610 = 01 1000 Separate in Coefficient and Biasing with 39810 DPD encoding Combination field: exp. MSBs = 012 coeff. MSD = 00002 01000 with 13 leading zeroes abcde Table row 1 1 01000 10001100 0000 … 0010 0011 1101 sign Combination Field Exponent Continuation Coefficient Continuation A2 30 00 00 00 00 02 3D HEX = 8.35 in decimal64 8 2010, processors with 2 billion transistors are expected to be developed [11]. Therefore, the large number of transistors available within silicon implementations and the increased sophistication of design tools gives designers the ability to include new and important features, such as decimal arithmetic. Previous implementations in decimal arithmetic include highspeed multipliers [12][13][14], algorithms for decimal adders [15][16][17] and multioperand addition [18][19], and algorithms for decimal partial product generation [14][19][20]. Although there has been a large amount of interest and research interest into decimal arithmetic architectures, many of the architectures fail to produce designs that are targeted at real implementations, especially at designs below 180nm. This dissertation attempts to study these designs by offering possible solutions and implementations in decimal arithmetic and, in some cases, how they possibly can be combined with binary arithmetic to produce combined binary/decimal arithmetic units. 9 2. BACKGROUND Decimal arithmetic operations were significantly researched in the 1950’s and the latter part of the 20th century, but nonetheless binary arithmetic hardware took over computer calculations. The reasoning behind this came after Burks, Goldstine, and von Neumann published a preliminary study on computer design [12]. They argued that for scientific research, simplicity was the major advantage of binary hardware and therefore increasing its performance and reliability. Furthermore, decimal numbers would need to be stored in binary form requiring extra storage space (bits) to maintain the same precision as binaries and require more circuitry than operations performed in pure binary form. Nevertheless if conversions from decimal to binary and viceversa are needed then it is significantly more efficient to perform operations in decimal hardware [4]. In many cases, techniques developed for binary arithmetic hardware can be applied to some extent to decimal hardware. It is therefore important to explore relevant approaches and research for binary arithmetic since an invaluable insight into solving decimal arithmetic problems can be gained. 2.1 Binary Comparison An important element in general purpose and application specific architectures is the comparator [22]. The design of high speed and efficient comparators aids in the performance of these architectures. The idea of designing efficient comparators however is not new as seen from previous studies in [23], [23], [23], [25]. Nevertheless further 10 gains in area usage and power can be obtained by designing a comparator that can handle different datatypes using an efficient compatible comparison method. The work on [26] presents the design and implementation of a high performance comparator capable of handling 32bit and 64bit two’s complement numbers and single and double precision binary floatingpoint numbers. This type of design is especially useful to reduce costs in processors, since it allows the same hardware to be used to compare multiple data types. A novel approach to the magnitude comparison problem was utilized with a comparator module that has logarithmic delay. This design can also be easily extended to support 128bit binary floating point numbers and can accommodate pipelining to improve throughput. The IEEE 754 standard [1] specifies floatingpoint comparisons where the relation between two numbers is given greater than, less than, equal or unordered. When either of the operands compared are NotaNumber or NaN the result of the comparison is unordered. If the NaN is a signaling NaN then an Invalid exception flag bit is asserted. The result of the comparison is represented by Floatingpoint Condition Codes or FCC [27]. Table 4 shows the FCC representation of the comparison result in this design. Note that bit FCC[1] is analogous to a greater than flag (GT) and FCC[0] is analogous to a less than flag (LT). When both flags are zero the numbers are equal and when both are one the numbers are unordered. 11 Table 4. Floatingpoint Condition Codes. FCC [1] FCC [0] (GT) (LT) 0 0 A = B 0 1 A < B 1 0 A > B 1 1 Unordered Relation As illustrated in [26], the combined comparator is composed of three blocks. The 2bit Sel signal indicates the type of operands being compared. The first block converts 32bit operands to 64bit operands so that all operand sizes are handled by the same hardware. Like most floatingpoint implementations, 32bit numbers are converted to 64 bit numbers to simplify the logic. The second block performs a magnitude comparison. Finally, the third block takes care of exceptions and special cases according to the IEEE 754 standard. It also correctly handles the signs of the input operands. 2.1.1 Magnitude Comparator Design The magnitude comparator devised in [26], and shown in Figure 2, is the core of the comparator module. The two operands A and B are compared in stages. The first stage compares corresponding 2bit pairs from each operand. Two output bits, GT (greater than) and LT (less than), from each element indicate if the result of the compared pair is greater than, less than or equal as shown in Table 5. If the bit pairs of A and B are denoted by A[2i+1, 2i] and B[2i+1, 2i] then the values for GT[i]1 and LT[i]1 (where the subscript 1 indicates the first stage of the comparison) are given by: [ ] = [2 +1]⋅ [2 +1] + [2 +1]⋅ [2 ]⋅ 1 GT i A i B i A i A i B[2i] + A[2i]⋅ B[2i +1]⋅ B[2i] , [ ] = [2 +1]⋅ [2 +1] + [2 +1]⋅ 1 LT i A i B i A i A[2i]⋅ B[2i] + A[2i]⋅ B[2i +1]⋅ B[2i]. 12 for ( 0 ≤ i ≤ ⎡n / 2⎤ −1) where n is the operand size, in this case 64. Figure 2. Block diagram of the binary portion of the comparator, taken from [26]. In subsequent stages the same process is used except the GT[i]j signals replace the A[i] signals and LT[i]j replace B[i] where j denotes the comparator stage. There is however an additional reduction possible in subsequent stages since GT and LT can not be equal to 1 at the same time and therefore for j > 1 the equations are simplified to: j j j j GT[i] GT[2i 1] GT[2i] LT[2i 1] 1 = + + ⋅ + + , j j j j LT[i] LT[2i 1] GT[2i 1] LT[2i] 1 = + + + ⋅ + . A total of ⎡log ( )⎤ 2 k = n stages are required to obtain the final result given by GT[0]k and LT[0]k. In the case of this implementation with 64bits, n = 64 and k = 6 stages are required. A B 64 64 Input Conversion 64 64 Magnitude Comparator LT, GT 2 Exception Handling 2 Sel 2 LT, GT Floatingpoint / two’s complement. 32bit / 64bit. 13 Table 5. BCD Magnitude Comparison GT[i] LT[i] Result 0 0 A[2i+1,2i] = B[2i+1,2i] 0 1 A[2i+1,2i] < B[2i+1,2i] 1 0 A[2i+1,2i] > B[2i+1,2i] 1 1 invalid B CMP A G 0 1 LT B CMP A G 1 0 LT B CMP A G 0 0 LT Result : 0 1 A < B B CMP A G 0 1 LT 10 00 11 01 10 01 00 11 Top number A: 8Dhex Bot. number B: 93hex 01 00 B CMP A G LT 01 10 B CMP A G LT 10 01 B CMP A G LT 0 1 1 0 Figure 3. Magnitude comparator example, A = 0x8D and B = 0x93. n= 8, k = 3 stages necessary. Figure 3 illustrates an example using this logarithmic tree magnitude comparator. The operand size in this case is 8bits and therefore only 3 stages are necessary (k=3). The comparison to be computed is A to B where A = 0x8D and B = 0x93. Each number is separated in bit pairs and each corresponding pair is compared individually. Subsequent stages group LT and GT signals together as shown. The final stage yields GT = 0 and LT = 1 as expected giving A < B. 14 2.1.2 Two’s complement and binary floatingpoint comparator The magnitude of the operands however is not the only characteristic considered when comparing numbers. To compare two’s complement or floatingpoint numbers the sign should also be considered. The third stage shown in Figure 2 sets the LT output to one in any of the following four cases: 1) A is negative and B is positive. 2) A and B are positive and the magnitude of A is less than the magnitude of B. 3) A and B are negative two’s complement numbers and the magnitude of A is less than the magnitude of B. 4) A and B are negative floating point numbers and the magnitude of A is greater than the magnitude of B. To minimize the complexity of the other predicates like Greater Than (GT) and Equal to (EQ), logic is asserted based on whether the input operands are LT or not LT given that LT, GT, and EQ cannot all be asserted simultaneously. This translates to simple logic for both 32 and 64bit numbers. In order to make sure the values for the cases listed above are produced correctly for the implementation presented in this paper, only LT[0]6 and GT[0]6 are computed utilizing the logic since EQ[0]6 can be produced by the following equation: 6 6 6 EQ[0] = LT[0] +GT[0] The subindex 6 denotes the sixth level of the magnitude comparator, or the final stage given that the operands considered are 64bits, i.e. LT[0]6 and GT[0]6 are the outputs of the magnitude comparator module. 15 Consequently, the values of LT, EQ, or GT for the whole design can be produced for two’s complement numbers as: [0] ( [63] [63] [63] [63]) 6 EQ = EQ ⋅ A ⋅ B + A ⋅ B 6 LT = (A[63]⋅ B[63]+ B[63]⋅ LT[0] + A[63]⋅ LT[0] ) ⋅ EQ 6 GT = LT + EQ Floatingpoint comparisons on the other hand are complicated because of the incorporation of exceptions which are mandated by the IEEE 754 standard. The major exception that should be detected with comparisons is if the operands are Unordered. According to the IEEE 754 standard, values are unordered if either operand is a NaN and a floatingpoint comparison is being performed. The hardware for detecting unordered may vary from one processor to the next, because the standard allows discretion in defining specific signaling and quiet NaN’s bit patterns. The IEEE 754 standard also states that comparisons must also output an Invalid Operation exception if either operand is a signaling NaN [1]. Furthermore, a final test must be performed to make sure +0 and −0 compare Equal, regardless of the sign. In summary, the floatingpoint comparisons must be able to handle Invalid operations, both types of NaN’s, and not differentiate between both types of zeroes. As with two’s complement numbers, the comparator is simplified by computing whether the two operands are Equal or Less than each other. Once these two outputs are known, it is simple to produce Greater than output. For the combined unit, the Less than comparison utilizes the same cases tabulated previously accounting for a floatingpoint operation. On the other hand, floatingpoint comparisons for Equal need to be modified to account for 16 either equal operands or the comparison of zeroes. Therefore, the combined comparator (two’s complement and floatingpoint) handles two cases for determining whether the operands are Equal: 1) The operand magnitudes are equal AND the operands’ signs are equal. 2) The operand magnitudes are zero AND the operands are floating point numbers. The final equations for the combined comparator are given below, where Azero represents a literal testing of whether A is +0 or −0, fp represents a literal specifying a floatingpoint operation and UO represents a literal indicating unordered operands: UO A B fp NaN NaN = ( + ) ⋅ [0] ( [63] [63] [63] [63] 6 EQ = EQ ⋅ A ⋅ B + A ⋅ B + Azero ⋅ fp) ⋅UO 6 LT = (A[63]⋅ B[63]+ A[63]⋅ B[63]⋅ LT[0] + A ⋅ B ⋅ LT ⋅ fp 6 [63] [63] [0] + A[63]⋅ B[63]⋅ LT[0] ⋅ fp) ⋅ EQ⋅UO 6 GT = LT + EQ +UO For these equations, logic is saved by only testing whether A is zero since EQ[0]6 already indicates if the operands are equal making a test of B equal to zero redundant. Sign extension for 32bit two’s complement numbers is implemented by sign extending the 32nd bit into the upper 32bits of the comparator. IEEE singleprecision numbers do not need to be converted to doubleprecision numbers, since the two formats have the same basic structure and the exponents are biased integers. The logic to detect NaNs and zeros for the two floatingpoint formats differs slightly, since single precision numbers have smaller significands and exponents than double precision numbers. 17 2.2 Addition Addition is a fundamental arithmetic operation and the design of efficient adders aids as well in the performance and efficiency of other operation units like multipliers and dividers. Decimal addition has been researched but not as heavily as binary addition and only a handful of research papers can be found on the topic. Nevertheless, binary arithmetic is important for the decimal case since decimal numbers are represented in binary and many concepts and techniques developed for binary can be applied to some extent as well. An overview of some relevant binary addition concepts is therefore necessary. 2.2.1 Binary addition One of the most basic elements in addition is the Full Adder (FA) or 3:2 counter. Adders are sometimes called counters, because they technically count the number of inputs that are presented at their input [28]. The FA takes three single bit inputs, xi, yi and ci and produces two single bit outputs si and ci+1 corresponding to [29]: xi + yi + ci = 2·ci+1 + si , where ci is commonly referred to as the carryin and ci+1 the carryout. The logic equations for the Full Adder cell are given by: si = xi ⊕ yi ⊕ ci and ci = xi yi + xi ci + yi ci . 18 Figure 4. Full adder cell and Truth Table. The full adder cell can be utilized to create nbit operand adders as shown in the next figure. This simple approach, called Ripple Carry Adder or Carry Propagate addition (RCA/CPA), has the disadvantage of a significant timeconsuming delay due to the long carry chain as the carry propagates from c0 to c1 all the way until the MSB, in this case s3. Figure 5. 4bit Ripple Carry Adder. In order to speed up this process, certain aspects of the addition in each cell can be exploited, as is the case for the Carry Lookahead Adder (CLA). If a carry is present at the FA’s carryin from the previous significant bit it is said to propagate if either xi or yi Full Adder yi si xi ci ci+1 x i y i c i c i+1 s i 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 FA y3 x3 s3 c4 FA y2 x2 s2 c3 FA y1 x1 s1 c2 FA y0 x0 s0 c1 c0 19 are equal to 1. On the other hand if a carry is generated within the FA cell, when both xi and yi are 1, the cell is said to generate a carryout. Logic equations for generate and propagate signals (g and p) as well as an equation describing when a carryout takes place can therefore be determined from the inputs: g = xi · yi , p = xi + yi , ci+1 = gi + ci · pi The last equation can be utilized to determine the carryout of the next significant bit FA: ci+2 = gi+1 + ci+1 · pi+1 = gi+1 + (gi + ci · pi)·pi+1 . Showing that ci+2 can be obtained exclusively with the operands inputs without the need of the carry ci+1 as in Figure 5. This provides a method of obtaining the carryout result for each bit position without the need of a carry chain and hence speeding up significantly the process. As the operand size is increased, however, the complexity of each new bit’s carryout logic grows significantly making the method impractical for operands of more than 4bits, depending on the technology used. In this case, further techniques allow carry generate and propagate signals to be obtained for an nbit block and improve the adder’s implementation. 2.2.2 Carry save addition (CSA) Carrysave addition is the idea of utilizing addition without carries connected in series as in the Ripple Carry Adder but instead to count and hence avoid the ripple carry chain. In 20 this way multioperand additions can be carried out without the excessive delay resulting from long carry chains. The following example shows how a 4bit CSA accepts three 4 bit numbers and generates a 4bit partial sum and 4bit carry vector, avoiding the connection of each bit adder’s carryout to the carryin of the next adder. Figure 6. Full adder cells used for carrysave addition. The example shown demonstrates how performing addition in a given array (each column in the figure) produces an output with a smaller number of bits; in this case form 3 bits to 2. This process is called reduction and is very useful during multiplication. 2.2.3 4:2 Compressors One particular useful carrysave adder is the 4:2 compressor presented in [30]. The main reason for using compressors is that their carryout (cout) is no longer dependent on the cin, as shown in Figure 7. This gives compressors a significant advantage over traditional carrysave adder trees implemented with 3:2 counters in that it can expedite processing the carry chain while still maintaining a regular structure. 3 0 0 1 1 7 0 1 1 1 + 8 1 0 0 0 12 1 1 0 0 6 0 0 1 1  A B C FULL ADDER Cell SUM Carry Decimal Value 21 Figure 7. Weinberger 4:2 Binary Compressor. 2.2.4 Decimal excess3 addition As stated earlier, usually decimal numbers are stored as Binary Coded Decimals (BCD). BCD numbers have 6 unused combinations, from 10102 to 11112, and this complicates addition and subtraction for the decimal case. Furthermore, negative numbers can not be represented in two’s complement fashion which is a common method for subtraction for the binary case. A different coding for decimal numbers, called Excess3, is important since it has many useful properties for subtraction and addition. Excess3 code can be generated by just adding a binary 3 to the common BCD code, as shown in Table 6. Table 6. Excess3 Code. Decimal Value Excess3 Code 0 0011 1 0100 2 0101 3 0110 4 0111 5 1000 6 1001 7 1010 8 1011 9 1100 c2 cin c1 s 3:2 Counter 3:2 Counter 22 Except for some corrections necessary during addition/subtraction, common binary techniques can be applied for arithmetic operations. Most importantly the addition of two numbers creates a decimal carry which is available by using the carry output of the most significant binary bit. This occurs because the addition of two excess3 digits creates a result in excess6 which already eliminates the unwanted 6 binary combinations. Furthermore, the code is selfcomplementing [31]. This implies that a subtraction or negative number addition can be obtained by inverting all bits of the digit and adding a binary ulp, in the same way as two’s complement binary numbers. The following equation, taken from [31], shows the operation result of two Excess3 numbers added together, where the underline represents a digit in BCD: SUM = D1 + D2 = D1 + 3 + D2 + 3 = D1 + D2 + 6 There are two possibilities to consider for the sum result. When D1 + D2 < 10 then no carry to the next higher digit is needed and the Excess6 result can be corrected by just subtracting 3 from the sum. This can be easily accomplished by adding 13 and ignoring the carry output, which effectively subtracts 16. When D1 + D2 ≥ 10 a decimal carry should be signaled to the digit in the next place. This can be accomplished by sending the Carry out signal of the most significant bit. Nevertheless this sends a carry of 16 (6 too much) and hence by adding 3 the result is restored into Excess3 code. Note that in both cases the correction requires the addition of 3 or 13 which can be accomplished by a simple inverter on the LSB output. Figure 8 shows an implementation of an Excess3 adder. 23 Figure 8. Adder for Excess3 code, taken from [31]. 2.2.5 Direct decimal addition The use of Excess3 code permits the addition of two decimal numbers by using a correction method to that corrects the six unwanted values in BCD code after the operation takes place (10102 to 11112). Regardless, a different approach proposed in [15] presents logic that performs direct decimal addition where a combinational element has as inputs two 4bit BCD numbers xi and yi and a carryin ci[0] and outputs a 4bit BCD digit si and a 1bit carryout ci+1[0] satisfying: (ci+1, si) = xi + yi + ci[0] , where ci+1 represents ten times the weight of si . The following are the logic equations that describe the direct decimal adder [12]: g [ j] x [ j] y [ j] i i i = ⋅ 0 ≤ j ≤ 3 “generate” FA C S FA C S FA C S FA C S FA C S A3 B3 A2 B2 A1 B1 A0 B0 Cin S3 S2 S1 S0 Cout FA C S • FA S 24 p [ j] x [ j] y [ j] i i i = + 0 ≤ j ≤ 3 “propagate” h [ j] x [ j] y [ j] i i i = ⊕ 0 ≤ j ≤ 3 “addition” [1] [0] ( [0] [0]) [3] [2] ( [2] [1]) [3] ( [3] [2]) ( [3] [1]) ( [2] [1]) i i i i i i i i i i i i i i i i i c g p c l p g p g k g p p p p g p = + ⋅ = + + ⋅ = + ⋅ + ⋅ + ⋅ [1] (( [1] ) [1]) (( [1] ) [1]) [0] [0] [0] i i i i i i i i i i s h k c h l c s h c = ⊕ ⋅ + ⊕ ⋅ = ⊕ [2] ( [2] [1]) ( [3] [2] [1]) (( [3] ( [2] [1])) [1]) i i i i i i i i i i s = p ⋅ g + p ⋅ h ⋅ p + g + h ⋅ h ⋅ c ((( [3] [2] [1]) ( [2] [1]) ( [3] [2])) [1]) i i i i i i i i p ⋅ p ⋅ p + g ⋅ g + p ⋅ p ⋅ c [0] ( [1]) [3] (( ) [1]) ((( [3] [3]) ( [3] [2] [1])) [1]) i 1 i i i i i i i i i i i i i c k l c s k l c g h h h h c = + ⋅ = ⋅ ⋅ + ⋅ + ⋅ ⋅ ⋅ + These equations describe a decimal full adder that can be utilized for either carrysave or carry propagate addition. 2.2.6 Decimal FloatingPoint Adder To the author’s knowledge the only published work to date of an arithmetic module compliant with the IEEE754 current revision draft is the decimal floatingpoint adder published by Thompson, Karra and Schulte in [16] and hence its inclusion in this section is of significance. This design differs from previous decimal adders in that it is fully compliant with the standard including special value cases and exception handling, and 25 that it is capable of generating a complete result in a single cycle instead of a single digit per cycle. Figure 9. Decimal FloatingPoint adder, from [16]. Figure 9 shows a block diagram of the adder design. Initially the two IEEE754 decimal numbers are decoded into their sign bits, coefficient (BCD) and Exponent fields (two’s complement binary). The operand exchange block orders the coefficients according to which number’s exponent is greater followed by the operation unit which determines the actual operation to be performed (addition or subtraction) depending on the signs of the operands. The coefficients, or significands, are aligned and a conversion into Excess3 format follows for their respective binary addition and flag bits determination. The result is finally corrected, depending on the previously set flags, shifted and rounded allowing it to be encoded back into IEEE754 decimal format. The adder presented in this work also 26 allows up to 5 stages of pipelining which improves its critical path delay. Also it is of importance since it explores the advantage of using Excess3 coding for decimal addition. 2.3 Binary Multiplication Since decimal operations are performed on binary circuits, understanding how binary multiplication is achieved aids in the application and development of new techniques for the decimal case. A brief overview on its most significant implementations is given in this Section for that purpose. The multiplication of two binary numbers, multiplier X (xN1, xN2, …, x1, x0) and multiplicand Y (yM1, yM2, …, y1, y0) is determined by the following equation [32]: Σ Σ ΣΣ − = − = + − = − = = ⎟⎠ ⎞ ⎜⎝ ⎛ ⎟ ⎟⎠ ⎞ ⎜ ⎜⎝ ⎛ = 1 0 1 0 1 0 1 0 2 2 2 N i M j i j i j N i i i M j j j P y x x y . This is illustrated in Figure 10 for the case of 6bit operands: Figure 10. Partial products during multiplication, taken from [32]. 27 Each partial product bit position can be generated by a simple AND gate between corresponding positions of the multiplier and multiplicand bits as shown in figure 3. All partial products are reduced or “compressed” by addition into a single product result. A direct implementation of this method is given in the Carry Save Array Multiplier or CSAM. In this type of multiplier the partial product array shown in Figure 10 is skewed into a square shape so that its implementation is more efficient for VLSI. The compression is performed by the use of full adders (FAs / 3:2 counters) and half adders (HAs). Figure 11 shows this array for the multiplication of two 8bit operands. MFA and MHA cells represent full adders and half adders with an additional AND gate input to generate the partial product. The highlighted arrow shows the critical path of the circuit, the longest carry propagation chain. In Figure 10, which has 6bit operands instead of 8, this would correspond to the addition of the 6th column (partial products x0y5 to x5y0) plus the carry propagation through the last adder that generates the product, from p5 to p11. This long carry chain limits significantly the performance of the multiplier and is even more considerable as the operand size is incremented. One of the most significant works that addressed this problem was proposed by Wallace [33]. Wallace suggested the use full adders and half adders in a recursive fashion adding three elements at a time in a carry propagate free way. In this manner, the partial product array can be reduced in stages subsequently to two numbers without carry propagation. When the resulting two numbers are obtained, a Carry Propagate Addition (CPA) takes place to obtain the final result [34]. 28 Figure 11. Carry Save Array Multiplier, CSAM. This is shown in the example in Figure 12. The diagram illustrates a 4bit multiplication where the resulting partial products are shown at the top, analogous to the 6bit multiplication of Figure 10. Each dot represents a partial product bit position (e.g. x0y5, x3y2, etc.) The second step shows the partial product array reorganized in a triangular shape where the oval around the dots represents a full adder (3 inputs) or a half adder (2 inputs). The result of each addition produces two elements, a sum and a carryout to its next significant bit position (column to its left). The process is repeated again until at the final stage only two bit array numbers are left, and a reduced size carry propagate addition is required to produce the final result. 29 Figure 12. Wallace Tree multiplier reduction for two 4bit operands. Carry and sum bits for the Half Adder shown. Figure 13 illustrates how a Wallace tree for 6bit operands is implemented using FAs and HAs. In this case the partial products corresponding to the 5th column are detailed. A partial product column array of 5bits feeds a FA and a HA. The carryout bits produced are the inputs for the 2nd stage FA on the next column. The output sum bits are passed directly to the next stage FA, within the same column. In this manner a partial product reduction tree can be formed. Figure 13. Implementation detail for Wallace tree, 5th column (only 2 stages are shown). FA FA 5th Column 11 10 9 8 7 6 5 4 3 2 1 Partial products from column 5 HA 1st Stage 2nd Stage Carryouts for next column, next stage Carryin from previous column C S C S C S To 3rd Stage Final stage CP addition Carry & Sum bits 30 As can be seen from the example shown in Figure 13, the resulting Wallace reduction tree is not regular and hence causes difficulties when the circuit layout is implemented. Nevertheless, the use of 4:2 compressors (exposed in Section 2.2.3), can be organized into efficient interconnection networks for reducing the partial product matrix in a matter that is regular and more suitable for implementation. However, careful attention has to be placed when organizing these compressor trees, because the carry terms within the 4:2 compressor have a weight that is one more than its sum, corresponding to the next significant bit (column to its left). This means that compressor trees must be built according to the following:  The column sum output sum for any compressor tree utilizes the current weight of its column.  The column carry output for any compressor tree must utilize the previous weight of the current column. Therefore, although compressor trees are traditionally drawn as binary trees, they must be organized carefully so that the counter outputs are summed together properly. Figure 14 shows an example of an 8bit compressor tree for three columns. It can be seen that the carryin for each element comes from its previous column. 2.4 Decimal Multiplication Decimal multiplication is considerably more involved and has not being researched as heavily as its binary counterpart. There are however certain studies and ideas from the 1950’s, when decimal arithmetic was researched significantly, that are worth mentioning 31 and that sometimes have aided in more modern developments but nevertheless there are only a very few modern papers on the topic. Figure 14. Three column 8bit 4:2 binary compressor tree One of the main difficulties lies in the generation of the partial products. In the binary case this could be accomplished by a simple AND gate which produced a single bit per digit result. In the decimal case however the inputs are not single bits but decimal numbers usually coded in BCD which implies two 4bit inputs per digit multiplication. The result is in decimal as well and therefore a 4bit output is produced. One possible form of implementing the multiplication algorithm is to follow the penciland paper approach, as shown in [31]. In this method a multiplication table is known beforehand and the result of the multiplication of each multiplicand digit with a multiplier can be determined by table lookup or combinational logic. A performance improvement 32 might be obtained if the resulting number is considered separately and divided into left digit (tens) and right digit (units). An addition accumulator can be used for each digit and the final result computed at then end. Table 7 shows the decimal multiplication table used for each component and Figure 15 shows an example of the algorithm. Table 7. Decimal Multiplication Table, from [31]. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 2 3 4 5 6 7 8 9 2 0 0 0 0 0 1 1 1 1 1 2 0 2 4 6 8 0 2 4 6 8 3 0 0 0 0 1 1 1 2 2 2 3 0 3 6 9 2 5 8 1 4 7 4 0 0 0 1 1 2 2 2 3 3 4 0 4 8 2 6 0 4 8 2 6 5 0 0 1 1 2 2 3 3 4 4 5 0 5 0 5 0 5 0 5 0 5 6 0 0 1 1 2 3 3 4 4 5 6 0 6 2 8 4 0 6 2 8 4 7 0 0 1 2 2 3 4 4 5 6 7 0 7 4 1 8 5 2 9 6 3 8 0 0 1 2 3 4 4 5 6 7 8 0 8 6 4 2 0 8 6 4 2 9 0 0 1 2 3 4 5 6 7 8 9 0 9 8 7 6 5 4 3 2 1 Left Digit Component Right Digit Component Nevertheless, with this approach a performance constraint is that the number of additions required to perform multiplication is one greater than the number of digits in the multiplier and hence slow when compared to other methods. An alternative approach, proposed on [31] as well, attempts to generate the partial product digits by addition instead of a table lookup. This is accomplished by overandover addition where the digits of the multiplier are checked one by one and the multiplicand is added an equivalent number of times to an accumulator. This approach however can be very time consuming as the number of additions required is significant. 33 916 Multiplicand 93 Multiplier LeftComponents RightComponents Accumulator Accumulator 2010 80500 738 1940 82510 2678 2678 85188 Figure 15. Decimal multiplication table Left and Right algorithm example, taken from [31]. To reduce the number of additions and speed up the multiplication a technique called Doubling and Quintupling can be used. With Doubling the value of twice the multiplicand is calculated before the actual accumulation takes place. This allows a faster accumulation as it reduces the number of additions necessary. If the multiplier digit is 5 for example the number of additions required is 3 (2+2+1) instead of 5. The value of twice the multiplicand can be obtained by adding the number to itself with a decimal adder. An important speedup can be accomplished however since the number is only added to itself and some combinations are never present which simplifies significantly the functions for the output result and does not require a full decimal adder. Quintupling on the other hand can also be used and it consists on calculating five times the multiplicand, 5M, before hand. This can be accomplished by noting that a multiplication by 5 can be performed by a multiplication by 10 (decimal left shifting) and a division by 2 (binary right sift) with certain corrections. In this way a value of 5M can be used for the accumulation and further reduce the number of additions required for the multiplication. 34 The discussed ideas have been implemented utilizing mostly a serial or sequential approach as shown in [35], [36], [37] and [38]. Two proposals however are significant and are detailed below. 2.4.1 High frequency decimal multiplier One of the recent research papers on decimal multiplication worth mentioning is the one by Kenney, Schulte and Erle in [13]. The design proposed presents an iterative multiplication that uses some of the ideas exposed above, doubling and quintupling. In this design a set of multiplicand multiples are computed using combinational logic. In this way the values of 2M, 4M, and 5M are obtained and then divided into two sets of multiples: {0M, 1M, 4M, 5M} and {0M, 2M, 4M}. Depending on the value of the multiplier digit, a selector picks a multiple from each set and in that way their addition produces any value from 0M to 9M in a single operation. The design is further improved by allowing a two stage pipeline increasing its operating frequency and by utilizing a new decimal representation for intermediate products which speeds up the process. This representation, called overloaded decimal, permits the complete use of all 4bits comprising the decimal digit and hence the numbers from A16 to F16 are allowed. In this way the correction back into decimal is avoided in each iteration’s addition. The process continues until all digits in the multiplier operand are consumed. In the final product each digit is corrected from overloaded decimal back into BCD by adding 610 when a digit lies in the range of A16  F16, which is easily accomplished with two level logic. A carry of value 1 is also added to the next order digit. 35 2.4.2 Multiplication with efficient partial product generation A different iterative multiplication approach is proposed by Erle, Schwarz and Schulte in [14]. In this design the partial products are calculated in a digitbydigit multiplier creating a digitbyword (multiplicand multiple) signeddigit partial product. The multiplier operand is examined digit by digit from least significant digit to the most significant digit and as each partial product is obtained it is accumulated with previous results to obtain the product. The most significant characteristic in this design is the recoding of the multiplier and restricting the range of each digit by utilizing a redundant representation from 510 to 510. In this way the digit multiplier is simplified since there are no longer two input numbers with ten possibilities each but two inputs with values ranging from 0 to 5. The sign of the product is obtained simply by looking at the signs of the input digits. The multiples of 0 and 1 correspond to trivial multiplication results and therefore the range of input digits considered is virtually restricted to just the numbers from 2 to 5. This significantly speeds up the process since the possible input combinations are reduced from 100 to only 16 but nevertheless complicates the final product calculation since the result needs to be recoded back into BCD from a redundant representation. Table 8 shows the multiplication table for the recoded operand digit values. Table 8. Restricted range, signed magnitude products, from [14]. 36 The problem however with these two last propositions, [13] and [14], is that they have limited parallelization and hence are difficult to use in a pipelined system. In other words, the computation of the multiplication in both cases is highly sequential since the partial products are added by accumulation one by one as they are obtained. This forces the multiplier to be busy and unavailable for further computations in a pipeline until a result is computed. Only then it can accept a new operation which is unacceptable in most of today’s floatingpoint units. It is therefore desirable to research methodologies which allow multiplication to be as parallel as possible as, for example, the CSA Multiplier and the Wallace tree for the binary case. 37 3. DECIMAL FLOATINGPOINT COMPARATOR As stated earlier, a comparator is an important element in general purpose and application specific architectures. The design of an efficient and high speed decimal comparator aids in the performance of these architectures. This design proposes a high performance 64bit decimal floating point comparator, compliant with the current draft of the IEEE754R standard for floatingpoint arithmetic. This is the first implementation of a decimal floatingpoint comparator compliant with the draft standard. The design can also be easily extended to support 128bit decimal floating point numbers and even though it is not pipelined, it can accommodate pipelining to improve throughput. 3.1 Decimal floatingpoint comparison Floating point comparisons are specified by the IEEE 754 standard [1]. The comparator proposed accepts two 64bit decimal floating point numbers. In the same way as binary comparisons, the relation between the two numbers is given by four mutually exclusive conditions: greater than, less than, equal and unordered. The numbers are unordered when either one or both operands compared are NotaNumber or NaN. If the NaN is specified as a signaling NaN then an Invalid exception flag bit is asserted. The result of the comparison is represented by Floatingpoint Condition Codes or FCC and presented in Table 4. Again, bit FCC[1] is analogous to a greater than flag (GT) and FCC[0] is analogous to a less than flag (LT). 38 The design however differs significantly from its binary counterpart mainly because the current IEEE 754 revision specifies that decimal floatingpoint numbers are not normalized and, therefore, the representation is redundant in nature. This implies for example that numbers 125 x 105 and 1250 x 106 are both representable and should be recognized as equal during comparison. Binary floatingpoint numbers on the other hand do not allow redundancy. Without redundancy numbers can be compared as pure binary integer numbers (since biased exponents are used) without the necessity of separating exponent and coefficient and perform alignment as proposed in [26]. The core of the comparison lies on the magnitude comparator used for the coefficients. A usual scheme to approach the comparison of the two coefficients is to subtract them. Taking into account the signs of the operands, the sign of the result determines if the comparison is greater than, less than or equal when the subtraction result is zero. This type of approach is advantageous in a system in the sense that the existing decimal floatingpoint adder/subtractor hardware can be utilized also for this purpose without an area increment. A decimal comparator however is only reasonable if it provides a significant speed improvement at the cost of a small area overhead when compared to a floatingpoint subtraction approach. To its advantage, however, the comparator can benefit from the fact that the difference between both numbers is not required and that the result of the subtraction does not need to be rounded and recoded into decimal floatingpoint standard again. Furthermore, an adder/subtractor requires a greater working digit precision (extra guard bits for example) than what can be represented in the format to account for rounding and normalization [29]. 39 Before subtraction takes place, since the floatingpoint numbers to be compared may have different exponents, their coefficients need to be aligned. Once alignment is performed, and taking into account the signs of the operands, the sign of the subtraction result determines if the comparison is greater than, less than or equal when the subtraction result is zero. This type of approach is advantageous in a system in the sense that the existing addition/subtraction hardware can be utilized also for this purpose without an area increment. The example in Figure 16 illustrates the classic approach to comparison where the number 3149 x 1023 is compared to 90201 x 1016. Figure 16. Classic approach to floatingpoint comparison. 0 0 0 0 3 1 4 9 7 6 5 4 3 2 1 0 0 0 0 9 0 2 0 1 7 6 5 4 3 2 1 0 A = 3149 x 1023 B = 90201 x 1016 Exponent difference = 7, coefficient alignment is necessary The number with the biggest exponent (big) is shifted left to remove leading zeros. 3 1 4 9 0 0 0 0 7 6 5 4 3 2 1 0 0 0 0 9 0 2 0 1 7 6 5 4 3 2 1 0 A = 31490000 x 1019 B = 90201 x 1016 ? ? The number with the smallest coefficient (small) is shifted right by exponent difference (3) Exponent difference = 3, alignment is still necessary 3 1 4 9 0 0 0 0 7 6 5 4 3 2 1 0 0 0 0 0 0 0 9 0 7 6 5 4 3 2 1 0 A = 31490000 x 1019 B = 90.201 x 1019 ? Exponent difference = 0! Coefficient comparison can take place. notice how the digits are lost right after shifting A > B 40 The comparator proposed however utilizes a scheme that avoids subtraction for the coefficient comparison and instead uses a faster approach. It also avoids the use of extra digits. Only a working precision of 16 decimal digits, as in the standard, is used. 3.2 Comparator Design An overview of the design of the decimal floatingpoint comparator is given in Figure 17. A decoding module, IEEE 754 decoder, converts the decimal64 operands (A and B) into a format that can be utilized for comparison. The combination field is processed and the IEEE 754 decoder outputs the number sign, exponent in unsigned 10bit binary form, coefficient as a 64bit BCD encoded number and tells if the number is infinite, a quiet NaN or a signaling NaN. Since alignment is needed, the value of the difference between the operands’ exponents is necessary and subtraction is, therefore, required. The 10bit unsigned binary exponents are compared in the Exponent Comparison module. This module contains a 10bit Carry Lookahead Adder for fast operation and performs subtraction to determine which exponent is smaller or if they are equal and the amount of shifting necessary for the coefficient alignment. This amount is passed to the Coefficient Alignment module. Alignment is performed by left shifting the coefficient of the number with the greatest exponent and, thus, reduce its exponent magnitude. Since the representation allows for 16 digits, shifting is limited from 0 (or no shift) to 15 digits. Larger alignment needs are evaded by treating them as special case scenarios. Consequently, this aids in maintaining the coefficient digit size (i.e. working precision) restricted to 16 digits allowing the coefficient magnitude comparison module to yield a result faster given that 41 its delay and complexity grows logarithmically (log4). These special cases or scenarios will be treated in the following subsections. BCDs 4 2x16 digits LT_mag, GT_mag Exponent Comparison Coefficient Alignment Special case handling module Coefficient Magnitude Comparator Operand B 64 Operand A 64 IEEE 754 Decoder A and B: signs, Infinite, NaN 20 Coefficient A, Coefficient B 6 128 Exponent A, Exponent B Shift amount Aligned coefficients 2 greater_exp, shift_offrange Shift_overflow 1 2 1 GT FCC[1] LT FCC[0] Comparison Result Figure 17. Decimal floatingpoint comparator design. Figure 18. Coefficient Aligner, logarithmic barrel shifter. 4bits hardwired Left Shift Input Data (64bits) 1 BCD Shift 2 BCDs Shift 4 BCDs Shift 8 BCDs Shift Output Data (64bits) Overflow detection 8bits hardwired Left Shift 16bits hardwired Left Shift 32bits hardwired Left Shift 4bit encoded shift amount 42 3.3 Coefficient magnitude comparison Once the 64bit/16digit coefficients are correctly aligned their magnitude can be compared. The magnitude comparator designed is based on the comparator proposed in [26] and exposed in Section 2.1 with significant performance modifications tailored specifically for BCD number comparison. The two operands A and B are compared in stages. The first stage compares corresponding 4bits BCD digits from each operand. Sixteen of these elements are used in parallel to process the complete 16digit coefficients. Two output bits (GT and LT) from each of these elements indicate the result as greater than, less than or equal.Table 9 shows the magnitude comparison where GT[i]j and LT[i]j represent greater than and less than flags respectively as in Section 2.1 and A[i] and B[i] represent the digit i of the operand coefficients. The subscript j indicates the stage of the comparison. Table 9. BCD Magnitude Comparison GT[i] LT[i] Result 0 0 A[i] = B[i] 0 1 A[i] < B[i] 1 0 A[i] > B[i] 1 1 invalid The first stage’s elements (one per digit pair compared) have 8 input bits (two BCDs) and 2 single bit outputs each, producing a truth table of 256 possibilities per output. Since the numbers compared are BCD encoded, and not binary, the truth table is simplified by ignoring all entries where the 4bit BCD numbers are greater than 9. This reduces the cases to be considered from 256 to 100 (the rest are don’t cares) and reduces significantly the minimized sumofproducts expressions for LT[i] 1 and GT[i] 1. 43 Subsequent stages compare the results of the previous stage in the same manner forming a logarithmic tree, comparing 4bit GT[i]j sets to corresponding LT[i]j sets of the result where GT[i]j replaces A[i] and LT[i]j replace B[i] signals. These elements are further optimized given that GT[i]j and LT[i]j are never both asserted at the same time as shown in Table 9. The truth table cases are further reduced from 100 to 82 producing fast and simplified minimized sumofproducts expressions for LT[i]j+1 and GT[i]j+1 considering it is a 4bit 2 number comparison: LT[i]j+1 = LT[i+3]j + (GT[i+3] j’ • LT[i+2] j) + (GT[i+3] j • GT[i+2]j’ • LT[i+1] j) + (GT[i+3] j’ • GT[i+2] j’ • GT[i+1] j’ • LT[i] j) GT[i]j+1 = GT[i+3]j + (GT[i+2] j • LT[i+3] j’) + (GT[i+1] j • LT[i+3]j’ • LT[i+2] j’) + (GT[i] j • LT[i+3] j’ • LT[i+2] j’ • LT[i+1] j’) The number of comparator stages is given by k = log4(4n) where n is the coefficient digit size. With n = 16 digits a total of log4 (4x16) = 3 stages are needed to yield the resulting LT and GT for the magnitude comparison of the coefficients. Figure 19 illustrates an example using this logarithmic tree magnitude comparator. The operand size is reduced to 4 digits instead of 16 for clarity. The comparison to be computed is A to B where A = 271310 and B = 219510. Each number is separated in digits and each corresponding digit pair is compared individually. Subsequent stages group LT and GT signals together as shown. The final stage yields GT = 1 and LT = 0 as expected giving A > B. 44 0011 0100 B CMP A G 1 0 LT B CMP A G 0 1 LT B CMP A G 0 0 LT B CMP A G 1 0 LT B CMP A G 0 1 LT 0010 0111 0001 0011 0010 0001 1001 0101 Top number A: 2713BCD Bot. number B: 2195BCD Result: A > B Figure 19. BCD magnitude comparator. 3.4 Special case scenarios The result of the comparison of the two operands cannot always be obtained by a magnitude comparison of the aligned coefficients. It is possible for either of the operands to be a NaN (in which case the result should be unordered), plus/negative infinity or plus/negative zero which are both representable. These possibilities are treated as special cases and can be determined early in the decoding phase. The coefficients of the operands cannot always be correctly aligned within the 16digit working precision for the magnitude compare module. There are two possibilities that can arise that are also considered as special cases. The first one occurs when the absolute value of the exponent difference between operands is greater than 15 and the second when the alignment of the coefficient produces a digit shifted out (overflow). This is treated along with the cases for NaNs, infinities and zeros and their respective signaling flags. 45 Each of the possible comparison scenario cases sets its own output signals LT and GT which affect the FCC comparison result. Five different mutually exclusive enable flag bits are used to indicate when each scenario occurs. The special case handling module, at the bottom of Figure 17, is the one responsible for determining the result according to the given situation. 3.4.1 One or both numbers is infinite If one or both numbers are infinite the comparison result can be obtained right away by examining the sign of the operands. If the operands are A and B then: A is less than B if A is negative infinity and B is positive infinity OR if A is negative infinity and B is not infinity OR if B is positive infinity and A is not infinity. LT_inf = (inf_A · sign_A · sign_B’) + (inf_A · sign_A · inf_B’ ) + (inf_A’ · inf_B · sign_B’) A is greater than B if A is positive infinity and B is negative OR if A is positive infinity and B is not infinity OR if A is not infinity and B is negative infinity. GT_inf = (inf_A · sign_A’ · sign_B) + (inf_A · sign_A’ · inf_B’) + (inf_A’ · inf_B · sign_B); Note that if both numbers are positive infinite or negative infinite the result is 11 for GT_inf and LT_inf signaling an unordered comparison. The signaling flag that indicates this scenario is given by: infinite_flag = infinity_A + infinity_B where infinity_A/B is 0 if numbers are finite and 1 if infinite. 46 3.4.2 Both operands are zero In the IEEE754 current draft the value of zero in decimal is indicated by a zero value for the number coefficient as opposed to a zero exponent in binary floatingpoint. If both operands are zero then both numbers are always equal regardless of their sign taking into account that +0 and 0 are both equivalent as specified by the standard. The bits LT and GT remain unmodified (both are zero indicating equality) guarded by the flag enable bit zero_flag which prevents all other scenario modules to affect the result (except for infinite numbers). This flag is given by: zero_flag = (A_zero · B_zero) & infinite_flag’ where A/B_zero is 0 if the number is nonzero and 1 if it is zero. 3.4.3 Exponent difference offrange If the absolute value of the exponent difference between operands is greater than 15 (indicated by the signal shift_offrange in Figure 17) then the coefficients cannot be aligned since the working precision allows 16 digits. This means that one of the numbers is evidently greater in magnitude than the other (e.g. 512 x 1040 and 123 x 103). The comparison result can be obtained by knowing which of the number’s exponent is greater and by examining the signs of the numbers. The signal greater_exp indicates which exponent is greater. A is less than B if exponent B is greater than exponent A (greater_exp = 1) and both numbers are positive OR if A is negative and B is a positive number OR if both numbers are negative and exponent A is greater, (greater_exp = 0). Numbers are never equal. 47 LT_offrange = (greater_exp · sign_A’ · sign_B’) + (sign_A · sign_B’) + (sign_A · sign_B · greater_exp’). A is greater than B if both numbers are positive and exponent A is greater than exponent B (greater_exp = 0) OR if A is positive and B is negative OR if both numbers are negative and exponent B is greater (greater_exp = 1). GT_offrange = (greater_exp’ · sign_A’ · sign_B’) + (sign_A’ ·sign_B) + (sign_A · sign_B · greater_exp) The signaling flag for this case is: exp_flag = shift_offrange • zero_flag’ • infinite_flag’, where shift_offrange is determined by calculating the exponents difference in the exponent compare module (>15). 3.4.4 Alignment shiftout, overflow If the exponent difference is within range (<15), alignment of the coefficient takes place by left shifting. If a digit is shifted out (shift_overflow) then the comparison result can be determined by knowing the signs of the numbers. The greatest exponent determines which of the two numbers compared is the one being aligned with respect to the other. When greater_exp = 0, exponent A is the one aligned and viceversa. If alignment overflow occurs A is less than B if: A is negative and B is positive OR if A alignment overflows (magnitude of number A is greater) and it is negative OR if B alignment overflows (magnitude of number B is greater) and B is positive. 48 LT_of = (sign_A · sign_B’) + (greater_exp’ · sign_A) + (greater_exp · sign_B’) A is greater than B if: A is positive and B is negative OR if A alignment overflows and it is positive OR if B alignment overflows and it is negative. GT_of = (sign_A’ · sign_B) + (greater_exp’ ·sign_A’) + (greater_exp · sign_B) The equation for the signaling flag of this scenario is: align_flag = shift_overflow · exp_flag’ · zero_flag’ · infinite_flag’ where shift_overflow is asserted during alignment if a digit is shifted out. 3.4.5 Coefficient comparison If the exponent difference is within range and no shift overflow occurs after alignment then this indicates that the coefficients are correctly aligned and their comparison can be executed by the dedicated coefficient magnitude comparator module discussed in Section 3.3. The output signals from this module (LT and GT) are renamed to avoid confusion as LT_mag and GT_mag. Nevertheless, the final result of the comparison is not yet determined as the relationships of greater than, less than or equal do not only depend on which number’s magnitude is greater but also on their signs. The result of the comparison in this scenario is then given by the following conditions. A is less than B if: A is negative and B is positive OR A and B are positive and the magnitude of A is less than the magnitude of B OR A and B are negative and the magnitude of A is greater than the magnitude of B. LT_cmp = (sign_A · sign_B’) + (sign_A’ · sign_B’ · LT_mag) + (sign_A · sign_B · GT_mag) 49 A is greater than B if: A is positive and B is negative OR if A and B are both positive and the magnitude of A is greater than B OR if both numbers are negative and the magnitude of A is less than B. GT_cmp = (sign_A’ · sign_B) + (sign_A’ · sign_B’ · GT_mag) + (sign_A · sign_B · LT_mag) Mag_flag determines if the outputs GT_cmp and LT_cmp should affect the final result and it is given by: mag_flag = align_flag’ · exp_flag’ · zero_flag’ · inifinite_flag’ In other words, the magnitude comparison of the coefficients is only valid (through mag_flag) if none of the previous enable flags was triggered. The final equations for the comparator considering all possible scenarios are: GT = unordered + ( (GT_inf · infinite_flag) + (GT_offrange · exp_flag) + (GT_of · align_flag) + (GT_cmp · mag_flag) ), LT = unordered + ( (LT_inf • infinite_flag) + (LT_offrange • exp_flag) + (LT_of • align_flag) + (LT_cmp • mag_flag) ). The signal unordered is asserted when either of the operands is a NaN. If this occurs the result is overridden and is always unordered (GT=1, LT=1) as specified in Table 4. The signals GT and LT are finally produced in the special case handling module. It is the one that receives the flags indicating the different scenarios and is responsible for the handling of the different comparison cases described in this section. 50 3.5 Combined binary floatingpoint, two’s complement and decimal floatingpoint comparator Given the similarity of approaches of the decimal comparator described and its binary counterpart exposed in Section 2.1, a single design capable of handling 32bit and 64bit two’s complement numbers, single and double precision binary floatingpoint and 64bit decimal floatingpoint numbers is interesting since it would result especially useful to reduce costs in processors, by allowing the same hardware to be used to compare all three data types. The main difference between the binary and the decimal comparator schemes is that decimal floatingpoint representation is redundant, as stated before, and therefore requires alignment while binary does not. Binary 32bit floatingpoint numbers only require a sign extension given that 32bit and 64bit number formats can both be handled by the same hardware. The logic that can be shared between both cases however is the core of the comparators, the magnitude comparison module which, in the decimal case, comes into effect after alignment. The magnitude comparator logarithmic tree for the decimal case is composed of comparator elements that handle 4bit BCD digits. Each element had two digit inputs (two 4bit BCDs) as opposed to handling 2bit pairs as in the binary case (Section 2.1). Optimization of the BCD element was possible since in BCD no numbers are encoded after 9hex and hence Ahex, Bhex, Chex, Dhex, Ehex and Fhex can be ignored providing further simplification. An additional benefit is the fewer number of stages necessary since 4bit digits are compared instead of 2bit pairs. Nevertheless the binary pair comparator element can be used for the decimal case instead of the BCD and provide a way of saving hardware since it would be used by both formats. Tests and simulations were run 51 however to see the impact of using the bit pair comparison module from the previous section for the decimal comparator and the results justified the joint design of the module proposed. Furthermore, less area would be required when implemented on a system since the decoding of the IEEE 754 decimal floatingpoint can be handled by the decimal floatingpoint unit potentially already existent in the system. An overview of this combined design is shown in Figure 20. The 3bit signal Sel determines the format type of the operands, 32bit or 64bit two’s complement, binary floatingpoint or decimal floatingpoint. If the operands are binary then the sign extension module sign extends 32bit numbers into 64bit so that the same hardware can be utilized. In the decimal case the input for the magnitude comparator is obtained after decoding and coefficient alignment. The exception and special case module handles the result of the magnitude comparator taking into account the signs of the numbers, the data type and the flags for overflow, offrange and others exposed in Section 3.4, for the decimal case. 52 Operand A Operand B 64 64 Sel Decimal IEEE 754 Decoder Sign Extension Logarithmic Magnitude Comparator Exponent Comparison, Coefficient Alignment Exceptions, special cases handling 2 LT, GT 20 128 Exponents Coefficients 3 1 1 GT FCC[1] LT FCC[0] Comparison Result 6 2x64 2x64 Infinites, Signs, NaNs Offrange, greater_exp Flags 3 2 Binary/Decimal BCD BINARY DECIMAL Figure 20. Combined two’s complement, binary and decimal floatingpoint comparator. 53 4. EXPERIMENTS FOR DECIMAL FLOATINGPOINT DIVISION BY RECURRENCE One method widely used for division is performed by recurrence or sequentially. In this method, the quotient is represented by a chosen radix and a digit is produced after each iteration. The quotient digit can also be selected from a redundant digit set as this approach has noteworthy speed and cost advantages. The main difficulty using a digit recurrence algorithm lies in the quotient digit selection function or QDS. Several studies have been made to simplify or improve this function. The Kornerup study presented in [40] shows an accepted analytical approach to determine a minimum number of digits required for the QDS function. This theory, however, is specific to the binary case and, hence, requires modification to be applied to the decimal case. This study attempts to provide an insight into the implementation feasibility of a decimal digit recurrence divider utilizing the recurrence division theory. 4.1 Decimal Division by Digit Recurrence Theory As discussed previously, when implementing division by recurrence, the quotient digit of radix r lies within a symmetric redundant selection set of consecutive integers given by: 2 q D {a,...,1,0,1,...,a} a r j ∈ = ∀ ≥ , (1) 54 such that ā = – a. The redundancy factor or measure of redundancy for a digit set is defined by: −1 = r ρ a with 1 2 1 < ρ ≤ . (2) The main equation when implementing division by recurrence for a dividend, x, and divisor, d, is given by [41]: w[j+1] = rw[j] – dqj+1 , (3) where r denotes the quotient radix, qj+1 the selected quotient digit and w[j] the partial remainder in iteration j. Naturally, in our case, the radix is decimal or r = 10. In order for the recurrence in (3) to be valid through all iterations and guarantee a result, two basic conditions for the QDS should be met: containment and continuity. And, the value of the quotient digit qj+1 is given by the selection function: qj+1 = SEL(rw[ j ], d). (4) The containment condition specifies that the quotient digit selected must maintain the next partial remainder bounded to satisfy convergence of the algorithm, or: − ρ ⋅ d ≤ w[ j] ≤ ρ ⋅ d . (5) This is summarized in Robertson’s diagram, shown in Figure 21, where the limits on the vertical axis for w[j+1] are noted by the horizontal doted lines. The selection interval of rw[j] for which it is possible to select qj+1 = k and keep the next residual bounded is defined as [Lk (d), Uk (d)]. Each interval, as shown in Figure 21, defines a specific interval for a given quotient digit (e.g. qj+1 = k–1 produces the interval given by [Lk1 (d), Uk1 (d)]). 55 Expressions for Lk (d) and Uk (d) can be obtained from Robertson’s diagram defined by [41][29]: L d k d k ( ) = ( −ρ ) ⋅ and U d k d k ( ) = ( + ρ ) ⋅ . (5) Figure 21. Robertson's diagram showing selection intervals for q=k1 and k. The continuity condition ensures that for any possible rw[j] (horizontal axis in Robertson’s diagram) there exists a selectable quotient digit k (i.e. rw[j] lies always within a selection interval [Lk,Uk]), otherwise a quotient digit would not be selectable. Therefore, the overlap present between two consecutive intervals must exist or be equal to zero, at minimum, or Lk ≤ Uk–1. This condition is imposed by: −1 ≤ ≤ k k k L S U , (6) where Sk denotes the partition points within the selection interval [Lk ,Uk1] such that the QDS returning qj+1 may be defined by [40]: S rw j S q k k k j ≤ < ⇒ = +1 +1 [ ] . (7) rw[j] w[j+1] ad −ρd ρd rρd (k1)d kd Lk1 Lk Uk1 Uk Interval for q=k1 Interval for q=k − rρd  56 4.2 Quotient Digit Selection The overlap is critically important, since it allows an inexact value of the divisor and the partial remainder to determine a suitable quotient digit. In this sense, only a limited number of leading digits of both divisor and partial remainder are required. With this consideration, the truncated partial remainder, rŵ[j], is defined as: rwˆ[ j] = rw[ j] +ε rw (8) where εrw denotes the truncation error. Carrysave representations are often utilized for division by recurrence algorithms when computing (3), because carrypropagate adders would lengthen the critical path excessively. The truncation error, using a carrysave representation, is defined by [40]: 0 2 ulp(rwˆ[ j]) rw ≤ε < ⋅ . (9) In a similar way, the truncated divisor is given by: d d = dˆ +ε with 0 ulp(dˆ) d ≤ε < . (10) Figure 22. Truncated inputs to the QDS function. Partial Remainder, rw[j] t . Divisor, d u . 57 Figure 22 illustrates the truncation of both divisor and partial remainder. The number of digits to the right of the radix point is given by u and t, respectively. Since both divisor and partial remainder utilize decimal representations or radix 10, it follows that: t ulp rwi ( ˆ ) = 10− , (11) ulp(dˆ) = 10−u , (12) where ulp indicates unit in the last place (less significant digit). Therefore, the truncated divisor can be represented as an integer multiple m of an ulp(dˆ) : dˆ = m×10−u . (13) Division by digit recurrence theory is often implemented for fractional normalized divisors [41]. Figure 23 illustrates the reasoning behind this, since a nonnormalized divisor would require an infinite precision as d approaches zero. Hence, the values of d are normalized and bounded by: 10 1 1 ≤ d < , (14) which implies, due to (13), that: 10u−1 ≤ m < 10u . (15) Following the analysis approach in [40], Figure 23 shows the partition of the interval [Lk ,Uk1] by Sk, defined in (6), as a function of dˆ and not d. Below Sk, the value of the selected quotient digit is q = k–1, and above Sk, q = k. S (dˆ) k is now a staircase function due to the truncation of both divisor and partial remainder and indicates rectangles 58 where a given quotient digit can be selected due to quantization. The value of S (dˆ) k can also be expressed as an integer multiple of ulp(rŵi) by constants given by k and m, or sk,m: t k k m S (dˆ) = s ×10− , . (16) Figure 23. PD Diagram, Sk selection points as a function of truncated d, [40]. The dotted rectangle in Figure 23 has its lower left hand corner at (dˆ, S (dˆ)) k and is limited by: dˆ ≤ d < dˆ + ulp(dˆ) (17) and ( ˆ) ( ˆ) 2 ( ˆ ) k i k i S d ≤ rw < S d + ⋅ ulp rw . (18) The study by Kornerup defines boundary conditions on this rectangle to determine the minimum amount of digits after truncation for the partial remainder and the divisor; t and u respectively. The rectangle should lie above Sk and, therefore, the boundary condition on its right hand corner yields: d rw[j] 1/r 1 Lk Uk1 Sk . 59 L (dˆ ulp(dˆ)) (k )(dˆ ulp(dˆ)) S (dˆ) k k + = − ρ + ≤ . (19) It follows, using (5), (12) and (16): ⎡ ⎤ t k m (k − )(m⋅10−u +10−u ) = s ×10− , ρ , (20) ⎡ ⎤ k m t u k m s , 10 − ( −ρ )( +1) = . (21) The height of the rectangle is of 2·ulp(rŵi). Nevertheless, consecutive rectangles aligned vertically are spaced by one ulp(rŵi) (resolution of Sk). Overlapping rectangles from the bottom should have a value of k – 1 and, therefore, the midpoint on the left edge should lie under Uk1. This boundary condition, combined with (5), yields: S d ulp rw U d k d k i k ( ˆ) ( ˆ ) ( ˆ) ( 1 ) ˆ 1 + ≤ = − + ⋅ − ρ . (22) Again, using (11) and (16) on this inequality gives: ⎣10 ( 1 ) 1⎦ , s ≤ t −u k − + m − k m ρ . (23) Combining (21) and (23) yields floor and ceiling expressions for the possible values of Sk: ⎡10t −u (k − ρ )(m +1)⎤≤ ⎣10t −u (k −1+ ρ )m −1⎦. (24) Rearranging terms results in an expression of the form: ⎡Am + B⎤ ≤ ⎣(A+ C)m⎦, (25) 60 with A = 10 t–u (k – ρ), B = 10 t–u (k – ρ) + 1 and C = 10 t–u (2ρ –1). For the nontrivial solution, where the quotient selected is zero and given condition (2), it is seen that A ≥ 0, B ≥ 1 and C > 0 for k ≥ 1. For condition (25) to withstand, it is necessary that C·m ≥ B. Nevertheless, the stronger condition C·m – B ≥ 1 allows a minimum of one integer between the floor and ceiling functions yielding: 10t−u (2ρ −1)m −10t−u (k −ρ ) −1 ≥ 1. (26) This condition should hold at the extreme case for the values of m and k in order for (25) to be valid. This occurs at the maximum value for the quotient digit k = a and, due to (15), at the minimum case when m = 10u–1. Rearranging terms gives: 2 10 10 (2 1) 10 ( ) 1 ρ − − −ρ ≤ − − − a u t , (27) which produces a minimal value for t for a known u. Furthermore, it is clearly seen that the numerator in (27) should be positive enabling a minimum u to be obtained as: 10( ) 10 2 1 ρ ρ − − − < a u , (28) 4.3 Considerations for the IEEE754 Decimal Case The application of the previous analysis, with radix = 10, to the IEEE 754 revision draft for decimal floatingpoint requires some considerations. Most significantly, the revision draft specifies that decimal floatingpoint numbers are not normalized and, therefore, the representation is redundant in nature. This implies for example that numbers 125 x 105 and 1250 x 106 are both representable and are equal in magnitude. The standard also specifies that the mantissa is represented as an integer and not a fraction with a leading 61 binary 1, as in the binary case (i.e. 0.1xxxx…2). This complicates the algorithm application, since both the divisor and dividend operands in the binary case are limited in range to 0.12 ≤ x < 12. Unsigned integer division has operands 0 ≤ x < rn – 1 and 0 ≤ d < rn – 1. The division result produces a quotient q such that [41][29]: q = ⎣x / d ⎦. (29) As mentioned previously, basic integer division algorithms require fullprecision for the QDS function. To apply fractional division theory, the divisor d should first be normalized, by shifting, so that the mostsignificant bit is a nonzero digit. With a shift of p digits, the normalized divisor d* is: d* = 10p d . (30) Consequently, the resulting quotient is: q = ⎣x / d ⎦ = 10p ⎣x / d* ⎦. (31) To use the algorithm, fractional operands are modified as defined by: n f x = x × r− , (32) n f d = d* × r− . (33) Expressions (27) and (28) can be used to obtain minimum bounds for the number of decimal digits of the partial remainder and divisor, t and u respectively. There is a 62 choice, however, in the value of a, or the amount of redundancy as shown in (1). In the decimal case a can vary from 6 to 9. As redundancy is incremented (a increases), the overlap described in (6) is augmented thus simplifying the QDS function by allowing for selection of the quotient digit and consequently a smaller lookup table. On the other hand, a greater value of a complicates the generation of the quotient digit multiples ( qj+1d ) needed for the iterative algorithm (3). For example, with a = 9 the possible divisor multiples required are (–9d, – 8d, …, –1d, 0, 1d, …, 8d, 9d). Nevertheless, as a is decremented and the possible quotient multiples are reduced, the additional digit inputs to the QDS function are incremented as more precision is required. Since each digit is a decimal digit the size of a lookup table for the QDS would increase by an order of magnitude with each additional digit required. Therefore, the smallest lookup table size is achieved with a = 9 and, hence, a maximum redundant digit set with ρ = 1, from (2). In this case, (28) and (27) yield: u ≥ 2 and t ≥ 2 implying that only 2 decimal digits are required for the divisor and the partial remainder. The shifted partial remainder, however, can still be within the allowed boundary (5) but be greater than 1 in which case integer digits are needed. Since the divisor is normalized (30) its range is limited to 1/10 ≤ d < 1, this observation is also shown in Figure 23 with the vertical line at 1/r. The possible range for the shifted partial remainder is then given by: rw[ j] = 10⋅w[ j] ≤ 10⋅ρd < 10 , (34) due to the containment condition given in (5). This implies that at most a single integer digit is required. The total number of digit inputs to the QDS function is 5 digits, 2 63 decimals for the divisor (. xx) and 2 decimals and an integer for the partial remainder (x. xx). A table based QDS function will then have 5 decimal inputs and a decimal output. Considering a 5bit encoding for the signed decimal quotient digit output the total number of QDS entries in the table would be: 105 × 5 bits = 100,000 × 5 bits. The division by recurrence algorithm requires a subtraction and a multiplication of the truncated divisor by the quotient digit which can be positive or negative. Since the numbers treated are decimal this complicates significantly the arithmetic operations involved. Furthermore, a significant complication of using a maximal redundant quotient digit set is the generation of extra divisor multiples (–9d, …, 0, …, 9d), as discussed previously. The proposed scheme utilizes ideas from decimal multiplication presented in [31]. The divisor multiples (qj+1d product) generation starts by computing a priori the multiples 0, d, 2d, 4d and 5d which can be added in combination to create all possible divisor multiples from 1d to 9d. Figure 24 shows an overall scheme of the decimal division by recurrence design. 64 Decimal IEEE754 Decoder 64 Dividend 64 Normalization, initial w[0] Exponents Decimal left shift QDS Table Excess3 Converter w[0] rw[j] w[j+1] = rw[j] – qj+1d x .xx .xx Divisor, d qd product generation (qj+1d) qj+1 TRUNCATED Excess3 subtractor Quotient result, rounding, scaling Coefficients (BCD) IEEE754 Recoder Special cases, rounding Decimal Scaling RESULT wS wC Divisor Figure 24. Decimal division by digit recurrence implementation. 65 5. DECIMAL PARTIAL PRODUCT REDUCTION A few of the approaches to decimal multiplication were described in Section 2.4. The different methods discussed included serial or iterative accumulate addition to obtain the multiplication result. Parallel multiplication however is used extensively in binary floatingpoint hardware ([42], [43]) and is of importance if performance is to be extended to the decimal case. Figure 25. Multiplication Algorithm Parallel multiplication can be divided in three main steps, as illustrated in Figure 25. The first step entails partial product generation where the multiplicand multiples are obtained. Then, partial product reduction occurs using a fast addition scheme to reduce the partial products to two. Finally, a carry propagate addition is necessary to obtain the final result. The overall performance of the multiplier, therefore, is closely related to the individual performance for these stages. However, improvement in partial product 7 9 8 x 3 4 7 5 5 8 6 3 1 9 2 + 2 3 9 4 2 7 6 9 0 6 Partial Product Reduction Multiplicand Multiplier Final Carry Propagate Addition Multiplicand multiples (Partial Products) 66 reduction for example, often increases complexity in partial product generation. This is the reasoning behind binary methods, like Booth encoding, where the number of partial products to be added is reduced at the expense of more complex multiplicand multiple generation through recoding [44][45]. Unfortunately, this condition can offset the gains in performance of the resulting multiplier. As discussed in Section 2.3, binary multiplication with tree multipliers typically use carrysave adders to repeatedly reduce the partial product matrix until only two rows remain which are then added using a fast carrypropagate adder to form the final product. Although tree multipliers are typically much faster algorithmically than array multipliers (see Section 2.3), they produce an irregular structure which can affect their performance. Traditional decimal codes are different than binary codes in that more information per bit has to be coded into the digital logic. The most common decimal encoding is 4bit Binary Coded Decimal (BCD) which represents decimal codes 0 through 9. This code is also referred to as BCD8421 where the numbers 8421 represent the weight of each bit in the encoding. BCD8421 has the advantage that each decimal number is represented in a common binary number system and, hence, some of the binary operations can be performed with regular binary logic structures [31]. Although BCD8421 codes are straightforward, they have two distinct disadvantages. First, the binary representation of ten through fifteen has no meaning and must be eliminated. Another major disadvantage is that BCD8421 is not selfcomplementing, whereas, a selfcomplementing BCD code is one where the 9's complement of the decimal digit may be obtained by changing the 1's to 0's and 0's to 1's (bit inversion) [31]. The 9's complement operation is necessary to perform subtraction in much the 67 same way as two’s complement numbers are used to perform subtraction with binary numbers. Although these two disadvantages make BCD8421 more challenging to work with, simple Boolean logic can be used to obtain its 9’s complement: 0 0 C = T 1 1 C = T 2 1 2 1 2 C = T T + T T 3 1 2 3 C = T + T + T where the letters T and C refer to true and complement digit, respectively. Efficient addition of decimal numbers is an interesting topic since it is not only used as a standalone operation but also required in other calculations like multiplication and division. Multioperand addition, where more than two numbers are added, is of particular importance since it can be used to reduce the partial products that occur in a multiplication operation (see Figure 10). To avoid the time consuming carry propagation in multioperand addition, two different schemes used in binary radix numbers are also applicable to the decimal case. These are signeddigit addition and carrysave addition, explored in this work. 68 5.1 Decimal CarrySave Addition One of the most common elements within partial product reduction is the full adder cell or the 3:2 counter, as stated in Section 2.2.1. As with the binary case, decimal partial product reduction can also utilize 3:2 counters except that each digit is 4bits long to accommodate the BCD8421 encoding. In this case, four 3:2 counters are used for each bit of the BCD digit as shown in Figure 26. Figure 27 shows how this scheme would be implemented with a block level diagram. Figure 26. Binary full adder cells used for decimal carrysave addition. The result of the addition is obtained by adding the sum and the shifted carry vector. Each carry bit in H, in Figure 26, represents the carry input for the next significant bit. The addition of both vectors assumes that H is shifted left one bit or, in binary, multiplied by two. 3 0 0 1 1 7 0 1 1 1 + 8 1 0 0 0 12 1 1 0 0 3 0 0 1 1 A: B: C: S: H: Decimal BCD 8 4 2 1 BINARY FULL ADDER SUM Carry 69 Figure 27. Block diagram for full adder cells used for decimal CarrySave addition. The function of the 3:2 counter or full adder can be summarized using the following equation: A + B + C = S + 2 x H where A, B, C, S and H are decimal numbers with Σ= = × 3 i 0 i i A a r such that ai is the bit value of A at position i and ri the corresponding weight (r3r2r1r0 = 8421 for BCD8421). Although using binary 3:2 counters is relatively simple, it has one major disadvantage for BCD8421 coding. As seen in Figure 28, the sum vector S is out of range since the sum S = 1100BCD is not a valid decimal representation and the addition of S and H yields a result in binary as opposed to correctly displaying a valid ABCD8421 B BCD8421 C BCD8421 FA C S FA C S FA C S FA C S 4 4 4 • • • • • • • 8 4 2 1 • • • • • • 4 4 S H 70 BCD8421 number. Consequently, the conversion to BCD would require additional logic and, hence, it is not efficient. Figure 28. Result of the addition, carry vector shifted left. One solution can be obtained by using a different weighted BCD representation during the summation process. That is, the bits are recoded to another valid decimal representation which allows the carries to be generated correctly when a shift occurs. To find the best BCD representation to utilize for the recoding, it is important to look at an important property of adding two numbers. In order to avoid the out of range result, decimal codes are employed such that the sum of the weights of each 4bit decimal code is equal to 9. In the previous example, since the code BCD8421 was used, it was possible to represent digits with a value greater than 9 (10 through 15, since we had 4 bits BCD). If the maximum possible value was 9 instead (since the sum of the weights of all 4bits is 9) then the result will never be out of range. This property is also utilized with binary signed digit adders to avoid carrypropagation [29]. The following BCD codes: 5211, 4311, 4221, and 3321 can satisfy this condition. Additionally, an advantage of these encodings is that they are selfcomplementing, meaning that the 9's complement of the number can be obtained by a simple bit inversion of each of the 4bits [31]. Using this trivial simplification allows the ten's 12 1 1 0 0 + 6 0 0 1 1  S : W = 2 x H : Decimal Value BCD 8 4 2 1 Leftshifted 18 1 0 0 1 0 Out of range [0,9] Result in binary, not in BCD8421 71 complement of the number to easily be obtained in much the same way as the two's complement for binary numbers is performed. Table 10 shows some of the mentioned codes. Table 10. 8421, 4221 and 5211 BCD representations. Decimal BCD8421 BCD4221 BCD5211 0 0000 0000 0000 1 0001 0001 0001  0010 2 0010 0010  0100 0100  0011 3 0011 0011  0101 0101  0110 4 0100 1000  0110 0111 5 0101 1001  0111 1000 6 0110 1010  1100 1001  1010 7 0111 1011  1101 1100  1011 8 1000 1110 1110  1101 9 1001 1111 1111 Furthermore, since the value of 2xH (the carry vector) is required to obtain the final sum, a BCD coding that facilitates a multiplication by 2 is desirable to use. This is because, in binary, a multiplication by 2 is easily accomplished by a bitwise left shift. However, in the selfcomplementing codes shown above the weights do not increase in powers of two and, hence, shifting cannot be applied directly. Nevertheless, if the number is reencoded this might be accomplished in a simpler way. Figure 29. Multiplication by 2, recoded result. H: 6 1 0 0 1 Decimal BCD 5 2 1 1 New encoding W=2xH: 12 1 0 0 1 10 4 2 2 BCD 72 One potential solution for recoding the carries into a selfcomplementing output is to use the 5211 code, shown in Table 10. By using this code, a multiplication by 2 would result in a 10, 4, 2, 2 encoding, as seen in Figure 29 without the need to even modify the bits. Although this encoding is not useful, since it does not satisfy the two conditions described earlier (selfcomplementing and sum inrange), the mostsignificant bit can be passed to the next significant digit or column. The resulting BCD code will be 4221 as illustrated in Figure 30. In summary, if a BCD5211 recoding is utilized, a shift asserts a carryout for the 10 weight and allows the decimal code to be transformed to BCD4221, which is selfcomplementing. The bit shifted out is the carry out to the next decimal digit and the space opened in the least significant bit position after the shift would be available to receive a carry input from the previous significant digit. For the leastsignificant decimal digit, it is assumed that this value is 0. Figure 30. Multiplication by 2, result in BCD4221. 1 0 0 1 Decimal BCD 5 2 1 1 New encoding 1 0 0 1  4 2 2 1 BCD Left Shift 4 2 2 1 BCD W=2 x H : 12 H : 6 Digit X 10 Digit X 1 73 Figure 31. Decimal 3:2 counter with BCD recoding example. Figure 31 also shows the multiplication by 2 operation on H. H is shifted left after recoding allowing a carryin to be accepted (“” in the example) and generating a carryout (“1” in the figure). The value of 2xH is then composed of a decimal digit and a single bit decimal carryout. The decimal digit is W with a value of 1000BCD4221 or 410 considering a carryin of zero to its LSB. The single bit carryout of 1 represents a value of 1 in the tens place and, together with W = 4, represents 14. W has the same arithmetic weight as S. The operation is summarized as: A + B + C = S + 2 x H = S + (Carryout x 10) + W + Carryin. The following figure shows the implementation of the decimal 3:2 counter: 3 0 1 0 1 7 1 1 0 1 + 8 1 1 1 0 4 0 1 1 0 7 1 1 0 1 7 1 1 0 0 A: B: C: S: H4221: H5211: Decimal Value BCD 4 2 2 1 BINARY FULL ADDERS Recoding BCD4221 to BCD5211 Left Shift (x2) W=2xH: 14 1 1 0 0  W in BCD4221 Result =S+2·H=S+W=18 Carryin Carryout 74 Figure 32. Block diagram for decimal 3:2 counter [44]. Although this recoding works nicely, a conversion is necessary to convert the input of the 3:2 counter into the desired decimal code (4221). Using Table 10, a BCD8421 number can be converted into BCD4221 through simple logic utilizing the following Boolean equations: H0 = D1 H1 = D1 + D3 H2 = D3 H3 = D2 + D3, where D3D2D1D0 represent the BCD8421 digit and H3H2H1H0 is the BCD4221 decimal digit. The counter then takes three 4bit decimal numbers as inputs in BCD4221 decimal code. These binary full adders perform the bitwise addition which results in a x 2 ABCD4221 BBCD4221 CBCD4221 3:2 Counter C S 4 4 4 4 4 Carryin 1 1 Carryout WBCD SBCD4221 Singlebit carryout after recoding and left shift. ‘X10’ going to next column Both S and W have the same arithmetic weight Singlebit carryin after recoding and left shift. From previous column 4 75 sum vector S and a carry vector H, both in BCD4221. When H is recoded into BCD 5211, this allows the left shift to obtain the correct decimal value in BCD4221. It is important to see that by using recoding the correct result is obtained. Previously, without recoding, binary counters produced an incorrect result. By using a scheme that allows the carryout to be in the correct notation (i.e. BCD4211 code) the result is produced correctly. Recoding from BCD4221 to BCD5211 can be accomplished using the following Boolean logic expressions, derived from Table 10 [44]: w[3] = h[3] ⋅ (h[2] + h[1] + h[0]) + h[2] ⋅ h[1] ⋅ h[0] w[2] = h[2] ⋅ h[1] ⋅ (h3⊕ h[0]) + (h[3] ⋅ h[0])⊕ h[2]⊕ h[1] w[1] = h[2] ⋅ h[1] ⋅ (h[3]⊕ h[0]) + h[3] ⋅ h[0] ⋅ (h[2]⊕ h[1]) w[0] = (h[2] ⋅ h[1])⊕ h[3]⊕ h[0] , where w3w2w1w0 represent the BCD5211 digit and h3h2h1h0 the digit in BCD4221. This logic is implemented in the block that performs the multiplication of the carry vector by two (“x2” in Figure 32). 5.2 Delay analysis of the decimal 3:2 counter by recoding An insight to the performance and flow structure of the 3:2 decimal counter proposed in [44] (shown in Figure 32) can gained by utilizing the Δ delay model frequently used for binary arithmetic delay analysis. The use of Δ delays aid in considering design trade 76 offs in a circuit. Consider the case of a regular 1bit Full Adder or 3:2 counter. Typically a 3:2 counter can be implemented using nine logic gates as shown in the following figure: Figure 33. Nine gates Full Adder / 3:2 counter cell. If the delay through a logic AND or OR gate is estimated to be of 2 Δ delays and the delay through an inverter gate of 1 Δ then we have the delay of the paths, from inputs to outputs: → = Δ → = Δ → = Δ → = Δ 4 5 , 9 , 10 in out in out C C C Sum A B C A B Sum i.e. it takes 10 Δ delays for the signal to propagate from A or B to the Sum output for example. In the same way, the delay through the “x2” block can be obtained by looking at the logic equations used to implement it. This block consists of a recoding from BCD 4221 to BCD5211 and a single bit hardwired left shift. The left shift does not incur a delay while the BCD recoding can be implemented using simple two level logic. Its delay therefore consists of 2 gate delays and an inverter delay used to create the logic predicates, resulting in 5 Δ . Consequently the delay of the complete decimal 3:2 counter 77 of Figure 32, and considering the delay model for the binary 3:2 counter shown above, it requires: A,B→ S = 10Δ A,B→W = 9Δ + 5Δ = 14Δ C → S = 5Δ , through carryin path, C →W = 4Δ + 5Δ = 9Δ , through carryin path. This analysis can be used to evaluate the feasibility of the proposed partial product reduction scheme described in the next section, using decimal 4:2 compressors. 5.3 Decimal 4:2 Compressor Trees Decimal partial product reduction trees that use the BCD recoding technique are different than their binary counterpart in the way the weights are handled. Binary partial product reduction trees always take the carry and pass it onto the next column as shown in Figure 14. This occurs because the carryout is at a more significant weight than the sum. However, using the coding discussed in the previous section, it is apparent that for the decimal case this is completely different as it produces a carryout digit at the same weight as the sum and a carryout bit to the next significant digit. The use 3:2 counters for partial product reduction is common in binary arithmetic. In [44], partial product reduction is handled by using a tree of decimal 3:2 counters, in a similar fashion to a Dadda tree [46]. This is shown in Figure 34 where the X2 block represents the recoding logic discussed earlier. 78 Figure 34. Decimal 9:2 counter, adapted from [44]. In [44], partial product reduction is handled by using a tree of decimal 3:2 counters, in a similar fashion to a Dadda tree [46]. This is shown in Figure 34 where the X2 block represents the recoding logic discussed earlier where a multiplication by 2 is obtained through BCD recoding. It is apparent however from Figure 34 that the resulting circuit is very irregular difficulting its implementation as its interconnection is complex and most likely affecting its performance and area consumption, in part due to unpredictable long wiring and its general layout. Regularity is one of the reasons why in binary arithmetic the Weinberger 4:2 compressor [30] is used for partial product reduction. Its extension to the decimal case is therefore an interesting problem as it can provide performance enhancements for partial product reduction. W S 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter 3:2 Counter X 2 X 2 X 2 X 2 X 2 X 2 X 2 79 Apart from its improved regular structure, the proposed decimal 4:2 compressor has the advantage that the carryin is no longer dependent on the carryout, as shown in Figure 35, breaking the carry chain path. This gives compressors a significant advantage over traditional carrysave adder trees in that it can expedite processing the carry chain while still maintaining a regular structure. Although compressor trees are different than traditional counters, they can be organized into efficient interconnection networks for reducing the partial product matrix. This allows partial product reduction trees to be built easily and with regularity. Figure 35. Proposed decimal 4:2 compressor. All signals are decimal (4bits). The new decimal compressor’s structure is similar to the binary case except for the fact that in binary the signals Cout and W are both passed to the next column since their weight corresponds to the next significant bit, unlike S. In decimal however both Cout and W are composed of 2digits, a “tens” digit and a “units” digit where a BCD number represents the “units” weight and a single bit the “tens” value. The “tens” single bit is generated in the X2 blocks during shifting as a carryout and, at the same time, a carryx2 Cout Cin W S BCD4221 BCD4221 x2 C 3 : 2 S C 3 : 2 S D3 D2 D1 D0 80 in is accepted from the previous column. Figure 35 does not show the “tens” single bit generated at the X2 blocks for clarity (carryout) or the single bit carryin. Utilizing the delay statements shown in Section 5.2 allows Δ delay analysis of the decimal compressor: , , , → = 10Δ +10Δ = 20Δ 3 2 1 0 D D D D S , , , → = 10Δ +14Δ = 24Δ 3 2 1 0 D D D D W , , , → = 14Δ 3 2 1 0 out D D D D C C → S = 5Δ in C →W = 4Δ + 5Δ = 9Δ in This model is important since it determines the feasibility of higher order compressors that use this block since, as opposed to the binary case, the Cout and the W signals incur an additional delay due to the X2 block logic. Figure 36. Decimal 8:2 compressor, all signals are decimal (4bits). Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 3:2 S 81 Figure 36 shows an 8:2 compressor tree that performs the same function as Figure 34 yet is regular. Assuming that the inputs are given in time 0, and using the Δ delay expressions determined for the 4:2 compressor the Δ delay for each path can be determined. The time when the corresponding signal is available is shown in the following figure for the 8:2 compressor. Figure 37. Decimal 8:2 compressor. Critical path and Δ delays for signals are shown. The Cout signal only requires 14 Δ as shown and is fast enough that it does not delay further the subsequent 4:2 compressor where its carryout is attached. In other words, the delay is determined by the 24 Δ alone required for the W signal in each compressor. If this was not the case, it will imply that the delay will accumulate with each stage affecting performance, even more significantly as the size of the compressor is increased (Ex: 16:2, 32:2, etc). Subsequent column heights (digits) compressors can easily be built with log4(n) levels, where n represents the number of digits to be compressed (8 digits as shown, 16, etc). Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 3:2 S 14Δ 14Δ 24 Δ 20 Δ 20 Δ 24 Δ 48 Δ 44 Δ 38 Δ 62 Δ 58Δ 82 Figure 38 illustrates a 16digit (64bit) decimal compressor tree using the same technique used to build the 8digit tree from Figure 36. Figure 38. Decimal 16:2 compressor, all signals are decimal (4bits). Decimal W 3:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S Decimal W 4:2 S 83 6. PARTIAL PRODUCT GENERATION SCHEMES Due to its complexity, decimal multiplication was previously implemented using sequential methods [31]. These designs generally required the iteration of each of the multiplier digits and, depending on its value, the addition of the corresponding multiple of the multiplicand to the result. This value has to be precomputed and stored before the process takes place and is typically performed through costly lookup tables [12]. Enhancements to this approach included the reduction of the multiplier digit range through recoding to reduce the number of combinations, a method similar to Booth multiplication. Digit recoding allows the computation on the fly of multiplicand multiples to avoid their storage and the use of decimal carry save or signmagnitude addition to speed the partial product reduction process [13], [14]. Currently there are only two propositions in the literature of decimal parallel multipliers, [44] and [20], with similar implementation architectures but significantly different multiplicand multiple generation schemes. Although these implementations are noteworthy, there are still many enhancements that require clarification which this section addresses. 6.1 Multiplier Recoding As stated above typical decimal multiplication is more involved than binary requiring the need to generate the radix10 multiplicand multiples 0x, 1x, 2x, …, 9x, as shown in Figure 39 (copied from Section 5 for clarity). Multiples like 3x and 7x are considered 84 “hard multiples”, since their generation with binary logic is not straight forward and usually require a carry propagate addition which is slow. Previous methods for generating multiples utilized either this approach or costly lookup tables. Figure 39. Multiplication Algorithm The common multiplication algorithm is given by: Σ− = = ⋅ = ⋅ 1 0 n i i ip x y x y r , where x and y denote an n digit BCD multiplicand and multiplier, respectively, and r =10 denotes the radix and yi ∈ [0, 9]. Recoding of the multiplier is an efficient technique for reducing its implementation since it permits a different set of multiples to be utilized avoiding the need for generating “hard multiples”. One method recodes each multiplier digit yi ∈ [0, 9] to yi = yHi + yLi where yHi ∈ {0, 5, 10} and yLi ∈ {2, 1, 0, 1, 2}. In this manner only the multiples 2x and 5x, which can be generated without excessive overhead, are required to create all other decimal multiples. This is illustrated in Table 11 in the radix5 columns. Similarly, multiple 10x is produced with a simple 4bit wired left shift (1digit shift). 7 9 8 x 3 4 7 5 5 8 6 3 1 9 2 + 2 3 9 4 2 7 6 9 0 6 Partial Product Reduction Multiplicand Multiplier Final Carry Propagate Addition Multiplicand multiples (Partial Products) 85 In [44], a radix5 multiplier using this approach is described. A block diagram of this digitrecoding scheme is shown in Figure 40. Hotone coded multiplexors are used avoiding the need of an extra ‘0x’ input. The selection signals are determined directly from BCD 8421 input digits, yi: Hi10 i [3] y x = y 5 [2] [1] [0] Hi i i i y x = y + y ⋅ y 2 [2] [1] [3] [0] [1] [0] Li i i i i i i y x = y ⋅ y + y ⋅ y + y ⋅ y 1 [2] [0] [2] [1] [0] Li i i i i i y x = y ⋅ y + y ⋅ y ⋅ y [3] [2] [1] [0] [2] [1] [0] i i i i i i i sign = y + y ⋅ y ⋅ y + y ⋅ y ⋅ y . Table 11. Radix 5/4 digit recoding. Decimal y i y Hi y Li y Hi y Li 0 0 0 0 0 1 0 1 0 1 2 0 2 0 2 3 5 2 4 1 4 5 1 4 0 5 5 0 4 1 6 5 1 4 2 7 5 2 8 1 8 10 2 8 0 9 10 1 8 1 Radix5 Radix4 86 Figure 40. Digit recoding for radix5, [44]. The second recoding approach, similar to the radix5 case, recodes each multiplier digit yi ∈ [0, 9] to yi = yHi + yLi where yHi ∈ {0, 4, 8} and yLi ∈ {2, 1, 0, 1, 2}. It is named radix4 recoding in [19] and is also shown in Table 11. In this approach, the hard multiples are avoided and instead 2x, 4x and 8x are required to generate all others. The multiple 4x can be generated by cascading two consecutive 2x modules. An additional 2x module yields 8x implying a latency three times as large to that required to obtain 2x. The logic equations for the digit recoding of yi are: Hi 8 i [3] ( i [2] i [1] i [0]) y x = y + y ⋅ y ⋅ y 4 [2] ( [1] [0]) Hi i i i y x = y ⊕ y ⋅ y Li 2 i [1] i [0] y x = y ⋅ y 1 [0] Li i y x = y [1] [0] i i sign = y ⋅ y . 10X 5X Hotone MUX {yHi10x, yHi5x} { yLi2x, yLi1x} 9’s complement sign 2X 1X Hotone MUX 87 6.2 Multiplicand Multiples Generation The recoding schemes discussed in the previous section avoid the computation of complicated multiples of x. Instead, in the case of radix5, only 2x and 5x modules are required as all other multiples can be generated from them. Two main approaches for the generation of these multiples can be identified in available literature. The first method for obtaining 2x and 5x is studied in [31] with a conventional binary logic approach and utilized in [12] and [20]. A more recent technique proposed in [19] utilizes BCD recoding and shifting instead. 6.2.1 2x and 5x with Conventional Binary Logic The multiples 2x and 5x can be produced rapidly as opposed to other multiples since in both doubling and quintupling of BCD8421 numbers no carry is generated beyond the next significant digit. Logic equations for doubling and quintupling of BCDs are given in [12]. When doubling takes place, the Least Significant Bit (LSB) of each decimal BCD digit is initially zero. When the digit value is in the range of [59] a carryout of at most one is produced (9 x 2 = 18). Therefore, it will not further propagate since the LSB of the next digit zero as well. The equations for doubling each multiplicand digit can be formed as follows: 2 [0] ( [2] [1] [0]) ( [2] [0]) [3]) −1 −1 −1 −1 −1 −1 = ⋅ ⋅ + ⋅ + i i i i i i i x a a a a a a 2 i [1] ( i [3] i [2] i [0]) ( i [2] i [1] i [0]) ( i [3] i [0]) x = a ⋅a ⋅a + a ⋅a ⋅a + a ⋅a 2 i [2] ( i [1] i [0]) ( i [2] i [1]) ( i [3] i [0]) x = a ⋅ a + a ⋅ a + a ⋅ a 2 i [3] ( i [2] i [1] i [0]) ( i [3] i [0]) x = a ⋅ a ⋅ a + a ⋅ a 88 On the other hand, when a number is odd and quintupling takes place, its value becomes initially five and when the number is even it becomes zero. Quintupling produces a carry out of at most four (9 x 5 = 45). Since the initial value is zero or five, adding the carry results in at most 9, preventing the carry to propagate any further. Therefore, by checking the next significant digit LSB (ai[0]) to check if the digit is 0 or 5, equations for 5x can be determined as follows: 5 [0] ( [0] [3] [1]) ( [0] [1]) ( [0] [3]) −1 −1 −1 −1 = ⋅ ⋅ + ⋅ + ⋅ i i i i i i i i x a a a a a a a 5 [1] ( [0] [2]) ( [0] [2] [1]) ( [2] [1]) −1 −1 −1 −1 −1 = ⋅ + ⋅ ⋅ + ⋅ i i i i i i i i x a a a a a a a = ⋅ ⋅ + ⋅ ⋅ + − − − − 5 [2] ( [0] [3] [1]) ( [0] [2] [1]) i i i 1 i 1 i i 1 i 1 x a a a a a a ( [0] [2] [1]) ( [0] [3]) −1 −1 −1 ⋅ ⋅ + 



A 

B 

C 

D 

E 

F 

I 

J 

K 

L 

O 

P 

R 

S 

T 

U 

V 

W 


