#### DISTRIBUTED ARITHMETIC

Distributed arithmetic is an efficient procedure for computing inner products between a fixed and a variable data vector. The basic principle is owed to Croisier et al. (Patent), and Peled and Liu have independently presented a similar method.

Consider the sum-of-products (*inner products*)

$$\mathbf{y} = \mathbf{a}^T \cdot \mathbf{x} = \sum_{i=1}^{N} \mathbf{a}_i \mathbf{x}_i$$

where the coefficients,  $a_i$ , i = 1, 2, ..., N are fixed.

A two's-complement representation is used for the data components which are scaled so that  $|x_i| \le 1$ .

DSP Integrated Circuits

Department of Electrical Engineering Linköping University

larsw@isy.liu.se http://www.es.isy.liu.se/



3

 $F_k(x_{1k}, x_{2k}, ..., x_{Nk}) = \sum_{i=1}^{N} a_i x_{ik}$ 

F is a function of N binary variables, the ith variable being the kth bit in the data  $x_i$ .

Since  $F_k$  can take on only a finite number of values,  $2^N$ , it can be computed and stored in a look-up table.

This table can be implemented using a ROM (Read-Only Memory).

Using Horners method for evaluating a polynomial for x = 0.5, we can rewrite

$$y = -F_0(x_{10}, x_{20}, ..., x_{N0}) + \sum_{k=1}^{W_d - 1} F_k(x_{1k}, x_{2k}, ..., x_{Nk}) 2^{-k}$$

$$y = \left( \left( \dots \left( \left( 0 + F_{W_d - 1} \right) 2^{-1} + \dots + F_2 \right) 2^{-1} + F_1 \right) 2^{-1} - F_0 \right)$$

The inner product can be rewritten

$$y = \sum_{i=1}^{N} a_{i} \left[ -x_{i0} + \sum_{k=1}^{W_{d}-1} x_{ik} 2^{-k} \right]$$

where  $x_{ik}$  is the kth bit in  $x_i$ .

By interchanging the order of the two summations we get

$$y = -\sum_{i=1}^{N} a_i x_{i0} + \sum_{k=1}^{W} \left[ \sum_{i=1}^{N} a_i x_{ik} \right] 2^{-k}$$

which can be written

$$y = -F_0(x_{10}, x_{20}, ..., x_{N0}) + \sum_{k=1}^{W_d - 1} F_k(x_{1k}, x_{2k}, ..., x_{Nk}) 2^{-k}$$

where

DSP Integrated Circuits

DSP Integrated Circuits

Department of Electrical Engineering Linköping University



$$y = \left( \left( \dots \left( \left( 0 + F_{W_d - 1} \right) 2^{-1} + \dots + F_2 \right) 2^{-1} + F_1 \right) 2^{-1} - F_0 \right)$$

Inputs,  $x_1, x_2, ..., x_N$  are shifted bit-serially out from the shift registers with the least-significant bit first. Bits  $x_{ik}$  are used as an address to the ROM storing the look-up table.



The computational time is  $W_d$  clock cycles.

The word length in the ROM,  $W_{ROM}$ , depends on the  $F_k$  with the largest magnitude and the coefficient word length,  $W_c$ , and

$$W_{ROM} \leq W_c + \log_2(N)$$

#### **Example 11.11**

Determine the values that to be stored in ROM for the inner product

$$y = a_1 x_1 + a_2 x_2 + a_3 x_3$$

where  $a_1 = \frac{33}{128} = (0.0100001)_{2C}$ ,  $a_2 = \frac{85}{128} = (0.1010101)_{2C}$ , and

$$a_3 = \frac{-11}{128} = (1.1110101)_{2C}.$$

| $x_1 x_2 x_3$ | $F_k$             | $F_k$                     | $F_k$      |
|---------------|-------------------|---------------------------|------------|
| 0 0 0         | 0                 | (0.0000000) <sub>2C</sub> | 0.0000000  |
| 0 0 1         | $a_3$             | (1.1110101) <sub>2C</sub> | -0.0859375 |
| 0 1 0         | $a_2$             | (0.1010101) <sub>2C</sub> | 0.6640625  |
| 0 1 1         | $a_2 + a_3$       | (0.1001010) <sub>2C</sub> | 0.5781250  |
| 1 0 0         | $a_1$             | (0.0100001) <sub>2C</sub> | 0.2578125  |
| 1 0 1         | $a_1 + a_3$       | (0.0010110) <sub>2C</sub> | 0.1718750  |
| 1 1 0         | $a_1 + a_2$       | (0.1110110) <sub>2C</sub> | 0.9218750  |
| 1 1 1         | $a_1 + a_2 + a_3$ | (0.1101011) <sub>2C</sub> | 0.8359375  |

DSP Integrated Circuits

Department of Electrical Engineering Linköping University larsw@isy.liu.se http://www.es.isy.liu.se/



Notice the similarity between the equation for a scalar multiplication

$$\mathbf{y} = \mathbf{a} \cdot \mathbf{x} = \mathbf{a} \cdot \left\{ -x_0 + \sum_{k=1}^{W_d - 1} x_k 2^{-k} \right\}$$

and the inner product

$$y = -F_0(x_{10}, x_{20}, ..., x_{N0}) + \sum_{k=1}^{W_d - 1} F_k(x_{1k}, x_{2k}, ..., x_{Nk}) 2^{-k}$$

In both cases, the same type of shift-accumulator can be used.

Hence, the distributed arithmetic unit essentially consists of a serial/parallel multiplier augmented by a ROM.

The shift-accumulator must be able to add correctly the largest possible value obtained in the accumulator register and in the ROM.

The largest value in the accumulator register is obtained when the largest (magnitude) value stored in the ROM is repeatedly accumulated.

$$y = \left( \left( \dots \left( \left( 0 + F_{W_d - 1} \right) 2^{-1} + \dots + F_2 \right) 2^{-1} + F_1 \right) 2^{-1} - F_0 \right)$$

Thus, at the last clock cycle, corresponding to the sign bit, the value in REG is

$$|y| = \dots (((0 + F_{max})2^{-1} + F_{max})2^{-1} + \dots + F_{max})2^{-1} \le F_{max}$$

Hence, the shift-accumulator must be able to add two numbers of magnitude  $\leq F_{max}$ .

The necessary number range is  $\pm$  1. The word length in the shift-accumulator must be extended with one guard bit for overflow detection = 1 + 8 bit word = 9 bits.

DSP Integrated Circuits

Department of Electrical Engineering Linköping University

larsw@isy.liu.se



### **Example 11.12**

A second-order section in direct form I can be implemented by using only a single PE based on distributed arithmetic.





A set of D flip-flops has been placed between the ROM and the shift-accumulator to allow the two operations to overlap in time, i.e., the two operations are pipelined.

The number of words in the ROM is only  $2^5 = 32$ .





#### **Example 11.13**

An linear-phase FIR structure can also be implemented using distributed arithmetic. Assume that *N* is even.



The logic circuitry has been pipelined by introducing D flip-flops between the adders (subtractors) and the ROM, and between the ROM and the shift-accumulator.

DSP Integrated Circuits

Department of Electrical Engineering Linköping University larsw@isy.liu.se http://www.es.isy.liu.se/



11

# **Parallel Implementation of Distributed Arithmetic**

The inner products containing many terms, can be partitioned into a number of smaller inner products which can be computed and summed by using either distributed arithmetic or an adder tree.



Distributed arithmetic can also use two, or more bits a time

terms. The basic approach is useful for up to 10 to 11 terms.

DSP Integrated Circuits

Lars Wanhamma

inner product.

is essential.

the inner product.

Department of Electrical Engineerin Linköping University

general case, e.g., a nonlinear-phase FIR filter.

larsw@isy.liu.se http://www.es.isy.liu.se/



12

#### THE BASIC SHIFT-ACCUMULATOR

N/2 bit-serial adders (subtractors) are used to sum the symmetrically placed values in the delay line. This reduces the number of terms in the

Only 64 words are required whereas  $2^{12} = 4096$  words are required for the

For higher-order FIR filters the reduction in the number of terms by 50%

The number of words in the ROM is  $2^N$  where N is the number of terms in

The chip area for the ROM is small for inner products with up to 5 to 6



For  $F_0$ , the clock cycle corresponding to the sign bit of the data,  $F_0$  should be subtracted.

This is done by adding  $-F_0$ , i.e., inverting all the bits in  $F_0$  using the XOR gates and the signal s, and adding one bit in the least-significant position.





After  $-F_0$  has been added, the most significant part of the inner product must be shifted out of the accumulator. This can be done by accumulating zeros. The number of clock cycles for one inner product is  $W_d+W_{ROM}$ .

A more efficient scheme is to free the carry–save adders in the accumulator by loading the sum and carry bits of the carry–save adders into two shift registers. The outputs from these can be added by a single carry–save adder.



DSP Integrated Circuits

Department of Electrical Engineering Linköning University

larsw@isy.liu.se http://www.es.isy.liu.se/



15

#### **REDUCING THE MEMORY SIZE**

## **Memory Partitioning**

One of several possible ways to reduce the overall memory requirement is to partition the memory into smaller pieces that are added before the shiftaccumulator.

The amount of memory is reduced from  $2^N$  words to  $2 \cdot 2^{N/2}$  words if the original memory is partitioned into two parts.

For example, for N = 10 we get

 $2^{10} = 1024$  words to  $2 \cdot 2^5 = 64$  words.

 $2^{13} = 1024$  words to  $2 \cdot 2^{3} = 64$  words.

Hence, this approach reduces the memory significantly at the cost of an additional adder.

This scheme effectively doubles the throughput since two inner products are computed concurrently for a small increase in chip area.



DSP Integrated Circuits

Department of Electrical Engineeri Linköping University larsw@isy.liu.se http://www.es.isy.liu.su



16

### **Memory Coding**

The second approach is based on a special coding of the ROM content. Memory size can be halved by using the ingenious scheme based on the identity

$$x = \frac{1}{2}[x - (-x)]$$

In two's-complement representation the identity can be rewritten

$$x = \frac{1}{2} \left[ -x_0 + \sum_{k=1}^{W_d - 1} x_k 2^{-k} - \left( -\overline{x_0} + \sum_{k=1}^{W_d - 1} \overline{x_k} 2^{-k} + 2^{-W_d + 1} \right) \right] =$$

$$= -(x_0 - \overline{x_0}) 2^{-1} + \sum_{k=1}^{W_d - 1} (x_k - \overline{x_k}) 2^{-k} - 1 - 2^{-W_d}$$

Notice that  $(x_k - \overline{x_k})$  can only take on the values -1 or +1.

Inserting this expression into the inner product yields



where 
$$F_k(x_{1k}, x_{2k}, ..., x_{Nk}) = \sum_{i=1}^{N} a_i(x_{ik} - \overline{x_{ik}})$$

The function  $F_k$  is shown in the table for N = 3.

| $x_1 x_2 x_3$ |   | <i>x</i> <sub>3</sub> | $F_k$              |  |
|---------------|---|-----------------------|--------------------|--|
| 0             | 0 | 0                     | $-a_1 - a_2 - a_3$ |  |
| 0             | 0 | 1                     | $-a_1 - a_2 + a_3$ |  |
| 0             | 1 | 0                     | $-a_1 + a_2 - a_3$ |  |
| 0             | 1 | 1                     | $-a_1 + a_2 + a_3$ |  |
| 1             | 0 | 0                     | $+a_1 - a_2 - a_3$ |  |
| 1             | 0 | 1                     | $+a_1 - a_2 + a_3$ |  |
| 1             | 1 | 0                     | $+a_1 + a_2 - a_3$ |  |
| 1             | 1 | 1                     | $+a_1 + a_2 + a_3$ |  |

Antisymmetry

DSP Integrated Circuits

Department of Electrical Engineering Linköping University

larsw@isy.liu.se http://www.es.isy.liu.se/



19



Distributed arithmetic with halved ROM.

Only *N*–1 variables are used to address the memory.



Notice that only half the values are needed, since the other half can be obtained by changing the signs.

To explore this redundancy we make the following address modification shown to the right in the table below.

$$u_1 = x_1 \oplus x_2$$

$$u_2 = x_1 \oplus x_3$$

$$A/S = x_1 \oplus x_{sign-bit}$$

| $x_1 x_2 x_3$ | $F_k$              | $u_1 u_2$ | A/S |
|---------------|--------------------|-----------|-----|
| 0 0 0         | $-a_1 - a_2 - a_3$ | 0 0       | A   |
| 0 0 1         | $-a_1 - a_2 + a_3$ | 0 1       | A   |
| 0 1 0         | $-a_1 + a_2 - a_3$ | 1 0       | Α   |
| 0 1 1         | $-a_1 + a_2 + a_3$ | 1 1       | A   |
| 1 0 0         | $+a_1 - a_2 - a_3$ | 1 1       | S   |
| 1 0 1         | $+a_1 - a_2 + a_3$ | 1 0       | S   |
| 1 1 0         | $+a_1 + a_2 - a_3$ | 0 1       | S   |
| 1 1 1         | $+a_1 + a_2 + a_3$ | 0 0       | S   |

DSP Integrated Circuits

Department of Electrical Engineering Linköping University arsw@isy.liu.se http://www.es.isv.liu.se/



#### **COMPLEX MULTIPLIERS**

Classical solution require 3 real multiplications and two adder networks.

Let

$$X = a + jb$$
 and  $K = c + jd$ 

where K is the fixed coefficient and X is the variable. Once again we use the identity

$$\mathbf{x} = \frac{1}{2} [\mathbf{x} - (-\mathbf{x})] = \frac{1}{2} \left[ -x_0 + \sum_{k=1}^{W_d - 1} x_k 2^{-k} - \left( -\overline{x_0} + \sum_{k=1}^{W_d - 1} \overline{x_k} 2^{-k} + 2^{-W_d + 1} \right) \right] =$$

$$= -(x_0 - \overline{x_0})2^{-1} + \sum_{k=1}^{W_d - 1} (x_k - \overline{x_k})2^{-k-1} - 2^{-W_d}$$



$$\mathbf{K} \cdot \mathbf{X} = (ca - db) + i(da + cb)$$

$$= \left\{ -c(a_0 - \overline{a_0})2^{-1} + \sum_{i=1}^{W_d - 1} c(a_i - \overline{a_i})2^{-i-1} - c2^{-W_d} \right\} -$$

$$-\left\{-d(b_0 - \overline{b_0})2^{-1} + \sum_{i=1}^{W_d - 1} d(b_i - \overline{b_i})2^{-i - 1} - d2^{-W_d}\right\} +$$

$$+j\left\{-d(a_0-\overline{a_0})2^{-1}\sum_{i=1}^{W_d-1}d(a_i-\overline{a_i})2^{-i-1}-d2^{-W_d}\right\}+$$

$$+j \left\{ -c(b_0 - \overline{b_0})2^{-1} + \sum_{i=1}^{W_d - 1} c(b_i - \overline{b_i})2^{-i - 1} - c2^{-W_d} \right\} =$$

DSP Integrated Circuits

Linköning University



It is obvious from the table that only two coefficients are needed, (c+d)and (c-d).

The appropriate coefficients can be directed to the accumulators via a 2:2multiplexer.

If  $a_i \oplus b_i = 1$  the F values are applied directly to the accumulators, and if  $a_i \oplus b_i = 0$  the F values are interchanged.

The F values are either added to, or subtracted from, the accumulator's registers depending on the data bits  $a_i$  and  $b_i$ .





$$= \left\{ -F_1(a_o, b_0)2^{-1} + \sum_{i=1}^{W_d - 1} F_1(a_i, b_i)2^{-i - 1} + F_1(0, 0)2^{-W_d} \right\} +$$

$$+j \left\{ -F_2(a_0, b_0)2^{-1} + \sum_{i=1}^{W_d-1} F_2(a_i, b_i)2^{-i-1} + F_2(0, 0)2^{-W_d} \right\}$$

Hence, the real and imaginary parts of the product can be computed using just two distributed arithmetic units.

The binary functions  $F_1$  and  $F_2$  can be stored in a ROM, addressed by the bits  $a_i$  and  $b_i$ . The ROM content is

| $a_i b_i$ | $F_1$  | $F_2$  |
|-----------|--------|--------|
| 0 0       | -(c-d) | -(c+d) |
| 0 1       | -(c+d) | (c-d)  |
| 1 0       | (c+d)  | -(c-d) |
| 1 1       | (c-d)  | (c+d)  |

Antisymmetry

DSP Integrated Circuits

Department of Electrical Engineering Linköping University

larsw@isv.liu.se http://www.es.isy.liu.se



24

#### **IMPROVED SHIFT-ACCUMULATOR**

The last term in the real part (and the same for the imaginary part)

$$KA = \left\{ -F_1(a_o, b_0)2^{-1} + \sum_{i=1}^{W_d-1} F_1(a_i, b_i)2^{-i-1} + F_1(0, 0)2^{-W_d} \right\} + \text{imaginary part}$$

shall be added to the first term in the sum,  $F_{Wd-1}$ , at the same level of significance. This can be accomplished by initially setting the carry D flipflops to F(0, 0, ..., 0), as illustrated below where only the upper part of the shift-accumulator part is shown.



# **Complex Multiplier Using Two-Phase Logic**

Layout of one half of a complex multiplier based on the improved shiftaccumulator using two-phase clocking. The coefficient word length is 8 bits. A 0.8-um double metal CMOS process was used. The maximal clock frequency are about 175 MHz at 5 V.

The chip area is  $440 \, \mu \text{m} \times 200 \, \mu \text{m} = 0.088 \, \text{mm}^2$ .



DSP Integrated Circuits

Linköpine University



27

# FFT PROCESSOR, CONT.



The decimation-in-frequency radix-2 bit-serial butterfly PE has been implemented in a 0.8 µm standard CMOS process.

# **Complex Multiplier Using TSPC Logic**

Layouts of one-half of a complex multiplier based on the improved shiftaccumulator using TSPC (True Single-Phase Clocking).

The maximal clock frequency is about 400 MHz at 5 V. The chip area is  $540 \,\mu\text{m} \times 250 \,\mu\text{m} = 0.135 \,\text{mm}^2$ . Drivers for the clock estimated to require an additional 0.052 mm<sup>2</sup>.



DSP Integrated Circuits Lare Wanhamma

Linköping University

http://www.es.isy.liu.se



The PE has a built-in coefficient generator that can generate all twiddle factors in the range 0 to 128, which is sufficient for a 1024-point FFT.

The layout using AMS 0.8-µm double metal CMOS process is shown below. It is clear that the coefficient generator and the complex multiplier occupy most of the area. The area is 1.47 mm<sup>2</sup>.



The maximal clock frequency at 3 V supply voltage is 133 MHz with a power consumption of 30 mW (excluding the power consumed by the clock).

DSP Integrated Circuits

Department of Electrical Engineering

larsw@isv.liu.se http://www.es.isv.liu.s

30

80% of the power is consumed in the complex multiplier and 5% in the coefficient generator.

The rest (15 %) is evenly distributed in the rest of the butterfly.

The D flip-flops and the gates at the bottom of the block diagram are the control.

DSP Integrated Circuits

Department of Electrical Engineering Linköping University arsw@isy.liu.se http://www.es.isy.liu.se/



31

Instead of storing  $W^p$ , we will store the values

$$\frac{C(a)+S(a)}{2}=\frac{1}{2}(\cos(a)-\sin(a))=\frac{1}{\sqrt{2}}\sin\left(a-\frac{\pi}{4}\right)$$

$$\frac{C(a)-S(a)}{2}=\frac{1}{2}(\cos(a)+\sin(a))=\frac{1}{\sqrt{2}}\sin(\frac{\pi}{4}+a)$$

where,  $a = 2\pi p/N$ .

The twiddle factors in the eight octants can be expressed in terms of the twiddle factors in the range 0 to  $\pi/4$ .

| Octant | а                                         | $a_0 a_1 a_2$ | b                  | $\frac{C+S}{2}$                                                    | $\frac{C-S}{2}$                                                   |
|--------|-------------------------------------------|---------------|--------------------|--------------------------------------------------------------------|-------------------------------------------------------------------|
| 0      | $0 \le a \le \frac{\pi}{4}$               | 000           | а                  | $\frac{1}{\sqrt{2}}\sin\left(\frac{\pi}{4}-\boldsymbol{b}\right)$  | $\frac{1}{\sqrt{2}}\cos\left(\frac{\pi}{4}-\boldsymbol{b}\right)$ |
| 1      | $\frac{\pi}{4} \le a \le 2\frac{\pi}{4}$  | 001           | $a-\frac{\pi}{4}$  | $\frac{-1}{\sqrt{2}}\sin(\boldsymbol{b})$                          | $\frac{1}{\sqrt{2}}\cos(\boldsymbol{b})$                          |
| 2      | $2\frac{\pi}{4} \le a \le 3\frac{\pi}{4}$ | 010           | $a-2\frac{\pi}{4}$ | $\frac{-1}{\sqrt{2}}\cos\left(\frac{\pi}{4}-\boldsymbol{b}\right)$ | -4 -                                                              |
| 3      | $3\frac{\pi}{4} \le a \le 4\frac{\pi}{4}$ | 011           | $a-3\frac{\pi}{4}$ | $\frac{-1}{\sqrt{2}}\cos(\boldsymbol{b})$                          | $\frac{-1}{\sqrt{2}}\sin(\boldsymbol{b})$                         |



#### **Twiddle Factor PE**

Twiddle factors can be generated in several ways: by using a CORDIC PE, via trigonometric formulas, or read from a precomputed table. Here we will use the latter method—that is, a PE that essentially consists of a ROM.

We have previously shown that it is possible to use only one  $W^p$  PE.

However, here it is better to use one for each butterfly PE, because the required chip area for a ROM is *relatively small*.

If only one ROM were used, it would have been necessary to use long bitparallel buses, which are costly in terms of area, to distribute the twiddle factors to the butterfly PEs.

The values of the twiddle factors, W, are spaced uniformly around the unit circle. Generally, there are N twiddle factors, but it is possible to reduce the number of unique values by exploring symmetries in the trigonometric functions. In fact, it can be shown that only N/8 + 1 coefficient values need be stored in a table.

DSP Integrated Circuits

Department of Electrical Engineerin Linköping University larsw@isy.liu.se http://www.es.isy.liu.se/



# DCT PROCESSOR, CONT.

The chip area needed to implement a vector-multiplier using distributed arithmetic grows as  $O(2^N)$  where N is the number of terms in the inner product.

The chip area required for implementing a one-dimensional MSDCT can be reduced by exploiting the symmetry (antisymmetry) of the basis functions.

A detailed study shows that the basic functions exhibit both symmetry and antisymmetry in their coefficients.

Using the same principle that was used in the linear-phase FIR filter structure, the pixels that are multiplied by the same coefficient are added (or subtracted).

This reduces the number of terms in the remaining inner products by 50%. The chip area is thereby reduced from  $O(2^N)$  to  $O(2^{N/2})$ , which is a significant reduction.





A 2-D DCT for  $16 \times 16$  pixels can be built using only one 1-D DCT PE which itself consists of 16 distributed arithmetic units with N = 8.

The TSPC based shift-accumulator can be used to implement a distributed arithmetic unit. The length of the shift-accumulator depends on the word length,  $W_{ROM}$ , which depends on the coefficients in the vector-products. In this case we assume that  $W_{ROM} = W_c + 1 = 12$  bits.

DSP Integrated Circuits Lars Wanhammar

Department of Electrical Engineering Linköping University larsw@isy.liu.se http://www.es.isy.liu.se/

