7.1 a) These butterfly pairs, which computed concurrently, are determined by the relative indices.
We assume that the butterflies are labeled with numbers 0 to 7 from top to down for each butterfly in Figure 7.4.
First alternative:
Execute butterflies $p$ and $p+N / 4$ simultaneously for all stages, the butterfly pairs are $\{0,4\},\{1,5\},\{2,6\}$, and $\{3,7\}$ for all stages.
Second alternative:
Execute butterflies $p$ and $p+N_{s} / 2$ at the first stage, $p$ and $p+N_{s}$ for other stages.
Note that $N_{s}=2^{4-\text { stage }}$, the butterflies pares are therefore $\{0,4\},\{1,5\},\{2,6\}$, and $\{3,7\}$ for the first and the second stage, $\{0,2\},\{1,3\},\{4,6\}$ and $\{5,7\}$ for the third stage, and $\{0,1\},\{2,3\},\{4,5\}$, and $\{6,7\}$ for the final stage.
b) Obviously, $m$ ranges from 0 to $3 . N_{s}=2^{n-\text { stage }}$, i.e., $N_{s}=8$ for the first stage,
$N_{s}=4$ for the second stage, $N_{s}=2$ for the third stage, and $N_{s}=1$ for the final stage.
See a) for the range of $p$-values.
First alternative:
$k_{1}=2 N_{s}\left\lfloor m / N_{s}\right\rfloor+\left[m \bmod \left(N_{s}\right)\right]$
$k_{2}=\left\{\begin{array}{lc}k_{1}+N / 4 & \text { Stage }=1 \\ k_{1}+N / 2 & \text { Stage } \geq 2\end{array}\right.$
Stage $=1$ :
$2 N_{s}\left\lfloor m / N_{s}\right\rfloor=0$ and $m \bmod \left(N_{s}\right)=m, k_{1}=m, k_{2}=m+N / 4=m+4$,
$k_{1 N_{s}}=k_{1}+N_{s}=m+8$ and $k_{2 N_{s}}=k_{2}+N_{s}=m+4+8=m+12$.
In the same manner, we can determine the $k_{1}, k_{2}, k_{1 N_{s}}$, and $k_{2 N_{s}}$ for the other stage. The result is listed in the following table.

| Stage | $k_{1}$ | $k_{2}$ | $k_{1 N_{s}}$ | $k_{2 N_{s}}$ |
| :---: | :---: | :---: | :---: | :---: |
| 1 | $0,1,2,3$ | $4,5,6,7$ | $8,9,10,11$ | $12,13,14,15$ |
| 2 | $0,1,2,3$ | $8,9,10,11$ | $4,5,6,7$ | $12,13,14,15$ |
| 3 | $0,1,4,5$ | $8,9,12,13$ | $2,3,6,7$ | $10,11,14,15$ |
| 4 | $0,2,4,6$ | $8,10,12,14$ | $1,3,5,7$ | $9,11,13,15$ |

Second alternative:
$k_{1}=4 N_{s}\left\lfloor m / N_{s}\right\rfloor+\left[m \bmod \left(N_{s}\right)\right]$
$k_{2}=\left\{\begin{array}{cl}k_{1}+N_{s} / 2 & \text { Stage }=1 \\ k_{1}+2 N_{s} & \text { Stage } \geq 2\end{array}\right.$

Stage $=1$ :
$4 N_{s}\left\lfloor m / N_{s}\right\rfloor=0$ and $m \bmod \left(N_{s}\right)=m, k_{1}=m, k_{2}=k_{1}+N_{s} / 2=m+4$,
$k_{1 N_{s}}=k_{1}+N_{s}=m+8$ and $k_{2 N_{s}}=k_{2}+N_{s}=m+4+8=m+12$.
In the same way, we can determine the values for $k_{1}, k_{2}, k_{1 N_{s}}$, and $k_{2 N_{s}}$ at each stage. This results the following table:

| Stage | $k_{1}$ | $k_{2}$ | $k_{1 N_{s}}$ | $k_{2 N_{s}}$ |
| :---: | :---: | :---: | :---: | :---: |
| 1 | $0,1,2,3$ | $4,5,6,7$ | $8,9,10,11$ | $12,13,14,15$ |
| 2 | $0,1,2,3$ | $4,5,6,7$ | $8,9,10,11$ | $12,13,14,15$ |
| 3 | $0,1,8,9$ | $4,5,12,13$ | $2,3,10,11$ | $6,7,14,15$ |
| 4 | $0,4,8,12$ | $2,6,10,14$ | $1,5,8,13$ | $3,7,11,15$ |

c) We consider the simplification of one index in the first alternative, the other simplifications are left to the readers.
All indices is represented with binary numbers, for example, $m$ is $m=m_{1} \cdot 2^{1}+m_{0}$, where $m_{i}=0,1$. We give only on example for the simplification here, i.e., $k_{2}$.
Stage $=00: k_{2}=\left(01 m_{1} m_{0}\right)_{2}$
Stage $=01: k_{2}=\left(10 m_{1} m_{0}\right)_{2}$
Stage $=10: k_{2}=\left(1 m_{1} 0 m_{0}\right)_{2}$
Stage $=11: k_{2}=\left(1 m_{1} m_{0} 0\right)_{2}$
which means that $k_{2}=s_{1}+s_{0}, s_{1} \cdot m_{1}+\overline{s_{1}} \cdot \overline{s_{0}}, \overline{s_{1}} \cdot m_{1}+s_{1} \cdot m_{0} \cdot s_{0},\left(\overline{s_{1}}+\overline{s_{0}}\right) \cdot m_{0}$.
Hence the addition is not necessary in the computation of $k_{2}$.
d) Observe that the separation of $\left\{k_{1}, k_{1 N_{s}}\right\}$ and $\left\{k_{2}, k_{2 N_{s}}\right\}$ does not effected by $m$.

Hence the butterflies operations does not changed except the orders.
7.2 The modification does not differ too much from the Box 7.5. It is left to the readers. Some variables and procedures should be notified here: Ns initiated with $\mathrm{N} / 2$ in stead of N , the range of the loop variable m should be reduced to [0 ((N/8)-1)], two sets of twiddle factor should be generated instead of one, four butterflies are computed concurrently and four results should be written and read in stead of two.
7.3 In this case we have: $T_{\text {min }}=\frac{1}{2}\left(T_{a d d}+T_{m u l t}\right)$. The critical path is $T_{a d d}+T_{m u l t}$. In order to reach the maximal sample rate we will have to either use interleaving or pipelining.

## a) Unit time processors

Here we assume that $T_{a d d}=T_{\text {mult }}=1$ t.u. The minimal sample period, $T_{\text {min }}$ is 1 t.u. and the critical path 2 t.u.

## Pipelining

After pipelining the critical path is split in two sections of equal length. We can start a new addition and a new multiplication in each sample interval. The schedule for processors is shown below.

We will need only one processor of each type and their degree of utilization is $100 \%$.

## Interleaving

Interleaving of resources must be done if the critical path (adder - multiplier) is indivisible. We will have to start a new set of addition-multiplication every sample period. The schedule for processors is shown in the figure to the right.


Schedule using interleaving of resources.

Now, we need two adders and two multipliers. More processors are needed since the processing is sequential. Further, their degree of utilization is only $50 \%$.

## b) Non unit time processors

Now, assume that $T_{m u l t}=3 T_{a d d}$ and $T_{a d d}=1$ t.u. The minimal sample period will be $T_{\text {min }}=4 / 2=2 \mathrm{t} . \mathrm{u}$. and the critical path 4 t.u.

## Interleaving

If the adder and the multiplier are indivisible we will have the schedule shown below. We need two processors of each type which will be utilized to $25 \%$ and $75 \%$, respectively.


Schedule using interleaving of resources.

## Pipelining

If we pipeline, the critical path will have to be split in two parts of length 2 . This is assuming that the multiplier can be split into two parts. The pipelined filter is shown to the right while the operation schedule for this is shown below.

| $2 * / 3$ | $2 * / 3$ | $2 * / 3$ | $2 * / 3$ |
| :---: | :---: | :---: | :---: |
| $+* / 3$ | $+* / 3$ | $+* / 3$ | $+* / 3$ |



Pipelined filter with multiplier split into two parts. The first part has an execution time of $1 / 3$ while the second has an execution time of $2 / 3$ of a complete multiplicationpar

Schedule for filter with multiplier split into two parts.

Here we have one processor performing $+* / 3$ and one $2 * / 3$. They are both used $100 \%$ of the time.

## Pipelining and interleaving

In many cases we can not divide a processor in two parts. If we do not divide the multiplier but still pipeline as much as possible, i.e., move one delay element so that the


Schedule for not fully pipelined filterd.
critical path is split into one 1 t.u. section and one 3 t.u. section, we get the schedule shown below. Here we have, one adder which is utilized to $50 \%$ and two multipliers utilized $75 \%$ each.

## Conclusions

It is desirable to have equal processor execution times. Pipelining will improve the processor utilization.
7.4 We have: $T_{\min }=\max \left\{\frac{5}{1}, \frac{10}{2}, \frac{15}{3}\right\}=5$

We need 1 adder, 1 multiplier of the first type, 2 of the second and 3 of the third type. We introduce delays into the critical path so that is broken into smaller pieces. This is often call for retiming. The degree of utilization of the adder is $60 \%$ and it is $60 \%, 70 \%$, and $80 \%$ of the multipliers.


Retimed filter section.


Feasible schedule
7.7 a) The iteration bound is determined by

$$
T_{\min }=\max \left\{\frac{T_{o p_{i}}}{N_{i}}\right\}=\max \left\{\frac{4+1+1}{1}, \frac{4+1+1}{2}\right\}=5 .
$$

b) The minimal number of PEs is $N_{P E}=\left|\frac{\sum_{i} N_{o p_{i}} T_{o p_{i}}}{T_{\text {sample }}}\right|=\left\lceil\frac{5 \times 4+4 \times 1}{8}\right\rceil=3$.
c)
7.8 a) Specify the ordering of the additions. Chose to add $c_{0} x(n)$ and $b_{2} y(n-2)$ first. We get.

$$
T_{\min }=\max ((4+1) / 1,(4+1+1) / 2)=5 \text { t.u. }
$$


b)
c)
7.12 a) There are several ways to combine additions and multiplications into basic operations, but the critical path will contain at least 4 operations.


The number of clock cycles per sample is: $\frac{100 \times 10^{6}}{1.25 \times 10^{6}}=80$

6 PE operations requies $\frac{6 \times 20 \times 1.25 \times 10^{6}}{100 \times 10^{6}}=1.5 \Rightarrow 2 \mathrm{PE}$

A feasible scheduling is shown below.

7.13 Assume that we implement for $T=T_{\infty}$, otherwise we can implement it with 1 PE.

Let each operation correspond to a vertex and construct the connectivity graph.
Obviously all operations are overlap with each other so that there is no branch between each two vertices.


Connectivity graph
From the connectivity graph we have to allocate three PEs.
7.14


Thus, 3 resurcers are required.
7.16 We give only one of the assignment alternatives. The other alternatives are left to the readers.
a) The exclusion graph for the 16-point FFT can be constructed as the same way in 7.11.1 for the memory assignment 2 .


The four RAMs assignment can be expressed in terms of the binary representation of index $i=i_{3} i_{2} i_{1} i_{0}$. A variable of index $i$ is assigned to $\operatorname{RAM}_{P(i)}$ where
$P(i)=p_{1} p_{0}$
$p_{0}(i)=i_{0}$
$p_{1}(i)=i_{2} \oplus i_{1}$
$\begin{array}{lllllllllllllllll}\text { Data index } & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\ \text { Data } x(i) & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet & \bullet\end{array}$

$$
\begin{aligned}
& \text { A memory assignment for a } 16 \text {-point FFT. }
\end{aligned}
$$

b) Using the PE assignment 2 in 7.11.2, we have the following exclusion graph.


A possible assignment, where butterflies in rows $p, p+N / 8, p+N / 4$ and $p+3 N / 8$ are executed parallel, is given here. A butterfly operation in row $r$ is assigned to the $\mathrm{PE}_{P(r)}$ where

$$
\begin{aligned}
& P(i)=p_{1} p_{0} \\
& p_{0}(i)=r_{1} \\
& p_{1}(i)=r_{2}
\end{aligned}
$$

7.17 a) Number the variables from the top and downwards. Connect nodes that must lie in the same memeory cell. 4 memory cells is required.
b) 6 variables yield 12 memory accesses per sample period. A mem-
 ory must have the acess time $12 \times 40$ $=480 \mathrm{~ns}$ to be able to read and write all variables. This corresponds to a sample frequency of $1 / 480 \mathrm{~ns}=2.08 \mathrm{MHz}$.
7.18 We can use the clique partitioning and choose the maximum cliques.

7.19 a) Assign names to all nods:

b) $u_{1}:=a v_{1}(n) \quad$ Which can be simplified to:

$$
\begin{aligned}
& \mathrm{u}_{2}:=\mathrm{b} \mathrm{v}_{2}(\mathrm{n}) \\
& \mathrm{u}_{5}:=\mathrm{d} \mathrm{x}(\mathrm{n}) \\
& \mathrm{u}_{3}:=\mathrm{u}_{1}+\mathrm{u}_{2} \\
& \mathrm{u}_{4}:=\mathrm{x}(\mathrm{n})+\mathrm{u}_{3} \\
& \mathrm{u}_{6}:=\mathrm{c}_{4} \\
& \mathrm{y}(\mathrm{n}):=\mathrm{u}_{5}+\mathrm{u}_{6} \\
& \mathrm{v}_{2}(\mathrm{n}+1):=\mathrm{v}_{1}(\mathrm{n}) \\
& \mathrm{v}_{1}(\mathrm{n}+1):=\mathrm{x}(\mathrm{n})
\end{aligned}
$$

c)

7.20 Figure 7.20 a shows computation graph $N$ with delay of operations inserted and $T$ exchanged for $-T$. The maximal spanning tree is shown in Fig. 7.20b where edges belonging to the tree are drawn with thick lines and the link branches with thin lines.


Fig. 7.20a. The network $N$.
Fig. 7.20b. The maximum spanning tree of $N$.
Insert the link branches one by one and add shimming delays so that the total delay in the fundamental loops that are formed become zero. Finally, we remove the negative delays elements and arrive at the scheduled graph in Fig. 7.20c. Note that there are no simming delays in the critical loops and that the schedule include


Fig. 7.20c.Final scheduled computation graph.
7.21 $T_{\min }=\max \left\{\frac{T_{o p i}}{N_{i}}\right\}=\max \left\{\frac{3+1}{2}, \frac{3+1+1}{3}\right\}=2$ clock cycles
$f_{\text {max }}=\frac{1}{T_{\text {min }}}=\frac{f_{c l k}}{2}$

The critical path is shown below.


Since the critical loop is large than 1 sampling period, we have to schedule the operations in two sample periods. The scheduling with the maximal sample frequency is shown below.

7.22 a) The upper bound of required number of memory cells is equal to the number of variables, i.e., 8 .
The lower bound is equal to the total required lifetime divided by the available time, i.e., $\lceil 33 / 10\rceil=4$.
b) Sort the lifetime diagram according to the start time and lifetime and allocate the memories.

c) Each variable requires read and write within 2 sampling periods, i.e., total 16 memory accesses are needed. $T_{\text {access }}=\frac{2}{15 \times 10^{6} \times 16}=8.3 \mathrm{~ns}$.
7.23 a) $T_{\min }=\max \left\{\frac{T_{o p_{i}}}{N_{i}}\right\}=\max \left\{\frac{3+1}{1}, \frac{7+1}{2}\right\}=4$ clock cycles
b) The precedence graph is shown below:

c) Since the critical loop requires 8 clock cycles which is larger than the sampling period, we have to schedule the computation in two sampling periods. The minimal number of
PEs is $N_{P E}=\left|\frac{\sum_{i} N_{i} T_{o p_{i}}}{2 T_{\text {sample }}}\right|=\left\lceil\frac{(3 \times 3+2 \times 7+4 \times 1) \times 2}{2 \times 5}\right\rceil=\lceil 5,4\rceil=6$.
d)


