# DIFFERENT APPROACHES OF LOW POWER VITERBI DECODER VIA REGISTER EXCHANGE: A REVIEW

Ruhaida Abdul Rashid<sup>1,2</sup>, Harlisya Harun<sup>1</sup>, Zuhanis Mansor<sup>2</sup>, Sahzilawati Mohammad Nor<sup>1</sup>

<sup>1</sup>Universiti Kuala Lumpur Malaysian Institute of Aviation Technology (UniKL MIAT)

<sup>2</sup>Universiti Kuala Lumpur British Malaysian Institute (UniKL BMI)

For Correspondence; ruhaida@unikl.edu.my

**ABSTRACT:** The Viterbi decoder is a popular device used in wireless communication to reduce the effect of noise. It decodes the code words generated by the convolutional encoder to reproduce the original data. Register exchange (RE) and Traceback (TB) are the two methods used in the Survival Memory Unit (SMU) when retrieving the original message. TB is more popular due to its low power and less complex process. Unlike TB, RE is widely used in high-speed and low latency applications where power consumption has always been the issue. Many types of research conducted on RE focuses on reducing its power-hungry tendencies. This paper reviews different techniques to reduce the power dissipation of the RE method.

Keywords: convolutional codes, power, register exchange, traceback, viterbi decoder

## 1. INTRODUCTION

Wireless digital communication is a type of communication, which can transmit information through the air without requiring wires or cables. The communication technology is becoming essential to many users as it gets people connected anytime and anywhere around the globe. In the present day, different types of wireless communication include satellite communication, infrared communication, broadcast radio, microwave communication, Wi-Fi, mobile communication and blue tooth technology.

In wireless digital communication, the receiver is expected to receive the same information sent via a transmitter. However, most of the communication channel will be affected by noise, thus distorting the originally transmitted information. In order to avoid the noise interference, the redundant bits are added to the transmitted information, so that it can be corrected at the receiver. This method is called forward error correction method (FEC). A convolutional code is widely used in FEC in wireless communications [1, 2, 3, 4]. There are three major decoding families used to decode the coded signal that uses convolutional coding; sequential, Viterbi and maximum a posteriori (MAP) [4]. Of all the families, Viterbi Decoding is most widely used [2, 3, 4, 5]. Its popularity extends to various digital communication such as cellular phones, video, and audio broadcasting receivers, and modems [6, 7, 8].

The Viterbi Decoder (VD) operates on a maximum likelihood (ML) algorithm called the Viterbi Algorithm (VA) which decodes the convolutional codes based on the minimum distance calculated between the expected codes and possible received codes. The unit composes of four main units; the Branch Metric Unit (BMU), Path Metric Unit (PMU), Add-Compare-Select Unit (ACSU) and Survival Memory Unit (SMU). The VD computes the survival path metric of all states throughout the trellis. The surviving path of each state is the minimum distance between the actual received symbols and the possible received symbols at each state. The PMU retains the temporary path metric of each state measured by the BMU module and the ACSU will then make the comparison and select the survival path metric of each state to be saved in the SMU for future computation. There are two methods used in the SMU to reconstruct the original bits; the Trace-back method (TB) and the Register Exchange (RE) method. Since the computational algorithm is very complex, both methods give different impact to the performance of the

Viterbi decoder. The TB method is known for its high latency due to the needs of tracing back the trellis based on the minimum path metric to reconstruct the original data. On the other hand, RE requires frequent data switching within registers which result in higher power consumption but at lower latency compared to TB [2, 9, 10]. Thus, RE is claimed to be suitable only for a smaller number of states or constraint length of the convolutional code. Despite being the most popular decoding technique used in wireless communication, VD is known for its excessive power dissipation with RE being higher than TB. Two modules known for its power hunger are the ACSU and SMU which takes approximately 75% and 50% [10, 11, 12].

In channel coding, the RE has the advantage of high speed and low latency; many attempts were made to reduce its power consumption at higher constraints to expand its application. This paper reviews different approaches to reduce the power utilization of the Viterbi Decoder using the RE method at a higher constraint length.

## 2. PREVIOUS WORK

Chih-Jhen [13] proposed a different alternative to reduce power utilization in the SMU. As stated in their paper, the hybrid method was widely used in some VD designs to compromise the disadvantage between the RE and TB methods. However, this method caused high power consumption in the SMU due to a high number of register banks required. In their work, a power-saving RAM is used in replacing the power-hungry register banks used in the hybrid design.

In their VD architecture, 3 new units were proposed; Path-Select Unit (PLSU), Current-State Unit (CSU) and Path-State Unit (PSU). All 3 units employ the usage of RAM to record the most survivable path and the output bit stream. The proposed architecture is shown in Fig 1 below.

In their design, a compare-select-add unit (CSAU) and Path Metric Memory (PMM) shown in Figure 2 was used to reduce hardware costs and area in determining the branch metric and path metric. The Pulse-Select Unit (PLSU) will reconstruct all the probable survivor paths, discarding all the branches and store the decision bits in the RAM. The architecture of PLSU is shown in Figure 3.



Figure 1: Additional units of the decoder [13]



Figure 2: Proposed CSAU and PMM structure [13]

From the unit, the PSU will generate the output bits and store them in the RAM of the SMU using RAM. The minimum hamming distance of the current state will choose the most probable survivor from the unit in Figure 3 and generate a RAM storage in the SMU.



Figure 3: Proposed PLSU structure [13]

This method of high constraint length is implemented on the CMOS  $0.18\mu m$ . The power evaluation of the SMU shows lower consumption levels compared to the TB and Hybrid method. The design did not only show a significant power

reduction but it's CSAU was able to reduce 50% of hardware complexity compared to classical ACSU.

Sunil [14] proposed a low power Viterbi decoder through a modified ACSU architecture and clock gating method in the SMU. As mentioned in their work, several architectures were designed using TB to reduce power consumption of the ACSU. They also stated that the SMU consumes 1/3 of power in the VD based on [15]. However, most of the design focused more on low constraint lengths. In their design, Sunil et al. modified the flow operation in the ACSU of [16] to further reduce the power consumption and delay. Here, pre-calculation differences and comparisons circuit of BM and PM is proposed. The butterfly operation shown in Figure 4 was re-arranged to perform a comparison operation of  $(PM_{t-l}^{i} - PM_{t-l}^{j})$  and  $(BM_t^{(j,p)} - BM_t^{(l,p)})$ .



Because of symmetric structure, each butterfly unit only requires to calculate  $PM_{t-1}^{i} - PM_{t-1}^{j}$ . The circuit in Figure 5 is designed to calculate the branch metric of the same butterfly unit. Both the pre-calculation of PM and calculation of BM has reduced the number of adders by half compared to the conventional ACSU. As a result, the power consumption of ACSU is also reduced.



Figure 5: Circuit for branch metric calculation [14]

In the SMU, the clock gating technique was implemented to avoid unnecessary clock switching which causes power consumption on every flip-flop. In their design, the clock gating unit was located at the load pointer. The pointer would update data in the data registers located in the SMU once the write signal was enabled. Thus, the clock gating technique reduced the number of clock switching which was unnecessary. These techniques of modifying the structure of ACSU and SMU have reduced the overall power consumption of the Viterbi Decoder. The overall performance was made against [16] that has a 20% power improvement compared to conventional VD proposed in [17] but with some performance degradation. The comparison between both studies show a 66% improvement of power. According to the author, the delay shall be reduced with the modified architecture thus increasing the throughput. However, there was no significant data in his work showing an improved throughput compared to [16].

Bhowal [11] proposed a low power Viterbi Decoder focusing more on the circuit design of the ACS module. He has proposed the use of simple Full Adders in determining the branch metric instead of 8x3 decoders employed in the conventional methods. Figure 6 and Figure 7 below show a comparison between the two methods; conventional and proposed with full adders. The full adders design has reduced the logic required in calculating the branch metric.

The author has also proposed a new Compare-Select-Add Module (CSU) replacing the conventional Add-Compare-Select Module (ACS). The number of processes is reduced by comparing the path metric of each butterfly unit.



Figure 6: Conventional Method [11]



Figure 7: Proposed method [11]

Instead of adding the previous path metric with two branch metrics transiting into the same next state and selecting the lower metric, the comparison of the 2 path metrics is first performed. Once the comparison is made, the lowest path metric is added to the two branch metrics leading to the next two states. This operation is explained in Figure 8[11].

The proposed module reduces the number of adders by two units as compared to the conventional ACS module. A comparison was made with the conventional RE method and it shows a power consumption reduction of 10%. However, through this design, the area utilization is compromised due to the number of adders required for every branch metric calculation.

Abdulrazaq Muhammad B. [18] has designed a low complexity FPGA implementation via register exchange method that reduces the size of registers used to store the state metric. The proposed method reduces switching of partially decoded data among registers, thus lowering the

requirement of multiplexer wiring (used in performing the switching and operations of shift registers) which can contribute to higher power consumption. The proposed method is illustrated in the following Table 1:



Figure 8: Proposed ASC module [11]

| Table 1: Split search method [18] |                        |   |                                |
|-----------------------------------|------------------------|---|--------------------------------|
|                                   | Present symbol         |   | Previous symbol                |
| 1 <sup>st</sup> bit               |                        |   |                                |
| 2 <sup>nd</sup> bit               | Find BM of<br>symbol l | 1 |                                |
| 3rd bit                           |                        |   | Find State metrics (Symbol 1)  |
|                                   |                        |   | Find surviving states          |
|                                   |                        |   | Update history                 |
| 4 <sup>th</sup> bit               | Find BM o              | f | Find Minimum metric (Symbol 1) |
|                                   | symbol 2               |   | Determine data output.         |
| 5 <sup>th</sup> bit               |                        |   | Find State metrics (Symbol 2)  |
|                                   |                        |   | Find surviving states          |
|                                   |                        |   | Update history                 |
| 6 <sup>th</sup> bit               | Find BM o              | f | Find Minimum metric(Symbol 2)  |
|                                   | symbol 3               |   | Determine data output.         |

To achieve reductions of register size, the process of determining the minimum path metric of each state is split into two. In his example, the author used a constraint K of 7 and code rate of 1/2, resulting in 64 available states. The states are then grouped into 8 groups and the minimum path metric is calculated simultaneously within the eight groups rather than making comparisons in every state for all 64 states.

As Illustrated in Table 1, when the second bit is received (i.e., first symbol), the branch metric is calculated simultaneously within all the eight groups. Each group will then calculate an individual state metric by adding the previous state metric to the current branch metric. From the calculated state metric, each state will record its surviving path metric. At this point, each group will have its own surviving path. When the 4th bits are received (2nd symbol), the minimums among the eight groups are then compared to each other to find the ultimate minimum. This method shows a reduced complexity of 5.5% compared to the conventional trace-back method and 1-pointer algorithm trace-back [4]. The result also shows lower latency and complexity compared to [9] which proposed an improved TB method in the same year.

Tajeswini Panse [19] introduced an advanced version of a register exchange called the hybrid register exchange method (HREM). It is a combination method between register exchange method (REM) and Traceback (TB). In this method, the initial state is transferred to the current state after it is traced back through an 'm' cycle. Which means the content of the current state is transferred for every subsequent T, eg., T=2, T=4, T=6 etc. The author has concluded that the hybrid reduces the memory requirement which also reduces the area and power of the Viterbi Decoder. However, there was no proper comparison made between this design against any other design. In theory, if a traceback method is performed, additional registers at each state is required to save the data which then allow a trace back through the mcycle. If an additional register is required for each state, then the area utilization will also increase. Thus, contradicting the conclusions made by the author. As there was no clear comparison made, a fair observation can be made, showing no reduction in area and power utilization.

## 3. CONCLUSION

Various methods and techniques were used to improve the power utilization of a higher constraint K of the Viterbi Decoder design. In RE, the high-power consumption is due to the switching of partially decoded data.

Chen [13] has proposed the usage of RAM in its CSAU to reduce the hardware complexity applicable on CMOS technology which makes their design more suitable for portable wireless devices.. Joshi & Paily [14] modified its ACSU process and implemented a clock-gating technique in its SMU but only at specific operating speed due to critical slack which is close to zero. Bhowal [11] has to compromise area utilization due to the increased number of adders. While Panse & Saratkar [19] has not shown a fair comparative result to prove its area and power utilization. However, Abdulrazaq [18]has shown a reasonably reduced complexity which also affects the power utilization. The design has proved a lower complexity through its split search method. Even though the number of multiplexers are increased, the wirings required at each stage are less, thus contributing to lower power consumption. Because of the split search, the design has a slight increase in latency compared to One Pointer design but lower compared to the conventional TB method.

If observed carefully, the latency could be further reduced if the groups can be reduced from 8 to 4 in each cycle. This may increase power utilization due to a higher number of states to be processed in each group, but may also be compromised if the number of adders are reduced by implementing the method used in[14]. Thus, with this design, a possibility of reducing power can be achieved without losing the latency.

## 4. REFERENCES

 [1] Arshad, N., & Basit, A. Implementation and Analysis of Convolutional Codes using MATLAB. *International Journal of Multidisiplinary Sciences and Engineering*, 3(8), 9–12 (2012). Retrievedfrom http://www.ijmse.org/Volume3/Issue8/paper2.pdf

- [2] Fu, C., Li, X., & Ai, B. A Low-Latency and Power-Efficient Viterbi Decoder Based on Dynamic Truncation Length. In *AICWMMN2013 Proceedings* (pp. 367–370). (2013)
- [3] Pradan, A. K., & Nandy, S. K. A Reconfigurable Viterbi Decoder for SDR and Mobile Communications. In International Conference on High Performance Computing and Applications (ICHPCA) (pp. 1–6). (2014)
- [4] Vikramanarasimhareddy, S., Charan Kumar, K., & Koppala, N. Design of Convolutional Codes for varying Constraint Lengths. *International Journal of Engineering Trends and Technology*, 4, 61–66. (2013)
- [5] Sharma, S., Vasudevamurthy, H. S., & Valarmathi, N. Design and FPGA Implementation of Block Synchronizer for Viterbi Decoder. In International Conference on Advances in Computing, Communications and Informations (ICACCI) (pp. 908–912). (2013)
- [6] Dawid, H., J.Joeressen, O., & Meyr, H. Viterbi Decoder: HIgh Performance Algorithm Architectures. 2005 IEEE 16th International Symposium on Personal, Indoor, and Mobile Radio Communication, 2. (2005)
- [7] Irkhede, S. R., & Haridas, S. L. Review of Viterbi Decoder for Trellis Coded Modulation System including. *International Journal of Engineering Research and Applications (IJERA)*, (April), 30– 35.(2014)
- [8] Sivasankar, G., & Thangarani, L. Performance Analysis of Viterbi Decoder for Wireless Applications. In *Advanced Computing: An International Journal* (Vol. 5, pp. 1–7). (2014) <a href="http://doi.org/10.5121/acij.2014.5401">http://doi.org/10.5121/acij.2014.5401</a>
- [9] Bhatt, N., Shah, P. M., & Asodariya, P. BFPGA Implementation Of Power Efficient Low Latency Viterbi Decoder. *International Journal of Engineering Research & Technology (IJERT)*, 2(5), 37–43. (2013)
- [10] Lee, X. R., Chang, H. C., & Lee, C. Y. A low-power radix-4 Viterbi decoder based on DCVSPG pulsed latch with sharing technique. *IEEE Asia-Pacific Conference on Circuits and Systems, Proceedings, APCCAS*, 1203–1206. (2010)

http://doi.org/10.1109/APCCAS.2010.5774991

- [11] Bhowal, S. Transformation of ACS module to CSA module of low-power Viterbi decoder for digital wireless communication applications. *Proceedings of* the 2013 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2013, 266–270. (2013) http://doi.org/10.1109/ICACCI.2013.6637182
- [12] Shieh, M.-D., Wang, T.-P., & Yang, D.-W. Low-power register-exchange survivor memory architectures for Viterbi decoders. *Circuits, Devices Systems, IET*, 3(2), 83–90. (2009) <u>http://doi.org/10.1049/iet-cds.2008.0262</u>
- [13] Chen, C., Yu, C., Yen, M., Hsiung, P., & Chen, S. Design of a Low Power Viterbi Decoder for Wireless Communication Applications, 7–10. (2010)
- [14] Joshi, S. P., & Paily, R. Low power Viterbi Decoder by modified ACSU architecture and clock gating method. *ICCSP* 2011 - 2011 International Conference on

Communications and Signal Processing, 499–503. (2011)

- http://doi.org/10.1109/ICCSP.2011.5739371
- [15] Kang, I., & Willson, A. N. Low-power viterbi decoder for CDMA mobile terminals. *IEEE Journal of Solid-State Circuits*, 33(3), 473–481. (1998) http://doi.org/10.1109/4.661213
- [16] El-Dib, D. A., & Elmasry, M. I. Modified registerexchange Viterbi decoder for low-power wireless communications. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 51(2), 371–378. (2004) http://doi.org/10.1109/TCSI.2003.822396
- [17] Chang, Y., Suzuki, H., & Parhi, K. K. A 2-Mb/s 256state 10-mW rate 1/3 Viterbi Decoder. *IEEE Journal of Solid-State Circuits*, 35(6), 826–834. (2000)
- [18] Abdulrazaq, M. B., Abdullahi, Z. M., Almustapha, M. D., & Dajab, D. D. Low Complexity FPGA Implementation of Register Exchange Based Viterbi Decoder. In *IEEE International Conference on Emerging & Sustainable Technologies for Power & ICT in a Developing Society (NIGERCON)* (pp. 21–25). (2013)
- [19]Panse, T., & Saratkar, K. Design of Trellis Code Modulation Decoder Using Hybrid Register Exchange Method. *IEEE Advancing Technology for Humanity*, 265–269. (2014)