# SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing

Xinming Huang Worcester Polytechnic Institute, Worcester, MA 01609 Email: xhuang@ece.wpi.edu

### Abstract

This paper describes a novel coarse-grained reconfigurable architecture, namely SmartCell, targeted for high data throughput and computationally intensive applications. By integrating a large number of computational units with reconfigurable interconnection fabrics, SmartCell is able to provide stream processing capacity to achieve both performance and power efficiency. The preliminary results show that SmartCell architecture has the potential of improving the power efficiency more than 10 times when compared with the commercial FPGA solutions.

### Introduction

Nowadays, data streaming applications, such as signal processing, multimedia applications and data encryptions, are the dominant workloads in many electronic systems. Field programmable gate arrays (FPGAs) are widely adopted as a reconfigurable solution to meet the ever increasing computational demands in these applications. However, the flexibility and short time to market advantages of the fine-grained FPGA implementations come at a significant more cost of area, power consumption and speed, due to the huge routing area overhead and timing penalty in comparison with customized application specific integrated circuits (ASICs). A quantitative measurement of the gap between FPGAs and ASICs is available in [1].

To solve this problem, a number of reconfigurable systems have been developed using coarse-grained structures, such as RaPiD [2], RAW and PipeRench. Because these coarsegrained reconfigurable architectures (CGRA) have much less overhead, they have the potential advantages to bridge the performance/power gap between FPGAs and ASICs. Most of the work in this area has focused on the efficient design with respect to system performance and compiler. Power consumption is another important aspect in the reconfigurable architecture designs.

The power consumption has become one of the key challenges in today's mobile and multimedia applications. This paper presents SmartCell – a novel CGRA targeted for high performance low power reconfigurable embedded systems. SmartCell integrates a large number of tiny processor cores or cells into a single chip. The cells are interconnected with programmable switching fabrics. The feature of dynamic reconfigurability can be achieved by the changing the data path among the cells and by changing the configuration of each processing element in the cells. SmartCell architecture is a biological inspired computing approach. It is analogous to the cell cloning and destroying process in the human body. The number of processor

elements (PEs) involved in the computing task can be dynamically changed to meet the application requirements. For instance, more cells can be involved to achieve high computational performance, while fewer cells are used when the power consumption is the priority. The technologies for on-chip power management are also introduced to reduce the dynamic and static power consumptions. This paper provides an overview of the SmarCell structure. A benchmark application is mapped on both SmarCell architecture and Xilinx FPGA board. The hardware utilization and power consumptions are compared between these two platforms.

### **SmartCell Architecture**

A block diagram of the SmartCell architecture is shown in Figure 1. This microsystem architecture is composed of three major components: the cell unit, the reconfigurable interconnect fabrics and the high-speed data I/O.



Figure 1: Block diagram of SmartCell architecture

In a typical SmartCell architecture, a set of cell units is organized in a tiled structure. Each cell block consists of four processor elements along with the control and data memories. The reconfigurable connection fabrics are designed for the inner and inter cell data communications. The data flow can be dynamically reconfigured for different applications. The number of PEs involved in the application tasks and the function of each PE can be changed in realtime. Gated clock [3] technology is implemented to turn off the switching activities of the inactive PEs, which can greatly reduce the dynamic power consumption. Two levels of pipelines are developed in the SmartCell architecture: the instruction level pipeline (ILP) inside each PE and the task level pipeline (TLP) among the cells. In our design, the global I/O buses can connect directly to the edge cell components as shown in Figure 1. The design of the cell component and the interconnections is discussed in the following subsections.

#### Cell unit

Each cell component consists of a group of four PEs. Each PE is a customized ultra-low power consumption arithmetic logic unit (ALU), which can be configured to perform complex logic, shift and arithmetic functions. Multiple PEs can be chained together to implement complex algorithms and functions. The desired operation is implemented by selecting the relevant input/output signals and by setting corresponding controls for every operator inside the PE, which is specified in the instruction code. New instruction will not be applied to the PE until the current instruction expires. The SmartCell architecture does not perform the sequence of instruction fetching, decoding, register read/write and ALU operations as in many contemporary processors. The SmartCell is, therefore, able to provide comparable power efficiency as an ASIC while maintaining dynamic programmability.

Inside each cell unit, four PEs in the east, west, south and north directions form a *quad* with shared register memory (SRM) located in the center. The SRM is designed to provide efficient inner-cell data communications. Each PE has dedicated memory addresses in the SRM for writing data, which avoids the contention and synchronization among multiple PEs. All addressed data can read from any PE in the same cell. Therefore, the data can be efficiently transferred among PEs in the same cell through the SRM. The controls for the read/write operations and the addresses are part of the instruction code stored in cache. Similar to the ALU operations, the register read/write access to the SRM for each PE is also part of the instruction schedule.

• Reconfigurable interconnection

Aimed at run-time reconfigurability, the SmartCell architecture provides both inner-cell and inter-cell data flow controls. The SRM provides an efficient pipelined connection for the PEs to exchange data in the same cell. For inter-cell communications, since four PEs of a cell are placed at four edges, each PE can be directly linked to its adjacent PE located in another cell using short wires. Connection to the immediate neighbor cell has the priority due to the short wire length and less number of switchboxes. The operation of the SmartCell datapath is configured by programming the inner- and inter-connections through the instruction code. In general, the inner-cell connections are used more often than the intercell connections because of the tiled system structure, which in turn results in better system performance.

### **Benchmark Results and Discussions**

The SmartCell architecture is synthesized in the Synopsys Design Compiler tool by using the TSMC 130nm ASIC library. Clock gating logic was automatically inserted to reduce the dynamic power consumption. The chip area of a 4-cell SmartCell system is about 1.33 mm<sup>2</sup>. To evaluate the performance of the SmartCell architecture, a 16-tap finite impulse response (FIR) filter has been prototyped on both SmartCell architecture and FPGA chip. The FPGA solution is based on the Xilinx Virtex-II pro FPGA chip, which also uses a 130nm process. Both implementations are simulated

at the same operating frequency of 50 MHz for a fair comparison. For a fully pipelined 16-tap FIR filter, 4 cells (16 PEs) are involved in the SmartCell implementation, while 16 MAC units are used in the FPGA implementation. The synthesis results in Table 1 indicate that the SmartCell system is about 10.9 times more power efficient than the FPGA implementation. Figure 2 shows the power consumption of the FIR filter versus different number of involved PEs and MAC units. The benchmark application demonstrates that the SmartCell can reduce the power consumption up to 18 times comparing to FPGA.

Table 1. Performances of fully pipelined 16-tap FIR filter

| Tuble 1.1 error munces of runy pipelineu 10 tup 11k inter |               |            |                 |
|-----------------------------------------------------------|---------------|------------|-----------------|
|                                                           | Dynamic Power | Core Power | Gate equivalent |
| SmartCell                                                 | 20.3 mW       | 21.7 mW    | 265 k           |
| FPGA                                                      | 157.6 mW      | 232.6 mW   | 82 k            |



Figure 2. Power consumption vs. number of PEs/MACs

## **Conclusion and Future Work**

This paper presents the SmartCell architecture as an innovative CGRA design for high performance and low power embedded computing. In this architecture, a dynamically reconfigurable and evolvable computing architecture is proposed to interconnect a large number of processing elements with programmable switching fabrics in a tiled structure. The experimental result shows the promising prospective of the SmartCell. Future work involves optimization of the cell unit and the design of more complex switchbox-based interconnections to provide more data flow flexibility.

### References

- I. Kuon and J. Rose, "Measuring the Gap Between FPGAs and ASICs," *IEEE Transactions on Computer-Aided Design* of Integrated Circuits and Systems, vol. 26, no. 2, pp. 203 -215, Feb. 2007.
- [2] D.C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. Ebeling, "Architecture design of reconfigurable pipelined datapaths," *In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI*, pp. 23 40, Mar. 1999.
- [3] J. Oh and M. Pedram, "Gated clock routing for low-power microprocessor design," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 20, pp. 715 - 722, Jun. 2001.