Energy efficient and fast execution of computationally complex applications is increasingly becoming a key challenge for the computer architects of the present generation. Due to the effects of utilization wall architecture design strategies that rely on Moore’s law cannot scale to meet the high throughput demands of next generation embarrassingly parallel computations.Simply relying on availability of many cores on a processing platform is failing to work as we go into sub-90 nm technology nodes. The apparently conflicting requirement of application scalability without significant compromise in performance can be addressed by designing Massively Parallel Reconfigurable Architectures.
The goal of this research activity is to develop a Massively Parallel Reconfigurable Architecture for solving embarrassingly parallel problems.
This activity is divided into two parts.
1. Design of a Reconfigurable Fabric
2. Automatic Identification of MIMO subgraphs from embarrassingly parallel application for execution on the Reconfigurable Fabric.
Realizing a generic architecture which can process single-stream and multi-stream OFDM systems. Merging all or some of the currently used standards to a single architecture is always a challenge because of different algorithms and throughput requirements, and may not be an efficient solution as actual opportunistic secondary user of spectrum. So virtual requirements for multi mode operation is fast switching among the standards with minimal cycle cost overheads.We use the algorithmic level transformation such as 2-D transformation, pipeline and realization of recursive computations.In general, the work focuses as function level hardware re-use of (Channel estimation, IFFT, FFT, Viterbi and Convolutional Turbo decoders ) it also utilizes datapath level optimizations for efficient multiplexing among different standards.
Face recognition is a non-intrusive biometric identification method which is being extensively researched for better recognition and real time performance. The factors that influence the performance of algorithm used are illumination variations, facial expressions, pose angle, occlusions, and time delay as individual faces change over time. We are in the process of building a full-fledged real-time face recognition system which can detect and recognize multiple faces in the input video stream.Real time face detection system detects faces of different sizes in the input image. We use the popular Viola-Jone’s cascade of classifiers for detecting faces. Possibility of mapping this algorithm on parallel architecture is being analyzed considering limited memory bandwidth and real time requirements.
Real time Face recognition system recognizes single or multiple faces in each frame of input video stream. When used in surveillance and security systems, recognizing multiple faces in real time has been a challenge. There is a tradeoff between performance of algorithm, which is often proportional to complexity of algorithm, and real time performance. Weighted Modular Principle Component Analysis (WMPCA) algorithm provides us parallel processing streams making it suitable for real time hardware implementations. WMPCA along with Radial Basis Function Neural Network (RBFNN) as classifier has shown better recognition rate compared to other simple classifiers. We make the architecture scalable with respect to image and database sizes by storing the data on off-chip memory due to limited on-chip memory resources. We also target a high performance face recognition system on parallel architecture.
The current trend of adding connectivity to a wide variety of devices have led to proliferation of communication standards, in both wired and wireless domains. Many of the widely used standards make use of Orthogonal Frequency Division Multiplexing (OFDM) which is based on Fast Fourier Transform (FFT) computations. The size of the FFT used varies with the standards and some of the standards even need multiple FFT sizes. In order to support multiple communication standards, say in a Cognitive Radio environment, it is essential to have a flexible FFT processor which can be configured on the fly, to compute FFTs of varying sizes.We have developed a framework, for implementing reconfigurable FFT processors based on design parameters like energy, performance and FFT size. The framework uses Radix-4n based parallel unrolled architecture based on novel raidx-4 butterfly unit. A 64-4K point FFT processor has been developed using this framework and implemented in 65nm CMOS technology. Currently we are working on developing a similar framework for Coarse Grain Reconfigurable Architectures (CGRAs) using the REDEFINE CGRA platform.
Numerical Linear Algebra kernels play important role in wide range of application domains like wireless MIMO receivers, Kalman Filtering, Neural Networks and data mining. Different applications require different precision data-path due to distinct numerical stability requirements. There have been several attempts on Graphic Processing Units and Field Programmable Gate Arrays for acceleration of these kernels.Due to recent technology advancements, Coarse Grained Reconfigurable Architectures have gained tremendous popularity due to their FPGA like flexibility and ASIC like performance. In CGRAs like REDEFINE performance is achieved by placing a Custom Functional Unit inside a Compute Element.While NLA kernels like Matrix Multiplication, LU decomposition, Cholesky Factorization, and QR decomposition have different data path execution, the semantics of the algorithms remain same. Performance for such kernels can be improved by introduction of a specialized hand-crafted CFU that can support data-paths of all these kernels.Apart from performance, Pareto optimality play important role in the hardware implementations of these kernels where each kernel can achieve equal performance as the other kernels.
Furthermore, it is important to focus on the possibility of algorithmic improvements and better scheduling techniques for NLA kernels. While two-dimensional systolic scheduling for most of NLA kernels gives excellent performance, achieving such a performance on CGRAs has been challenging task. Some of the algorithms like Givens Rotation can be improved to perform better than their existing implementations.
Design of Hardware Accelerator for Accurate Alignment of Short Reads in Next Generation Sequencing [NGS] Platforms Genomic science is revolutionized with the development of next generation sequencing technologies. The commercial next generation sequencing platforms like GS FLX from 454 Life Sciences Roche, Genome Analyzer from Illumina, SOLiD from Applied Biosystems, CGA platform from Complete Genomics, PacBio RS from Pacific Biosciences etc typically produce millions of short reads of varying lengths (45 to 1000 base pairs).Mapping short reads against a reference genome (billions of base pairs long) is typically the first step to analyze such next-generation sequencing data and it should be as accurate as possible. Short read mapping is one of the most challenging problems, directly affecting the throughput and performance of the sequencer pipeline. Because of the high number of reads to handle, numerous sophisticated algorithms have been developed in the last few years to tackle this problem. Many software mapping tools exist now, like BWA, Bowtie, Novoalign, SOAP2, BFAST, SSAHA2, MPscan, GAAST etc. There exists FPGA hardware based solutions as well, trying to perform several short read alignments at one instance. However they have not succeeded in achieving significant performance improvement over the software counterparts, as well as not able to scale up to the growing short read data size and reference genome alignment complexity.
In our research, we are targeting the design of a novel scalable hardware architecture, which can perform efficient and accurate local alignment while mapping millions of short reads against the reference genome. The work aims at achieving a significant improvement in performance over the existing software and hardware platforms, thereby accelerating the genome sequencing pipeline.
Moore’s law continues to drive both chip complexity and performance to new highs. In order to develop and verify the resulting complex Hardware/Software systems, the Hardware has to be executed – prototyped – to allow verification and software execution.Functional verification continues to be a major concern. The number of chip iterations continues to grow due to increasing design complexity and functionality. In addition, software-driven applications and the complexities of software and hardware integration are moving at a faster pace than Moore’s Law. Hence a verification gap gets created, which emphasizes on the need for simulation acceleration.In event driven simlation method, we simulate events. In a compiled code, we simulate the state of the hardawre every cycle. In cycle based simulation, we move task by task. These tasks contain mulitple events with in them. In context of architecture simulation, when the program is compiled for different ISA, those instructions, basically the arithmetic and logical operations, are executed as it is, in the native machine and those instructions which are not part of the ISA, they are emulated. Hence, the name, execution driven simulation.
In this work, the same idea has been extended to hardware. What ever, as software, was executed in the host as a notive code, is now being done in hardware. Though it is execution driven in hardware, we still need to bother about keeping the hardware flexible. Otherwise it is going to be execution driven simulation on hardware for just one application. So we also have to architect, how to make the execution driven simulation flexible.
In brief, we aim to achieve and improve over the performance of what an execution driven simulator achieves in software.
Reconfigurable computing is breaking down the barrier between hardware and software design technologies. The segregation between the two has become more and fuzzier because reconfigurable computing has now made it possible for hardware to be programmed and software to be synthesized. Reconfigurable computing can also be viewed as a trade-off between general-purpose computing and application specific design. Given the architecture and design flexibility, reconfigurable computing has catalyzed the progress in hardware-software code sign technology and a vast number of application areas such as scientific computing, biological computing, artificial intelligence, signal processing, security computing, and control-oriented design.Artificial Neural Networks (ANNs) are algorithmic techniques based on biological neural systems. Typical realization of ANN is software solutions. ANN is captured in a High Level Language (HLL), Viz., C, C++, etc.Such solutions have performance limitations which can be attributed to some of the reasons below:1.Code generated by the complier cannot perform application specific optimization.
2.Communication latencies between processors through a memory hierarchy could be significant due to the non-deterministic nature of the communications.
Nowadays Artificial Neural Networks (ANNs) are being increasingly used in applications that involve solutions to non-linear optimization problems and adoptive systems. An application specific integrated circuit (ASIC) realization of ANN can offer performance that is at least an order of magnitude higher than software. Hardware synthesis of ANN on a reconfigurable chip will enable exploring variety of architectures that best meet certain design and performance criteria.
Sparse matrices of huge sizes often appear in computations in the field of network theory, graph theory etc.Due to the sparsity of non-zero elements, the sparse matrices are represented in compressed format using specialised data structures, which occupy reduced space on computer memory. While performing mathematical operations on these matrices, it is necessary to follow specialised algorithms which make use of sparsity of the elements, and thereby improving the efficiency of the operations. As the sizes of these sparse matrices are often huge, the cache performance comes into picture during the operations, which has significant effect on the computation time. As matrix-vector multiplications are the most common operations, many research works towards reducing the cache misses during matrix-vector multiplication have been published so far. Most of these methods use Compressed Row Storage(CRS) and Incremental Compressed Row Storage (ICRS) data structures for storage, which allows easy manipulation of data.
However, the performance of sparse matrix-vector multiplications is limited by the data structure itself, as these data structures exhibit poor temporal and spatial locality, leading to more number of cache misses. Hence, there is a need of a sparse matrix compression format, which can provide better cache performance efficiency during such operations and also can improve the computation time. We are working on Sparse Matrix Datastructures which not only occupies lesser memory than CRS and ICRS, but also the corresponding Sawtooth Multiplication (SM) improves the cache performance during multiplication with a vector (SpMV) or with a series of vectors forming a dense matrix(SpMM).