Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier

3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence archit...

Full description

Bibliographic Details
Main Author: Stenersen, Espen
Format: Others
Language:English
Published: Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon 2008
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876
id ndltd-UPSALLA1-oai-DiVA.org-ntnu-8876
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-ntnu-88762013-01-08T13:26:27ZVectorized 128-bit Input FP16/FP32/FP64 Floating-Point MultiplierengStenersen, EspenNorges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjonInstitutt for elektronikk og telekommunikasjon2008ntnudaimSIE6 elektronikkKrets- og systemkonstruksjon3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation. This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16-bit, 32-bit and 64-bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at 200 MHz, 300 MHz and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology. Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz, architecture one has a total power consumption of 1.9210 mW at 65 nm, and 15.4090 mW at 90 nm. Architecture two has a total power consumption of 7.3569 mW at 65 nm, and 17.4640 mW at 90 nm. Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz, architecture one has a total area of 59816.4414 um^2 at 65 nm, and 116362.0625 um^2 at 90 nm. Architecture two has a total area of 50843.0 um^2 at 65 nm, and 95242.0469 um^2 at 90 nm. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876Local ntnudaim:4191application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic ntnudaim
SIE6 elektronikk
Krets- og systemkonstruksjon
spellingShingle ntnudaim
SIE6 elektronikk
Krets- og systemkonstruksjon
Stenersen, Espen
Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
description 3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation. This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16-bit, 32-bit and 64-bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at 200 MHz, 300 MHz and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology. Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz, architecture one has a total power consumption of 1.9210 mW at 65 nm, and 15.4090 mW at 90 nm. Architecture two has a total power consumption of 7.3569 mW at 65 nm, and 17.4640 mW at 90 nm. Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz, architecture one has a total area of 59816.4414 um^2 at 65 nm, and 116362.0625 um^2 at 90 nm. Architecture two has a total area of 50843.0 um^2 at 65 nm, and 95242.0469 um^2 at 90 nm.
author Stenersen, Espen
author_facet Stenersen, Espen
author_sort Stenersen, Espen
title Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_short Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_full Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_fullStr Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_full_unstemmed Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_sort vectorized 128-bit input fp16/fp32/fp64 floating-point multiplier
publisher Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon
publishDate 2008
url http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876
work_keys_str_mv AT stenersenespen vectorized128bitinputfp16fp32fp64floatingpointmultiplier
_version_ 1716520078797701120