Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier

3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence archit...

Full description

Bibliographic Details
Main Author:	Stenersen, Espen
Format:	Others
Language:	English
Published:	Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon 2008
Subjects:	ntnudaim SIE6 elektronikk Krets- og systemkonstruksjon
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876

id	ndltd-UPSALLA1-oai-DiVA.org-ntnu-8876
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-ntnu-88762013-01-08T13:26:27ZVectorized 128-bit Input FP16/FP32/FP64 Floating-Point MultiplierengStenersen, EspenNorges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjonInstitutt for elektronikk og telekommunikasjon2008ntnudaimSIE6 elektronikkKrets- og systemkonstruksjon3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation. This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16-bit, 32-bit and 64-bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at 200 MHz, 300 MHz and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology. Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz, architecture one has a total power consumption of 1.9210 mW at 65 nm, and 15.4090 mW at 90 nm. Architecture two has a total power consumption of 7.3569 mW at 65 nm, and 17.4640 mW at 90 nm. Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz, architecture one has a total area of 59816.4414 um^2 at 65 nm, and 116362.0625 um^2 at 90 nm. Architecture two has a total area of 50843.0 um^2 at 65 nm, and 95242.0469 um^2 at 90 nm. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876Local ntnudaim:4191application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	ntnudaim SIE6 elektronikk Krets- og systemkonstruksjon
spellingShingle	ntnudaim SIE6 elektronikk Krets- og systemkonstruksjon Stenersen, Espen Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
description	3D graphic accelerators are often limited by their floating-point performance. A Graphic Processing Unit (GPU) has several specialized floating-point units to achieve high throughput and performance. The floating-point units consume a large part of total area, and power consumption, and hence architectural choices are important to evaluate when implementing the design. GPUs are specially tuned for performing a set of operations on large sets of data. The task of a 3D graphic solution is to render a image or a scene. The scene contains geometric primitives as well as descriptions of the light, the way each object reflects light and the viewer position and orientation. This thesis evaluates four different pipelined, vectorized floating-point multipliers, supporting 16-bit, 32-bit and 64-bit floating-point numbers. The architectures are compared concerning area usage, power consumption and performance. Two of the architectures are implemented at Register Transfer Level (RTL), tested and synthesized, to see if assumptions made in the estimation methodologies are accurate enough to select the best architecture to implement given a set of architectures and constraints. The first architecture trades area for lower power consumption with a throughput of 38.4 Gbit/s at 300 MHz clock frequency, and the second architecture trades power for smaller area with equal throughput. The two architectures are synthesized at 200 MHz, 300 MHz and 400 MHz clock frequency, in a 65 nm low-power standard cell library and a 90 nm general purpose library, and for different input data format distributions, to compare area and power results at different clock frequencies, input data distributions and target technology. Architecture one has lower power consumption than architecture two at all clock frequencies and input data format distributions. At 300 MHz, architecture one has a total power consumption of 1.9210 mW at 65 nm, and 15.4090 mW at 90 nm. Architecture two has a total power consumption of 7.3569 mW at 65 nm, and 17.4640 mW at 90 nm. Architecture two requires less area than architecture one at all clock frequencies. At 300 MHz, architecture one has a total area of 59816.4414 um^2 at 65 nm, and 116362.0625 um^2 at 90 nm. Architecture two has a total area of 50843.0 um^2 at 65 nm, and 95242.0469 um^2 at 90 nm.
author	Stenersen, Espen
author_facet	Stenersen, Espen
author_sort	Stenersen, Espen
title	Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_short	Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_full	Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_fullStr	Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_full_unstemmed	Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier
title_sort	vectorized 128-bit input fp16/fp32/fp64 floating-point multiplier
publisher	Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon
publishDate	2008
url	http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8876
work_keys_str_mv	AT stenersenespen vectorized128bitinputfp16fp32fp64floatingpointmultiplier
_version_	1716520078797701120

Vectorized 128-bit Input FP16/FP32/FP64 Floating-Point Multiplier

Similar Items