Are deep video architectures (also) biased toward texture rather than shape?

Convolutional neural networks (CNNs) have achieved high accuracy on several different perceptual tasks, such as object recognition and action recognition. Interpretability is required due to the significant impact of CNNs and the requirement of model improvement. Geirhos et al. suggested that ImageN...

Full description

Bibliographic Details
Main Author:	Li, Boyu
Format:	Others
Language:	English
Published:	KTH, Skolan för elektroteknik och datavetenskap (EECS) 2021
Subjects:	Computer and Information Sciences Data- och informationsvetenskap
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-304892

id	ndltd-UPSALLA1-oai-DiVA.org-kth-304892
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-3048922021-11-17T05:33:56ZAre deep video architectures (also) biased toward texture rather than shape?engTar djupa videoarkitekturer (också) partisk hänsyn åt textur snarare än form?Li, BoyuKTH, Skolan för elektroteknik och datavetenskap (EECS)2021Computer and Information SciencesData- och informationsvetenskapConvolutional neural networks (CNNs) have achieved high accuracy on several different perceptual tasks, such as object recognition and action recognition. Interpretability is required due to the significant impact of CNNs and the requirement of model improvement. Geirhos et al. suggested that ImageNet-trained CNNs exhibit a bias towards learning texture rather than shape and the shape-based representation has the advantage of previously unseen robustness towards multiple image distortions. Inspired by their research, we extend it from the object recognition area to the action recognition area. In this project, we will investigate if the texture bias found for 2D CNNs is similarly present for video CNNs. Through experiments, we make comparisons of different models and different training datasets. We indicate that although Kinetics-trained, UCF101- finetuned I3D outperforms UCF101-trained TSN and UCF101-trained CNN + LSTM on utilizing both shape and texture cues, it shows the largest texture bias among the three models. Recurrent video models or TSN models are capable of reduction in texture bias, possibly due to the advantage of their temporal modeling. We also suggest that Kinetics-trained Flow I3D exhibits a smaller texture bias than the RGB one. Since the model bias depends on the training dataset, a dataset with small static representation bias can force models to rely more on the shape cue to reduce the tendency to texture bias. Convolutionsneurala nätverk (CNN) har uppnått hög noggrannhet i flera olika perceptuella uppgifter, såsom objektigenkänning och åtgärdsigenkänning. Tolkbarhet krävs på grund av den betydande effekten av CNN och kravet på modellförbättring. Geirhos et al. föreslog att ImageNet-utbildade CNN uppvisar en bias mot inlärningsstruktur snarare än form och den formbaserade representationen har fördelen av tidigare osedd robusthet gentemot flera bildförvrängningar. Inspirerad av deras forskning utvidgar vi det från objektigenkänningsområdet till åtgärdsigenkänningsområdet. I det här projektet kommer vi att undersöka om texturbias som hittats för 2D CNNs finns på samma sätt för video-CNN. Genom experiment gör vi jämförelser av olika modeller och olika träningsdatamängder. Vi indikerar att även om Kinetics-tränade, UCF101-finjusterade I3D överträffar UCF101-tränade TSN och UCF101-tränade CNN + LSTM på att använda både form och textur, visar det den största textur bias bland de tre modellerna. Återkommande videomodeller eller TSN-modeller kan reducera texturbias, möjligen på grund av fördelen med deras tidsmässiga modellering. Vi föreslår också att Kinetics-utbildade Flow I3D uppvisar en mindre texturförspänning än RGB. Eftersom modellbias beror på träningsdatasetet kan en dataset med liten statisk representationsbias tvinga modeller att förlita sig mer på formkön för att minska tendensen till texturbias. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-304892TRITA-EECS-EX ; 2021:682application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer and Information Sciences Data- och informationsvetenskap
spellingShingle	Computer and Information Sciences Data- och informationsvetenskap Li, Boyu Are deep video architectures (also) biased toward texture rather than shape?
description	Convolutional neural networks (CNNs) have achieved high accuracy on several different perceptual tasks, such as object recognition and action recognition. Interpretability is required due to the significant impact of CNNs and the requirement of model improvement. Geirhos et al. suggested that ImageNet-trained CNNs exhibit a bias towards learning texture rather than shape and the shape-based representation has the advantage of previously unseen robustness towards multiple image distortions. Inspired by their research, we extend it from the object recognition area to the action recognition area. In this project, we will investigate if the texture bias found for 2D CNNs is similarly present for video CNNs. Through experiments, we make comparisons of different models and different training datasets. We indicate that although Kinetics-trained, UCF101- finetuned I3D outperforms UCF101-trained TSN and UCF101-trained CNN + LSTM on utilizing both shape and texture cues, it shows the largest texture bias among the three models. Recurrent video models or TSN models are capable of reduction in texture bias, possibly due to the advantage of their temporal modeling. We also suggest that Kinetics-trained Flow I3D exhibits a smaller texture bias than the RGB one. Since the model bias depends on the training dataset, a dataset with small static representation bias can force models to rely more on the shape cue to reduce the tendency to texture bias. === Convolutionsneurala nätverk (CNN) har uppnått hög noggrannhet i flera olika perceptuella uppgifter, såsom objektigenkänning och åtgärdsigenkänning. Tolkbarhet krävs på grund av den betydande effekten av CNN och kravet på modellförbättring. Geirhos et al. föreslog att ImageNet-utbildade CNN uppvisar en bias mot inlärningsstruktur snarare än form och den formbaserade representationen har fördelen av tidigare osedd robusthet gentemot flera bildförvrängningar. Inspirerad av deras forskning utvidgar vi det från objektigenkänningsområdet till åtgärdsigenkänningsområdet. I det här projektet kommer vi att undersöka om texturbias som hittats för 2D CNNs finns på samma sätt för video-CNN. Genom experiment gör vi jämförelser av olika modeller och olika träningsdatamängder. Vi indikerar att även om Kinetics-tränade, UCF101-finjusterade I3D överträffar UCF101-tränade TSN och UCF101-tränade CNN + LSTM på att använda både form och textur, visar det den största textur bias bland de tre modellerna. Återkommande videomodeller eller TSN-modeller kan reducera texturbias, möjligen på grund av fördelen med deras tidsmässiga modellering. Vi föreslår också att Kinetics-utbildade Flow I3D uppvisar en mindre texturförspänning än RGB. Eftersom modellbias beror på träningsdatasetet kan en dataset med liten statisk representationsbias tvinga modeller att förlita sig mer på formkön för att minska tendensen till texturbias.
author	Li, Boyu
author_facet	Li, Boyu
author_sort	Li, Boyu
title	Are deep video architectures (also) biased toward texture rather than shape?
title_short	Are deep video architectures (also) biased toward texture rather than shape?
title_full	Are deep video architectures (also) biased toward texture rather than shape?
title_fullStr	Are deep video architectures (also) biased toward texture rather than shape?
title_full_unstemmed	Are deep video architectures (also) biased toward texture rather than shape?
title_sort	are deep video architectures (also) biased toward texture rather than shape?
publisher	KTH, Skolan för elektroteknik och datavetenskap (EECS)
publishDate	2021
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-304892
work_keys_str_mv	AT liboyu aredeepvideoarchitecturesalsobiasedtowardtextureratherthanshape AT liboyu tardjupavideoarkitekturerocksapartiskhansynattextursnarareanform
_version_	1719494113127038976

Are deep video architectures (also) biased toward texture rather than shape?

Similar Items