Wordspotting from multilingual and stylistic documents

Les outils et méthodes d’analyse d’images de documents (DIA) donnent aujourd’hui la possibilité de faire des recherches par mots-clés dans des bases d’images de documents alors même qu’aucune transcription n’est disponible. Dans ce contexte, beaucoup de travaux ont déjà été réalisés sur les OCR ains...

Full description

Bibliographic Details
Main Author:	Tarafdar, Arundhati
Other Authors:	Tours
Language:	en
Published:	2017
Subjects:	Analyse d’images de documents Repérage de mots (word spotting) Documents graphiques Recherche d’information Séparation texte-graphique Filtrage Cartes de probabilité Points d’intérêts (SIFT) Bengla Document Image Analysis Word Spotting Graphical documents Information Retrieval Probability matrix information 2-D Filter Water Reservoir Principle Clustering SIFT
Online Access:	http://www.theses.fr/2017TOUR4022/document

id	ndltd-theses.fr-2017TOUR4022
record_format	oai_dc
spelling	ndltd-theses.fr-2017TOUR40222019-01-19T04:37:15Z Wordspotting from multilingual and stylistic documents Repérage de mots dans les images de documents multilingues et graphiques Analyse d’images de documents Repérage de mots (word spotting) Documents graphiques Recherche d’information Séparation texte-graphique Filtrage Cartes de probabilité Points d’intérêts (SIFT) Bengla Document Image Analysis Word Spotting Graphical documents Information Retrieval Probability matrix information 2-D Filter Water Reservoir Principle Clustering SIFT Les outils et méthodes d’analyse d’images de documents (DIA) donnent aujourd’hui la possibilité de faire des recherches par mots-clés dans des bases d’images de documents alors même qu’aucune transcription n’est disponible. Dans ce contexte, beaucoup de travaux ont déjà été réalisés sur les OCR ainsi que sur des systèmes de repérage de mots (spotting) dédiés à des documents textuels avec une mise en page simple. En revanche, très peu d’approches ont été étudiées pour faire de la recherche dans des documents contenant du texte multi-orienté et multi-échelle, comme dans les documents graphiques. Par exemple, les images de cartes géographiques peuvent contenir des symboles, des graphiques et du texte ayant des orientations et des tailles différentes. Dans ces documents, les caractères peuvent aussi être connectés entre eux ou bien à des éléments graphiques. Par conséquent, le repérage de mots dans ces documents se révèle être une tâche difficile. Dans cette thèse nous proposons un ensemble d’outils et méthodes dédiés au repérage de mots écrits en caractères bengali ou anglais (script Roman) dans des images de documents géographiques. L’approche proposée repose sur plusieurs originalités. Word spotting in graphical documents is a very challenging task. To address such scenarios this thesis deals with developing a word spotting system dedicated to geographical documents with Bangla and English (Roman) scripts. In the proposed system, at first, text-graphics layers are separated using filtering, clustering and self-reinforcement through classifier. Additionally, instead of using binary decision we have used probabilistic measurement to represent the text components. Subsequently, in the text layer, character segmentation approach is applied using water-reservoir based method to extract individual character from the document. Then recognition of these isolated characters is done using rotation invariant feature, coupled with SVM classifier. Well recognized characters are then grouped based on their sizes. Initial spotting is started to find a query word among those groups of characters. In case if the system could spot a word partially due to any noise, SIFT is applied to identify missing portion of that partial spotting. Experimental results on Roman and Bangla scripts document images show that the method is feasible to spot a location in text labeled graphical documents. Experiments are done on an annotated dataset which was developed for this work. We have made this annotated dataset available publicly for other researchers. Electronic Thesis or Dissertation Text en http://www.theses.fr/2017TOUR4022/document Tarafdar, Arundhati 2017-07-12 Tours Ramel, Jean-Yves Pal, Umapada
collection	NDLTD
language	en
sources	NDLTD
topic	Analyse d’images de documents Repérage de mots (word spotting) Documents graphiques Recherche d’information Séparation texte-graphique Filtrage Cartes de probabilité Points d’intérêts (SIFT) Bengla Document Image Analysis Word Spotting Graphical documents Information Retrieval Probability matrix information 2-D Filter Water Reservoir Principle Clustering SIFT
spellingShingle	Analyse d’images de documents Repérage de mots (word spotting) Documents graphiques Recherche d’information Séparation texte-graphique Filtrage Cartes de probabilité Points d’intérêts (SIFT) Bengla Document Image Analysis Word Spotting Graphical documents Information Retrieval Probability matrix information 2-D Filter Water Reservoir Principle Clustering SIFT Tarafdar, Arundhati Wordspotting from multilingual and stylistic documents
description	Les outils et méthodes d’analyse d’images de documents (DIA) donnent aujourd’hui la possibilité de faire des recherches par mots-clés dans des bases d’images de documents alors même qu’aucune transcription n’est disponible. Dans ce contexte, beaucoup de travaux ont déjà été réalisés sur les OCR ainsi que sur des systèmes de repérage de mots (spotting) dédiés à des documents textuels avec une mise en page simple. En revanche, très peu d’approches ont été étudiées pour faire de la recherche dans des documents contenant du texte multi-orienté et multi-échelle, comme dans les documents graphiques. Par exemple, les images de cartes géographiques peuvent contenir des symboles, des graphiques et du texte ayant des orientations et des tailles différentes. Dans ces documents, les caractères peuvent aussi être connectés entre eux ou bien à des éléments graphiques. Par conséquent, le repérage de mots dans ces documents se révèle être une tâche difficile. Dans cette thèse nous proposons un ensemble d’outils et méthodes dédiés au repérage de mots écrits en caractères bengali ou anglais (script Roman) dans des images de documents géographiques. L’approche proposée repose sur plusieurs originalités. === Word spotting in graphical documents is a very challenging task. To address such scenarios this thesis deals with developing a word spotting system dedicated to geographical documents with Bangla and English (Roman) scripts. In the proposed system, at first, text-graphics layers are separated using filtering, clustering and self-reinforcement through classifier. Additionally, instead of using binary decision we have used probabilistic measurement to represent the text components. Subsequently, in the text layer, character segmentation approach is applied using water-reservoir based method to extract individual character from the document. Then recognition of these isolated characters is done using rotation invariant feature, coupled with SVM classifier. Well recognized characters are then grouped based on their sizes. Initial spotting is started to find a query word among those groups of characters. In case if the system could spot a word partially due to any noise, SIFT is applied to identify missing portion of that partial spotting. Experimental results on Roman and Bangla scripts document images show that the method is feasible to spot a location in text labeled graphical documents. Experiments are done on an annotated dataset which was developed for this work. We have made this annotated dataset available publicly for other researchers.
author2	Tours
author_facet	Tours Tarafdar, Arundhati
author	Tarafdar, Arundhati
author_sort	Tarafdar, Arundhati
title	Wordspotting from multilingual and stylistic documents
title_short	Wordspotting from multilingual and stylistic documents
title_full	Wordspotting from multilingual and stylistic documents
title_fullStr	Wordspotting from multilingual and stylistic documents
title_full_unstemmed	Wordspotting from multilingual and stylistic documents
title_sort	wordspotting from multilingual and stylistic documents
publishDate	2017
url	http://www.theses.fr/2017TOUR4022/document
work_keys_str_mv	AT tarafdararundhati wordspottingfrommultilingualandstylisticdocuments AT tarafdararundhati reperagedemotsdanslesimagesdedocumentsmultilinguesetgraphiques
_version_	1718814776132370432

Wordspotting from multilingual and stylistic documents

Similar Items