Summary: | Citations play an important role in ranking of authors, journals, institutions, and organizations. Sometimes, citing documents cite a reference many times in their full-text, which is further used in many application scenarios, such as: 1) finding relationship between cited and citing papers; 2) identifying influential cited paper from set of references in citing paper; 3) identification of suitable citation functions; and 4) study of in-text citations in different logical sections of papers to conclude different findings. The accurate identification of in-text citations remained an open area of research. Recently, the complexities involving automatic identification of in-text citations have been reported with an accuracy rate of 58%. This is due to many issues as highlighted by the state-of-the-art research. This paper investigates such issues in further details: 1) by taking benefits from the previous research; 2) by analyzing different referencing formats; and 3) by experimenting on a comprehensive data set. Based on the investigation, this paper proposes a taxonomy and workable system, which utilizes a set of heuristics build from detailed study. The proposed model is then applied on unseen diversified data set taken from the Journal of Universal Computer Science and CiteSeer. The proposed model was able to achieve an average F-score of 0.97 as compared with the baseline 0.58.
|