Summary: | 碩士 === 國立中央大學 === 資訊管理研究所 === 97 === With the growth of information technology, a large volume of digital documents and materials has appeared. Without information technology, searching of information would require a great human effort. To decrease the users’ effort, documents discrimination system has been developed and applied. In this kind of system, documents usually are discriminated by similarities automatically. In Information Retrieval, researches mainly use TF-IDF to present terms from documents, exploit those terms to form Vector Space Model, and then compute documents similarity based on the formed Vector Space Model. This approach could be improved. First, in addition to single terms, compound nouns are used in documents also. Second, different terms are used in the presentation of the same concept. This paper has proposed a method which forms the Vector Space Model with concepts that are exacted from documents. The steps include, first, extracting concept from terms and compound nouns of the documents, and second, building a Vector Space Model with these concepts as dimensions. Experimental results show that the approach of concept extraction outperforms TF-IDF in accuracy of document similarity computing.
|