ICT English-Persian comparable textual corpus

Document Type : Research Paper

Authors

Abstract

In Computational Linguistics, corpus is a collection of written texts or spoken materials in machine-readable form, assembled for the purpose of studying linguistic structures, language changes over time as well as natural language processing projects. In this paper, we focus on designing a bilingual corpus. This corpus is made automatically and it consists of resources and documents in ICT domain. We developed a software framework for building textual corpus to reduce the cost and construction time. In addition, this software provides corpus management capabilities. We also proposed an alignment method for Persian-English ICT corpus. Our goal is to design an alignment system for the extraction of corresponding sentences. In this method, we deployed a bilingual dictionary and artificial intelligence techniques in order to calculate score representing the similarity between two sentences. then, we automatically map each pair of sentences in both languages.

Keywords