Bangla text corpus creation and word embeddings

This IUB funded project aims to create a Bangla language corpus. Currently there are 700,000 articles are included in this corpus and increasing. Based on this corpus we are working improving bangle word embedding issue. This word embedding are vector representations of word that allow machines to learn semantic and syntactic meanings by performing computations on them. Two well known embedding models are CBOW and Skipgram. Different methods proposed to evaluate the quality of embedding are categorized into extrinsic and intrinsic evaluation methods. This research will focuses on intrinsic evaluation of the evaluation of the models on tasks, such as analogy prediction, semantic relatedness, synonym detection, antonym detection and concept
categorization.

Comments are closed.