Download PDFOpen PDF in browser

An Enhanced LSI based Search Engine for Arabic Medical Documents

EasyChair Preprint no. 80

6 pagesDate: April 21, 2018

Abstract

Vector space model (VSM) is widely used for representing text documents in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, latent semantic indexing (LSI) proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the standard LSI method based on cosine measures instead of words occurrences to form LSI term-by-document matrix. We empirically evaluated the performance using an Arabic medical data collection that contains 800 documents with 47,222 unique words. A testing set contains five medical keywords used to evaluate the quality of the top-20 retrieved documents using different singular values (i.e. different number of dimensions). The results shows that the performance of the proposed method outperforms the standard LSI.

Keyphrases: Arabic text, dimensionality reduction, Latent Semantic Indexing, search engine

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:80,
  author = {Fawaz Al-Anzi and Dia Abuzeina},
  title = {An Enhanced LSI based Search Engine for Arabic Medical Documents},
  howpublished = {EasyChair Preprint no. 80},

  year = {EasyChair, 2018}}
Download PDFOpen PDF in browser