Does an Automated Assignment of Biological Topics Produce Relevant Semantic Meaning?

 

David C. McLean, Jr.1,2, Bin Zheng2, and Xinghua Lu2

 

1 Marine Biomedicine and Environmental Sciences Center, Medical University of South Carolina, Charleston, SC

2 Department of Biostatistics, Bioinformatics, and Epidemiology, Medical University of South Carolina, Charleston, SC

 

 

As scientific literature grows rapidly, it becomes increasingly valuable to provide accurate, relevant, and automatic identification of topics in the electronic literature stream for both archival and real-time information retrieval. Text documents can be considered as mixtures of words from different topics; these topics can be inferred by statistical learning techniques. A probabilistic topic model, known as the latent Dirichlet allocation (LDA) model, was applied to automatically identify arbitrary biological topics from a corpus of Medline abstracts collected to describe protein function. The documents within the corpus were also annotated with Gene Ontology (GO) terms. The correlation between latent topics and GO annotation was quantified with the mutual information (MI) function. To determine whether MI provides relevant semantic information, latent topics generated by the LDA model were scored by a human curator. The correlation of human-assigned score and MI may contribute to validation of the LDA model for topic identification These experiments explore the applications of the LDA model in an automatic indexing system that may provide valuable information retrieval tools for scientists and clinicians.