Development and evaluation of a biomedical search engine using a predicate-based vector space model

Myungjae Kwak, Gondy Augusta Leroy, Jesse D Martinez, Jeffrey Harwell

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher ( p<. .001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment ( p<. .001) and 1.34 versus 0.98 with rank order adjustment ( p<. .001) for predicate-versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.

Original languageEnglish (US)
Pages (from-to)929-939
Number of pages11
JournalJournal of Biomedical Informatics
Volume46
Issue number5
DOIs
StatePublished - Oct 2013

Fingerprint

Space Simulation
Search Engine
Vector spaces
Search engines
Research Personnel
Patents
Information Storage and Retrieval
PubMed
Neoplasms
Information retrieval

Keywords

  • Information retrieval
  • Predicate
  • Search engine
  • Triple
  • Vector space model

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

Development and evaluation of a biomedical search engine using a predicate-based vector space model. / Kwak, Myungjae; Leroy, Gondy Augusta; Martinez, Jesse D; Harwell, Jeffrey.

In: Journal of Biomedical Informatics, Vol. 46, No. 5, 10.2013, p. 929-939.

Research output: Contribution to journalArticle

@article{231e88f0e9e740239c905f3848f5f12c,
title = "Development and evaluation of a biomedical search engine using a predicate-based vector space model",
abstract = "Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher ( p<. .001) for the predicate-based (80{\%}) than for the keyword-based (71{\%}) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment ( p<. .001) and 1.34 versus 0.98 with rank order adjustment ( p<. .001) for predicate-versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.",
keywords = "Information retrieval, Predicate, Search engine, Triple, Vector space model",
author = "Myungjae Kwak and Leroy, {Gondy Augusta} and Martinez, {Jesse D} and Jeffrey Harwell",
year = "2013",
month = "10",
doi = "10.1016/j.jbi.2013.07.006",
language = "English (US)",
volume = "46",
pages = "929--939",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "5",

}

TY - JOUR

T1 - Development and evaluation of a biomedical search engine using a predicate-based vector space model

AU - Kwak, Myungjae

AU - Leroy, Gondy Augusta

AU - Martinez, Jesse D

AU - Harwell, Jeffrey

PY - 2013/10

Y1 - 2013/10

N2 - Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher ( p<. .001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment ( p<. .001) and 1.34 versus 0.98 with rank order adjustment ( p<. .001) for predicate-versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.

AB - Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher ( p<. .001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment ( p<. .001) and 1.34 versus 0.98 with rank order adjustment ( p<. .001) for predicate-versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.

KW - Information retrieval

KW - Predicate

KW - Search engine

KW - Triple

KW - Vector space model

UR - http://www.scopus.com/inward/record.url?scp=84883806589&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883806589&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2013.07.006

DO - 10.1016/j.jbi.2013.07.006

M3 - Article

C2 - 23892296

AN - SCOPUS:84883806589

VL - 46

SP - 929

EP - 939

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 5

ER -