Identity matching and information acquisition: Estimation of optimal threshold parameters

Pantea Alirezazadeh, Fidan Boylu, Robert Garfinkel, Ram Gopal, Paulo B Goes

Research output: Contribution to journalArticle

Abstract

With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly "match" an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a "strength" score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates.

Original languageEnglish (US)
Pages (from-to)160-171
Number of pages12
JournalDecision Support Systems
Volume57
Issue number1
DOIs
StatePublished - Jan 2014

Fingerprint

Decision Making
Software
Costs and Cost Analysis
Research
Industry
Merging
Labels
Cleaning
Decision making
Data Accuracy
Information acquisition
Costs

Keywords

  • Data quality
  • Information acquisition
  • Record matching
  • Sampling distributions
  • Statistical estimation
  • Type I and Type II errors

ASJC Scopus subject areas

  • Management Information Systems
  • Information Systems
  • Information Systems and Management
  • Arts and Humanities (miscellaneous)
  • Developmental and Educational Psychology

Cite this

Identity matching and information acquisition : Estimation of optimal threshold parameters. / Alirezazadeh, Pantea; Boylu, Fidan; Garfinkel, Robert; Gopal, Ram; Goes, Paulo B.

In: Decision Support Systems, Vol. 57, No. 1, 01.2014, p. 160-171.

Research output: Contribution to journalArticle

Alirezazadeh, Pantea ; Boylu, Fidan ; Garfinkel, Robert ; Gopal, Ram ; Goes, Paulo B. / Identity matching and information acquisition : Estimation of optimal threshold parameters. In: Decision Support Systems. 2014 ; Vol. 57, No. 1. pp. 160-171.
@article{992820ef84cb49aa9533784f63d2f049,
title = "Identity matching and information acquisition: Estimation of optimal threshold parameters",
abstract = "With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly {"}match{"} an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a {"}strength{"} score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates.",
keywords = "Data quality, Information acquisition, Record matching, Sampling distributions, Statistical estimation, Type I and Type II errors",
author = "Pantea Alirezazadeh and Fidan Boylu and Robert Garfinkel and Ram Gopal and Goes, {Paulo B}",
year = "2014",
month = "1",
doi = "10.1016/j.dss.2013.08.014",
language = "English (US)",
volume = "57",
pages = "160--171",
journal = "Decision Support Systems",
issn = "0167-9236",
publisher = "Elsevier",
number = "1",

}

TY - JOUR

T1 - Identity matching and information acquisition

T2 - Estimation of optimal threshold parameters

AU - Alirezazadeh, Pantea

AU - Boylu, Fidan

AU - Garfinkel, Robert

AU - Gopal, Ram

AU - Goes, Paulo B

PY - 2014/1

Y1 - 2014/1

N2 - With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly "match" an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a "strength" score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates.

AB - With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly "match" an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a "strength" score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates.

KW - Data quality

KW - Information acquisition

KW - Record matching

KW - Sampling distributions

KW - Statistical estimation

KW - Type I and Type II errors

UR - http://www.scopus.com/inward/record.url?scp=84892366009&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892366009&partnerID=8YFLogxK

U2 - 10.1016/j.dss.2013.08.014

DO - 10.1016/j.dss.2013.08.014

M3 - Article

AN - SCOPUS:84892366009

VL - 57

SP - 160

EP - 171

JO - Decision Support Systems

JF - Decision Support Systems

SN - 0167-9236

IS - 1

ER -