Prev:
Chapter 2 - Research Method & Design
Chapter 3
Literature Review
Next:
Chapter 4 - Context Interviews

1             

Summary. 1

Chapter 1 - Introduction. 12

Chapter 2 - Research Method and Design. 18

Chapter 3 - Literature Review.. 36

3.1 Summary. 36

3.2 Information Quality. 36

3.3 Existing IQ Frameworks. 38

3.3.1 AIMQ Framework. 38

3.3.2 Ontological Framework. 40

3.3.3 Semiotic Framework. 41

3.4 IQ Measurement 44

3.4.1 IQ Valuation. 46

3.5 Customer Relationship Management 47

3.5.1 CRM Business Context 48

3.5.2 CRM Processes. 48

3.5.3 Customer Value. 49

3.6 Decision Process Modelling. 50

3.6.1 Information Economics. 50

3.6.2 Information Theory. 51

3.6.3 Machine Learning. 51

3.7 Conclusion. 53

Chapter 4 - Context Interviews. 56

Chapter 5 - Conceptual Study. 84

Chapter 6 - Simulations. 124

Chapter 7 - Research Evaluation. 166

Chapter 8 - Conclusion. 180

References. 184

Appendix 1. 194


Literature Review

3.1        Summary

This chapter reviews literature of relevance to the project, drawn from academic and practitioner sources. The purpose of the review is threefold:

·         to identify the gaps in the existing Information Quality knowledge base that this project seeks to address,

·         to present a specific organisational context for IQ valuation, in the form of Customer Relationship Management systems,

·         to provide an overview of the reference disciplines which examine and measure value and uncertainty.

This kind of review is necessary in Design Science research to ensure that the research makes a contribution to the Information Systems knowledge base, is relevant to practitioners and makes correct use of the reference disciplines.

This chapter is organised into three sections, addressing the three goals outlined above: Information Quality, Customer Relationship Management and the information-centric reference disciplines.

3.2       Information Quality

Information Quality (IQ) is an Information Systems (IS) research area that seeks to apply modern quality management theories and practices to organisational data and systems. This involves building and applying conceptual frameworks and operational measures for understanding the causes and effects of IQ problems. Additionally, some research seeks to evaluate the impact of initiatives to improve IQ.

IQ is fundamental to the study and use of Information Systems. Yet it is not the principle focus of research or practice. Perhaps the most widely understood model of how IQ fits into IS more generally is the Delone and Mclean Model of IS Success (DeLone and McLean 1992; DeLone and McLean 2003; Seddon 1997).

delonemclean

 

Figure 3 IS Success Model of Delone and Mclean (DeLone and McLean 1992)

Here, IQ is understood to affect both Use and User Satisfaction, along with System Quality. This model’s assumptions about the separation of content (Information Quality) from delivery (System Quality), and about the individual vs organisational impact are discussed further below. However, it is a useful starting point owing to its widespread adoption and broad scope.

While IQ can be conceived as part of the IS Success sub-field, as an object of study it pre-dates Delone and Maclean’s model. One notable general IQ researcher active during the 1980s is Donald Ballou (Ballou and Pazer 1985); (Ballou and Tayi 1985); (Ballou and Tayi 1989). Prior to this period, the research was either specific to certain fields such as auditing (Johnson et al. 1981) or related to specific techniques such as data-matching and integration (Fellegi and Sunter 1969).

Throughout the 1990s, IQ research increased with the proliferation of internet-based information-sharing, the deployment of enterprise systems such as data warehouses (DW) (Shankaranarayanan and Even 2004; Wixom and Watson 2001) and business intelligence (BI) and the importance of information-based business strategies such as enterprise resource planning (ERP) (Cai and Shankaranarayanan 2007) and customer relationship management (CRM) (Courtheoux 2003; Ishaya and Raigneau 2007; Miller 2005). During this period a number of authors (consultants and academics) wrote books and business journal articles for practitioners grappling with information quality problems (Becker 1998; English 1999; Huang et al. 1999; Marsh 2005; Orr 1998; Redman 1995; Redman 2008; Strong and Lee 1997; Tozer 1994)

Academic and practitioner researchers have produced several generic IQ frameworks; that is, they are intended to be applicable to a very broad class of information systems (Barone et al. 2007; Cappiello et al. 2006; Ge and Helfert 2008; Gustafsson et al. 2006; Joseph et al. 2005; Stvilia et al. 2007). Typically, these use a small number of components or dimensions of IQ to group a larger number of IQ criteria or characteristics. One early study listed 178 such IQ dimensions, criteria and goals (Wang et al. 1993), which illustrates the breadth of ideas encompassed within the Information Quality sub-discipline.

Some IQ research proceeds by examining one of these IQ concepts in isolation, such as believability (Pradhan 2005; Prat and Madnick 2008a; Prat and Madnick 2008b) or timeliness (Ballou and Pazer 1995; Cappiello et al. 2003). Another tack is to take a broader view of the concept of quality and how it relates to information (Batini and Scannapieco 2006; Fox and Redman 1994; Piprani and Ernst 2008; Sarkar 2002; Tayi and Ballou 1998; Welzer et al. 2007).

In contrast, another research stream examined IQ in the context of specific applications (Dariusz et al. 2007), such as accounting (Kaplan et al. 1998), security (English 2005; Wang et al. 2003), “householding[1]” (Madnick et al. 2004; Madnick et al. 2003) and undergraduate teaching (Khalil et al. 1999) as well as more traditional IS areas like conceptual modelling (Levitin and Redman 1995; Lindland et al. 1994; Moody and Shanks 2003; Moody et al. 1998), process design (Lee et al. 2004; Lee and Strong 2003; Strong 1997), metadata (Shankaranarayanan and Even 2004; Shankaranarayanan and Even 2006) and querying (Ballou et al. 2006; Motro and Rakov 1996; Parssian 2006; Wang et al. 2001).

Other researchers focused on the interaction between information quality and how it is used in decision-making by individuals, for example, in information-seeking behaviour (Fischer et al. 2008; Ge and Helfert 2007; Klein and Callahan 2007), decision quality (Frank 2008), information processing (Davies 2001; Eppler and Mengis 2004; Shankaranarayanan and Cai 2006) and visualisation (Zhu et al. 2007).

The rest of this section is organised as follows. The next sub-section examines three important frameworks from the academic literature: the Ontological Model (Wand and Wang 1996), the Semiotic Framework (Price and Shanks 2005a) and the AIMQ (Lee et al. 2002). The first two are grounded in theory (ontology and semiotics, respectively) and adopt a “first-principles” approach to describe information systems (and deficiencies) in general. The third is empirically-based, drawing on the opinions of a pool of practitioners and researchers.

The subsequent sub-section addresses existing IQ measurement literature, including the different types of approaches endorsed by researchers (subjective and objective) and problems therein. Lastly, I consider a particular kind of measurement: valuation. Here I discuss the need for value-based (eg cost/benefit and investment-oriented) approaches to information quality assessment and critically examine past attempts at this.

3.3        Existing IQ Frameworks

3.3.1          AIMQ Framework

The first framework I examine is the AIMQ (Lee et al. 2002). This framework has been selected as it is well-developed and a good exemplar of the empirical approach to IQ research. It also ties together a number of research projects arising from MIT’s Total Data Quality Management (TDQM) project, lead by Professor Richard Wang. This program arose from Wang’s group’s view of information as a manufactured product (Ballou et al. 1998; Parssian et al. 1999; Wang et al. 1998) and that “total quality management” (TQM) principles – which had proved so successful in improving product quality for manufactured goods – could be applied to producing information goods (Dvir and Evans 1996; Wang 1998; Wang and Wang 2008).

The AIMQ paper proceeds with an analysis of academic and practitioner perspectives on IQ based on the four dimensions derived from the authors’ earlier research (Wang and Strong 1996; Wang 1995): Intrinsic, Contextual, Representational and Accessibility IQ.

Intrinsic IQ implies that information has quality in its own right. Contextual IQ highlights the requirement that IQ must be considered within the context of the task at hand; it must be relevant, timely, complete, and appropriate in terms of amount, so as to add value. Representational and accessibility IQ emphasize the importance of computer systems that store and provide access to information; that is, the system must present information in such a way that it is interpretable, easy to understand, easy to manipulate, and is represented concisely and consistently; also, the system must be accessible but secure. (Lee et al. 2002, p135)

These dimensions are not grounded in any theory, but are derived empirically using market research methods. They argue that these dimensions – and associated criteria – are sufficient to capture the multi-dimensional nature of IQ. To support this, they cite content analyses from a number of case study projects where all issues raised by practitioners can be mapped onto these criteria.

Rather than grouping these criteria by the four dimensions above, they adopt the PSP/IQ (Product–Service–Performance/Information Quality) two-by-two matrix developed earlier
(Kahn et al. 2002). Here, the columns represent two different perspectives of quality (conformance to specifications and meeting/exceeding customer expectations), while the rows represent two view of information (information-as-a-product and information-as-a-service).


 

 

Conforms to Specifications

Meets or Exceeds Consumer Expectations

Product Quality

Sound Information

 

IQ Dimensions:

·         Free-of-Error

·         Concise Representation

·         Completeness

·         Consistent Representation

 

Useful Information

 

IQ Dimensions:

·         Appropriate Amount

·         Relevancy

·         Understandability

·         Interpretability

·         Objectivity

 

Service Quality

Dependable Information

 

IQ Dimensions:

·         Timeliness

·         Security

Usable Information

 

IQ Dimensions:

·         Believability

·         Accessibility

·         Ease of Operation

·         Reputation

 

Figure 4 - PSP/IQ Matrix (Kahn et al. 2002)

The authors argue that while their four IQ dimensions offer complete coverage, this matrix is more useful for helping managers prioritise IQ problems. They go on to develop a survey instrument which assesses the quality of information by asking information consumers to rate each of these 15 dimensions on an eleven-point Likert scale. An average score for each quadrant is computed, and an overall IQ score is the simple average of the four quadrants.

These scores are used in two ways: firstly, they allow benchmarking against a best-practice referent (such as an industry leader). Here, the organisation can assess in which areas they are meeting best practices and in which there are “gaps”, drilling down through quadrants to dimensions to survey items. Secondly, the survey instrument also records whether a respondent is an information consumer or IS professional. This allows analysis of another kind of “gap” this time based on the roles.

Organisations can target quadrants and dimensions where they are experiencing a best-practices gap. They can also determine whether this might be due to a role gap, where those using information and those responsible for managing it disagree about its quality. The authors conclude that the AIMQ method is useful for identifying IQ problems and areas for improvement, and tracking any improvements over time.

While this framework has a method for IQ assessment and prioritisation of improvements, it lacks a solid theoretical underpinning. The original research identified 16 constructs (Wang and Strong 1996), but as “value-added” was problematic it has been dropped without explanation. The remaining 15 constructs are not defined; instead the authors rely on diverse information consumers and IS professionals to interpret “near-synonyms”. For example, to determine the accessibility dimension - part of the Accessibility IQ dimension in the original study and part of the Usability quadrant in the PSP/IQ model - respondents are asked to rate the following statements:

·         The information is easily retrievable.

·         The information is easily accessible.

·         The information is easily obtainable.

·         The information is quickly accessible when needed.

 

For this dimension, the authors report a Cronback’s Alpha (construct reliability) of 0.92 – a very high score indicating that these items are indeed measuring a single latent variable. However, the authors offer no advice to respondents about the differences between the retrieval, access and obtainment of information. Additionally, further items assess currency and timeliness of information without regard to the “promptness of access” (in the fourth item above).

Other examples of the use of “near-synonyms” in items to assess dimensions include: believable, credible and trustworthy; correct, accurate and reliable; useful, relevant, appropriate and applicable; current, timely and up-to-date; and understand and comprehend. Relying on respondents to bring their own differentiation criteria to bear on these overlapping terms weakens their conclusions.

Further, the dimensions themselves suffer from “near-synonyms”: it is not obvious how interpretability and understandability differ, nor reputation and believability. As a consequence, it is not surprising that scores on these dimensions have a very high cross-correlation of 0.87 and 0.86 respectively (Lee et al. 2002). Respondents are unlikely to give very different ratings to the statements “It is easy to interpret what this information means” (Interpretability) and “The meaning of this information is easy to understand” (Understandability).

Using overlapping dimensions, “near-synonymous” terms and relying on the individual to assign meaning is a result of using an atheoretic approach to understanding Information Quality. By this, I mean that the authors do not present a theory of the nature of information or how it is created, assessed and used. Rolling these 15 dimensions up into four quadrants (derived by a theory) is an improvement. However, the subsequent survey design relies on the initial conception of IQ and hence carries forwards its limitations.

3.3.2         Ontological Framework

An example of a theoretically-derived framework for information quality is the ontological model proposed by Wand and Wang (Wand and Wang 1996). Ontology is the branch of philosophy that deals with the structure and organisation of the world in the broadest sense. In this context, it is the body of knowledge concerned with constructing models of (parts of) the world.

Wand and Wang start with a very clear set of statements defining the real world, the subset of interest (the domain) and the information system in terms of states. Based on Wand’s earlier work on ontological modelling (Wand and Weber 1990), they build up a set of postulates relating the state of the information system with the state of the real world.

Specifically, they conceive of the world as being made up of things with properties. The real world is a system, decomposable into sub-systems. Each sub-system may be described in terms of a set of states and laws governing how it may progress from state to state. A system exists in one state at a moment in time. An information system is clearly a type of system too, and also has a set of states and laws. The representation process is the creation of a view of the real world within the information system. The interpretation process is the inference of the real world by a user (human or machine) perceiving the representation. In this way, the states of real world and the information system should be “aligned”. By analysing the relationship between these states, Wand and Wang offer a thorough analysis of data deficiencies: “an inconformity between the view of the real world system that can be inferred from a representing information system and the view that can be obtained by directly observing the real world system” (Wand and Wang 1996, p89).

They identify three deficiencies that occur at the time of information system design: incomplete representation, ambiguous representation and meaningless states. Incomplete representation is when states exist in the real world that cannot be represented in the information system. Meaningless states are those in the information system that do not correspond to a real world state. Ambiguous representation is when an information system state corresponds to more than one real world state, making it impossible to correctly infer the state of the real world.

Note that these deficiencies refer to sets of states (statespaces) and possible mappings between them, rather than a particular system at a point in time. For example, with an incomplete representation, if the real world is not in that “missing” state the information system can provide a correct representation. Similarly, correct inference is possible for an IS with meaningless states (or ambiguous representation), as long the information system (real world) is not in the problem state. However, the possibility of a mis-mapping constitutes a design deficiency.

The fourth type of data deficiency Wand and Wang identify occurs at the time of operation: garbling. Here, a well-designed information system (ie complete, unambiguous and meaningful) may be in the “wrong” relative to the real world. That is, the information system’s state (at a particular moment) does not correspond to the real world state. This may be due to erroneous data entry or failure to reflect changes in the real world. They label such situations as incorrect.

Based on this analysis of the deficiencies in mapping between the (perceived) real world state and the information system state, they describe four dimensions of data quality. These are: complete, unambiguous, meaningful and correct. They go on to show how a number of other frequently-cited attributes of data (or information) quality fall into these four dimensions. For example, “lack of precision” can be understood as an ambiguity problem. This can be seen when we consider a customer birth date: if the IS captures the year and month, but not the day then one IS state corresponds to (up to) 31 real world states: we cannot distinguish between them and so that mapping is deemed ambiguous. As alluded to above, currency (or timeliness) is understood as when the real world changes state but the IS fails to “keep up”. This results in the operational deficiency of garbling (to an incorrect state).

So we can see that this ontological model – by virtue of its grounding in a well-constructed theory – provides assurance that it is reasonably exhaustive in its coverage of data deficiencies due to system design or operation. However, its drawbacks are two-fold: first, its narrow scope. By restricting it to what the authors term the “internal view” (that is, “use-independent” intrinsic properties of data) the model does not address oft-cited information quality concepts such as relevance, importance, usefulness or value. Secondly, while laying out a conceptual model, there is no guidance for how to formally analyse, assess, measure or value a specific implementation (planned or realised). These drawbacks are explicitly acknowledged by the authors, who call for further work to extend their model.

3.3.3          Semiotic Framework

Next, I present an example of a framework that builds on the ontological model presented above to tackle the usage and assessment aspects. This framework also employs another theory, this time of semiotics, so is known as the Semiotic Framework for Information Quality (Price and Shanks 2005a).

The analysis begins with the insight that the philosophical area of semiotics (the study of systems of signs and symbols, in a broad sense) provides a coherent lens through which information quality can be studied. While the philosophical aspects of language and meaning enjoy a long history, semiotics (or semiology) emerged as a distinct discipline around the start of the 20th Century through the work of early researchers like Swiss linguist Ferdinand de Saussure (1857-1913), the American philosopher Charles Sanders Peirce (1839-1914) and later Charles William Morris (1901-1979) (Chandler 2007). While their work influenced linguistics, philosophy and language-based studies, semiotics has also found use within IS for systems analysis (Stamper et al. 2000), data model quality (Krogstie et al. 2006; Lindland et al. 1994) and later data model and content quality (Moody et al. 1998).

The key to understanding this framework is the equivalence of the semiotic notion of a sign and the IS conception of a datum. A sign is a “physical manifestation … with implied propositional content … that has an effect on some agent” (Price and Shanks 2005a), where an effect is either a change in understanding or action. The referent is the implied propositional content, or “intended meaning” of the sign while the process of effecting change on some agent (semiosis) is the interpretation or received meaning of the sign. Hence, a datum in a data store constitutes a sign and a semiotic analysis of the data store as a sign-system allows a rigorous theoretical description of the quality of information.

Specifically, Price and Shanks identify three levels that build on each other. The first is the syntactic level, which deals with relations between sign representations (ie data and meta-data). The second is the semantic level, concerned with relations between sign representation and its referent (ie data and external phenomena). Lastly, the third is the pragmatic level, addressing the relations between sign representation and its interpretation (ie data and task/context). So, loosely speaking, these three levels (and their corresponding quality criteria) describe data form, meaning and usage:

 

Syntactic

Semantic

Pragmatic

Quality Question Addressed

Is IS data good relative to IS design (as represented by metadata)?

Is IS data good relative to represented external phenomena?

Is IS data good relative to actual data use, as perceived by users?

Ideal Quality Goal

Complete conformance of data to specified set of integrity rules

1:1 mapping between data and corresponding external phenomena

Data judged suitable and worthwhile for given data use by information consumers

Operational Quality Goal

User-specified acceptable % conformance of data to specified set of integrity rules

User-specified acceptable % agreement between data and corresponding external phenomena

User-specified acceptable level of gap between expected and perceived data quality for a given data use

Quality Evaluation Technique

Integrity checking, possibly involving sampling for large data sets

Sampling using selective matching of data to actual external phenomena or trusted surrogate

Survey instrument based on service quality theory (i.e. compare expected and perceived quality levels)

Degree of Objectivity

Completely objective, independent of user or use

Objective except for user determination of relevancy and correspondence

Completely subjective, dependent on user and use

Quality Criteria Derivation Approach

Theoretical, based on integrity conformance

Theoretical, based on a modification of Wand and Wang’s (1996) ontological approach

Empirical, based on initial analysis of literature to be refined and validated by empirical research

Table 4 Quality Category Information (Adapted from Price and Shanks 2005a)

Syntactic quality – concerned with the relations between signs – is understood as how well operational data conform to IS design (embodied as meta-data). Integrity theory provides a ready-made theory for determining this conformance to eg cardinality constraints and rules of well-formed data structures.

The semantic level naturally builds on the model presented by Wand and Wang, as it is to do with how the information system represents the real world; that is the mapping between states of the external world and the data that are intended to represent this world. However, Price and Shanks modify the Wand and Wang model in three significant ways: firstly, they introduce an additional criterion of “non-redundancy” in the mapping. They argue that, like meaningless states, the presence of redundant states (ie multiple states in the IS refer to the same state in external world) in the IS constitute a design deficiency because they introduce a “danger” of deficiency in operation. The result is that the both the representation and interpretation processes now require a bijective function (one-to-one and “onto”): all states the external world must map onto a unique state in the IS, and vice versa.

A subsequent refinement of the framework based on focus group feedback (Price and Shanks 2005b) recasts “non-redundancy” as “mapped consistency” ie multiple IS states are permitted as long as they agree with each other (or are reconcilable within an acceptable time). This allows for system designers to employ caching, versioning, archiving and other forms of desirable redundancy.

Price and Shanks also argue that incompleteness can arise at design time (one or more external state cannot be represented in the IS) or during operation (for example, a clerk fails to enter data into a field). Thirdly, Price and Shanks address the decomposition deficiencies outlined by Wand and Wang by introducing separate notions of phenomena-correctness (correct mapping to an entity) and property-correctness (correct mapping to an attribute value of an entity). In terms of conventional databases, this distinction corresponds to row and column correctness respectively.

At the pragmatic level, the Semiotic Framework abandons theoretical derivation and employs an empirical approach akin to the AIMQ Framework, based on literature analysis, to describe a list of pragmatic quality criteria. At this level, the reliability construct subsumes the semantic level criteria of mapped (phenomena/property) correctly, meaningfully, unambiguously, completely and consistently. The additional (revised) pragmatic criteria are: Perceptions of Syntactic and Semantic Criteria, Accessible, Suitably Presented, Flexibly Presented, Timely, Understandable, Secure, Type-Sufficient and Access to Meta-data. The last two are included based on focus group refinement (Price and Shanks 2005a; Price and Shanks 2005b): the former replaces “value” in requiring all types of data important for use, while the latter refers to the ability of users to assess the lineage, granularity, version and origins of data.

The authors suggest that the SERVQUAL theory (Parasuraman et al. 1985) provides a means for assessing the quality at the pragmatic level. Similar to the AIMQ Framework, a “gap” is identified through a survey instrument employing Likert scales – this time between a consumer’s expectations and her perceptions.

The strengths of this framework include the use of semiotic theory to stratify information quality criteria into levels (form, meaning and use) and the successful integration of Wand and Wang’s ontological model at the semantic level. The main weakness is the lack of theoretical basis for assessing quality at the pragmatic level, which introduces similar problems as found in the AIMQ Framework. These include inter-dependencies (as acknowledged by the authors for eg Understandability and Access to Meta-data), problems with “near-synonyms” (eg using the undefined terms “suitable” and “acceptable” to describe aspects of quality and “worthwhile” and “important” to describe aspects of value) and finally “value” in general (value-added, valuable). As with the AIMQ Framework, the concept was originally included but ultimately dropped owing to its conceptual poor-fit and heavy inter-dependence: feedback showed that “valuable was too general and abstract to ensure consistent interpretation … and therefore not useful as a specific quality criteria” (Price and Shanks 2005b).

3.4       IQ Measurement

This section summarises existing research in Information Quality measurement. While dozens of papers propose and analyse aspects of IQ, surprisingly little has been written about the specific measurement and definitions of metrics for IQ-related constructs. Some examples include a methodology for developing IQ metrics known as InfoQual has been proposed (Dvir and Evans 1996), while the Data Quality Engineering Framework has a similar objective (Willshire and Meyen 1997). Measurement of particular aspects of IQ have been tackled, such as soundness and completeness (Motro and Rakov 1996) and accuracy and timeliness (Ballou and Pazer 1995), completeness and consistency (Ballou and Pazer 2003). In many cases, such measurements are combined through transformations using weighting, sums and differences to derive metrics that allow comparison of quality levels over time (Evans 2006; Parssian et al. 1999; Parssian et al. 2004).

There are, broadly speaking, three approaches to IQ measurement, based on the kinds of scores employed: percentages (ratio, 0-100%), Likert scales (ordinal, eg low, medium and high) and valuation (ordinal, eg Net Present Value). The third is addressed in the following sub-section, while the first two are discussed here.

The purpose of IQ measurement is largely managerial (Heinrich et al. 2007): selection and monitoring of existing information sources for tasks and the construction of new information sources (possibly out of existing ones). This may involve benchmarking within and between organisations (Cai and Shankaranarayanan 2007; Stvilia 2008), as well as before and after IQ improvement projects. Naumman and Rolker (2000) identify three sources (or perspectives) of IQ measurement in their comprehensive review of IQ measurement based on user/query/source model:

 

Perspective

User

Query

Source

Type of Score

Subject-criteria scores

Process-criteria scores

Object-criteria scores

Example Criteria

Understandability

Response time

Completeness

Assessment Method

User experience, sampling

Parsing

Parsing, contract, expert, sampling

Scale/Units

Likert

Percentage

Percentage

Characteristics

Varies between users and tasks

Transient, depend on each usage instance

Change over time but constant for each usage instance and user

Table 5 Adapted from Naumann and Rolker (2000)

The pattern of using a combination of percentages for objective measures of IQ and a Likert scale for subjective measures is repeated throughout IQ research. For example, the Semiotic Framework (Price and Shanks 2005a) employs objective (percentage) measures in their syntactic and semantic levels and subjective (Likert) measures in their pragmatic level. This is spelled out in Pipino (Pipino et al. 2002), who argue for the combination of objective and subjective measures on the grounds that “subjective data quality assessments reflect the needs and experiences of stakeholders”, while “objective assessments can be task-independent … [which] reflect states of the data without contextual knowledge of the application, and can be applied to any data set, regardless of the task at hand.”

Across a number of models and frameworks, there is widespread agreement that percentages are the obvious and natural way to measure at least some IQ aspects such as completeness (Ballou and Pazer 2003), currency (Cappiello et al. 2003) and correctness (Paradice and Fuerst 1991). The assumption is that an information source with a score of 75% is of better quality than one with a score of 70%. However, there are considerable difficulties in determining such a figure, which undermines the claim of objectivity.

For example, consider a customer database comprising many thousands of records, each with a several dozen attributes. In this case, 75% completeness could mean that 25% of the customer records are missing. Or that 25% of the attributes (columns) have blank values. Or – as an example of a design problem - that 25% of the allowed values for a certain attribute (say, Customer Title) may be missing (eg Parson, Earl and Inspector). More subtly, 75% completeness may mean any combination of these issues is extant. While some researchers distinguish between these issues (eg Semiotic Framework, Ontological Model), most do not. The fundamental problem is that an enormous number of issues could combine to yield a particular quality score; yet it’s unlikely that all of these situations would be regarded as equivalent by any given user in a particular context.

By way of illustration, consider the use of data quality tagging (Chengalur-Smith and Ballou 1999; Fisher et al. 2003; Price and Shanks 2008), an application of IQ measurement of research interest and with practical import. Briefly, it is the process of presenting extra information about the quality of a dataset to data consumers, typically expressed as a ratio (or percentage) score. The idea is that information consumers will change their use of the data (ie decision outcome, time, confidence etc) based on the level of quality conveyed by the score. The experimental results indicate that information consumers will – under some circumstances – incorporate this extra information into their decision-making (to some extent).

However, the consumers’ interpretation of a quality score of, say, 70% is not obvious: is this the probability that the data is correct? Or some sort of distance metric, confidence interval, measure of spread (like variance) or significance value? In the absence of any instructions (other than the unhelpful remark that 1 is perfect quality and 0 is no quality) consumers will make up their own mind based on their experience, education and expectations, given the task and its context. Data quality tagging provides one motivation for objective IQ measures, and also highlights the drawbacks of their use.

This motivates the use of specific task- or usage-oriented measures – addressing the contextual dimension in the TDQM (Pipino et al. 2002) and AIMQ frameworks and the pragmatic layer in the Semiotic Framework. These frameworks argue for the necessity of adopting subjective measures to address this. For example, Lee et al. (2000) state that considering the information consumer viewpointnecessarily requires the inclusion of some subjective dimensions”, while for Price and Shanks (2005), the use of data by consumers “is completely subjective”, since the pragmatic quality criteria

are evaluated with respect to a specific activity and its context. That implies that the assessment of such criteria will be based on information consumer perceptions and judgements, since only they can assess the quality of the data relative to use. (Price and Shanks 2005a, p93)

This proposal – that information usage by consumers in context can only be assessed by subjective measures – seems appealing, at least in the general case. After all, who would suppose that an objective measure of the quality of information for any arbitrary task could exist, given the problems with the objective measures in the comparatively simpler case of assessing semantic (or inherent) data quality?

However, this does not imply that surveying users with a Likert scale (or letter grade or nominal percentage) is the only possible approach. There is an important category of subjective assessment of IQ that employs an entirely different elicitation approach based around user preferences. This approach, while still subjective, allows for considerable sophistication in derivation and analysis. The next sub-section addresses this approach - the valuation of Information Quality.

3.4.1         IQ Valuation

Information Quality valuation can be understood as a special case of IQ assessment, whereby the goal is to place a value on the quality level associated with an information source. In other words, the assessment methods are financial and the units of measurement are money (dollars or other currency). Some authors advocate a resource or asset view of an organisation’s information resource (Levitin and Redman 1998; Moody and Walsh 2002; Solomon 2005). Frequently, IQ frameworks address value by considering cost as a factor or quality item. This is understood in at least two different ways: the cost of reaching a level of quality (through checking and correction procedures), where it is considered a factor to trade-off against other factors (Ballou et al. 1998); and the cost of reaching a level of non-quality through errors, mistakes and “information scrap and re-work” (English 1999). The former detract from value, while avoiding the latter contributes to value.

Other frameworks directly explicate the cost/benefit trade-off (Ballou and Tayi 1989; Eppler and Helfert 2004; Mandke and Nayar 2002; Paradice and Fuerst 1991), while others have applied decision-theoretic approaches – employing probabilities and pay-offs – to understand the impact of poor quality data (Kaomea 1994; Michnik and Lo 2009). One sophisticated analysis uses the financial engineering concept of “real options” to price data quality (Brobrowski and Soler 2004). Some examine trade-offs in particular applications contexts, such as data warehousing (Rao and Osei-Bryson 2008).

The importance of valuing IQ has been recognised by both academics and practitioners (Henderson and Murray 2005; Jacaruso 2006). For example, at the Data Quality workshop hosted by the National Institutes for Statistical Sciences in 2001, one of the key recommendations was that “Metrics for data quality are necessary that … represent the impact of data quality, in either economic or other terms” (Karr et al. 2001). Earlier, practitioner/researcher Thomas Redman made an estimate for the typical organisation of about 25% of revenue (Redman 1998). Estimating the costs involved is a difficult accounting challenge owing to the diffused and intangible nature of poor information quality. In particular, opportunity costs (eg earnings forgone due to poor decision) are notoriously difficult to capture.

Even and Shankaranarayanan are amongst the few researchers to have tackled explicitly the notion of value-driven information quality (Even and Shankaranarayanan 2007a; Even and Shankaranarayanan 2007b; Even et al. 2007), using models that subjectively weight the benefit associated with data values across a number of familiar IQ dimensions, before aggregating up to get a total utility estimate for data assets.

The concept of value – incorporating costs and benefits in the broadest sense – faces two significant problems within Information Quality research. The first is conceptual: most researchers recognise its importance, but are unsure or inconsistent in its handling. The second is practical and concerned with the process of making a reasonable valuation subject to resource constraints. Despite these problems, valuation remains an important (albeit under-explored) area within IQ.

Examples of the conceptual problems with the concept of value were introduced earlier in the context of both the AIMQ and Semiotic frameworks. To recap, the “value-added” attribute of the contextual dimension was originally a part of the TDQM model (Strong et al. 1997) but was then dropped for the PSP/IQ model without explanation (Kahn et al. 2002). As a consequence, value was not added to the AIMQ Framework (Lee et al. 2002).

With the Semiotic Framework, value was originally included (Price and Shanks 2005a), but as a context-specific “placeholder” item at the pragmatic level. Feedback from a focus group of practitioners identified its inclusion as a weakness, and the item was removed altogether. Further, “near-synonyms” and tautologies around value are used throughout the paper, adding to the lack of clarity. For example, value, value-added and valuable are, at different points, equated with or defined as worth, importance, usefulness and sufficiency (Price and Shanks 2005a; Price and Shanks 2005b).

The second difficulty with valuation of information quality concerns the tractability of valuation processes. One example such example is presented by Ballou and Tayi (1989) who prescribed a method for periodic allocation of resources to a class of IQ proposals (maintenance of data assets). It assumes a budgetary approach (that is, a fixed budget for IQ to be shared among a set of proposals), rather than an investment approach (evaluation of proposals based upon expected value returned). It further assumes that the data managers have sought and won the largest budget they can justify to their organisation. Based upon statistical sampling, a parameter estimation heuristic and an iterative integer programming model, the method arrives at an optimal dispersal of resources across proposals.

The method requires data analysts to understand the appropriate level of data granularity (fields, attributes, records) for the analysis and the expected costs of errors in these data sets. In general, the problem of estimating the costs of IQ defects is extremely complex. Earlier work (Ballou and Pazer 1985) employs differential calculus to estimate transformation functions that describe the impact of IQ defects on “down-stream” decision-making. This functional approach was later combined with a Data Flow Diagram method (Ballou et al. 1998).

Gathering information on the parameters required for these methods is likely to be very costly and fraught with technical and organisation difficulties. Further, there is little empirical evidence to support the feasibility of industry analysts undertaking the sophisticated mathematical analyses (ie the differential calculus and integer linear programming) as described.

Regardless of the valuation perspective or process, it can only be undertaken within a specific organisational process: the same information source or dataset will introduce (or remove) different costs depending on the purpose for which it is being used.

3.5        Customer Relationship Management

During the “dot-com boom” era, there was considerable academic interest in Customer Relationship Management (CRM) strategies, applications and processes, with some 600 papers published by the “bust” (Romano and Fjermestad 2001). CRM is the natural context to examine customer information quality, as it provides an academic framework and business rationale for the collection and use of information about customers. While quality data (or information) about customers is identified as key to the success of CRM initiatives (Messner 2004; Missi et al. 2005) it is not clear exactly how one should value this. Indeed, even the real costs of poor customer data are difficult to gauge due to the complexities of tracing causes through to effects. This is part of the much larger data quality problem. At the large scale, The DataWarehousing Institute estimated that – broadly defined - poor data quality costs the US economy over $US600 billion per annum (Eckerson 2001).


 

3.5.1         CRM Business Context

Customer Relationship Management can be understood as a sub-field of the Information Systems discipline (Romano and Fjermestad 2001; Romano and Fjermestad 2003), to the extent that it is a business strategy that relies on technology. Alter suggests that we can conceive of such systems as work systems (Alter 2004; Alter and Browne 2005). As such, the relationship between CRM and IQ is bi-directional: CRM systems require high quality customer information to succeed; and improving the quality of customer information can be a beneficial outcome of deploying CRM (Freeman et al. 2007; Jayaganesh et al. 2006).

One example of the latter is the study by Freeman and Seddon on CRM benefits (2005)(Freeman and Seddon 2005). They analysed a large volume of qualitative data about reported CRM benefits to test the validity of an earlier ERP (Enterprise Resource Planning) benefits framework. Some of the most significant benefits to emerge from this study related to quality of information: improved customer-facing processes and improved management decisions. Indeed the key “enabler” of these benefits was identified as “the ability to access and capture customer information”.

Other studies highlight the importance to high quality customer information for CRM success. For example, industry analysts Gartner reported that “[CRM] programs fail, in large part, because the poor quality of underlying data is not recognized or addressed.” (Gartner 2004, p1). Gartner stresses the link between poor quality customer information and CRM failure in their report “CRM Data Strategies: The Critical Role of Quality Customer Information” (Gartner 2003).

In light of the importance of quality information on CRM success, practitioners and researchers involved in CRM are frequently concerned with information quality. Similarly, CRM processes represent a significant source of value for practitioners and researchers dealing with information quality. That is, customer processes (undertaken within a CRM program) afford information managers with an opportunity to examine how high quality information can impact upon value-creation within the firm.

Certainly, we should not regard CRM processes as the only means by which quality customer information is translated into value: regulatory functions, strategic partnerships, market and competitor analyses and direct sale (or rent) of information assets can also contribute through cost reduction and revenue increases. Further, obtaining and maintaining high quality customer information is not a guarantee of a successful CRM strategy. However, the relationship between the two is sufficiently strong as to warrant a closer look at how information is used within CRM processes.

3.5.2         CRM Processes

Meltzer defines, a CRM process is seen an organisational process for managing customers (Meltzer 2002). He identifies six basic functions:

Cross-sell: offering a customer additional products/services

Up-sell: offering a customer higher-value products/services.

Retain: keeping desirable customers (and divesting undesirable ones).

Acquire: attracting (only) desirable customers

Re-activate: acquiring lapsed but desirable customers.

Experience: managing the customer experience at all contact points

At the core of these processes is the idea of customer classification: a large set of customers is partitioned into a small number of target sets. Each customer in a target set is treated the same by the organisation, though each may respond differently to such treatments. This approach seeks to balance the competing goals of effectiveness (through personalised interaction with the customer) and efficiency (through standardisation and economies of scale).

For example, a direct mail process might require partitioning a customer list into those who are to receive the offer, and those excluded. In this case, there are four possible outcomes from the treatment dimension “Offer/Not Offer” and the response dimension “Accept/Not Accept”. The objective of the process is to correctly assign all customers to their correct treatment (ie accepting customers to “offer”, not accepting customers to “Not Offer”).

Clearly, organisations require hiqh-quality customer information in order to be able to execute these processes. Further, the need to correctly place a particular customer into the right group constitutes the (partial) customer information requirements of the organisation: the types of information collected, the levels of granularity, timing and availability and other characteristics depend, in part, on the usage. Obversely, the design of the customer processes themselves will depend on what information is (in principle) available, suggesting an interplay between information managers and process designers.

Hence, at its core, this segmentation task is a key point at which high-quality customer information translates into value for the organisation. As discussed above, this is not to say that CRM processes constitute the entirety of the value-adding effects of customer information; rather, that a sizeable proportion of the value amenable to analysis may be readily found therein. This is due the existence of a widely employed valuation method underlying many CRM strategies: the idea of the Customer Lifetime Value.

3.5.3         Customer Value

Customer Value is sometimes called Lifetime Value (LTV) or Customer Lifetime Value (CLV) or Future Customer Value. It is widely used as the basis for evaluating CRM and Database Marketing initiatives (Hughes 2006). There is a related notion of Customer Equity, which could be considered the sum of Customer Value over all customers. The idea is that the worth of a customer relationship to an organisation can be evaluated by adding up the revenues and costs associated with servicing that customer over the lifetime of the relationship, taking into account future behaviours (such as churn) and the time value of money (Berger and Nasr 1998). As such, it represents the Net Present Value of the customer relationship; that is, “the sum of the discounted cash surpluses generated by present and future customers (within a certain planning period) for the duration of the time they remain loyal to a company” (Bayón et al. 2002, p18).

Customer Value is used to evaluate the impact of CRM processes on an organisation’s bottom line and takes the role of the “target variable” for controlling the design and operation of these processes. For example, a cross-sell process that focussed just on immediate sales over the duration of a particular campaign is not a suitable measure since it will fail to take into account follow-up purchases, referrals and “channel cannabilisation” (whereby sales from one channel, such as a website, may be transferred to another, say a call centre, without a net gain). Using Customer Value aligns operational marketing efforts with the longer-term interests of investors (and other stakeholders).


 

3.6       Decision Process Modelling

In this section, I introduce some important conceptual tools for understanding the role of information in representing the world (meaning) and making decisions (usage). Firstly, I look at some ideas from information economics – the study of the value of information. I then narrow that to examine decision-theoretic models that can be used to describe Customer Relationship Management processes, and the kinds of quantitative evaluation that they employ. Finally, I examine an engineering-oriented model, known as information theory, widely used to understand and measure information flows. Throughout, the relevance to customer information quality is highlighted through examples involving common CRM processes.

3.6.1         Information Economics

There are two basic concepts underpinning information economics: uncertainty and utility. Firstly, uncertainty refers to the absence of certain knowledge, or imperfections in what an observer knows about the world. It can be characterised by a branch of applied mathematics known as Probability Theory (Cover and Thomas 2005). While other approaches have been proposed (eg Possibility Theory), Probability Theory has by far the widest reach and most sophisticated analysis. The idea is that an observer can define a set of mutually-exclusive outcomes or observations, and assign a weight to each outcome. This weight reflects the chance or likelihood that the (as-yet-unknown) outcome will materialise. Originally developed to help gamblers calculate odds, it is now so embedded in all areas of science, statistics, philosophy, economics and engineering that it is difficult to conceive of the world without some reference to probabilities.

That said, there is some dispute and consternation about the interpretation of these weights. The so-called frequentists argue that the weights correspond to the long-run frequencies (or proportion of occurrences). So, to say “the probability of a fair coin-toss producing a heads is 50%” means that, after throwing the coin hundreds of times, 50% of the throws will result in a head. The objection, from rival Bayesians, is that this interpretation falls down for single events. For example, to state that “the probability of the satellite launch being successful is 50%” cannot be interpreted in terms of frequencies since it only happens once. These discussions aside, Probability Theory remains the single most comprehensive theory for understanding and reasoning about uncertainty.

The second key concept is utility. This refers to a measure of the happiness or net benefit received by someone for consuming or experiencing a good or service. Its role in economic theory is to capture (and abstract) the idea of “value” away from psychological or cognitive processes. We can thus reason about how a particular decision-maker’s utility varies under different circumstances. As such, utility has underpinned economic theory for several hundred years (Lawrence 1999), allowing theorists to posit homo economicus, the so-called “rational man”, to describe and predict the behaviours of large groups of people.

However, there have been many debates within the economics community about the nature of utility (eg whether or not it is subjective), how it is measured and so on. Despite these latent problems, a sophisticated edifice was constructed throughout the 19th century in a theoretical body known as “neoclassical microeconomics”. This explains and predicts decision-making around economic production and consumption through concepts such as supply and demand curves, marginal (or incremental) utility, production possibility frontiers, returns to scale and so on.

Following important developments in Game Theory after World War II, two mathematicians, Morgestern and von Neumann, set about recasting neoclassical microeconomics in terms of this new mathematical model (Neumann and Morgenstern 2004). Their resulting work is often known as the “game-theoretic reformulation of neoclassical microeconomics”, or more loosely, Utility Theory.

Morgestern and von Neumann’s key insight was to link utility with preferences. In their model, actors have a preference function that ranks different outcomes, or possibilities, associated with “lotteries” (taking into account chance). They showed that, mathematically, micro-economics could be reconstructed “from the ground up” using this idea of ranked preferences. What’s more, preferences can be observed indirectly through people’s behaviour (ie their preferences are revealed through their choices), allowing experimental research into decision-making.

3.6.2         Information Theory

Throughout the 1960s and 1970s, the field of information economics integrated Game Theory with another post-war development: Information Theory. Working at the Bell Laboratories on communications engineering problems, the mathematician Claude Shannon published a modestly-entitled paper “A Theory of Mathematical Communication” for an in-house research journal (Shannon 1948). When the full import of his ideas was grasped, it was re-published (with Warren Weaver) as a book entitled The Mathematical Theory of Communication (Shannon and Weaver 1949).

The key quantity Shannon introduced was entropy, a measure of the uncertainty of a random variable. By measuring the changes in uncertainty, Shannon’s theory allows analysts to quantify the amount of information (as a reduction in uncertainty) of an event.

Conceptually, Shannon’s innovation was to explain how the communication process is, at its heart, the selection of one message from a set of pre-defined messages. When the sender and receiver select the same message (with arbitrarily small probability of error), we are said to have a reliable communication channel. This simple precept – combined with a rigorous and accessible measurement framework – has seen information theory (as it is now known) continue development through dozens of journals, hundreds of textbooks and thousands of articles. It is widely taught at universities in the mathematics and engineering disciplines.

From this grand theoretical foundation a large number of application areas have been developed: communications engineering, physics, molecular biology, cryptography, finance, psychology and linguistics (Cover and Thomas 2005). Economics – in particular, information economics – was very quick to adopt these new ideas and integrate them with Game Theory. The object of much of this work was to understand the interplay between value and information – how economics can help place a value on information and (in turn) how information can shed new light on existing economic theories (Heller et al. 1986). Notable economists tackling these ideas during this period included Henri Theil (Theil 1967), Jacob Marschak (Marschak 1968; Marschak 1971; Marschak 1974a; Marschak 1974b; Marschak et al. 1972; Marschak 1980) and Joseph Stiglitz (Stiglitz 2000) and George Stigler (Stigler 1961).

3.6.3         Machine Learning

One application area of particular interest to this research is machine learning. This branch of applied mathematics examines methods for sifting through large volumes of data, looking for underlying patterns. (For this reason, there is a large overlap with the data mining discipline.) Specifically, the focus is on algorithms for building computer models for classifying, segmenting or clustering instances into groups. For example, models can be used to estimate how likely it is a customer will default on a loan repayment, based on the repayment histories of similar customers. Other example applications include direct marketing (where the task is to identify customers likely to respond to an offer), fraud detection (flagging suspect transactions) and medical diagnostic tasks.

The primary interest of the machine learning research community is not the models themselves, but the algorithms used to build the models for each application area. When evaluating and comparing the performance of these algorithms, researchers and practitioners draw on a range of measures.

In terms of outcomes, the recommended measure is the cost of misclassification (Hand 1997). That is, when making a prediction or classification (or any decision), the cost arising from a mistake is the ideal success measure, in terms of getting to the best decision. Standard Economic Theory requires decision-makers to maximise their expected utility (Lawrence 1999), which in Information Economics is used to develop a sophisticated approach to valuing information based on so-called pay-off matrices (a table that captures the costs and benefits of different decision outcomes).

Evaluating performance in this way is highly context-specific: the costs of misclassification for one decision-maker might be very different than for another. In other cases, the costs might not be known a priori or even be entirely intangible. To deal with these scenarios, decision-theoretic measures of outcome performance are used. As these are independent from the consequences of decisions, they evaluate the ability of the models to guess correctly the preferred outcome only.

The most widely used measures for binary decisions like medical diagnoses are sensitivity and specificity (Hand 1997) and derived measures. Essentially, these measure the probabilities of “false positives” and “false negatives” and are known as precision and recall in the document retrieval literature. In the direct marketing literature, it’s more common to describe the success of a campaign in terms of the ratio of true positives (“hits”) to false positives (“misses”). This generalises to the ROC[2] curve, which plots out these two measures on a graph. The area under the curve (AUC) is frequently used to compare between models (Fawcett 2006; Provost et al. 1997). Other research by has extended the ROC concept to include costs, where they are available (Drummond and Holte 2006).

A marketing-specific version of this concept is found in “lift”, or the proportional expected improvement in classifying prospects over a random model. This idea is further developed in the L-Quality metric proposed by (Piatetsky-Shapiro and Steingold 2000).

While these approaches are independent of costs and as such allow evaluation of the models in a general sense, they do not naturally extend to cases where there are more than two outcomes. For example, a CRM process that categorised each customer into one of four different groups, depending on their likely future spend, cannot be characterised neatly in terms of false negatives/positives. A further problem is that these approaches do not take into the prior probabilities. For instance, suppose a process correctly categorises customers’ gender 97% of the time. That might sound high-performing, but not if it’s actually being applied to a list of new mothers in a maternity hospital!

One approach to both of these situations is to use measures based on entropy, or the reduction of uncertainty, as first proposed by Shannon (Shannon and Weaver 1949). The machine learning community makes extensive use of a set of measures proposed by Kononenko and Bratko (1991). The “average information score” and “relative information score” measure how much uncertainty is reduced by a classifier, on average. Being theoretically-sound, it elegantly takes into account both non-binary outcomes and prior probabilities, allowing performance comparison between different decision tasks as well as different contexts.

CRM processes (which, at their core, involve segmenting customers into different groups for differentiated treatment) can be characterised as classifiers. From a classifier perspective there are three approaches to measuring their performance: cost-based (which is context-specific and to be preferred in real situations, if costs are available), decision-theoretic (useful for common cases involving binary decisions) and information-theoretic (useful for multiple outcome decisions with uneven prior probabilities).

Conceived as classifiers, the impact of information quality on the performance of these CRM processes can be understood in terms of decision-making: how do IQ deficiencies result in mis-classification of customers? The methods and measures used for quantifying CRM performance (including scoring and valuing) can be brought to bear to answer this question, indirectly, for customer IQ.

3.7        Conclusion

The information quality literature is replete with frameworks and definitions, few of which are theoretically-based. These conceptual difficulties mean that measurement of IQ deficiencies is weak, especially in the area of valuation. A large and growing body of knowledge relating to quantifying value and uncertainty is established in the fields of information economics, decision-making and information theory which have seen little application to IQ. Customer Relationship Management provides a customer-level focus for IQ and, through machine-learning models, provides a natural and obvious context for employing this established knowledge.

[1]Housholding” in the information quality context refers to the process of grouping related entities, for instance individuals who reside at the same house, or companies that fall under a shared ownership structure.

[2] The term “ROC” originated in communications engineering, where it referred to “Receiver Operating Characteristic”.


 



 

 

Prev:
Chapter 2 - Research Method & Design
Up:
Contents
Next:
Chapter 4 - Context Interviews