Prev:
Chapter 5 - Conceptual Study
Chapter 6
Simulations
Next:
Chapter 7 - Research Evaluation

1             

Summary. 1

Chapter 1 - Introduction. 12

Chapter 2 - Research Method and Design. 18

Chapter 3 - Literature Review.. 36

Chapter 4 - Context Interviews. 56

Chapter 5 - Conceptual Study. 84

Chapter 6 - Simulations. 124

6.1 Summary. 124

6.2 Philosophical Basis. 125

6.3 Scenarios. 127

6.3.1 Datasets. 128

6.3.2 Decision functions. 131

6.3.3 Noise process. 132

6.4 Experimental Process. 134

6.4.1 Technical Environment 134

6.4.2 Creating models. 135

6.4.3 Data Preparation. 137

6.4.4 Execution. 138

6.4.5 Derived Measures. 138

6.5 Results and derivations. 139

6.5.1 Effects of Noise on Errors. 139

6.5.2 Effects on Mistakes. 147

6.5.3 Effects on Interventions. 157

6.6 Application to Method. 160

6.7 Conclusion. 164

Chapter 7 - Research Evaluation. 166

Chapter 8 - Conclusion. 180

References. 184

Appendix 1. 194


Simulations

6.1       Summary

In Chapter 5, a theoretical framework for Customer Information Quality interventions was developed through a conceptual study. This framework seeks to describe quantitatively the effect of improving Information Quality (IQ) deficiencies on certain information systems (automated, data-driven customer processes in large organisations) and relate data attributes within these systems to wider organisational goals (business value). The intent is to provide analysts with some theoretically-grounded constructs, measures and steps to support them in evaluating investments in interventions to improve IQ. To evaluate and refine the theoretical development of the framework, simulations and mathematical analyses are used to investigate the relationship between improvements to IQ deficiencies and business value in this context.

Following a Critical Realist research approach, the investigation proceeds by triggering, under controlled conditions, the underlying mechanisms found in the ambient environment. The goal is to understand the operation of these mechanisms through the use of CMO patterns (“context-mechanism-outcome”). This understanding can then inform decision-making by practitioners in allocating resources to improve information quality (IQ).

The experimental design requires creating conditions that will activate (and inhibit) the mechanisms within the confines of the study, just as would happen in practice. To achieve this and ensure external validity, real-world customer datasets are sourced and decision models are developed and deployed using the same tools and procedures as encountered in practice.

The experiments employ a series of computer simulations of the customer processes to test the impacts of synthetic “noise” (information quality deficiency) upon the processes’ performance. These results and subsequent mathematical analyses are used to validate the metrics developed in the framework (from Chapter 5) that helps analysts design, evaluate and prioritise investments in IQ interventions.

The key findings are that:

·         the effects of the “garbling” noise process on customer data can be analysed mathematically with a high degree of confidence.

·         the information-theoretic entropy metrics (derived from theory) are useful and practicable for selecting and prioritising IQ interventions.

·         these metrics can be translated readily into business impacts, expressed in terms of cash flows.

Based on the internal validity (robust and careful experimental design and execution) and external validity (re-creation of ambient conditions), the case for the generalisability of the experimental results is made.

The rest of this chapter is structured as follows. Section 2 examines the philosophical basis for this empirical work, linking it back to the research design (Design Science) and research philosophy (Critical Realism). Sections 3 and 4 introduce the scenarios under examination (including the datasets, algorithms and “noise” processes) as well as the explaining the practicalities and technical details of how the simulations were undertaken. Section 5 uses results from the simulations to argue that the theoretical framework developed in Chapter 5 (including its metrics) can be operationalised and used in practical situations. Specifically, Section 5.1 shows how the pattern of observed outcomes from the “noise” process is well-described by its mathematical characterisation. Section 5.2 demonstrates that the proposed Influence metric can be used as a cheaper, more practicable proxy for assessing the actionability of an attribute in a particular organisational process. Section 5.3 models the effects of interventions and “noise” on the organisation’s costs and benefits. Finally, Section 6 looks at how these results can be packaged into a method for analysts to apply to specific situations.

6.2       Philosophical Basis

This section relates how concepts from Chapter 2 (Research Method and Design) apply to the design and conduct of a series of experiments into the effect of IQ deficiencies on the operation of customer decision models. Specifically, the applicability of Bhaskhar’s Critical Realism for this task is discussed as well as the criteria for internal and external validity of the experiments (Mingers 2003).

In undertaking a Critical Realist (CR) experiment, it is important to recognise that the real world under study is ontologically differentiated and stratified (Bhaskar 1975; Bhaskar 1979) into three overlapping domains: the real, the actual and the empirical. The world exists independently of our experiences (that is, an ontologically realist position is adopted) while our access to and knowledge of it filtered through socially-constructed categories and concepts (an epistemologically interpretivist stance).

This view is appropriate when considering customer attributes and events, which may have their roots in the natural world but are defined through social mechanisms such as the law. For example, gender is a complex genetic, biological and social phenomenon. In the context of customer information systems, the role of chromosomes is irrelevant; what counts is the social construction, whether that is by legal definition (eg birth certificate) or self-definition (eg asking the customer). Similar remarks could be made for date of birth and marital status.

This matters because assessing how well a system describes the attributes, statuses and events associated with customers depends on the host organisation’s existing shared concepts and categories. By way of illustration, consider marital status. Determining whether or not the list of possibilities is complete or that a given customer is mapped correctly will always be subjective and grounded in a particular culture or legal setting. Marriage is not a natural event or attribute and there is no objective determination of it. In CR terms, marital status is firmly in the transitive dimension.

The role of CR in these experiments extends to the nature of the claims to knowledge arising from them. The systems under analysis – data-driven customer decision-making processes – are systems that describe and predict aspects of real-world customers and their behaviour. Whether it is mortgage approvals or a direct mail campaign, these systems are created for (and assessed against) their ability to inform action based on likely future behaviours like credit defaults, responses to a marketing message and so on.

It is not a goal of this research to explain why people default on credit card payments or sign up to subscription offers, nor is it a goal to build better models of such behaviour. The “generative mechanisms” that trigger and inhibit such complex social behaviours lie, in ontological terms, in the realm of the real. Untangling these deep patterns of causality is outside the scope of this research. Instead, the domain of the actual is the focus, where we find events. In a business context, these events include applying for a loan, changing one’s name, making a purchase, signing up for a service and so on. We don’t have access to these events (we can’t perceive them directly) but instead our knowledge of them comes to us through our sense-perceptions of the “traces” they leave in the empirical: coloured images flashed on a computer screen, audio reproductions of human speech on a telephone in a call-centre, text printed on a receipt.

The databases and decision functions developed and deployed in the customer decision processes are themselves transitive objects embedded in the three domains. The “generative mechanisms” operating in the real are the underlying laws of nature that govern the operation of electrical and mechanical machinery. The patterns of causality operating at this level are highly controlled to give rise to the intended events in the actual: changes in system state and operation. We access the occurrence of these events through the empirical: perceptions of the graphics, texts and sounds (on screen or on paper) through which the state variables are revealed to us.

These system events (from the actual domain) are designed to reflect, or correspond to, the “real-world” customer events (also from the actual domain). Empirically, they may be entirely distinct. A decision model that predicts a customer will purchase a product may express this graphically on a computer screen. This looks nothing like an individual making a purchase at a cash register in a department store. However, the former corresponds to the latter (although they are not the same event).

Similarly, in the domain of the real the customer and system mechanisms are distinct. The underlying causal patterns that give rise to the purchase decision event are complex and grounded in social psychology, economics and cognitive science. The underlying causal patterns that gave rise to the model’s prediction event are grounded in electronics and computer engineering, constrained by software code that implements a mathematical function. That the latter can predict the former is due to the ability of the model to mimic (to an extent) these complex psycho-social causal structures. The pattern of customers in a certain postcode being more likely to make a certain purchase is detected, extracted and then implemented by the model. This mimicking mechanism – a decision tree, neural net, regression function or similar – is entirely distinct from the customer’s and bears no resemblance. From the perspective of the organisation that developed and deployed the decision model, it is only to be assessed against how well it predicts customer events, that is, how it performs in the domain of the actual. Superficial appearances in the empirical domain or deep understanding of causality in the real domain only matter to the extent that they impact upon events in the actual.

In terms of these experiments in information quality, the customers’ “generative mechanisms” (in the domain of the real) that give rise to their behaviours (events in the domain of the actual) are not relevant. What is important is how the “generative mechanisms” within the customer information systems give rise to the systems’ events (predictions of behaviours) and how the operation of these systems is perturbed by IQ deficiencies.

In short, these experiments concern the events arising from algorithmic models of customers, not customer behaviour. The extent to which these models reflect customers is not relevant for this research.

The experimental logic is to re-create in the laboratory these conditions (mechanisms, events and perceptions from their respective domains) and, in a controlled fashion, introduce IQ deficiencies. The impact of these deficiencies is manifested in the domain of the actual: the decision function (“generative mechanism”) may give rise to different events when different (deficient) customer events (as encoded in the data) are presented.

Under certain conditions, the systems’ “generative mechanisms” (the decision functions) remain unchanged and it becomes possible to ascribe causality to the changes in data. Specifically, if the underlying decision functions are, formally speaking, deterministic then any changes to observed events are attributed to changes in the input data alone and we can attempt to establish causality. This is only possible in closed systems where there is no learning or other changes taking place, such as with human decision-making.

Since these IQ deficiencies can only impact upon the operation of the customer information system at the level of the actual, their generative mechanism doesn’t matter for these experiments. Consider the example of a mortgage approval system. If the “deceased flag” for a customer is changed from “alive” to “deceased”, then it may impact upon the mortgage approval decision (it is generally unlawful to grant credit to a deceased person, as well as commercially unwise). Such a change (“noise[12]”) may have a plethora of causes: perhaps a mis-keying at the point of data entry, a failed database replication process or a malicious fraud by an employee. Understanding the root-cause (aetiology) may be important for detecting the change and ensuring similar events do not occur subsequently. However, the effect of this change-event under the conditions of that particular mortgage approval system will be the same regardless of what caused it.

So understanding the “down-stream” effects of IQ deficiencies does not require understanding their “up-stream” causes. This is not to say that root-cause analysis is not important; it’s just to say that it is not necessary for the purpose at hand.

These experiments are concerned with the effects of noise-events in customer information systems upon the prediction-events arising from the algorithmic models of customer behaviour. The external validity of the experiments rests on how well the laboratory re-creates the ambient conditions found in practice, and the extent to which the systems’ “generative mechanisms” triggered (or inhibited) during the experiments mimic those found in practice. As argued above, it does not depend on how well the algorithms predict real customer behaviour, nor how well the IQ deficiencies match those found in ambient conditions.

From a CR perspective, the internal validity of the experiments is determined by whether the systems’ “generative mechanisms” (that is, the operation of electro-mechanical machines constrained by software that implements a statistical customer decision model) give rise to changed prediction-events in the presence of induced IQ deficiencies manifested as noise-events. More simply, the internal validity depends on the how well I can exclude other potential explanations for changes in the prediction-events, such as programming errors or hardware failure.

The next section explains how the experiments were designed to meet these criteria for internal and external validity.

6.3       Scenarios

In order to ensure the external validity of the experiments, it is necessary to re-create the conditions and “generative mechanisms” in the laboratory as they operate in the ambient environment. This involves using technology, models, algorithms, statistics and data as employed in the target organisations, defined in Chapter 5 (Conceptual Study).

It’s important to note these contrived laboratory conditions are not a “simulation[13]” of the ambient environment: it’s a reconstruction. Therefore, it’s important that the pertinent elements that comprise the “generative mechanisms” are taken from those ambient environments. This means using real datasets, real tools and real customer models. The noise introduced, however, is synthetic and thus constitutes a simulation.

The IQ deficiencies are studied at the level of events (the actual) not mechanisms (the real). Hence, it not required to re-create the same root-causes or “generative mechanisms” in the laboratory as in the ambient environment. That is, the noise is deliberate and contrived rather than occurring as a result of, for example, mis-keying by data entry clerks, disk failure or poorly-specified business rules.

As a consequence, the noise in the experiments will not reflect ambient conditions in prevalence or distribution. What is important with this element is that the noise process introduces IQ deficiencies in a controlled fashion, allowing deductions to be drawn from the observed effects. In practical terms, a well-defined, replicable noise-adding procedure is required.

6.3.1         Datasets

The requirement that the customer datasets be real constrained the candidate sets to those made publicly available for research purposes. The principle catalogue of such datasets is the UCI Machine Learning Repository (Asuncion and Newman 2007), which holds approximately 173 datasets. These datasets are donated by researchers and research sponsors (organisations) and are suitably anonymised and prepared for analysis. The typical use of these datasets is in the design, development and improvement of data mining and related statistical algorithms.

Of the 173 datasets, approximately one third each are categorical (or nominal) data, numerical (or ratio data) and mixed. Most are drawn from the domains of life sciences (48), physical sciences (28) and computer science/engineering (26). Five datasets explicitly relate to business and 14 to social sciences and hence are likely to be construed as customer data.

In selecting the datasets that represent the ambient environment, the criteria were:

·         The selection should provide a representative class of decision tasks (classification, segmentation and prediction) using customer data.

·         Each dataset should contain both nominal and numerical data types.

·         The customer attributes should be generic (applicable across domains).

·         There should be sufficient attributes (columns) and instances (rows) to build realistic models.

·         The size and complexity of the datasets should not present practical or resourcing issues for the experiments.

Based on a review of the available options and these criteria, three datasets were selected for analysis, as described below.


 

6.3.1.1    ADULT

This dataset was derived from US Census data (1994) and was donated in 1996. It contains 48,842 instances (customer records) with the following 15 attributes (columns):

Attribute

Type

Values

Age

Numerical

 

Workclass

Nominal

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

Fnlgwt

Numerical

 

Education

Nominal

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education-num

Numerical

 

marital-status

Nominal

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse

Occupation

Nominal

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship

Nominal

Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried

Race

Nominal

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

Sex

Nominal

Female, Male

capital-gain

Numerical

 

capital-loss

Numerical

 

hours-per-week

Numerical

 

native-country

Nominal

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

income

Nominal

<=50K, >50K

Table 13 ADULT dataset

The last attribute – income – is the target variable, or class, for the scenario; that is, the task is to predict (or classify) whether a given customer’s income is over or under $50,000 per annum. In a practical context, such a task may be performed in order to support the targeting of marketing messages (for example, branding or making an offer).

Note that for reasons of practicality, the dataset was sampled from almost 50,000 instances down to 10,000 (20% random sampling). This significantly improved the time and computing resources required during the experiments without impacting upon the validity of results. This is discussed further below.


 

6.3.1.2   CRX

This dataset concerns credit applications in an Australian financial service provider (identity is confidential). It was supplied by Ross Quinlan (University of Sydney) in 1986. The 16 attributes have been de-identified so that the semantics of the labels is not recoverable. There are 690 customer instances in this dataset.

Attribute

Type

Values

A1

Nominal

b, a

A2

Numeric

 

A3

Numeric

 

A4

Nominal

u, y, l, t

A5

Nominal

g, p, gg

A6

Nominal

c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff

A7

Nominal

v, h, bb, j, n, z, dd, ff, o

A8

Numeric

 

A9

Nominal

t, f

A10

Nominal

t, f

A11

Numeric

 

A12

Nominal

t, f

A13

Nominal

g, p, s

A14

Numeric

 

A15

Numeric

 

A16

Nominal

+, -

Table 14 CRX Dataset

The target variable, A16, presumably relates to subsequent customer credit worthiness (+) or defaulting behaviour (-), though this is conjecture.


 

6.3.1.3    GERMAN

This dataset relates another customer credit task, this time using a wider range of customer attributes (21) to determine a personal loan in Germany. It was donated by Hans Hoffman (University of Hamburg) in 1994. It contains 1000 customer records, as detailed below.

Attribute

Type

Values

Account Status

Nominal

A11, A12, A13, A14

Duration

Numerical

 

Credit History

Nominal

A30, A31, A32, A33, A34

Purpose

Nominal

A40, A41, A42, A43, A44, A45, A46, A47, A48, A49, A410

Credit Amount

Numerical

 

Savings

Nominal

A61, A62, A63, A64, A65

Employment

Nominal

A71, A72, A73, A74, A75

Installment Ratio

Numerical

 

Marital Status and Sex

Nominal

A91, A92, A93, A94, A95

Guarantors

Nominal

A101, A102, A103

Residence

Numerical

 

Property

Nominal

A121, A122, A123, A124

Age (years)

Numerical

 

Other credit

Nominal

A141, A142, A143

Housing

Nominal

A151, A152, A153

Existing Credits

Numerical

 

Job

Nominal

A171, A172, A173, A174

Dependents

Numerical

 

Telephone

Nominal

A191,, A192

Foreign Worker

Nominal

A201, A202

Creditworthiness

Nominal

Good, Bad

Table 15 GERMAN Dataset

There are two identified limitations with the selected datasets. First is the nature of the decision task. They are all binary-valued rather than a more complex segmentation task, such as assigning each customer to one of, say, 50 segments or recommending a product to a customer based on related purchases (collaborative filtering). Secondly, the decision distributions in all three dataset are quite balanced (ranging from a 30-70 split to a 50-50 split). Some applications, like fraud detection, are heavily unbalanced (with perhaps <1% of customers being fraudulent).

These three datasets meet the explicit criteria. The decision tasks relate to three typical customer decision processes: segmentation (perhaps to support a direct marketing campaign), credit card approval and the awarding of personal loans. All datasets contain a mix of nominal and numerical data and commonly-used customer attributes such as sex, age, education, credit history, work type, income and family situation are represented. The number of attributes ranges from 15 to 21, while the number of instances ranges from 690 to 10,000. These datasets have been used widely by data mining and related researchers for many years to develop and test models, and so are likely to have been deemed sufficiently representative of what happens in practice by this research community.

6.3.2         Decision functions

The goal in selecting the statistical algorithms used to produce the decision functions is to reproduce in the laboratory the same generative mechanisms found in the ambient environment. This means choosing a subset of candidate functions that is likely to be representative of the modelling tools and techniques used in practice. Thus algorithms were selected not on the basis of their performance per se, but on their wider adoption by customer modelling practitioners. Hence, functions were sought that have been known for a long time, are well understood by researchers, taught at universities to students, implemented in many software packages and feature in textbooks.

Based on these criteria, the following five decision functions (algorithms) were selected. These descriptions are illustrative, and more detail about the parameter selection is provided subsequently. The specifics of how these decision functions operate are not important for the present purpose, but details can be found in most data mining texts (Han and Kamber 2006).

·         ID3 (ID3-numerical). This algorithm is the modified version of the original ID3 rule induction algorithm, designed to deal with numerical as well as nominal data. ID3 was one of the first decision-tree algorithms to be developed and remains an important algorithm for building decision trees.

·         AD (Alternating Decision Tree). Another decision tree algorithm, this one is more modern (1999) and employs the machine learning technique of boosting.

·         NB (Naïve Bayes Tree). This decision tree induction algorithm uses the Naïve Bayes algorithm at the leaves. That is, it assumes statistical independence in the input attributes.

·         BNet (Bayes Net). This algorithm employs a Bayesian Network (a model of interconnected nodes, weighted by their conditional probabilities) to construct the decision function.

·         LMT (Logistic Model Tree). In this algorithm, linear logistic regression is embedded within the tree induction process.

Two other algorithms were considered for inclusion owing to their popularity. Quinlan’s C4.5 algorithm (Quinlan 1993) is a successor to the ID3 that uses a different splitting criterion (information gain ratio instead of information gain), supports nominal data and handles missing values. Since the implementation of ID3 used here addresses all these issues[14], C4.5 was excluded. The second algorithm was the CHAID (“chi squared automatic interaction detection”). This algorithm uses the familiar χ2 statistic as the splitting criterion during tree induction. However, when evaluated under test conditions it behaved the same as ID3.

The major limitation with this set of decision functions is that issues of practicality means certain classes of more esoteric functions (such as neural networks, genetic algorithms and support vector machines) are omitted, as is the more mundane approach of manually-derived IF-THEN type decision rules used in simpler, smaller-scale situations. However, there is no theoretical or practical reason to suggest that these alternatives would behave markedly differently.

A survey of 203 data mining and analytics practitioners in March, 2007 suggests that the use of these esoteric algorithms are not widespread, with neural networks reportedly used by 17.2% of respondents within the previous 12 months, support vector machines by 15.8% and genetic algorithms by 11.3% (Piatetsky-Shapiro 2007b). By contrast, decision trees were used by 62.6%.

The set of machine learning rule induction algorithms selected here are non-trivial, widely used and based on different theoretical approaches (Information Theory, Bayesian statistics and linear logistic functions). They are well understood by researchers, supported by most statistical and data mining software packages and are practicable to implement.

6.3.3         Noise process

Since these experiments are concerned with the effects of IQ deficiencies and not the causes, a “noise process” is required that can introduce errors to the datasets in a controlled fashion. That is, the noise process is not required to reflect the “generative mechanisms” for noise in ambient conditions but rather one that induces noise-events in the domain of the actual.

In order to allow other researchers to verify these results, the noise process should be simple, practicable (in terms of programming and execution effort), analysable (closed-form algebraic solutions), repeatable, generic (applies to all data types) and explicable (without requiring deep mathematical knowledge). Failure to meet any of these criteria would undermine the replicability of the study.

The noise process selected is termed garbling[15] (Lawrence 1999) and it is applied to the dataset on a per-attribute (column) basis. It has a single parameter, g, that ranges from 0 (no garbling) to 1 (maximal garbling). In essence, it involves swapping field values in the dataset, as follows:

For a given dataset attribute (column) and garbling rate, g:

1.        For the ith customer, C­i, pick a random number, p, on the interval (0,1].

2.       If p ≤ g then garble this value …

2.1.     Select another customer, C­j, at random from the dataset.

2.2.     Swap the ith and jth customers’ values for the given attribute.

3.        Move to the (i+1)th customer and repeat step 1 until all records processed.

Thus, when g=0 none of the records are garbled and when g=1 all of them will be. In this way, a controlled amount of noise is introduced to any attribute in each of the datasets. (Please see Appendix 1 for the actual source code, in JavaScript, that implements this algorithm.)

It should be noted that not all records will be altered when they are garbled. Consider an example when Customer Record #58 has the “gender” field (presently “male”) garbled. Suppose Customer Record #115 is randomly selected to swap with #58. Suppose #115 also has a “gender” field of “male”, so that when the values are swapped they are both unaltered. This is analysed in detail in Section 6.5.1.

This noise process has some desirable properties. Firstly, it is quite intuitive and easy to visualise and implement. Secondly, it preserves the prior probability distributions over the dataset. That is, if the breakdown of frequencies of field-values in a given attribute is [0.15, 0.35, 0.2, 0.3] beforehand, then this will not change after garbling. As a consequence, it will not change the distributional statistics like mode or mean (for numeric data types). Lastly, it handles numeric data types without assuming an underlying statistical model. For example, it does not require assumptions of linearity or normality and since it only ever uses existing field-values it will not inadvertently generate “illegal” values (such as negative, fractional or out-of-range values).

In an information-theoretic sense, garbling is a “worst-case” noise event. That is, all information about the true external world value is completely and irrevocably lost: if a customer’s attribute has been garbled then there is absolutely no clue as to what the original value might have been. An observer is no better informed about the external world value after looking at the record than before. By contrast, another noise process like Gaussian perturbations (ie adding a random offset) retains some clue as to the original value.

In terms of the Ontological Model of IQ (Wand and Wang 1996), this worst-case situation occurs when ambiguity is maximised, perhaps as a result of a design failure. For example, the system is in a meaningless state (ie one which does not map to an external world state) or there is an incomplete representation (the external-world value cannot be expressed in the system).

In a practical sense, this kind of “disastrous” error would arise in situations where:

·         a field-value has been deleted or is missing,

·         an indexing problem meant an update was applied to the wrong customer record,

·         two or more customers have been inadvertently “fused” into the same record,

·         the field has been encoded in such a way as to be meaningless to the user or application,

·         an external-world value is not available, so an incorrect one is used instead “at random”.

It would not apply to situations where the IQ deficiency retains some information about the original value. For example, a problem of currency in customer addressing might arise when a customer changes residence. In such situations, the correct external world value is likely to be correlated to some degree with the “stale” value. Another example might be the use of subjective labels (such as eye colour) where one might expect some correlation between incorrect and correct values (eg “brown” is more likely to be mis-mapped to “hazel” than “blue”). Lastly, issues around precision in hierarchical data (a form of ambiguity) would also not be reflected by this process. For example, mis-describing a customer as residing in “United States” rather than “San Francisco, California, USA” would not arise from garbling.

6.4       Experimental Process

This section outlines the sequence of steps undertaken to implement the series of experiments. The goal is to explain how the internal and external validity of the study was maintained, to place the outcomes into context and allow repetition of the study to verify outcomes.

6.4.1         Technical Environment

The technical environment for the experiments were contrived to reproduce the ambient conditions found in practice. As outlined in Section 3, the scenarios (including the datasets, decision tasks and algorithms) were selected against criteria designed to realise this reproduction. The implementation platform for the experiments was constructed in keeping with this goal, and comprised the following technical elements:

·         standard low-end desktop PC (ie 2GHz processor, 512MB RAM, 120GB HDD, networking and peripherals),

·         windows XP SP2 (operating system),

·         RapidMiner 4.1 (data mining workbench),

·         WEKA 3.410 (machine learning algorithm library),

·         Microsoft Excel (data analysis spreadsheet),

·         GNU command line tools (batched data analysis),

·         wessa.net (online statistics package).

The bulk of the model building, experimental implementation and data collection were undertaken with the RapidMiner tool. This is the leading open source data mining and predictive analytics workbench. Formerly known as YALE (“Yet Another Learning Environment”), it is developed by the University of Dortmund, Germany, since 2001. It is a full-featured tool for building, testing and analysing models, incorporating a very large number of learning algorithms with a graphical user interface for setting up experiments.

WEKA (“Waikato Environment for Knowledge Analysis”) is a similar open source workbench, developed by New Zealand’s University of Waikato since 1997. WEKA’s library of over 100 learning functions is available for use within the RapidMiner environment and, owing to its more comprehensive selection and code documentation, was used in this instance.

A survey of 534 data mining and analytics practitioners in May 2007 found that RapidMiner was ranked second, used by 19.3% of respondents (Piatetsky-Shapiro 2007a). The most used was the commercial product, SPSS Clementine, at 21.7%. WEKA had a 9.0% share. While web-based surveys are open to “gaming” by vendors with a commercial interest – as acknowledged by the researchers – this does provide support for the assertion that the laboratory conditions in this experiment recreate those found in ambient environments.

It must be emphasised that both commercial and open source tools are used to build, validate and analyse exactly the same decision models as they implement a roughly overlapping set of learning algorithms. While they differ in their interfaces and have some variation in their capabilities, the resulting models are identical and as “generative mechanisms”, invoke the same events in the domain of the actual.

6.4.2        Creating models

The first step is to create the decision functions (models) from each of the three datasets (ADULT, CRX and GERMAN) using each of the five learning algorithms (ID3, AD, NBTree, BNet, LMT). This results in 15 decision models.

As explained in Section 3, the datasets contain a set of attributes and a target variable, or class, which is the “correct” decision or classification, as assessed by the domain experts who provided the datasets. The purpose of the learning algorithms is to build models that successfully predict or classify this target value. The attribute values are taken to be “correct” but there are some missing values. The ADULT dataset has 1378 missing values (0.98%), CRX has 67 (0.65%) while GERMAN has none. RapidMiner’s built-in missing value imputation function was used to substitute the missing values with the mode (for nominal data) or mean (for numerical data).

Building the model consists of presenting the data in CSV (comma separated value) format to the RapidMiner tool and applying the specified learning algorithm to construct the model. In most cases, the learning algorithm has a number of parameters that are available for tuning the performance. Rather than employing sophisticated meta-learning schemes (whereby another learner is used to tune the parameters of the original model), modifications were made by hand, using the performance criterion of “accuracy”[16]. To ensure the models weren’t “over-trained”, automatic validation was employed where the algorithm was tested against a “hold out set” (subset of data unseen by the learning algorithm).

The resulting 15 models are considered, for the purpose of these experiments, to be those constructed with “perfect information”. In each case, the model was exported as a set of rules (or weights), encoded in XML for subsequent re-use.

To illustrate, below is the decision model created by the ID3 algorithm with the ADULT dataset is reproduced in a graphical form. In general, the resulting models are far too complex to be visualised in this way (especially the Bayesian Networks and Logistic Modelling Trees). However, this image does give some insight into the general form that these tree-type decision models take: a series of nodes that encode conditional logic (IF-THEN rules) being traversed sequentially before arriving at a “leaf node” or final prediction or classification, in this “>50K” or “<=50K”.

CRX-DT-treeplot.png

Figure 20 ID3 Decision Tree for ADULT Dataset

The resulting models’ performances are detailed below. Rather than just provide the classification “accuracy” rates (in bold), these are reported as “mistake rates” broken down into Type I mistakes (false positives) and Type II mistakes (false negatives), where positive in this case refers to the majority class. (Given that the classes are in general quite well-balanced, the ordering is somewhat arbitrary.) This extra information is used in subsequent cost-based analysis, since different mistake types attract different costs.

Model Mistake Rates

(Type I, Type II)

ADULT

CRX

GERMAN

Averages

ID3

18.3%

(17.6%, 0.67%)

14.1%

(11.0%, 3.04%)

27.2%

(16.6%, 10.6%)

19.9%

AD

14.5%

(9.91%, 4.63%)

12.8%

(5.65%, 7.10%)

14.6%

(14.7%, 9.90%)

14.0%

NBtree

13.0%

(7.99%, 4.98%)

5.80%

(2.46%, 3.33%)

18.3%

(11.9%, 6.40%)

12.4%

BNet

16.7%

(4.80%, 11.9%)

11.7%

(3.48%, 8.26%)

22.9%

(13.3%, 9.60%)

17.1%

LMT

13.1%

(9.11%, 3.97%)

3.19%

(1.45%, 1.74%)

17.0%

(11.2%, 5.80%)

11.1%

 

Averages

 

15.1%

 

9.52%

 

20.0%

 

14.9%

Table 16 - Decision Model Performance by Algorithm and Dataset

The internal validity of these experiments hinges on using the tools correctly in applying the nominated algorithms to the datasets and producing the intended models. To support this, model performance results were sourced from the peer-reviewed machine learning literature and compared with these experimental results. As it was not possible to find studies that produced results for every algorithm on every dataset, a representative sample across the datasets and algorithms was chosen. Note also that, as discussed in Section 3b above, C4.5 is a substitute for the ID3-numerical (with information gain ratio as the splitting criterion, as used in this study). Since studies on ID3-numerical weren’t found, C4.5 is used for comparing these results with other studies.

Firstly, when Kohavi introduced the NBtree algorithm he compared the new algorithm against Quinlan’s C4.5 using a number of datasets, including ADULT and GERMAN (Kohavi 1996). Summarising, he found that C4.5 on ADULT (at 10,000 instances) had a mistake rate of 16%[17] and the NBtree algorithm improved that by 2% (to 14%). Sumner, Frank and (2005) reported mistakes rates using LMT on ADULT at 14.39% and using LMT on GERMAN at 24.73%. Ratanamahatana and Gunopulos (2003) reported a mistake rate of 26% on GERMAN with C4.5. For the CRX dataset Liu, Hsu and Yiming (1998) report 15.1% mistake rate using C4.5 and 27.7% with the same algorithm on GERMAN. Cheng and Greiner Cheng (1999) found a mistake rate of 14.5% for BNet on ADULT. For the AD algorithm, Freund and Mason (1999) found a mistake rate of 15.8% on CRX. For the same dataset and algorithm, Holmes, Pfarhringer et al. (2002) had a result of 15.1%.

Of course, the numbers reported in the literature do not exactly align with those found here. Slight differences can be accounted for by factors such as the handling of missing values (some studies simply drop those instances; here, the values were instead imputed) or the setting of particular tuning parameters (these are not reported in the literature so reproducing them is not possible). The largest discrepancy was for LMT on GERMAN (17.0% here compared with 25.0% in one study). This algorithm also has the largest number of tuning options, increasing the chances of divergence and the possibility of model over-fitting.

The overall closeness of the results reported in the literature with those reproduced here give support to the claim of internal validity: the events induced in these experiments (reported as performance metrics) result from the triggering of the intended underlying generative mechanisms and not “stray” effects under laboratory conditions. This gives assurance that the technical environment, selection of tools, handling and processing of datasets, application of algorithms and computing of performance metrics was conducted correctly, ensuring that, for example, the wrong dataset wasn’t accidentally used or there is a software problem with the implementation of an algorithm.

6.4.3         Data Preparation

The next step is to prepare the data for analysis. A simple web page was developed to provide an interface to custom JavaScript code used in the data preparation. This form consisted of an input for the dataset and testing and tuning parameters to control the process of introducing noise.

Firstly, as outlined above, two of the datasets had some missing values (ADULT and CRX), denoted by a “?” character. For nominal data, they were substituted with the mode value. For numerical data, the mean was used. This is known as imputation. Secondly, noise was introduced using the garbling algorithm (Section 3.3). As discussed, this involved iterating through each attribute, one at a time, and swapping data values according to a threshold parameter, g. For each attribute in each dataset, ten levels of garbling at even increments were applied (g=0.1, 0.2, …, 1.0). Finally, the resulting “garbled” datasets were written out in a text-based file format, ARFF, used in a number of analytic tools.

This data preparation step resulted in ten “garbled” datasets for each of the 49 attributes (that is, 14 from ADULT, 15 from CRX and 20 from GERMAN) for a total of 490 datasets.

 

 

6.4.4        Execution

This phase involves applying each of the five derived decision functions to the 490 garbled datasets, sequentially, and determining how the resulting decisions differ from the original. It is realised by using the RapidMiner tool in “batch mode” (that is, invoked from the command line in a shell script, rather than using the Graphical User Interface).

It’s worth emphasising that the decision functions are developed (trained) using “perfect information” – no garbling, no missing values. Even so, the mistake rate (misclassifications) is around 15%, reflecting the general difficulty of building such decision functions. This study is concerned with the incremental effect of noisy data on realistic scenarios[18], not the performance of the decision functions themselves. So, in each case, the baseline for evaluation is not the “correct decision” (supplied with the dataset) as used in the development (training) phase. Rather, the baseline is the set of decisions (or segments or predictions) generated by the decision function on the “clean” data (g=0). To generate this baseline data, the models were run against the three “clean” datasets for each of the five decision functions, for a total of 15 runs.

For each of the five decision functions, all 490 garbled datasets are presented to the RapidMiner tool, along with the baseline decisions. For each instance (customer), RapidMiner uses the decision function supplied to compute the decision. This decision is compared with the baseline and, if it differs, reported as a misclassification. Since all scenarios involved a binary (two-valued) decision problem, the mistakes were arbitrarily labelled Type I (false positive) and Type II (false negative).

In total, 2450 runs were made: five decision functions tested on the 49 attributes (from three datasets), each with 10 levels of garbling. The estimated computing time for this series of experiments is 35 hours, done in overnight batches.

6.4.5        Derived Measures

The last step is computing a key entropy statistic, the Information Gain, for each attribute to be used in subsequent analyses. This is done within the RapidMiner environment and using the built-in Information Gain function. For the case of numerical attributes, the automatic “binning” function (minimum entropy discretisation) was used to create nominal attribute values and the numerical values mapped into these.

A series of shell scripts using command line GNU tools (grep, sed and sort) pull the disparate metrics into a single summary file with a line for each experiment comprising:

·         decision function identifier

·         dataset identifier

·         attribute identifier

·         garble rate

·         garble events

·         error rate

·         mistakes (total, Type I and Type II)

This summary file was loaded into a spreadsheet (Excel) for further analysis, as outlined below.


 

6.5       Results and derivations

The key results are provided in tabular form, grouped by experimental conditions (dataset and algorithm). Supporting derivations and analyses are provided below alongside the experimental results, to show how they support the development of the metrics used during the design and appraisal of IQ interventions.

6.5.1         Effects of Noise on Errors

The first metric, called gamma (γ) relates the garbling mechanism with actual errors. Recall that an error refers to a difference in the attribute value between the external world and the system’s representation. For example, a male customer mis-recorded as female is an error. Suppose a correct attribute value (male) is garbled, that is, swapped with an adjacent record. There is a chance that the adjacent record will also be male, in which case, the garbling event will not introduce an error. The probability of this fortuitous circumstance arising does not depend on the garbling process itself or the rate of garbling (g), but on the intrinsic distribution of values on that attribute.

Attribute

ID3

AD

NB

BNet

LMT

Average

a0

98%

98%

98%

98%

97%

98%

a1

43%

43%

43%

43%

43%

43%

a2

100%

100%

100%

100%

100%

100%

a3

81%

81%

80%

81%

81%

81%

a4

81%

81%

81%

81%

81%

81%

a5

66%

67%

66%

67%

67%

67%

a6

89%

89%

88%

89%

89%

89%

a7

74%

73%

73%

72%

74%

73%

a8

26%

26%

25%

26%

25%

26%

a9

44%

44%

44%

44%

44%

44%

a10

16%

16%

16%

16%

16%

16%

a11

9%

9%

9%

9%

9%

9%

a12

76%

76%

76%

77%

75%

76%

a13

15%

15%

15%

15%

16%

15%

c0

41%

46%

41%

44%

44%

43%

c1

100%

101%

101%

101%

99%

100%

c2

97%

101%

99%

100%

101%

100%

c3

37%

37%

35%

36%

36%

36%

c4

38%

37%

36%

37%

37%

37%

c5

88%

91%

89%

92%

93%

91%

c6

59%

58%

61%

61%

60%

60%

c7

97%

99%

95%

96%

97%

97%

c8

49%

50%

49%

49%

50%

49%

c9

50%

52%

52%

52%

50%

51%

c10

69%

68%

65%

67%

66%

67%

c11

49%

50%

48%

49%

49%

49%

c12

17%

17%

17%

18%

16%

17%

c13

94%

97%

94%

93%

94%

94%

c14

81%

83%

79%

80%

81%

81%

g0

70%

69%

70%

69%

69%

69%

g1

91%

89%

87%

89%

87%

89%

g2

62%

62%

62%

62%

63%

62%

g3

82%

82%

82%

82%

81%

82%

g4

101%

101%

100%

100%

101%

101%

g5

61%

59%

58%

59%

59%

59%

g6

76%

75%

75%

77%

77%

76%

g7

66%

68%

69%

69%

67%

68%

g8

60%

59%

58%

61%

58%

59%

g9

18%

16%

18%

17%

18%

17%

g10

70%

69%

71%

69%

69%

69%

g11

73%

73%

73%

73%

70%

73%

g12

98%

98%

95%

97%

97%

97%

g13

31%

32%

31%

31%

31%

31%

g14

45%

46%

46%

45%

45%

46%

g15

47%

49%

49%

49%

48%

48%

g16

52%

54%

54%

53%

54%

54%

g17

27%

25%

25%

25%

25%

25%

g18

49%

48%

48%

49%

48%

49%

g19

7%

7%

7%

8%

7%

7%

Table 17 gamma by Attribute and Decision Function

The attributes are labelled so that the first letter (a, c or g) corresponds to the dataset (ADULT, CRX and GERMAN respectively). The following number indicates the attribute identifier within the dataset. Note that, by definition the valid range for γ is 0% to 100%, but that some values here are in excess (101%). This is because, for a given garbling rate g, the actual number of garbles performed has a small variance around it.

So, to illustrate, the attribute a9 (“sex”, in the ADULT dataset) has a γ of 44%. This means that swapping a customer’s sex value with another chosen at random has a 56% chance of leaving the value unchanged.

The derivation for γ is as follows. Firstly, I model the garbling process as a simple 1st order Markov chain, with a square symmetric transition probability matrix where the marginal distribution follows the prior probability mass function. For example, attribute a9 from above has prior probability distribution of A = [0.67 0.33]T. The garbling process for this attribute can be modelled with the following matrix, Ta9:

The interpretation is that 45% of the time that a record is garbled, it will start out as “male” and stay “male” (no error), 11% of the time it will start out as “female” and remain “female” (no error) and the remaining 44% of the time (22% + 22%) the value will change (an error). In this sense, γ is a measure of the inherent susceptibility of the attribute to error in the presence of garbling.

In general, for a given attribute’s probability distribution over N values [w1, w2,…, wN], the value of γ (probability of garbling leading to an error) is computed by summing the off-diagonal values:

This metric effectively measures the inverse of the “concentration” of values for an attribute. An attribute that is “evenly spread” (eg. A = [0.24 0.25 0.26 0.25]) will have a high γ value, approach 1. The most “evenly spread” distribution is the uniform distribution, when each of the N values is equal to 1/N:

For example, attribute a2 has γ =0.999 because it has a large number of possible values, each with approximately 2% probability of occurring.

By contrast, an attribute that is highly concentrated, A = [0.001 0.001 0.001 0.997] will have a lower γ approaching 0. The extreme case is when an attribute follows the Kronecker’s delta distribution of [1 0 0 0 … 0]. In this case, γmin = 0.

So γ is an intrinsic property of an attribute, constant regardless of the underlying garbling rate, g, or indeed how the attribute is used in decision functions. I can now analyse how γ and g combine to produce the observed errors under the garbling process described here.

I begin by defining ε as the error rate of an attribute, g is the garbling parameter and γ is as above. Recall that the garbling process works by sweeping through each customer record and for each record, with probability g, will swap that record’s value with another. By way of terminology, I say the original record is the source and the randomly selected second record is the target. As shown above, the chance that this swap will result in a changed value (ie error) is γ.

However, the proportion of records that is garbled is not simply g, the garbling parameter. The reason is that a given customer record might be selected for swapping during the sweep (that is, as the source, with probability g) but it may also be selected as the target in another swap. Whether selected as a source or a target, there is still a probability γ that the swap will result in an error.

Here, RS is the rate at which records are the source in a swap and RT is the rate at which they are the target. Clearly, RS is g, the garbling parameter. A value of 0 implies no record is selected for swapping (hence RS = 0) and a value of 1 implies all records are selected (RS=1). To calculate RT, the probability of a record being selected as a target, consider the simple case of a thousand records and g=0.1. In this scenario, I would expect 100 swaps (1000*0.1). This means there are 100 sources and 100 targets. Each record has a 1/1000 chance of being selected at random, and it undergoes this risk 100 times. If we think of this as a Bernoulli trial, then RT is a binomial random variable with probability of being selected 1/N over Ng trials:

In general, for n trials and probability of p, the distribution K of the count of occurrences has the probability mass function (pmf) of:

where

Owing to the memorylessness property of the garbling process, it does not matter if a record is swapped once, twice or ten times: it is subject to the same probability of error. Hence, I am only interested in whether a record is not swapped (K = 0) or swapped (K > 0). In this case, I have n=Ng and p=1/N and K=0:


At this point, I introduce the well-known limiting approximation:

So that the probability of a record never being selected is:

And hence the probability of a record being selected more than zero times is:

This last quantity is the estimate of the probability that a given customer record is selected “at random” as a target in a swap at least once (ie K > 0). (In effect, the Poisson distribution is used as an approximation to the binomial, with parameter λ = np = Ng/N = g and K=0.)

Going back to my formula for ε:

I have the probability of being selected as a source, RS = g. However, if the record is not selected as a source (with probability 1-g), then there is still a chance it will be selected as a target ie Pr(K > 0):

Substituting back into the original yields:

This formula gives the probability that a record is in error for a given extrinsic garble rate, g, and intrinsic attribute statistic of γ.

As a function of g, the error rate ε varies from 0 (when g=0) to a maximum of γ (when g=1). In the example shown below, γ=0.7. The effect of varying γ is to simply the scale the graph linearly.

Figure 21 Error Rate (ε) vs Garbling Rate (g)

In order to establish that this formula correctly describes the behaviour of the garbling process, the predicted error rates are compared with those observed during the experiments. Since there were 2450 experimental runs, it is not practical to display all of them here. Some examples of the comparison between predicted and observed are provided in this table, followed by a statistical analysis of all experiments.

 

a0

c0

g0

g

predicted

experiment

predicted

experiment

predicted

experiment

0

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.1

0.1815

0.1838

0.0780

0.0849

0.1287

0.1337

0.2

0.3374

0.3372

0.1449

0.1528

0.2392

0.2262

0.3

0.4707

0.4733

0.2022

0.2133

0.3338

0.3387

0.4

0.5845

0.5833

0.2511

0.2435

0.4145

0.4300

0.5

0.6813

0.6804

0.2926

0.3026

0.4831

0.4797

0.6

0.7631

0.7688

0.3278

0.3470

0.5411

0.5530

0.7

0.8321

0.8338

0.3574

0.3614

0.5901

0.5918

0.8

0.8899

0.8882

0.3823

0.3919

0.6310

0.6300

0.9

0.9380

0.9380

0.4029

0.4229

0.6652

0.6643

1.0

0.9778

0.9786

0.4200

0.4301

0.6934

0.6977

Table 18 Predicted and Observed Error Rates for Three Attributes, a0, c0 and g0

As expected, there is a close agreement between the predicted number of errors and the number of error events actually observed. Note that the last row (where g=1.0), the error rate ε reaches its maximum of γ, the garble parameter. In order to establish the validity of the previous analysis and resulting formula, all 49 attributes are considered. The analysis hinges on the use of the limit approximation above (as a tends to infinity). In this situation, a = Ng, suggesting that the approximation is weakest when the number of customer records (N) is small or when the garbling rate, g, is close to zero. This can be seen above, where the discrepancy between the predicted and experimental value is greatest at g=0.1 and for attribute c0 (the fewest records, at 690). Below is a comparison between the predicted result and experimental result, averaged, for each attribute.

Attribute

Predicted ε (average)

Observed ε (average)

Average

Difference

Maximum

Difference

Root Mean Square

Correlation Coefficient

a0

0.6656

0.6658

0.0001

0.0057

0.0023

1.0000

a1

0.2933

0.2930

0.0003

0.0047

0.0020

0.9999

a2

0.6671

0.6804

0.0132

0.0199

0.0144

1.0000

a3

0.5512

0.5527

0.0015

0.0049

0.0027

0.9999

a4

0.5506

0.5520

0.0014

0.0038

0.0023

0.9999

a5

0.4523

0.4532

0.0009

0.0000

0.0000

0.9999

a6

0.6049

0.6078

0.0029

0.0073

0.0030

0.9999

a7

0.5002

0.4995

0.0007

0.0062

0.0028

0.9999

a8

0.1729

0.1732

0.0003

0.0024

0.0014

0.9999

a9

0.3022

0.3012

0.0010

0.0045

0.0029

0.9999

a10

0.1065

0.1060

0.0005

0.0009

0.0005

0.9999

a11

0.0626

0.0625

0.0001

0.0012

0.0007

0.9996

a12

0.5178

0.5163

0.0015

0.0055

0.0029

0.9999

a13

0.1043

0.1048

0.0004

0.0019

0.0009

0.9997

c0

0.2859

0.2910

0.0051

0.0200

0.0117

0.9981

c1

0.6668

0.6916

0.0248

0.0470

0.0284

0.9996

c2

0.6646

0.6877

0.0230

0.0287

0.0207

0.9996

c3

0.2487

0.2741

0.0254

0.0436

0.0271

0.9963

c4

0.2487

0.2697

0.0211

0.0360

0.0268

0.9991

c5

0.6100

0.6404

0.0305

0.0514

0.0213

0.9991

c6

0.4056

0.4275

0.0219

0.0289

0.0203

0.9989

c7

0.6560

0.6651

0.0090

0.0364

0.0155

0.9991

c8

0.3396

0.3522

0.0125

0.0250

0.0134

0.9975

c9

0.3332

0.3522

0.0189

0.0339

0.0210

0.9976

c10

0.4434

0.4591

0.0157

0.0306

0.0172

0.9992

c11

0.3380

0.3617

0.0238

0.0227

0.0158

0.9989

c12

0.1175

0.1342

0.0167

0.0218

0.0177

0.9975

c13

0.6355

0.6622

0.0267

0.0446

0.0256

0.9994

c14

0.5487

0.5688

0.0202

0.0430

0.0190

0.9989

g0

0.4720

0.4688

0.0032

0.0155

0.0080

0.9991

g1

0.6092

0.5994

0.0098

0.0093

0.0058

0.9997

g2

0.4231

0.4264

0.0033

0.0147

0.0088

0.9986

g3

0.5519

0.5594

0.0075

0.0156

0.0099

0.9995

g4

0.6671

0.6845

0.0173

0.0281

0.0173

0.9997

g5

0.3989

0.4061

0.0072

0.0111

0.0070

0.9996

g6

0.5156

0.5158

0.0002

0.0135

0.0078

0.9994

g7

0.4608

0.4591

0.0018

0.0099

0.0060

0.9995

g8

0.4034

0.4011

0.0023

0.0105

0.0060

0.9994

g9

0.1177

0.1178

0.0001

0.0049

0.0026

0.9986

g10

0.4734

0.4659

0.0076

0.0077

0.0041

0.9996

g11

0.4988

0.4967

0.0021

0.0074

0.0045

0.9997

g12

0.6595

0.6574

0.0021

0.0101

0.0050

0.9998

g13

0.2150

0.2134

0.0016

0.0087

0.0053

0.9985

g14

0.3049

0.3097

0.0048

0.0113

0.0068

0.9989

g15

0.3319

0.3317

0.0002

0.0088

0.0050

0.9993

g16

0.3681

0.3654

0.0027

0.0093

0.0041

0.9998

g17

0.1783

0.1722

0.0062

0.0075

0.0042

0.9995

g18

0.3278

0.3313

0.0035

0.0139

0.0066

0.9987

g19

0.0485

0.0488

0.0003

0.0034

0.0017

0.9961

Table 19 Comparing Expected and Predicted Error Rates

Here, the average error rates across all levels of g are shown in the second and third columns. The fourth column is the difference between these two figures. In absolute terms, the difference in averages ranges from 0.01% (a11) to 2.02% (c14). However, comparing averages doesn’t tell the whole story. To understand what’s happening between the predicted and observed values at each level of g, I can use the RMS (root mean square) difference measure. Commonly used for such comparisons, this involves squaring the difference, taking the mean of those values and then taking the square root. This is a better measure, since it takes into account differences at the smaller values of g rather than rolling them up as averages. Again, there are small differences found between the predicted and observed (ranging from 0.0000 up to 0.0284 with a mean of 0.0095), suggesting a close fit between the two. The last column shows the pairwise Pearson correlation coefficient for each attribution, indicating a very strong correlation between predicted and observed values. (The correlations were all highly significant to at least 10-9).

Lastly, to check the “worst case”, the maximum difference is reported for each attribute. The biggest gap (0.0514) in all 490 cases occurs for attribute c5 (at g=0.2, specifically), where the predicted value is 0.3091 and the observed value is 0.3606. Note that, as suggested by the limiting approximation, this occurs for the dataset with the fewest customer records and a small value of g.

This comparison of the predicted and observed error rates shows that, on average, the formula derived from mathematical analysis is a very close approximation, with an expected RMS discrepancy less than 1% and an expected correlation of 0.9992. Furthermore, the “worst case” check provides confidence that the experimental procedure was conducted correctly.

6.5.1.1    Relationship To Fidelity

This section has shown that the error rate associated with an attribute subject to garbling noise can be estimated mathematically using a simple formula. This formula, derived above from first principles, relies on two quantities: g, which is the garbling parameter of the noise process and γ which is an intrinsic statistical property of the attribute.

The theoretical framework developed in Chapter 5 proposed the use of the fidelity metric,, to quantify the effect of noise on an attribute. This role is replaced by g and γ in the experiments, since g can be continuously varied (controlled) to produce the desired level of errors. Their relationship with the more general metric is illustrated through Fano’s Inequality (Cover and Thomas 2005), which links the error rate with the equivocation for a noisy channel. Recall the definition of , for the external world state W and IS state X:

We see that the fidelity improves as the equivocation, H(W|X), decreases. When H(W|X)=0 the fidelity is maximised at 100%. Fano’s Inequality bounds the equivocation for a given error rate :

Where N is the number of states in W and H(∊) is the binary entropy function of ∊:

As is a function solely of g and γ and H(W) and log(N-1) are constant, an increase in must result from a decrease in either g or γ. For a given attribute, γ is fixed, so changes in g yield an opposite change in . In this way, we can see that the garbling noise process constrains a particular model on fidelity so we can describe it as a non-linear function of g, (Note that this represents a lower limit of the fidelity, as is constrained by the inequality.)

Figure 22 Effect of Garbling Rate on Fidelty

Above, Figure 21 shows the effect of varying the garbling rate, g, from 0% to 100%. The broken line shows reaching its maximum at 40% (as the value of γ in this example is 0.4). The unbroken line shows falling from 100% (when g=0%) to 2% (when g=100%). The value for H(W) is 3 and the value for N = 33. This illustrates that fidelity decreases non-linearly as the garbling rate, g, increases and the error rate, , increases.

In general, I can expect different kinds of noise processes to impact on fidelity in different ways. While the garbling noise process used here is amenable to this kind of closed-form algebraic analysis, other types may require a numerical estimation approach.

However, I can always compute the γ for a given attribute and estimate the error rate, ε, by direct observation. In such cases, I can use the formula to derive an “effective garbling rate”, geff. For example, suppose there are two attributes, W1 and W2 with observed error rates of ε1=0.05 and ε2=0.1, respectively. Further, their gamma levels are measured at γ1=0.40 and γ2=0.7. Their garbling rates can be read from the above charts as follows. For W1 I use Figure 21 (where γ=0.40) and see that ε=0.05 corresponds to geff =0.06. For W2 I use Figure 20 (where γ=0.7) and read off a value of geff=0.07 for ε2=0.1. This illustrates a situation where the underlying effective garbling rates are almost the same, but the error rate is twice as bad for the second attribute since its γ value is so much larger.

The interpretation of the effective garbling rate is that it quantifies the number of customer records impacted by a quality deficiency, as distinct from the ones actually in error. When detecting and correcting impacted records is expensive, understanding the likelihood that an individual impacted record will translate into an error is useful for comparing competing attributes.

6.5.2        Effects on Mistakes

The next metric to define is dubbed alpha (α), which describes the actionability of an attribute. This is the probability that an error on that attribute will result in a mistake. Recall that a mistake is a misclassification or “incorrect decision” when compared the relevant baseline decision set. This metric is in the range of 0 to 1, where 0 means that no changes to that attribute will change the decision whereas 1 means every single change in attribute value will change the decision.

Attribute

ID3

AD

NB

BNet

LMT

Average

a0

0%

4%

3%

8%

5%

4%

a1

0%

0%

4%

4%

4%

2%

a2

0%

0%

0%

0%

1%

0%

a3

0%

0%

7%

7%

2%

3%

a4

0%

16%

7%

6%

13%

9%

a5

1%

16%

18%

15%

16%

13%

a6

0%

4%

5%

7%

7%

5%

a7

0%

0%

5%

14%

4%

5%

a8

0%

0%

2%

5%

2%

2%

a9

0%

0%

1%

8%

1%

2%

a10

54%

39%

27%

14%

35%

34%

a11

44%

17%

16%

16%

21%

23%

a12

0%

2%

3%

8%

5%

3%

a13

0%

0%

5%

6%

7%

4%

c0

0%

0%

2%

0%

0%

0%

c1

0%

0%

1%

1%

4%

1%

c2

1%

1%

7%

2%

4%

3%

c3

1%

6%

7%

3%

5%

4%

c4

0%

0%

2%

2%

0%

1%

c5

0%

1%

2%

4%

5%

3%

c6

0%

0%

2%

4%

4%

2%

c7

0%

6%

1%

4%

2%

3%

c8

98%

59%

64%

21%

54%

59%

c9

0%

14%

4%

8%

7%

7%

c10

0%

0%

2%

8%

10%

4%

c11

0%

0%

7%

0%

0%

1%

c12

0%

0%

14%

4%

5%

5%

c13

0%

3%

2%

5%

10%

4%

c14

0%

4%

6%

4%

4%

4%

g0

44%

23%

18%

21%

22%

26%

g1

4%

17%

9%

8%

12%

10%

g2

20%

16%

11%

14%

13%

15%

g3

0%

13%

8%

9%

9%

8%

g4

2%

0%

4%

5%

12%

5%

g5

5%

19%

28%

12%

10%

15%

g6

0%

0%

6%

7%

6%

4%

g7

1%

0%

0%

0%

5%

1%

g8

0%

0%

6%

5%

9%

4%

g9

0%

0%

8%

10%

9%

6%

g10

1%

0%

0%

0%

0%

0%

g11

2%

3%

6%

8%

7%

5%

g12

0%

0%

0%

0%

10%

2%

g13

0%

1%

8%

10%

12%

6%

g14

0%

0%

11%

11%

7%

6%

g15

0%

0%

1%

0%

5%

1%

g16

0%

0%

3%

3%

1%

1%

g17

0%

0%

0%

0%

1%

0%

g18

0%

0%

1%

3%

7%

2%

g19

0%

0%

7%

15%

10%

6%

Table 20 alpha by attribute and decision function

Values for α range from 0 (errors lead to no mistakes) to a maximum of 98% (for c8 using ID3). This means that an error on attribute c8 will, in 98% of cases, result in a changed decision. Upon inspection of the α values for other attributes in the CRX dataset using ID3, I can see they are nearly all zero. This indicates that ID3, in this case, nearly entirely relies on c8 to make its decision. Since c8 is a binary valued attribute, it is not surprising that almost any change in the attribute will change the decision.

Other algorithms are not so heavily reliant on just one attribute; Naïve Bayes (NB) and the Logistic Model Tree (LMT), for example, draw on more attributes as can be seen by their higher α values across the range of attributes. Despite these variations, there is broad agreement between the algorithms about which attributes have highest values of α. In general, each dataset has one or two dominating attributes and a few irrelevant (or inconsequential) attributes, regardless of the specific algorithm used. It is to be expected that there would be few irrelevant attributes included in the dataset: people would not go to the expense of sourcing, storing and analysing attributes that had no bearing on the decision task at hand. In this way, only candidate attributes with a reasonable prospect of being helpful find their way into the datasets.

The similarity between observed α values for attributes across different decision functions is not a coincidence. There are underlying patterns in the data that are discovered and exploited by the algorithms that generate these decision functions. There are patterns in the data because these data reflect real-world socio-demographic and socio-economic phenomena.

So the ultimate source of these patterns lies in the external social world: high-income people, for instance, tend to be older or more highly-educated or live in certain post codes. At the level of the real, there are generative mechanisms being triggered resulting in observable customer events (the actual). These events are encoded as customer data in the system. The algorithms then operate on these data to produce rules (decision functions) that replicate the effect of the generative mechanism in the external social world. However, the generative mechanism for the system bears no resemblance to the external social world, as it is an artefact composed of silicon, software and formulae operating according to the laws of natural science.

As long as the system’s generative mechanism (hardware and software) operates correctly, any sufficiently “good” learning algorithm will detect these patterns and derive rules that can use them. In other words, the capacity of a system to detect and implement the underlying patterns (thereby replicating events in the external social world) is constrained by the properties of the patterns themselves, not the mechanisms of the system.

The theoretical framework developed in Chapter 5 proposes a quantitative measure of the relation between error events and mistakes for an attribute used in a decision task: influence. Recall that this entropy-based measure was defined as the normalised mutual information between the attribute, X and the decision, Y:



Informally, influence can be described in terms of changes to decision uncertainty: before a decision is made, there is a particular amount of uncertainty about the final decision, given by H(Y). Afterwards, there is 0 uncertainty (a particular decision is definitively selected). However, in between, suppose just one attribute has its value revealed. In that case some uncertainty about the final decision is removed. The exact amount depends on which value of the attribute arises, but it can be averaged over all possible values. Hence, each attribute will have its own influence score on each particular decision task.

This quantity has a convenient and intuitive interpretation in the context of machine learning, predictive analytics and data mining: information gain ratio (Kononenko and Bratko 1991). The information gain is the incremental amount of uncertainty about the classification Y removed upon finding that an attribute X takes a particular value, X=x. Formally, it is the Kullback-Leibler divergence of the posterior distribution p(Y|X=x) and the prior distribution p(Y):

If I take the expected value over all possible values of x, I have:



This quantity is used frequently in data mining and machine learning to select subsets of attributes Yao (Yao et al. 1999) and performance evaluation (Kononenko and Bratko 1991). In practice, the related quantity of the information gain ratio is used, where the information gain is divided by the intrinsic amount of information in the attribute ie H(X). This is done to prevent very high gain scores being assigned to an attribute that takes on a large number of values. For example, compared with gender, a customer’s credit card number will uniquely identify them (and hence tell you precisely what the decision will be). However, a credit card has approximately 16 * log210 bits (53 bits) whereas gender has approximately 1 bit. Information gain ratio will scale accordingly.

This discussion provides the motivation for examining how the observed actionability of each attribute, as arising during the experiments, aligns with the entropy-based measure of influence. In general, I would expect influential attributes to be prone to actionable errors. Conversely, a low influence score should have low actionability (so that no errors lead to mistakes).

Before looking into this, it’s worth considering the importance of finding such a relationship. From a theoretical perspective, it would mean that the underlying generative mechanisms (in the domain of the real) that give rise to customer behaviours and events (at the actual) are being replicated, in some sense, within the information system. That is, the algorithms that construct the decision functions are picking up on and exploiting these persistent patterns while the system itself is operating correctly and implementing these functions. The degree of agreement between the two measures indicates the success of the system in “mirroring” what is happening in the external world.

At a practical level, if the influence score is an adequate substitute or proxy for actionability, then the question of how to design and appraise IQ improvement interventions becomes more tractable. The whole effort of generating some noise process, applying it to each attribute, comparing the output to the benchmark and then re-running it multiple times is avoided. In general, I might expect such experiments in real-world organisational settings to be time-consuming, fraught with error and disruptive to normal operations.

Perhaps more significantly, the experimental approach requires an existing decision function to be in place and (repeatedly) accessible. Using the influence score, only the ideal decision values are required, meaning that the analysis could proceed before a system exists. This could be very useful during situations such as project planning. Further, it could be used when access to the decision function is not possible, as when it is proprietary or subject to other legal constraints. This is further explored during Chapter 7.

The analysis proceeds by examining the information gain for each attribute across the three decision tasks (for a total of 49 attributes) and five decision functions. The seventh column shows the average information gain for the five decision functions. The last column, Z, shows the “true information gain”. This is calculated by using the correct external world decision (ie training data) rather than the outputs of any decision function.

Attribute

ID3

AD

NB

BNet

LMT

Average

Z

a0

0.0184

0.0667

0.0770

0.1307

0.0737

0.0733

0.0784

a1

0.0073

0.0225

0.0326

0.0451

0.0314

0.0278

0.0186

a2

0.0003

0.0005

0.0008

0.0013

0.0006

0.0007

0.0004

a3

0.0349

0.1956

0.1905

0.1446

0.1683

0.1468

0.0884

a4

0.0178

0.1810

0.1439

0.0833

0.1367

0.1126

0.0415

a5

0.0271

0.1664

0.2058

0.3359

0.1664

0.1803

0.1599

a6

0.0204

0.1160

0.1206

0.1337

0.1189

0.1019

0.0683

a7

0.0288

0.1674

0.2086

0.3612

0.1699

0.1872

0.1693

a8

0.0012

0.0082

0.0115

0.0214

0.0074

0.0099

0.0088

a9

0.0056

0.0350

0.0390

0.1045

0.0318

0.0432

0.0333

a10

0.1779

0.1206

0.0847

0.0565

0.1010

0.1081

0.0813

a11

0.0639

0.0262

0.0205

0.0138

0.0234

0.0296

0.0209

a12

0.0137

0.0392

0.0533

0.0916

0.0516

0.0499

0.0428

a13

0.0040

0.0091

0.0133

0.0151

0.0139

0.0111

0.0102

c0

0.0003

0.0017

0.0026

0.0047

0.0003

0.0019

0.0004

c1

0.0287

0.0358

0.0257

0.0283

0.0204

0.0278

0.0211

c2

0.0412

0.0472

0.0525

0.0641

0.0462

0.0502

0.0394

c3

0.0177

0.0311

0.0376

0.0378

0.0305

0.0309

0.0296

c4

0.0177

0.0311

0.0376

0.0378

0.0305

0.0309

0.0296

c5

0.0813

0.0944

0.1079

0.1442

0.1128

0.1081

0.1092

c6

0.0558

0.0513

0.0492

0.0702

0.0460

0.0545

0.0502

c7

0.1103

0.1771

0.1170

0.1428

0.1101

0.1314

0.1100

c8

0.9583

0.6159

0.5133

0.4979

0.4404

0.6052

0.4257

c9

0.1428

0.2615

0.1998

0.3151

0.1785

0.2195

0.1563

c10

0.1959

0.2729

0.2049

0.3207

0.2023

0.2393

0.2423

c11

0.0057

0.0036

0.0021

0.0023

0.0003

0.0028

0.0007

c12

0.0180

0.0264

0.0150

0.0395

0.0125

0.0223

0.0100

c13

0.0099

0.0293

0.0149

0.0204

0.0156

0.0180

0.2909

c14

0.0004

0.1198

0.1203

0.1439

0.1084

0.0985

0.1102

g0

0.3803

0.2227

0.1367

0.2068

0.1748

0.2243

0.0947

g1

0.0072

0.0964

0.0494

0.0842

0.0369

0.0548

0.0140

g2

0.1155

0.0944

0.0840

0.1052

0.0758

0.0950

0.0436

g3

0.0184

0.0485

0.0454

0.0691

0.0307

0.0424

0.0249

g4

0.0206

0.0257

0.0338

0.0648

0.0264

0.0343

0.0187

g5

0.0224

0.0866

0.0525

0.0577

0.0471

0.0532

0.0281

g6

0.0063

0.0083

0.0229

0.0386

0.0119

0.0176

0.0131

g7

0.0000

0.0026

0.0001

0.0000

0.0000

0.0005

0.0030

g8

0.0018

0.0034

0.0209

0.0182

0.0187

0.0126

0.0068

g9

0.0010

0.0042

0.0060

0.0105

0.0058

0.0055

0.0048

g10

0.0008

0.0000

0.0000

0.0063

0.0001

0.0015

0.0000

g11

0.0108

0.0195

0.0720

0.0943

0.0230

0.0439

0.0170

g12

0.0043

0.0062

0.0022

0.0048

0.0187

0.0072

0.0107

g13

0.0014

0.0050

0.0132

0.0195

0.0090

0.0096

0.0089

g14

0.0104

0.0070

0.0515

0.0737

0.0204

0.0326

0.0128

g15

0.0039

0.0008

0.0020

0.0005

0.0003

0.0015

0.0015

g16

0.0023

0.0052

0.0171

0.0259

0.0019

0.0105

0.0013

g17

0.0007

0.0002

0.0014

0.0002

0.0000

0.0005

0.0000

g18

0.0008

0.0000

0.0013

0.0015

0.0062

0.0020

0.0010

g19

0.0010

0.0013

0.0050

0.0083

0.0050

0.0041

0.0058

Table 21 Information Gains by Attribute and Decision Function

As expected, there is broad agreement between the different decision functions as to how much information can be extracted from each attribute. The gain ranges from effectively zero (eg a2) through to 0.95 (eg c8 with ID3). The attributes with high gains also show some differences in how the decision functions are able to exploit information: a3, c8 and g0, for example, show considerable variation from the average.

When comparing the “true information gain” with the average for the five decision functions, there is also broad agreement, with a Pearson correlation co-efficient, ρ=0.8483 (highly significant to at least 10-14). This suggests that, by and large, the decision functions are effective at detecting and using all the available or “latent information” in each attribute. However, some notable exceptions are a4, c13 and g0. It may be that other algorithms for building decision functions could better tap into this information.

Now I can examine how well the information gain works as a proxy or substitute for actionability, α. To do this, the Pearson correlation co-efficient, ρ, is used to gauge how closely they are in step. All results are significant at <0.01 unless otherwise reported.

ρ

ID3

AD

NB

BNet

LMT

Average

Z

a

0.8898

0.3212

0.2983

0.5233

0.1799

0.2247

0.2006

c

0.9704

0.8938

0.8202

0.9191

0.8773

0.9057

0.7439

g

0.9847

0.9215

0.6707

0.7510

0.8080

0.9266

0.9320

ALL

0.8698

0.7367

0.6724

0.5817

0.6414

0.7459

0.5678

Table 22 Correlation between Information Gain and Actionability, by Dataset and Decision Function

Note that here the average column contains the correlations between the average information gain and the average actionability (where the averaging is done over all five decision functions). It is not the average of the correlations. As before, Z refers to the “true information gain” when using the correct decisions in the dataset as the benchmark. The rows describe the datasets (Adult, CRX and German, respectively) and ALL describes the correlation when all 49 attributes are considered collectively.

The correlation coefficients (ρ) range from 0.18 to to 0.98, averaging around 0.70. This constitutes a moderate to strong positive correlation, but there is significant variability. Using information gain instead of actionability in the case of the Adult dataset would, regardless of decision function, result in prioritising different attributes. In the German dataset, the deterioration would be much less pronounced.

Based on the widespread use of the information gain ratio (IGR) in practice, this measure was evaluated in an identical fashion, to see if it would make a better proxy. As explained above, it is computed by dividing the information gain by the amount of information in the attribute, H(X). The following table was obtained:

Attribute

H(X)

ID3

AD

NB

BNet

LMT

Average

Z

a0

5.556

0.33%

1.20%

1.39%

2.35%

1.33%

1.32%

1.41%

a1

1.392

0.53%

1.61%

2.34%

3.24%

2.25%

1.99%

1.34%

a2

5.644

0.01%

0.01%

0.01%

0.02%

0.01%

0.01%

0.01%

a3

2.928

1.19%

6.68%

6.51%

4.94%

5.75%

5.01%

3.02%

a4

2.867

0.62%

6.31%

5.02%

2.91%

4.77%

3.93%

1.45%

a5

1.852

1.46%

8.98%

11.11%

18.14%

8.99%

9.74%

8.63%

a6

3.360

0.61%

3.45%

3.59%

3.98%

3.54%

3.03%

2.03%

a7

2.161

1.33%

7.75%

9.65%

16.71%

7.86%

8.66%

7.83%

a8

0.783

0.15%

1.04%

1.47%

2.73%

0.94%

1.27%

1.12%

a9

0.918

0.61%

3.81%

4.25%

11.39%

3.47%

4.71%

3.63%

a10

0.685

25.99%

17.62%

12.37%

8.25%

14.75%

15.80%

11.88%

a11

0.480

13.31%

5.45%

4.28%

2.87%

4.88%

6.16%

4.36%

a12

3.269

0.42%

1.20%

1.63%

2.80%

1.58%

1.53%

1.31%

a13

0.761

0.52%

1.19%

1.74%

1.98%

1.83%

1.45%

1.33%

c0

0.881

0.04%

0.19%

0.29%

0.54%

0.03%

0.22%

0.05%

c1

5.627

0.51%

0.64%

0.46%

0.50%

0.36%

0.49%

0.38%

c2

5.505

0.75%

0.86%

0.95%

1.16%

0.84%

0.91%

0.72%

c3

0.816

2.17%

3.81%

4.60%

4.63%

3.73%

3.79%

3.63%

c4

0.816

2.17%

3.81%

4.60%

4.63%

3.73%

3.79%

3.63%

c5

3.496

2.32%

2.70%

3.09%

4.12%

3.23%

3.09%

3.12%

c6

1.789

3.12%

2.87%

2.75%

3.92%

2.57%

3.05%

2.80%

c7

5.133

2.15%

3.45%

2.28%

2.78%

2.14%

2.56%

2.14%

c8

0.998

95.98%

61.69%

51.41%

49.86%

44.11%

60.61%

42.64%

c9

0.985

14.50%

26.55%

20.28%

32.00%

18.12%

22.29%

15.87%

c10

2.527

7.75%

10.80%

8.11%

12.69%

8.00%

9.47%

9.59%

c11

0.995

0.57%

0.36%

0.21%

0.23%

0.03%

0.28%

0.07%

c12

0.501

3.60%

5.28%

2.99%

7.89%

2.49%

4.45%

2.00%

c13

4.744

0.21%

0.62%

0.31%

0.43%

0.33%

0.38%

6.13%

c14

3.911

0.01%

3.06%

3.08%

3.68%

2.77%

2.52%

2.82%

g0

1.802

21.10%

12.36%

7.59%

11.48%

9.70%

12.44%

5.26%

g1

3.726

0.19%

2.59%

1.32%

2.26%

0.99%

1.47%

0.38%

g2

1.712

6.75%

5.51%

4.91%

6.14%

4.43%

5.55%

2.55%

g3

2.667

0.69%

1.82%

1.70%

2.59%

1.15%

1.59%

0.93%

g4

5.643

0.36%

0.46%

0.60%

1.15%

0.47%

0.61%

0.33%

g5

1.688

1.33%

5.13%

3.11%

3.42%

2.79%

3.15%

1.67%

g6

2.155

0.29%

0.38%

1.06%

1.79%

0.55%

0.82%

0.61%

g7

1.809

0.00%

0.14%

0.00%

0.00%

0.00%

0.03%

0.16%

g8

1.532

0.12%

0.22%

1.36%

1.19%

1.22%

0.82%

0.44%

g9

0.538

0.18%

0.78%

1.11%

1.95%

1.07%

1.02%

0.89%

g10

1.842

0.04%

0.00%

0.00%

0.34%

0.01%

0.08%

0.00%

g11

1.948

0.55%

1.00%

3.70%

4.84%

1.18%

2.25%

0.87%

g12

5.226

0.08%

0.12%

0.04%

0.09%

0.36%

0.14%

0.20%

g13

0.845

0.17%

0.59%

1.56%

2.31%

1.06%

1.14%

1.05%

g14

1.139

0.91%

0.61%

4.52%

6.47%

1.80%

2.86%

1.12%

g15

1.135

0.35%

0.07%

0.18%

0.04%

0.03%

0.13%

0.13%

g16

1.413

0.16%

0.37%

1.21%

1.83%

0.13%

0.74%

0.09%

g17

0.622

0.11%

0.04%

0.23%

0.04%

0.01%

0.08%

0.00%

g18

0.973

0.08%

0.00%

0.13%

0.15%

0.63%

0.20%

0.10%

g19

0.228

0.44%

0.57%

2.19%

3.62%

2.21%

1.81%

2.55%

Table 23 Information Gain Ratio by Attribute and Decision Function

As suggested by its name, the information gain ratio is expressed as a percentage. A value of 100% for a particular attribute implies that that attribute wholly governs the operation of the decision function. In other words, once the value of that one attribute is known, there is no longer any uncertainty about the decision. Since each row in this table is simply the values of the last table divided through by a constant H(X) (in italics, second column), there is the same broad agreement between decision functions and with the “true information gain ratio”, labelled Z.

From these data, the correlation table was produced, showing the direction and degree of agreement between the information gain ratio and the actionability for each attribute:

ρ

ID3

AD

NB

BNet

LMT

Average

Z

a

0.9704

0.8379

0.7885

0.6558

0.7802

0.8181

0.7558

c

0.9875

0.9620

0.9040

0.9268

0.9291

0.9576

0.9506

g

0.9856

0.8712

0.7149

0.8843

0.7906

0.9274

0.8992

ALL

0.9138

0.8298

0.8194

0.5871

0.8049

0.8524

0.8110

Table 24 Correlation between Information Gain Ratio and Actionability, by Dataset and Decision Function

All results are statistically significant at <0.001. The correlation coefficients range from 0.59 (looking at all 49 attributes at once when using the BNet decision function) up to 0.99 (using ID3 on the GERMAN dataset). This is a much more robust range than for information gain. Of the 28 cells in the table (each corresponding to a different slice of the data, for evaluation purposes), 12 have a correlation coefficient >0.90 while only three have a value <0.75.

This analysis indicates that the information gain ratio is a better substitute or proxy for actionability than the un-normalised information gain. The interpretation is that errors in attributes with a high information gain ratio are more likely to result in mistakes than attributes with a low information gain ratio. The reason is that an error in an attribute with a large amount of entropy H(X) (such as a continuous-valued attribute) is likely to affect only a small proportion of cases and hence is less likely to be exploited by an algorithm when building the decision model.

When ranking attributes, the use of the IGR is particularly effective at screening out irrelevant (low gain) attributes and prioritising high-gain ones. The table below compares the IGR rank (from first to last) of each attribute when sorted by actionability, α.


 

Rank

IGR

 

IG

 

a

c

g

 

a

c

g

1

1

1

1

 

5

1

1

2

4

2

3

 

10

3

4

3

2

4

2

 

2

12

2

4

7

5

8

 

4

9

3

5

8

3

7

 

6

2

6

6

3

13

6

 

1

13

15

7

12

10

9

 

7

6

12

8

11

11

4

 

12

8

8

9

10

9

10

 

8

4

14

10

5

7

5

 

3

5

5

11

9

8

14

 

11

7

7

12

6

14

11

 

9

14

10

13

13

12

12

 

13

11

9

14

14

6

16

 

14

10

13

15

 

15

15

 

 

15

16

16

 

 

13

 

 

 

11

17

 

 

20

 

 

 

19

18

 

 

17

 

 

 

17

19

 

 

18

 

 

 

20

20

 

 

19

 

 

 

18

 

 

 

 

 

 

 

 

Spearman

ρ

0.71

0.71

0.92

 

0.60

0.53

0.82

Table 25 Rankings Comparison

If we look at the top and bottom ranked attributes for each dataset, we see that IGR correctly identifies the most and least actionable attributes for ADULT and CRX. For GERMAN, IGR picks up the most actionable and places the second-least actionable last. In this sense, IGR out-performs the un-normalised information gain, IG.

The last row shows the Spearman rank correlation coefficient for the three datasets. This is a non-parametric statistic that measures how closely the attributes are ranked when using actionability as compared with IGR (or IG). All results are significant at <0.05. These results indicate a strong relationship, and that in each case, IGR outperforms IG. Note that in raw terms, this statistic is a little misleading in this context, since it “penalises” a mis-ranking at the top end the same as a mis-ranking in the middle. That is, mixing up the 7th- and 11th-ranked attribute is penalised as heavily as mixing the 1st-and 5th-ranked, even though in a business context the second situation is likely to be worse. Worse still, rankings are highly sensitive to slight variations; the 11th through 15th ranked attributes may differ by as little as 1%, distorting the effect of mis-ranking.

A better way to visualise the performance of substitution of the IGR for actionability is the “percent cumulative actionability capture” graph, below. The idea is that for a given scenario, there is a total amount of actionability available for “capture” (obtained by summing the actionability scores, α, for each attribute). Obviously, using the actionability scores will give the best results, ie selecting the attributes in the sequence that will yield the greatest actionability.

But actionability isn’t evenly distributed amongst the attributes: The top one or two attributes contribute a disproportionately large amount, while the bottom few contribute a small proportion. Selecting the top, say, three attributes may yield 50% of the total available actionability. Selecting 100% of the attributes will capture 100% of the actionability.

When I rank the attributes by actionability, the plot of the percent cumulative actionability as a function of the number of attributes selected represents the “best case” ie directly using actionablity scores to select attributes. If I repeat the exercise but this time ranking the attributes by a proxy measure (in this case, IGR), I get another curve. In general, using IGR instead of α will result in a slightly different order of selection of attributes (as shown in Table 13). The area between these two curves represents the “lost value” in using the proxy measure in lieu of actionability. The “worst case” – corresponding to just picking attributes at random – would be a straight line at a 45° angle.

Figure 23 Percent Cumulative Actionability for ADULT dataset

Compared with the other datasets (below) there is a larger gap between the perfect case (using α scores) and using IGR. The gap reaches a maximum at the 4th-ranked attribute (a 16% point discrepancy) and the average gap across the dataset is 3.7% points.

Figure 24 Percent Cumulative Actionability for CRX dataset

For the CRX dataset, ranking the attributes by IGR instead of α results in little loss of actionability until the 6th-ranked attribute. The gap reaches a maximum at the 9th-ranked attribute (4.6% points) and the average shortfall across all attributes is 1.7% points.

Figure 25 Percent Cumulative Actionability for GERMAN dataset

For the GERMAN dataset, using IGR instead of α results in actionabiity capture that tracks closely with the best case. Here, the maximum gap is at the 7th-ranked attribute (5.4% points) and the average gap is 1.2% points.

Lastly, all 49 attributes from the three datasets are combined to show how IGR works as a proxy for α across a number of scenarios. This situation performs worse than when considered individually, with the biggest gap opening up at the 8th-ranked attribute (11% points) and an average loss of 4.9% points across all 49 attributes.

Figure 26 Percent Cumulative Actionability for All datasets

This examination of the performance of IGR scores as a substitute for α scores when selecting attributes indicates that information gain ratio is a robust predictor of actionability. IGR is useful for estimating which attributes are the most likely to have errors translate into mistakes. This, in turn, is useful for prioritising attributes for information quality interventions.

6.5.3         Effects on Interventions

As examined in Chapter 5, a very wide range of organisational and technological changes could be implemented to improve information quality. This spans re-training for data entry employees, re-configuration of the layout and functionality of packaged enterprise systems, enhanced quality assurance and checking for key business processes, use of alternate external information sources, re-negotiation of incentive structures for senior managers, reviews of source code, improvements to information and communication technology infrastructure, re-engineering of data models or the deployment of specialised matching and “data cleanup” tools.

To quantify the performance of a particular intervention, the framework outlined in Chapter 5 proposed a measure, τ, for traction. This was defined as:

Here, Xe refers to the original value of the eth attribute while X’­e is the value of the same attribute after the intervention. Mathematically, this takes the same form as the error rate, ε. For errors though, the original value (Xe) is compared with the true external world value (We).

In terms of the model, the primary effect of these disparate IQ interventions is to reduce the effective garbling rate, geff to a new, lower value. Fewer items being garbled result in fewer errors (as mediated by the garbling parameter, γ). Fewer errors result in fewer mistakes (as mediated by the attribute’s actionability score, α). Fewer mistakes mean lower costs. The value of a proposed IQ intervention is the change in the costs to the process, minus the costs of the intervention itself. That is, the value of the intervention is the benefit minus the cost, where the benefit is the expected drop in the cost of mistakes.

To simplify discussion for the time being, suppose the expected per-customer cost of a mistake for a given process is given by M. I can appraise (or value) a particular proposed IQ intervention, on a per-customer basis[19], as benefits minus costs:


Where μ 1 is the original mistakes rate and μ 2 is the expected mistakes rate after the intervention. C is the cost of the intervention itself.

For a particular attribute, the rate of mistakes is the error rate multiplied by the actionability or α:

By substituting in the formula for error rate, ε, in terms of garbling rate, g, and garbling parameter, γ, I obtain the following expression:

Since γ and α are constant for a particular attribute in a given process, regardless of garbling rate, the mistake rate is a function solely of g. Let g1 be the initial garbling rate and g2 be the post-intervention garbling rate. Substituting the mistake rates back into the benefit in the value equation yields:

An alternative way to characterise interventions is to introduce a correction factor, χ. This takes a value from 0% to 100%, with 0% implying no change to the underlying garbling rate (g1 = g2) while 100% implies all garbling events are removed (so that g2=0). In general, g2 = (1-χ)g1. Under this modelling assumption, the traction, τ, is related to the correction factor, χ, by:


This is because the IS state before and after the intervention only differ on those customer records where a garbling has been removed (corrected). Rather than using the traction directly, the explicit use of the garbling rate and correction factor will emphasise the underlying noise model used here.

So, when expressed in this way, the benefit of an intervention is given by:

For small values of g (<0.1), this can be further approximated as:

Modelling the per-customer costs of proposed IQ interventions is heavily dependent on the specific conditions. For instance, a one-off re-design of a key enterprise system might entail a single fixed cost at the start. Other interventions (such as staff training) might be fixed but recurring. In general, there might be fixed costs (such as project overheads) and variable costs (that depend on the number of customer records). The variable costs might be a function of the number of customer records tested and the number of customer records that are updated (edited), whether or not the edit was correct. Assuming the intervention tests all the records and only edits the ones in error, the cost function is:

(Here, κF is the per-customer fixed-cost component, calculated by dividing the fixed costs by the number of customers.) A more sophisticated analysis is possible if the garbled records are identifiable. For example, suppose during a recent organisational take-over the target organisation’s customer records were garbled during the database merge. In this case, only that proportion, g, needs to be tested:

Substituting the simpler approximated cost and benefit equations into the per-customer value equation gives:


To put this into an investment perspective, the value (and hence benefits and costs) must be scaled up from a per-customer (and per-decision instance) basis to an aggregated form, across the customer and a period of time (investment window).

As outlined in Chapter 5, the Stake metric (expected cost of mistakes for a process) captures this. Here, the parameter M represents the expected cost of a single mistake. Not all of an organisation’s customers are subject to all the processes; a bank, for instance, may expect just a few percent of its customers to apply for a mortgage in any year. Multiplying M by the number of customers undergoing a process expresses the value in absolute amounts. For comparison purposes, it is sufficient to express this as proportion of the entire customer base, β, rather than a raw count.

The third factor affecting a process’s Stake is the expected number of times it is executed in the investment window. Mortgage applications might be quite rare, whereas some direct marketing operations may be conducted monthly. Most simply, this is the annual frequency, f, multiplied by the number of years, n. (To properly account for the time-value of money, a suitable discount rate must be used, in accordance with standard management accounting practice. This is addressed below.) The Stake for a given process (without discounting) is given by:

Similarly, the costs (κF, κT and κE) can be scaled by βfn to reflect their recurrence. If the intervention attracts only a single cost in the first period (such as a one-off data cleansing or matching initiative) rather than ongoing costs, this scaling would not be required.

So the total value of an intervention on a particular attribute, a, for a particular process, p, is given by[20]:

Note that this is essentially a factor model: the benefit part of the equation is the product of eight factors. Should any of these factors become zero, the value is zero. What’s more, the value varies linearly as any one factor changes (except for the garble rate, g, which is of a product-log form). This means that if you double, say, the correction factor χ while keeping everything else constant, the value will also double. Conversely, if any factor (such as γ) halves then the resulting value will halve.

To capture the total value of an intervention, the benefit arising from that intervention must be aggregated across all the processes, Pa, that use the attribute under question while the costs are only incurred once. This gives the total aggregated value of an intervention on attribute a:

Where an annual discount rate of d needs to be applied, the discounted total aggregated value is:

When d=0 (ie the discount rate is zero), Vr = Va.

This section reported and interpreted the results of the experimental process. This started by developing a generic garbling procedure to simulate the effect of noise on data values, accompanied by a statistical model of the resulting errors. The model was shown to fit closely with the pattern of events observed in the experiments. Drawing on the theoretical framework from Chapter 5, a proxy measure for predicting how errors translate into mistakes was obtained. This proxy – the information gain ratio – was shown to be useful for prioritising attributes by actionability. A measure for characterising proposed interventions were then developed, which in turn led to a benefit model and a cost model. This cost/benefit model was then expressed in discounted cash flow terms.

6.6       Application to Method

This section shows how the measures and formulae derived above can be employed by analysts designing and implementing IQ interventions. There are two broad uses for these constructs within an organisation. Firstly, they can focus analysts on the key processes, attributes and interventions that offer the greatest prospect for delivering improvements. Secondly, they can help appraise objectively competing proposals or initiatives.

When designing IQ interventions, is important to note the space of possible solutions is extremely large. There might be dozens or even scores of customer decision processes and scores – possibly hundreds – of customer attributes to consider, each subject to multiple sources of noise. Lastly, with different stakeholders and interests, there could be a plethora of competing and overlapping proposed interventions for rectifying information quality problems.

It is also important to understand that considerable costs are involved in undertaking the kind of quantitative analysis employed here. For example, to estimate properly the α measure would require the same approach as Section 6.5.2: repeatedly introducing errors into data, feeding it into the decision process, and checking for resulting mistakes. Undertaking such activity on a “live” production process would be costly, risky, time-consuming and prone to failure. Repeating this for the all the attributes as used in all the processes would be a formidable task. In contrast, the organisation’s preferred discount rate for project investments, d, is likely to be mandated by a central finance function (or equivalent) and so is readily obtained.

Rather than estimating or computing all the measures for all possibilities, the ability to focus attention on likely candidates is potentially very valuable. The following table describes the measures, ranked from easiest to most difficult to obtain. The top half are external to any IQ initiative while the bottom half would be derived as part of the IQ initiative. Based on these assumptions, a cost-effective method for designing and appraising IQ interventions is developed.


 

Symbol

Name

Scope

Source

Definition

n

Period

Organisation

Project sponsor

The number of years for the investment.

d

Discount rate

Organisation

Finance

Set using financial practices taking into account project risks, the cost of capital etc.

f

Frequency

Process

Business owner

The expected number of times per year the process is executed.

β

Base

Process

Business owner

The proportion of the organisation’s customer base that is subject to the process on each execution.

M

Mistake Instance Cost

Process

Business owner

The expected cost of making a mistake for one customer in one instance.

γ

Garble Parameter

Attribute

IQ project

The probability that a garbled data value will change.

IGR

Information Gain Ratio

Attribute

IQ project

A measure of an attribute’s influence on a decision making process.

ε

Error Rate

Attribute

IQ project

The proportion of data values in error.

g

Garble Rate

Attribute

IQ project

A measure of the prevalence of garbling resulting from a noise process.

χ

Correction Factor

Intervention

IQ project

The net proportion of garbles corrected by an intervention.

κ

Cost Factors

Intervention

IQ project

The costs of an intervention: fixed (κF), testing (κT) and editing (κE).

α

Actionability

Process, Attribute

IQ project

The rate at which errors translate into mistakes.

Table 26 Value Factors for Analysis of IQ Intervention

Below is the sequence of steps to investigate, design and appraise IQ interventions. The inputs are the organisation’s set of customer decision processes that use a shared set of customer attributes. The output is an estimate of the Net Present Value (NPV) of candidate IQ interventions. The approach is to focus on the key processes, attributes and interventions that realise the largest economic returns, whilst minimising the amount of time and cost spent on obtaining the above measures.

1)       Stake

Goal: Select the most-valuable processes within scope. Define values of d and n appropriate for the organisation. Identify the key customer decision processes. Use β and f to gauge “high traffic” processes. Use M to estimate high impact decisions. Multiplying these factors gives S, the stake. Use S to rank the processes and select a suitable number for further analysis.

2)       Influence

Goal: Select the most-important attributes. For each of the top processes, use IGR (Influence) to select top attributes. This requires getting a sample of inputs and outputs for the decision functions and performing the entropy calculation. It does not require any manipulation of the systems themselves. For each attribute, sum the product of its Influence and Stake over each process to get an aggregated view of importance. Use this importance measure to select the top attributes.

3)       Fidelity

Goal: Select the most-improvable attributes. For each of the top attributes, measure its γ value. This involves estimating the probability of each data value occurring and summing the squares. Observe the values (error rates between external-world and system representation) and, using the formula in Section 6.5.1, estimate the garble rate, g. For the highly garbled attributes, estimate α values for the high-stake processes. This requires introducing noise to the attributes and seeing how it translates into mistakes. Use these to populate the Actionability Matrix (see Table 27 below).

4)       Traction

Goal: Select the most-effective interventions. For each of the top attributes, estimate χ for various intervention proposals. To do this, analysts must either undertake the intervention on a sample and measure the drop in the effective garbling rate or draw on past experience in similar circumstances. In doing this, the cost factors (various κ) can be estimated. Use the values to estimate the costs and benefits of the candidate interventions with the NPV formula.

These models can be used in business cases to inform organisational decision-making about IQ investments.

Note that when the discount rate is applied, the NPV calculation requires using Vr:

Not all organisations use NPV directly. This expression can be re-cast as Internal Rate of Return (IRR) by setting Vr=0 and solving for d, or for payback period by solving for n. Alternatively, if Return on Investment (ROI) is required, then Va is expressed as a ratio of benefit/cost instead of the difference (benefit-cost) before discounting.

While these steps are presented in a linear fashion, it is envisaged that an analyst would move up and down the steps as they search for high-value solutions, backtracking when coming to a “dead end”. For example, an attribute selected at Step 2 (ie with high Influence) may have very low error rates (and hence low garble rates) and so afford little opportunity for improvement, regardless of how high the α and χ values may be. In this case, the analyst would go back to Step 2 and select the next-highest attribute. Similarly, a problematic and important attribute may simply not have any feasible interventions with a χ over 5%, in which case any further efforts will be fruitless and Step 3 is repeated.

To help the organisation keep track of the measures during the evaluation exercise, the following “Actionability Matrix” is proposed. This table of values is constantly updated throughout the project and it is important to note that it is not intended to be fully populated. In fact, determining all the values in the table indicates that the selection process has gone awry.

Suppose the organisation has a set of customer processes, P1, P2, …, P7 that use (some of) the customer attributes A1, A2, …, A11. Each cell, αa,p records the actionability for the ath attribute and pth process. The last column records the error rate () for the attribute, while the last row records the “annual stake[21]” (βfMp) for the process. The rows and columns are arranged so that the attributes with the highest error rates are at the top and the processes with the highest stake are on the left.


 

α

P1

P2

P3

P4

P5

P6

P7

Error Rate (∊)

A1

0.12

0.00

0.11

 

0.17

 

 

55%

A2

 

0.05

 

0.15

 

 

 

35%

A3

0.05

0.42

0.07

0.11

 

 

 

15%

A4

0.22

0.11

0.61

0.03

0.07

 

 

12%

A5

0.13

 

 

0.09

 

 

 

10%

A6

 

0.07

 

0.55

 

 

 

8%

A7

 

 

 

 

 

 

 

8%

A8

 

 

 

 

 

 

 

5%

A9

 

 

 

 

 

 

 

3%

A10

 

 

 

 

 

 

 

3%

A11

 

 

 

 

 

 

 

0%

Stake (S)

$128

$114

$75

$43

$40

$24

$21

 

Table 27 Illustration of an Actionability Matrix

Cells with high values of α are highlighted, drawing attention to the strong prospects for economic returns. To gauge the amount of cash lost each year on a certain process due to IQ deficiencies with a particular attribute, the cell value (α) is multiplied by the marginals (S and ). For example, the annual loss due to attribute A6 on process P­4is 0.55*0.08*43 = $1.89. For A3 in P2 it is 0.42*0.15*114 = $7.18. The top “value leaker” is A1 in P1, with 0.12*0.55*128 = $8.48.

Note that the values of α are determined by how the decision function for a process uses the attributes. As such, they will not change unless the underlying decision function changes, meaning they will persist for some time. This means that as new processes are added to the organisation, the existing α values will not have to be updated, ensuring that organisational knowledge of how information is used can accumulate.

Throughout this method, it is assumed that the α values are obtainable (although expensive) and that IGR is used as a cheaper proxy to save unnecessarily measuring actionability for all attributes in all processes. However, for situations where a new customer decision process is being planned, α is just simply not available. If the organisation has not yet implemented the decision function then the probability of an error translating into a mistake cannot be measured experimentally.

In such circumstances, it is still possible to estimate IGR. The “true information gain ratio”, Z, was shown in Table 11 to be highly correlated with the IGR for specific decision functions. Recall that this measures the influence of an attribute on the “correct decision” (as opposed to the decision made by a particular model). So IGR can be found as long as a sufficient sample of correct decisions (as used, for example, in training a decision function) is available. This IGR can then be used in lieu of α in the Actionability Matrix and NPV calculation to get an order-of-magnitude estimate of value.

This section has shown how the measures, formulae and assumptions can be used to guide the design and appraisal of IQ interventions, in a way that allows analysts to focus attention on high-value solutions while discarding low-value ones. A four-step iterative sequence is outlined along with a simple matrix for tracking key values. The resulting model of costs and benefits can be expressed as Net Present Value (or related measures) as needed by the organisation.


 

6.7        Conclusion

This concludes the specification and investigation of the framework. The chapter began with a high-level conceptual model of how IQ impacts on organisational processes and a theoretically-grounded set of candidate metrics for assisting analysts in prioritising IQ improvements.

The investigation proceeded by defining and creating an environment for inducing IQ deficiencies (noise) in realistic contexts, using realistic datasets, decision tasks and algorithms. The garbling noise process was selected for use here and its effects successfully modelled as a combination of inherent properties of the data (γ) and a controllable independent variable (g). The actionability (α), or effect of errors in giving rise to mistakes, was experimentally measured in these contexts. Far too expensive to obtain in all cases in practice, the theoretical measure of IGR (information gain ratio, or Influence in this framework) was tested and shown to be a very useful proxy. Finally, based on these findings, a financial model of the effect of removing IQ deficiencies was developed. A method was proposed for analysts to use the Actionability Matrix to apply these measures in an efficient iterative search for high-value IQ interventions.

Hence, this designed artefact meets the definition for a framework outlined in Chapter 5, Section 2. It comprises of a model (the Augmented Ontological Model of IQ), a set of measures (grounded in Information Theory) and a method (based around populating the Actionability Matrix). Importantly, this framework allows analysts to make recommendation on investments in IQ improvements using the quantitative language of cost/benefit analyses. This was a key requirement identified by the practitioner interview in Chapter 4.

[12] The term “noise” is used here to describe unwelcome perturbations to an information-bearing signal, as used in statistics, physics and engineering.

[13] One may argue that statistical models created by organisations are, in a fashion, simulations of customer behaviour. In this case, these experiments are a re-construction of a simulation.

[14] Discussed further below in Section 4 (Experimental Process).

[15] The term “garbling” is adapted from Blackwell’s seminal work in The Theory of Experiments

[16] In this software package, it simply means “percentage correctly classified”. Since we are dealing with binary decision problems, it is adequate and entropy-based measures are not required.

[17] This is the same as the reported results for ID3 in the notes attached to the dataset.

[18] The pathological case of an error in the data actually improving the decision (ie an error correcting a mistake) is discussed in Section 6.

[19] These results are aggregated into a total value (and total cost) basis in the subsequent section.

[20] The α and M parameters vary for each process, whereas all other parameters are a function only of the attribute.

[21] That is, the stake as defined above, but divided by the number of years in the investment window, since this will be the same for all interventions.



 

 

Prev:
Chapter 5 - Conceptual Study
Up:
Contents
Next:
Chapter 7 - Research Evaluation