|Research Blog - Customer Intelligence|
But does this actually work? A (possibly apocryphal) story that abounds in the data quality literature is something I call the parable of the "broken legs". It goes something like this:
A leading health-care organisation was doing some analysis of patient claims data. They found that some 80% of claims made were for broken legs (or fracture to the lower limbs). They tested a number of theories - perhaps the plan was popular with skiers? - before they figured out what was happening. When the patients made a claim, the call-centre staff had to select from a drop-list the appropriate ailment or they could not advance to the next screen. They resented this step since it added a large delay (and they're bonuses are linked to promptness). However, the call-centre staff soon realised that they could merely select the first item in the list and proceed. That first item was "broken legs". Here endeth the lesson.
So, I ask the question: under what circumstances is it a good idea to insist on non-blank fields? Or, equivalently, when is it a good idea to allow blank fields?
Here's the setup:
Assume that you presently allow blank fields, and are deciding whether to add a control to ban them.
A field value can be in one of three states: Correct (C), Blank (B) or Rubbish (R). Note that Rubbish means the value is valid, but randomly generated (perhaps drawn from the same distribution as the Correct values).
Presently, your data is split across these three types in proportion [C,B,R].
The idea is that putting in a blank-field check will turn some of your Blanks into Corrects. Of course, you'll also create some extra Rubbishes too. Let's think about the relative costs involved.
Firstly, the cost of a Correct entry is 0. Next, let's set the cost of a Rubbish entry as 1 "penalty unit". (As we see later, it doesn't really matter what the absolute cost is in dollars.) I argue that the cost of a Blank (c) is somewhere between 0 and 1: at least with Blanks you know where the dodgy data is so you can (presumably) work around it, or retrospectively fix it. With Rubbish though, you don't even know how bad the problem is.
Proposition: 0 <= c <= 1
"Proof": Assume c > 1, then each Blank could be replaced with a Rubbish entry driving down the total cost. Since this doesn't happen in practice, no one believes that c > 1.
[If Blanks are in fact costing you more than Rubbish, please get in touch with me via email and I will happily - for a modest fee - make the substitution for you in your data set.]
NB: c < 0 implies a negative cost ie you're making money on a Blank and is perverse and indicates something is wrong with how you're using information.
The astute reader will say "Ah! But sometimes Blanks are substituted in a process known as imputation." True. But the imputed values are not Rubbish, they are usually mode, mean or other average (or likely) values.
OK, so we've established a rough-and-ready estimate of the relative cost of a Blank (c). Now, we need to guesstimate the proportion of Blanks that will converted into Corrects. Naively, people may hope this is 100%. As the parable of the "broken legs" shows, this is not the case. So, we set q as the proportion going from Blank -> Correct, and (1-q) as the proportion going from Blank -> Rubbish.
Now, the gain is going to be the expected benefits net costs. (Here we assume there is no switching cost of putting the check in.) The benefit is going to be the proportion of Blanks converted to Corrects times the relative cost of Blanks. The cost will the proportion of Blanks converted to Rubbish times the extra cost of a Rubbish over a Blank.
G = [benefit] - [cost]
= [q * c] - [(1-q) * (1-c)]
= qc - [1 - q - c + qc]
= q + c - 1
So, G has a maximum at (1 + 1 - 1 = +1), and a minimum at (0 + 0 - 1 = -1). If G = 0 it means you're at the critical (or break-even) point.
So here's a simple decision model: it's worthwhile removing the blank-field option if and only if q + c > 1.
Aside: If the switching cost (ie developing new specs, building the code, testing it, deploying it, training and documentation etc) are not negligible, then you'll need to know the absolute costs. To do this, let C be the absolute cost of a Rubbish value (in dollars), B is the proportion of your data that are Blanks and S is this switching cost.
GA = [B * q * c * C] - [B * (1-q) * (1-c) * C] - S
= BC * [q + c - 1] - S
And the decision rule becomes "implement if and only if BC[q + c - 1] > S".
However, in most cases the simpler case suffices (q + c > 1) and we only need to ask: what are reasonable values for q and c?
Well, my hunch is that for most cases q < 25%, particularly where users are losing the option of entering a blank. This may seem low, but ask yourself how many people would think like this:
Ah, I see I can no longer leave this field blank. I must now select an option from this list. Management has not made this change to frustrate me. They've offered me an opportunity to improve myself. Sure, I could just put in any old thing, but this is my chance to stand up, make a difference and Do The Right Thing.
Clearly, not many people would react like this. The figure of 80% bandied about in the broken legs parable suggests that q<25% is about right.
Of course, you'll need to test this yourself with some field experiments. Here's what NOT to do: pick your five best employees or most dilligent customers. Get them to use the new system for the first time while their supervisor, boss and entire board of management look over their shoulder, for five minutes max. Declare it a success when q=100%. Don't repeat six weeks later once staff have figured out the work-arounds.
Now, what's reasonable for c ie cost of a Blank as a proportion of cost of a Rubbish value? I would suggest in general c<50% ie a Blank costs at most half as much as a Rubbish. Why so low? Well, as mentioned above, a Blank has at least the benefit of being obviously wrong so some contingent action may be taken (eg data processed differently or re-acquired). The quality problems will have a high visibility, making it easier to get management support for fixing them. You can also analyse where the data quality problems are, where they're coming from and why users are reluctant to use that particular field. This is potentially very valuable. Also, it doesn't screw up averages and other stats about the data set as a whole.
Based on these hunches and the "q + c > 1" decision rule:
0.25 + 0.5 < 1
so, in general, it's not a good idea to disallow blanks. My recommendation for application designers is:
Notify users that they have left a field blank, but allow them to proceed to the next step. If a client demands that fields are completed before users can proceed, tell them the parable of the broken legs. If they persist, ask them to estimate values of q and c (off the top of their head) and use the above decision rule. In the event of a dispute, flip it around to "what values of q and c make it worthwhile?" and check these values for reasonableness. As a last resort, go and look for empirical evidence of q and c in your particular context to settle it.
Of course, you'll need to drag out the back of your own envelope and do the sums yourself - but I hope this is a good start.