Bug Priorities
DRAFT 1.8
jordan.brown@sun.com v1.8 02/17/04 13:27:04
Background
I'm tired of the grandiose claims made in project plans that P1-P3 bugs will
all be fixed, that quality will trump schedule, followed almost inevitably
by the end-game rush to downgrade or ignore bugs. It is my hope that that
intellectual dishonesty can be avoided by having relatively objective
criteria for bug classification agreed by all parties and incorporated into
the project plans as quantifiable quality requirements. This document is
an attempt to produce such an objective set of criteria.
Plans
For the moment, this is still a draft. I'm trying to use these definitions
for current work, but can't claim they are universally accepted.
General Philosophy
These guidelines are intended for use as objective, bright-line
distinctions between the various bug priorities. The intent is that at
any time during the project's life cycle a bug can be analyzed against
these guidelines and yield a priority that can be used to determine the
status of the project against its "must fix" requirements. The basic
question they try to answer is "if we were to discover this bug the day
before our FCS build, would we hold the FCS build for the fix?".
These guidelines often yield a P3 on a bug that obviously must be fixed
before FCS because they assume that P3 really does mean "must fix for FCS",
that an outstanding P3 bug will block FCS. Similarly, they will often
classify a bug that "should" be fixed as P4, because "should" does not mean
"must".
| P1 |
Required ASAP
(Dev / test blocker [10])
|
Build fails.
Major features inoperative. [2]
Causes hardware damage.
|
| P2 |
Required before Beta
(Beta stopper)
|
Major features missing. [2]
Operation without user error causes data corruption.
Operation without user error causes service crash.
[3]
Operation with remote failure causes data corruption.
[5]
Security failure - privilege promotion.
[1]
Legal issues
Documentation: major features undocumented, or documented
so incorrectly as to be unreasonably difficult to use.
|
| P3 |
Required before FCS
(FCS stopper)
|
Minor features inoperative.
[2]
Operation with user error causes data corruption.
Operation with user error causes service crash.
[3]
Operation with remote failure causes service crash.
[3],
[5]
Operation with local disk full causes data corruption.
Operation with local disk full causes service crash.
[3]
Significant embarassment with trivial fix
(e.g. highly visible spelling error).
Operation without user error leads to confusing situation.
[9]
Security failure - secret data revealed.
[4]
Documentation - minor features undocumented, or documented
so incorrectly as to be unreasonably difficult to use.
Interoperability failure. [11]
|
| P4 |
Desired for FCS |
Minor features missing.
[2]
Spelling / grammar errors.
Operation with user error leads to confusing situation.
[9]
Catastrophic local failure leads to corrupt data.
[6]
Unexpected error (bug) is not handled well.
[8]
Operation with local disk full leads to confusing situation.
[9]
Documentation - Features documented inadequately or incorrectly,
but a reasonable user should be able to figure it out.
|
| P5 |
Low priority |
Wordsmithing.
Code cleanup
Corrupt internal data leads to problems.
[7]
|
Notes
Although it is not captured here, an excess of lower-priority bugs should
probably be considered show-stopping. For instance, a single spelling error
in a non-prominent location is probably not a show-stopper... but a spelling
error on every page might well be. These guidelines generally yield a lower
priority when a failure is the result of user error, but if the design of
the project is such that users are frequently led into error then that itself
is probably a show-stopping bug, and at a minimum the bugs tied to those
errors should be considered to be higher priority than they might otherwise
be.
Another factor that is not captured here is that the history of the bug may
be relevant. A regression - a failure in a component that worked in a
previous release - may be more important than a failure in a new component.
A failure present in a previous release may be less important than a failure
in a new component, since presumably customers are already coping with it
and one might want to avoid shipping new bugs to the field.
Performance bugs are not discussed, because they are hard to discuss
objectively. One can measure performance against project plan goals,
but there is no way to objectively gauge the correctness of those goals.
Performance bugs come in two major flavors: "X is slower than it needs
to be" and "X is slower than I want it to be". For the former, it
is often subjective whether the wasted performance is important. Reducing
a particular function call from 10ms to 1ms may be critical if that function
is called millions of times per day, and completely unimportant if that
function is called once during system startup. (Perhaps we can articulate
a distinction based on expected seconds of wasted time per user day.)
For the latter (slower than I want), how critical is this performance really?
How reasonable are the user's expectations? Sometimes the performance is
painful, but is a matter of physics or information theory and cannot be
improved. It would be good to articulate performance requirements past
which the product would be considered unusable, but that's very hard to do
and runs the risk of setting a "negative goal" where it isn't considered
necessary to exceed the minimum requirement.
Documentation bugs are necessarily subjective, since a documentation bug
deals intrinsically with "soft" questions like whether the user will
understand what the software is really doing.
Denial of service bugs are not discussed.
Probability of encountering a failure is not factored into the priority.
For security bugs, probability must not be factored in because a
malevolent user will cause the low-probability scenario to occur.
For other bugs, a low probability might suggest a reduced priority,
but no attempt is made to capture that here.
Footnotes
[1] Most generally, user can do an operation he is not authorized
for. Specific examples: Non-root user can achieve root.
Not-logged-in user can execute more than strictly limited code
(e.g. login, appropriate anonymous services). Creating, deleting,
or modifying a file without authorization.
[2] "Missing" means that the feature is not present. It isn't
offered to the user. "Inoperative" means that the feature is
present and offered to the user, but fails.
[3] "Crash" here means that a component fails and must be
restarted, not merely that an operation yields an error, even an
inappropriate error with a stack backtrace. If issuing an independent
request without taking any recovery actions might succeed, "crash"
does not apply.
[4] User can view data he is not authorized to view.
[5] "Remote failure" means a failure in a component outside the
current system, and in particular refers to real-world failures
like server power outages, network failures, et cetera.
See also [6] and [7] below.
[6] "Catastrophic local failure" means a failure in a component
outside the current project and inside the current system, and in
particular refers to power outages and kernel panics. Such failures
should be handled as well as reasonable, which may not be very well.
[7] "Corrupt internal data" means corruption in stored data or data
retrieved from a server. It does not include erroneous user input,
though it does include user input that has been purportedly validated.
Corrupt internal data implies a bug in this or another component.
Note that causing corruption is likely a P2 or P3 bug;
this note applies only to the response to receiving the corrupt data.
Note also that there may be serious performance penalties associated
with revalidating data in the hopes of detecting such corruption, and
so revalidation may, on balance, be undesirable.
[8] Ideally, when a bug is detected in the current project, the operation
should be cleanly aborted and diagnostic information provided that will
allow the service and sustaining organizations to address the bug.
For instance, unexpected Java exceptions should be reported (with stack
backtrace) and should lead to a clean shutdown of the operation.
This note applies to a failure to properly handle such a bug - for
instance, this note almost always applies to catching and ignoring
java.lang.Exception or its generic friends Error, Throwable, or
RuntimeException.
[9] A "confusing situation" is one in which the user is likely to
misunderstand the results of an operation or in which it is not
clear to the user what to do next. An error message that simply
reported "failed" would be a confusing situation, since the user
probably has no idea what to fix to address the problem. An error
message when no error actually occurred would be another case, as
would a command that apparently completed successfully even though
it encountered problems.
[10] A "dev/test blocker" is a bug that idles developers or testers.
Breaking the build is such a case, because development and test
builds cannot continue. Breaking a major piece of functionality
is likely to be such a case, because developers are prevented from
exercising the code they're working on and major blocks of tests may
fail. Merely causing a single test to fail, or a few tests to fail,
does not constitute "blocking". As a P1 bug tells the RE to drop
everything else and address this bug, it must be used carefully:
tests and development should be planned so that when a relatively
minor failure is encountered staff can be redirected to other
activities rather than being idled.
[11] "Interoperability failure" refers to a current or future scenario
where different components, or different versions of the same components,
do not work together properly. The extent to which interoperability
is required depends on the project, but for network transactions the
usual standard is that all versions of one component must interoperate
with all versions of the other. An "old" client must interoperate with
a "new" server, and vice versa. For APIs the test is usually less
strict; it is usually acceptable for a "new" application to require
a "new" version of the library, but almost never acceptable for a "new"
version of the library to require "new" applications. Note that many
interoperability failure scenarios are future-looking ones of the
form "what happens if today's version of the component attempts to
interoperate with a version from next week?".