Bug Priorities

DRAFT 1.8

jordan.brown@sun.com v1.8 02/17/04 13:27:04

Background

I'm tired of the grandiose claims made in project plans that P1-P3 bugs will all be fixed, that quality will trump schedule, followed almost inevitably by the end-game rush to downgrade or ignore bugs. It is my hope that that intellectual dishonesty can be avoided by having relatively objective criteria for bug classification agreed by all parties and incorporated into the project plans as quantifiable quality requirements. This document is an attempt to produce such an objective set of criteria.

Plans

For the moment, this is still a draft. I'm trying to use these definitions for current work, but can't claim they are universally accepted.

General Philosophy

These guidelines are intended for use as objective, bright-line distinctions between the various bug priorities. The intent is that at any time during the project's life cycle a bug can be analyzed against these guidelines and yield a priority that can be used to determine the status of the project against its "must fix" requirements. The basic question they try to answer is "if we were to discover this bug the day before our FCS build, would we hold the FCS build for the fix?".

These guidelines often yield a P3 on a bug that obviously must be fixed before FCS because they assume that P3 really does mean "must fix for FCS", that an outstanding P3 bug will block FCS. Similarly, they will often classify a bug that "should" be fixed as P4, because "should" does not mean "must".

P1 Required ASAP
(Dev / test blocker [10])
Build fails.
Major features inoperative. [2]
Causes hardware damage.

P2 Required before Beta
(Beta stopper)
Major features missing. [2]
Operation without user error causes data corruption.
Operation without user error causes service crash. [3]
Operation with remote failure causes data corruption. [5]
Security failure - privilege promotion. [1]
Legal issues
Documentation: major features undocumented, or documented so incorrectly as to be unreasonably difficult to use.

P3 Required before FCS
(FCS stopper)
Minor features inoperative. [2]
Operation with user error causes data corruption.
Operation with user error causes service crash. [3]
Operation with remote failure causes service crash. [3], [5]
Operation with local disk full causes data corruption.
Operation with local disk full causes service crash. [3]
Significant embarassment with trivial fix
(e.g. highly visible spelling error).
Operation without user error leads to confusing situation. [9]
Security failure - secret data revealed. [4]
Documentation - minor features undocumented, or documented so incorrectly as to be unreasonably difficult to use.
Interoperability failure. [11]

P4 Desired for FCS
Minor features missing. [2]
Spelling / grammar errors.
Operation with user error leads to confusing situation. [9]
Catastrophic local failure leads to corrupt data. [6]
Unexpected error (bug) is not handled well. [8]
Operation with local disk full leads to confusing situation. [9]
Documentation - Features documented inadequately or incorrectly, but a reasonable user should be able to figure it out.

P5 Low priority
Wordsmithing.
Code cleanup
Corrupt internal data leads to problems. [7]

Notes

Although it is not captured here, an excess of lower-priority bugs should probably be considered show-stopping. For instance, a single spelling error in a non-prominent location is probably not a show-stopper... but a spelling error on every page might well be. These guidelines generally yield a lower priority when a failure is the result of user error, but if the design of the project is such that users are frequently led into error then that itself is probably a show-stopping bug, and at a minimum the bugs tied to those errors should be considered to be higher priority than they might otherwise be.

Another factor that is not captured here is that the history of the bug may be relevant. A regression - a failure in a component that worked in a previous release - may be more important than a failure in a new component. A failure present in a previous release may be less important than a failure in a new component, since presumably customers are already coping with it and one might want to avoid shipping new bugs to the field.

Performance bugs are not discussed, because they are hard to discuss objectively. One can measure performance against project plan goals, but there is no way to objectively gauge the correctness of those goals. Performance bugs come in two major flavors: "X is slower than it needs to be" and "X is slower than I want it to be". For the former, it is often subjective whether the wasted performance is important. Reducing a particular function call from 10ms to 1ms may be critical if that function is called millions of times per day, and completely unimportant if that function is called once during system startup. (Perhaps we can articulate a distinction based on expected seconds of wasted time per user day.) For the latter (slower than I want), how critical is this performance really? How reasonable are the user's expectations? Sometimes the performance is painful, but is a matter of physics or information theory and cannot be improved. It would be good to articulate performance requirements past which the product would be considered unusable, but that's very hard to do and runs the risk of setting a "negative goal" where it isn't considered necessary to exceed the minimum requirement.

Documentation bugs are necessarily subjective, since a documentation bug deals intrinsically with "soft" questions like whether the user will understand what the software is really doing.

Denial of service bugs are not discussed.

Probability of encountering a failure is not factored into the priority. For security bugs, probability must not be factored in because a malevolent user will cause the low-probability scenario to occur. For other bugs, a low probability might suggest a reduced priority, but no attempt is made to capture that here.

Footnotes

[1] Most generally, user can do an operation he is not authorized for. Specific examples: Non-root user can achieve root. Not-logged-in user can execute more than strictly limited code (e.g. login, appropriate anonymous services). Creating, deleting, or modifying a file without authorization.

[2] "Missing" means that the feature is not present. It isn't offered to the user. "Inoperative" means that the feature is present and offered to the user, but fails.

[3] "Crash" here means that a component fails and must be restarted, not merely that an operation yields an error, even an inappropriate error with a stack backtrace. If issuing an independent request without taking any recovery actions might succeed, "crash" does not apply.

[4] User can view data he is not authorized to view.

[5] "Remote failure" means a failure in a component outside the current system, and in particular refers to real-world failures like server power outages, network failures, et cetera. See also [6] and [7] below.

[6] "Catastrophic local failure" means a failure in a component outside the current project and inside the current system, and in particular refers to power outages and kernel panics. Such failures should be handled as well as reasonable, which may not be very well.

[7] "Corrupt internal data" means corruption in stored data or data retrieved from a server. It does not include erroneous user input, though it does include user input that has been purportedly validated. Corrupt internal data implies a bug in this or another component. Note that causing corruption is likely a P2 or P3 bug; this note applies only to the response to receiving the corrupt data. Note also that there may be serious performance penalties associated with revalidating data in the hopes of detecting such corruption, and so revalidation may, on balance, be undesirable.

[8] Ideally, when a bug is detected in the current project, the operation should be cleanly aborted and diagnostic information provided that will allow the service and sustaining organizations to address the bug. For instance, unexpected Java exceptions should be reported (with stack backtrace) and should lead to a clean shutdown of the operation. This note applies to a failure to properly handle such a bug - for instance, this note almost always applies to catching and ignoring java.lang.Exception or its generic friends Error, Throwable, or RuntimeException.

[9] A "confusing situation" is one in which the user is likely to misunderstand the results of an operation or in which it is not clear to the user what to do next. An error message that simply reported "failed" would be a confusing situation, since the user probably has no idea what to fix to address the problem. An error message when no error actually occurred would be another case, as would a command that apparently completed successfully even though it encountered problems.

[10] A "dev/test blocker" is a bug that idles developers or testers. Breaking the build is such a case, because development and test builds cannot continue. Breaking a major piece of functionality is likely to be such a case, because developers are prevented from exercising the code they're working on and major blocks of tests may fail. Merely causing a single test to fail, or a few tests to fail, does not constitute "blocking". As a P1 bug tells the RE to drop everything else and address this bug, it must be used carefully: tests and development should be planned so that when a relatively minor failure is encountered staff can be redirected to other activities rather than being idled.

[11] "Interoperability failure" refers to a current or future scenario where different components, or different versions of the same components, do not work together properly. The extent to which interoperability is required depends on the project, but for network transactions the usual standard is that all versions of one component must interoperate with all versions of the other. An "old" client must interoperate with a "new" server, and vice versa. For APIs the test is usually less strict; it is usually acceptable for a "new" application to require a "new" version of the library, but almost never acceptable for a "new" version of the library to require "new" applications. Note that many interoperability failure scenarios are future-looking ones of the form "what happens if today's version of the component attempts to interoperate with a version from next week?".

P1	Required ASAP (Dev / test blocker [10])	Build fails. Major features inoperative. [2] Causes hardware damage.
P2	Required before Beta (Beta stopper)	Major features missing. [2] Operation without user error causes data corruption. Operation without user error causes service crash. [3] Operation with remote failure causes data corruption. [5] Security failure - privilege promotion. [1] Legal issues Documentation: major features undocumented, or documented so incorrectly as to be unreasonably difficult to use.
P3	Required before FCS (FCS stopper)	Minor features inoperative. [2] Operation with user error causes data corruption. Operation with user error causes service crash. [3] Operation with remote failure causes service crash. [3], [5] Operation with local disk full causes data corruption. Operation with local disk full causes service crash. [3] Significant embarassment with trivial fix (e.g. highly visible spelling error). Operation without user error leads to confusing situation. [9] Security failure - secret data revealed. [4] Documentation - minor features undocumented, or documented so incorrectly as to be unreasonably difficult to use. Interoperability failure. [11]
P4	Desired for FCS	Minor features missing. [2] Spelling / grammar errors. Operation with user error leads to confusing situation. [9] Catastrophic local failure leads to corrupt data. [6] Unexpected error (bug) is not handled well. [8] Operation with local disk full leads to confusing situation. [9] Documentation - Features documented inadequately or incorrectly, but a reasonable user should be able to figure it out.
P5	Low priority	Wordsmithing. Code cleanup Corrupt internal data leads to problems. [7]