Troubleshooting is a form of problem solving. Problem solving forms part of thinking. Considered the most complex of all intellectual functions problem solving has been defined as higher-order Cognitive It is the systematic search for the source of a problem so that it can be solved. Troubleshooting is often a process of elimination - eliminating potential causes of a problem. The process of elimination is a basic Logical tool to solve real world problems Troubleshooting is used in many fields such as system administration and electronics. A system administrator, systems administrator, or sysadmin, is a person employed to maintain and operate a Computer system and/or network. Electronics refers to the flow of charge (moving Electrons through Nonmetal conductors (mainly Semiconductors, whereas electrical
In general troubleshooting is the identification or diagnosis of "trouble" in a system. Diagnosis is the identification by Process of elimination, of the nature of anything System (from Latin systēma, in turn from Greek systēma is a set of interacting or interdependent Entities, real or abstract The problem is initially described as symptoms of malfunction and troubleshooting is the process of determining the causes of these symptoms.
A system can be described in terms of its expected or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the "print" option from various computer applications is intended to result in hardcopy emerging from some specific device). Hard Copy is an American tabloid news television show that ran in syndication from 1989 to 1999 Any unexpected, particularly undesirable behavior is a symptom and troubleshooting is the process of isolating its specific cause or causes. Frequently the symptom is a failure to observe any results. (Nothing was printed, for example).
Most discussion of troubleshooting, and especially training in formal troubleshooting procedures, is extremely domain specific. The bulk of the material is relevant to a particular field of endeavor (such as automotive repair, computer hardware services, or software systems support). However, troubleshooting has common elements regardless of the specifics.
Any system can be described in terms of its components or subsystems. Each subsystem can be described in terms of its expected behavior. So the inputs to a system can be described as a cascade of inputs and results among the components of the system. (For example: selecting the "print" option in a computer application may cause the software to call on a separate utility, such as lpr on a UNIX system; that in turn might open, read and parse a number of configuration files which might direct it to perform some form of hostname address resolution via DNS, NIS, or LDAP, and then initiate a TCP/IP connection to a specific network device, and so on). The Line Printer Daemon protocol/Line Printer Remote protocol (or LPD, LPR) also known as the Berkeley printing system, is a set of programs that provide Unix (officially trademarked as UNIX, sometimes also written as Unix with Small caps) is a computer The Domain Name System (DNS is a hierarchical naming system for computers services or any resource participating in the Internet. The Network Information Service or NIS (originally called Yellow Pages or YP) consists of a Client-server Directory service protocol The Lightweight Directory Access Protocol, or LDAP (ˈɛl dæp is an Application protocol for querying and modifying Directory services running over The Internet Protocol Suite (commonly TCP/IP) is the set of Communications protocols used for the Internet and other similar networks
The domain-specific knowledge that drives the troubleshooting process is the understanding of these systems in terms of the interactions and dependencies among their subsystems and components. In particular the specialist can ennumerate the components and knows a set of procedures for testing many of them in isolation from the system as a whole. (For example the systems administrator may know which configuration files lpr is trying to parse and may read them manually, check their permissions, or may assume the identity of the user who is experiencing the problem and manually run an lpr command from the system's shell prompt; this may isolate the problem to the application's configuration, the user's preference settings, the workstation's configuration or network settings, the network's name services domain, or back to the printer's configuration or hardware).
Well-designed systems have designated "test points" or monitoring instrumentation. (For example most printers have indicator lights which change colors or blink, or LCD panels which display messages for detectable problems: paper jams, empty paper trays, network or other cable disconnection, etc. As another example UNIX and Linux systems support features for system call tracing through commands like truss, strace, and ktrace).
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that "was working when it was plugged in over there"). However, there is a well known principle that correlation does not imply causality. In Probability theory and Statistics, correlation, (often measured as a correlation coefficient) indicates the strength and direction of a linear Causality (but not causation) denotes a necessary relationship between one event (called cause and another event (called effect) which is the direct consequence (For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of coincidence). Coincidence is the noteworthy alignment of two or more events or circumstances without obvious causal connection
It's useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. An electrical filament is a thread of Metal, usually Tungsten, which is used to convert Electricity into light in Incandescent light bulbs (as developed The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.
A basic principle in troubleshooting is to start from the simplest and most probable possible problems first. Probability is the likelihood or chance that something is the case or will happen This is illustrated by the old saying "When you see hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle. A saying is something that is said notable in one respect or another KISS redirects here See also. For other uses see Kiss (disambiguation. This principle results in the common complaint about help desks or manuals, that they sometimes first ask: "Is it plugged in and does that receptacle have power?", but this should not be taken as an affront, rather it should serve as a reminder or conditioning to always check the simple things first before calling for help. A help desk is an information and assistance resource that Troubleshoots problems with Computers or similar products
A troubleshooter could check each component in a system one by one, substituting known good components for each potentially suspect one. System (from Latin systēma, in turn from Greek systēma is a set of interacting or interdependent Entities, real or abstract However, this process of "serial substitution" can be considered degenerate when components are substituted without regards to a hypothesis concerning how their failure could result in the symptoms being diagnosed.
Efficient methodical troubleshooting starts with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist) of tests to eliminate these prospective causes. Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer's light is on and that its cable is firmly seated at both ends), and to "bisect" the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).
This latter technique can be particular efficient in systems with long chains of serialized dependencies or interactions among its components. It's simply the application of a binary search across the range of dependences. A binary search algorithm (or binary chop) is a technique for locating a particular value in a sorted list of values
Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Feedback is a circular causal Process whereby some proportion of a system's output is returned (fed back to the Input. Such systems are less amenable to "bisection" troubleshooting techniques.
It also helps to start from a known good state, the best example being a computer reboot. In Computing, booting ( booting up) is a bootstrapping process that starts Operating systems when the user turns on a Computer system A cognitive walkthrough is also a good thing to try. The Cognitive walkthrough method is a Usability inspection method used to identify Usability issues in a piece of software or web site focusing on how easy it is Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system. Documentation may refer to the process of providing evidence ("to document something" or to the communicable material used to provide such documentation (i A technical writer is a professional writer who designs writes creates maintains and updates technical Documentation &mdashincluding Online help, User guides A theory of operation is a description of how a device or System should work
A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. Design is used both as a Noun and a Verb. The term is often tied to the various Applied arts and Engineering (See design disciplines See also The Human Factor (disambiguation. Human factors is a term that covers The science of understanding the properties A behavior-shaping constraint, also sometimes referred to as a forcing function or Poka-yoke, is a technique used in Error-tolerant design to prevent the user from An error-tolerant design is one that does not unduly penalize user errors This is especially bad if accompanied by habituation, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it isn't apparent on a casual inspection which part is being used. See also Habit (psychology In Psychology, habituation is the psychological process in humans and animals in which there is a decrease in behavioral
Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. A checklist is used as an aid to Memory. It helps to ensure consistency and completeness in carrying out a task A flowchart is a Schematic representation of an Algorithm or a stepwise process, showing the steps as boxes of various kinds and their order by connecting Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.
Contents |
One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility . . . on finding a procedure to reliably induce the symptom to occur.
Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and the resolution generally involves repairing or replacing those components which are at fault.
Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem. Another, extremely common, problem in electronic and electro-mechanical systems
In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes. A race condition or race hazard is a flaw in a System or process whereby the output and/or result of the process is unexpectedly and critically dependent
Intermittent issues can be defined thus:
An intermittent fault is a one which occurs irregularly or inconsistently.
—Steven Litt, [1]
In particular he asserts that there is a distinction between frequency of occurrence and a "known procedure to consistently reproduce" an issue. For example knowing that an intermittent problem occurs "within" an hour of a particular stimulus or event . . . but that sometimes it happens in five minutes and other times it takes almost an hour . . . does not constitute a "known procedure" even if the stimulus does increase the frequency of observable exhibitions of the symptom.
Nevertheless, sometimes troubleshooters must resort to statistical methods . . . and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved. A root cause is an initiating Cause of a Causal chain which leads to an outcome or effect of interest
Isolating single component failures which cause reproducible symptoms is relatively straightforward.
However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. In Engineering, Fault-tolerant design, also known as fail-safe design, is a design that enables a system to continue operation possibly at a reduced level (also known Features which add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will "take it down. Failover is the capability to switch over automatically to a redundant or standby Computer server, System, or network upon the failure "
Even in simple systems the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).
Note that, while we talk about "replacing components" the resolution of many problems involves adjustments or tuning rather than "replacement. " For example, intermittent breaks in conductors --- or "dirty or loose contacts" might simply need to be cleaned and/or tightened. All discussion of "replacement" should be taken to mean "replacement or adjustment or other maintenance. "