IEN 104 Minutes of the Fault Isolation Meeting 12 March 1979 Virginia Strazisar Bolt, Beranek, and Newman 20 March 1979 Minutes of the Fault Isolation Meeting held at BBN on March 12 Attendees: Virginia Strazisar, BBN, chairman Peter Sevcik, BBN Dale McNeill, BBN Noel Chiappa, MIT Ray McFarland, DOD Mike Wingfield, BBN Jack Haverty, BBN Bill Plummer, BBN Mike Brescia, BBN Ginny suggested that there are three situations in which fault isolation is needed: 1) the user at a terminal on the catenet who cannot reach some destination on the catenet, 2) a catenet control center that must decide what network or gateway in the catenet has failed, and 3) the gateway implementor who must decide what part of the gateway hardware or software has failed. These situations were put forth as a framework for discussing the types of fault isolation facilities that we need. Ginny stated that the object of the meeting was to draw up a list of fault isolation tools needed, giving special consideration to what situations each of these tools would be used in and what questions they could be used to answer. From the suggestions drawn up at the meeting, the detailed formats and protocols could be designed; this level of design was specifically avoided at the meeting. The first situation discussed was the user at a catenet terminal, who discovers that he either cannot connect to a particular destination host or that he no longer gets any response from his previously working connection. At present no information is passed to the user in either of these cases. Everyone agreed that the user should receive some error reply. It was suggested that the user should receive a response indicating that either 1) the destination host is unreachable, 2) the local gateway or network is unreachable or 3) the catenet is inoperational. Most people agreed that the naive user does not care to know what the catenet problems are in any more detail than this. For example, an error messgage of the form "Can't reach destination network because gateway 3 is down" would be totally useless to the naive user. The user also wants to know when the service will be restored, either "within a short time" such that the user is willing to wait for the service to be restored; or "not for a long time" such that the user will quit trying to use the service at this time. Several people pointed out that a more sophisticated user may want to know exactly what component of the catenet failed. There was some discussion as to whether users should be given access to tools that would enable them to probe the catenet gateways to determine where the failure occurred. The consensus of opinion was that the user should be given access to such tools, but that no user should be required to use such tools. Our model was that the naive user on receiving an error message would call a network or catenet control center, whereas the more sophisticated user may attempt to track down the problem before contacting the control center. We discussed in more detail what sort of message a gateway could return to the user. It was suggested that if the network returned an error message about a specific host that that error message (text) should be returned verbatim to the user. It was also suggested that error codes be defined for "common" failures, i.e. net down, host down, and that these be included in the error message. It was pointed out that the gateways currently return messages to the source host if they believe (based on their routing information) that the destination network is unreachable. These messages contain the source and destination addresses and the protocol field from the original datagram. Several people pointed out that this information is insufficient to return an error message to the source user and that the entire internet header of the original datagram should be returned in the error message. We discussed the problem of what to do in the case where datagrams are lost in a gateway or network in such a manner that no error message is generated and returned to the source. It was decided in this case that the source host should automatically probe the gateways in order to return a reasonable status message to the user. It was assumed that the user is running a program that implements some type of internet protocol, such as TCP, and that that program is capable of detecting long delays or mutiple retransmisssions and of generating some type of probe packet to attempt to track down the failure when this occurred. These probe packets are discussed in more detail below. Information obtained from such probing could also be sent to a monitoring center. We discussed the concept of a monitoring or control center. The primary purpose of a monitoring or control center in terms of fault isolation is to isolate the component (network or gateway) that failed and to notify the proper authority to have it fixed. We felt that a control center was needed to avoid having all the users in the catenet calling any and all implementors they felt might be responsible for problems. The concept of a single control center was discussed and rejected for both technical and political reasons. From the technical point of view, it was pointed out that the catenet could become partitioned such that the control center was cut off from part of the catenet and thus could no longer handle faults in that portion of the catenet. On the political side, it was pointed out that organizations responsible for the individual networks may be unwilling to support one control center run by one organization. We agreed that the catenet control center should actually be multiple control centers. These could be either the existing network control centers working in co-operation or separate catenet control centers, each of which was established by co-operating network groups. Tools that these control centers would need included a facility to probe gateways to determine why a particular destination was unreachable. We elaborated slightly on the design of a facility for probing gateways. A host or control center sends its local gateway a message saying "poll the gateways in the catenet to determine why I can not get to destination X". The gateway then polls its neighbors, its neighbors' neighbors, etc., extracting routing tables, addresses of neighbor gateways, status of neighbor gateways and networks, etc. to determine why the destination is unreachable. The gateway would then formulate a response to the host; this response would be of the form: "the network connection between gateway 3 and net 2 is down", "gateway 5 and gateway 6 are down", etc. This mechansim would be an extension of the gateway-gateway protocol as defined in IEN #30. This probe facility would be used by the source host to generate a message to the user in the case where no response is recieved from the destination and no error message is returned by the gateways. The facility would also be used by catenet control centers to isolate the componenet of the catenet that has failed. It was pointed out that we should be concerned not only with total failures, but also with system performance, especially delay. In this context, we were not concerned with cases where delay seemed slightly longer than usual, but rather cases in which traffic crossed the catenet with extrememly high delays, i.e several minutes. A facility was suggested to track this sort of problem: generate a packet from source A addressed to destination B; have this packet trace its route and timestamp it at each gateway on the route to B; at B, echo the packet; return the packet to the source, A, using source routing and the route stored in the packet via the trace mechanism; timestamp the packet on its route back to A. The timestamps in the packet could now be interpreted to yield transit times across each network as there would be a pair of timestamps for each gateway traversed. The final stage of fault isolation is the situation in which the failure has been attributed to a particular gateway and the implementor of that gateway must debug it. This part of fault isolation was not discussed in detail. It was suggested that at this point, it would be very useful to be able to turn off timeouts in the catenet to avoid having the state of the catenet change in such a way that the problem can no longer be isolated. In summary, the following list of tools and situations in which they would be used was suggested. 1) Error messages indicating whether the destination host, the local network or gateway, or the catenet had failed, and indicating the time at which service should be restored. These are to be returned automatically to the catenet user whenever there is a failure in using a catenet service. 2) Gateway to gateway probing mechanism that can be initiated with a host to gateway message. This mechanism would be used by a control center to isolate a component failure. It would also be available to the user. It would be used by source host protocol programs to formulate an error message for the user when no repsonse was received from the destination and no error message was received from the gateways. 3) Ability to trace, echo and source route packet with timestamping. This facility would be used to determine where delays are occurring when a destination is reachable, but delays cannot be accounted for. 4) Ability to echo packets off any gateway. 5) Ability to trace packets. 6) Ability to source route packets. 7) Ability to dump gateway tables. 8) Ability to trace packets by sending replies from every gateway that handles the packet. These capabilities would be used by control centers and gateway implementors to isolate failed components and determine the reasons for failure. These facilities were not discussed in detail. A description of mechanisms for tracing packets and source routing packets was given in IEN #30, although these have not yet been implemented. The next step in developing fault isolation mechanisms for the catenet is to work out the detailed design for the mechanisms suggested above, and to implement these in hosts, gateways and control centers.