The study and use of distributed computing systems is fairly new. It has been created by the widely-remarked decreasing cost of computer hardware (especially for processors and memories) and the increasing availability of digital communications capabilities. These trends have created a situation in which it is economically justified to consider the creation of a computing system with multiple processors communicating with each other over a communications network.
The correct design of a distributed system is very difficult. The inherent parallelism of multiple processors working separately makes the design problem much more difficult than the design problem for a simple sequential system. In addition, testing and debugging a distributed system may be extremely difficult because of both the parallelism (and the resulting timing problems) and because of the distributed nature of the hardware.
Analysis techniques are needed which can detect errors (or potential errors). Eventually, we would hope that correctness proof systems will be available which will allow systems to be analyzed and proven correct. However, proof techniques for distributed systems have not yet been completely developed. As an alternative, we show in this paper that other forms of automatic analysis can be used to detect certain types of errors in distributed systems, and we propose a methodology for the construction of distributed systems which allows this analysis.
The analysis mechanism which we suggest is Petri nets [Peterson 1977]. Petri nets are a simple mathematical graph model of computation based upon the concept of concurrency and parallelism. Hence it seems a perfect base for the modelling and analysis of distributed systems.
Our methodology for the correct design of distributed systems consists of several steps:
At the same time, the language must be sufficiently constrained and simple to allow correct modelling and analysis. And, finally, the language must be implementable.
In the next section of this paper, we present our suggestion for such a language. Then we show how this language can be modelled as a Petri net and analyzed. Finally we show how it could be implemented.
We borrow the basic elements of our model of distributed processing (processes with disjoint memories communicating with messages) form existing concepts in operating systems.
The concept of a process is adapted directly from operating systems work. It is a convenient and useful concept for hiding the particular number of physical processors which happen to exist in a particular system. In a distributed system, the number of processors available may vary over time. The process concept allows the programmer to structure an application on the basis of the number of processors which could best be used for that application; the mapping of these virtual processors onto physical processors is a separate problem with which the programmer should not be concerned.
A major question is whether processes can share (virtual) memory. Sharing memory between processes creates several problems such as protection, synchronization, mutual exclusion, and defining the appropriate memory mappings.
Another problem caused by shared memory is related to mapping processes onto processors. If two processes share (virtual) memory, it would be all but impossible to map these processes onto separate processors since the shared virtual memory should be implemented by shared physical memory. This is contrary to the spirit of our model. In the best case each process should be mapped onto a separate processor and physical communication links should exist for each virtual communication link. This would largely eliminate the need for supporting system software for scheduling and routing.
For these reasons, we do not include shared memory in our model of distributed computation. This simplifies our system while not limiting the functionality of our system. The effect of shared memory can be achieved by message passing, admittedly at a higher cost. In return for this potentially higher cost, we have a simpler system in which all communication between processes is explicit.
Processes virtualize the physical processors. Similarly, messages virtualize the physical communications network into a virtual communications network: the message system. In reality, processors communicate by sending streams of bits over the wires between the processors. The particular configuration of these wires and formatting of the bit streams should be hidden from the programmer. In the first case, the links between processors will certainly vary over time due to both line failures and changes in the net topology. In addition the programmer sees only processes, not processors, and so has no information about the locations of the two processes which wish to communicate.
Thus, the message system provides both an appropriate conceptual virtualization and a solution to the communication problem between processes. The message system provides a virtual communication facility for the processes of the distributed application. Messages, containing requests and replies for information and services are passed from process to process providing virtual communication links between any two processes as needed.
As we have mentioned, processes communicate with each other by messages; the message system virtualizes the underlying physical communication system. In addition, we feel that the message system may also transform the semantics of message passing to provide a more convenient user interface.
Reliability is one aspect of the communication network which should be improved for the programmer by the message system. An error in transmission across the physical communication network should result in automatic retransmission, perhaps over an alternate network path, at least some few times. This increases the apparent reliability of the communication network for the programmer.
Producing a logical message system which differs from the physical communication system requires careful thought about the desired semantics of the message system. Two operations are necessary in a message system: send and receive. The semantics of these operations must be carefully defined, since they affect the ease-of-programming of the system as well as such properties as analysis and performance. To try to match the semantics of the message system to the way in which it is used, let us look at some of the uses which must be possible in the system.
First, we would expect much of the use of the message system to result from a request by a user process for some service from some server process, followed by a later reply from the server to the user. Examples might include a request to a compiler to compile a file, followed by a reply that compilation is complete and the number of errors discovered. Another example might be a request to a data base for some information followed by a reply of the appropriate information.
Not all message flow will fit this simple structure, however. One can easily envision systems where a single request is followed by a "beginning compilation" and "compilation complete" reply messages. Or a sequence of "requests" with no replies, as in a performance or debugging monitor system consisting of an initial message notifying the monitor of the start of monitoring, followed by a stream of messages -- one for each significant event. No "reply" would be needed or desired.
Thus, although much of the use of the message system may be for request/reply messages, we must also make provision for the cases when messages are not paired in this way. Notice that a request/reply interchange can be effected by simply two messages: first from user to server and later from server to user. So a non-paired system can provide paired messages. Similarly in a paired system, use of a "null" reply (which is ignored by the receiver) would allow the form of all communication to be request/reply pairs while the actual information flow would correspond to the unpaired system. Therefore, neither system (paired or unpaired) is inherently more powerful that the other. We would suggest that the unpaired system would both be easier to implement and more closely conform to a variety of applications.
In general, we can define a conversation as a sequence of messages sent and received between two processes concerning a single application operation. One of the most common conversations might be of the simple request/reply form.
In the simplest case, a conversation will occur between a pair of processes with messages passing back and forth between these two processes. The identity of the two processes will be completely defined when the application is written. However, there are other structures which need consideration.
Consider the case of the user process and the server process. A simple system would have the user send a request directly to the server. The server would receive directly from the user. More generally, however, there will be multiple users. How does the server know which user to receive from? The server cannot know in advance which user to receive from, but should receive from the first user requesting service. Thus, we may have multiple users sending to a single server. Similarly, there may be multiple servers available to provide a service. The number and identity of these server processes is not of importance to the user and may in fact change over time. Thus, a single user may want to send to any of a number of server processes.
Notice however that once a conversation between a particular user and a particular server has been established, that future messages sent in this conversation must be directed to the appropriate single process. This need not always be the same process since either the user or the server may be a team or pipeline of processes with the conversation proceeding from team member to team member. In any case, the sender indicates to which process the reply should be made.
What is needed for the multiple-user, multiple-server problem is an ability to define a set of processes and send to or receive from any member of that set. In addition we would require the following property: If a user is requesting service and a server is awaiting such a request, then the request should be sent to a waiting server without undue delay. We do not want servers to be idle while users wait for service. The problem is how to assure this property in a distributed system with multiple users and multiple servers.
If we had multiple users, but only one server, then we could simply notify the one server of all requests. The server would select the next request to receive; other requests would have to wait. With multiple servers, but only one user, the servers could keep the user notified of their availability; the user could then select an available server and send the request directly to that server. However, with multiple users and multiple servers, neither group can keep the other group correctly informed of their status. If a single server told the users that it was available, it would then have to tell them it was unavailable when it received a request from any of them. The update problem is just as bad if users notify servers of pending requests; when a server picks up the request, all other servers must be informed that this request is no longer pending.
The solution is to bring pending requests and server status information together in one place. We can define a mailbox as a place where messages are placed. A separate mailbox is defined for each set of servers. Requests are sent to the mailbox by user processes and received from the mailbox by server processes. The problem now becomes how to match incoming requests with servers; the most common solution is first-come-first-served.
An alternate approach is simply to define a scheduler process for each set of servers. All requests are sent to the scheduler, as well as messages from the servers indicating that they have finished their last request and are ready for the next request. The scheduler process pairs user requests with available servers, forwarding the request to the server and changing its internal tables to indicate that the request is no longer pending and the server is no longer available. This solution allows an algorithmic pairing of requests and servers and, since the scheduler process is just a process, eliminates the need for a separate mailbox construct.
The major advantage of the mailbox/scheduler approach is that messages are always sent directly to a specific location, either to the mailbox/scheduler (by the user process) or on to a specific (waiting) server process (by the scheduler). This reduces the multiple-user/multiple-server problem to the multiple-user/single-server problem. In addition, by using a scheduler, the order of service can be adjusted as desired.
As an example, consider a set of line printer driver processes, each with its own line printer. To print a file, the file name is sent to the line printer driver process scheduler. The scheduler can then adjust the order of printing among waiting files by time of arrival, size of file, time of day, printer characteristics, and so on, as desired.
As another example, consider a more complex situation in which a conversation is needed with a server. The user process would first send a message to the scheduler process asking for service. When a server is free, the server would notify the scheduler which would then forward the request for service on to the server. The request would include the reply name of the user process requesting service. The server would send a message to the user process; this message would identify the server. A conversation could then occur between the server and the user. When the conversation ended, the server would send another message to its scheduler, while the user process continues with its program.
Since some processes may need to carry on several conversations simultaneously, messages are sent to and from ports. A port is a named queuing station for messages. Each port is associated with its unique defining process; only the defining process can receive from the port although multiple processes can send to the port. A port provides queueing to store messages sent but not yet received. Port names are unique throughout the distributed system. If fact, process names are relatively unimportant, since all communication with a process is by its port names, not its process name.
Standard (parameterized) scheduler processes might be provided by a library to reduce the need for a programmer to write many schedulers.
Note that this scheme has several advantages:
The design, programming and construction of a distributed system is a difficult task. Part of this difficulty is due to the degree of concurrent activities in the system and the need to consider all possible interactions between the concurrent cooperating processes. To effectively utilize the potential concurrency, the programmer must use modern programming techniques to create the system. This will require the creation of a suitable high-level language for distributed programming.
Several efforts are underway to design such a language; typically these are based upon existing modern programming languages, such as CLU [9, 10] or Pascal.
The highest level system design is as shown in Figure 3. The system consists of the definitions of a number of processes. In addition, there are a number of system-wide constants and data types. Remember that our definition of a distributed system provides a unified software system; thus some constructs may be constant across the entire system.
However, since processes do not share memory, there are no global variables or procedures. Hence, data abstractions which are defined system wide and which include procedural operations (such as a data abstraction in a CLU-like language), must be implemented in a macro fashion, by duplicating the code for these operations within all affected processes.
A process is defined much the same as with the definition of a conventional program (see Figure 4). Data abstractions, constants, variables and procedures and functions, local to the process are defined. One distinguished procedure, block or statement defines the starting address of the process. In addition, ports are declared and the type of messages for each port is defined.
Messages are strongly-typed objects. The type of a message is defined by the programmer. Naturally, the type of a message which is sent to a port must be compatible with the type defined for that port. As with other type definitions, this may require a run-time check unless it can be proved at compile-time that the types are compatible. Port types will typically be defined system-wide.
It may be necessary to send several types of messages to a process. Since message types are defined for each port, multiple ports allow multiple types to be sent from process to process. Alternatively, a variant type may be utilized to send multiple types through the same port. A tag field would be added to each message defining the type of the message.
Messages will typically be records with fields for reply port names, authentification information and other fields.
Given the code for a distributed system, as proposed above, and knowing the difficulty of correctly designing such a system, analysis techniques will be needed to detect errors (or potential errors). Eventually, correctness proof systems should be available, but even now more limited analyses can detect some errors [4, 12]. More extensive automatic analysis procedures should also be possible.
The approach to distributed systems which has been presented provides a basis upon which distributed algorithms and applications may be built. The simplicity and generality of our proposal allow the application or algorithm designer great freedom. This freedom is necessary because of the newness of the area; we do not yet know what restrictions are reasonable and appropriate.
But this very freedom and generality of design may allow certain obvious errors to be programmed into newly designed systems -- For example, deadlock or incorrectly updated distributed databases. Thus, we need an analysis mechanism which can determine if these errors have actually occurred in a system.
This analysis mechanism may exist in Petri nets [Peterson 1977]. Petri nets are a simple mathematical graph model of computation.
The design methodology consists of the following steps:
The same text which is used to produce the running distributed system can also be used to define the Petri net model. The majority of the program can be easily modelled. The message system has been explicitly designed to be both natural and easy to understand, but also to be modelled by Petri nets. A port of the system, being a queue of messages, is modelled by a place in the Petri net. A message is modeled by a token. A send instruction places a token in the port-place; a receive instruction removes a token from the port-place.
The translation from system text to Petri net can thus be completely mechanical. By using the actual system text, we are assured that the model is a correct representation of the actual system; no errors will be introduced into the modelling process by the intervention of a human modeller. However, the Petri net model, like other graph models, is an uninterpreted model: the specific semantics of specific operations and tests cannot be effectively modeled. This may lead to some problems.
A larger problem is the analysis of the resulting Petri net. It is suspected that a Petri net mechanically produced from the text of a nontrivial distributed application will be very large, containing thousands of places and transitions. Some preliminary work has been done on reducing the size of a Petri net while maintaining analysis properties [Kowalk and Valk 1979]. It is expected that more work on the reduction of Petri nets to manageable size is needed.
Finally, after the Petri net has been created and reduced, analysis can begin. Existing analysis techniques can be used for several problems. It is possible to determine, for example, if a bound exists on the number of tokens in a place. The existence of such a bound (and its value) can be used to determine buffer sizes for message queues. Ports which by the design of the system can only have a small bounded number of messages can have buffer space statically allocated, eliminating the need for dynamic memory allocation algorithms and the possiblity of running out of buffer space.
Other properties, such as mutual exclusion and conservation of resources, can also be handled. Properties such as deadlock are believed to be decidable for Petri nets, but decision algorithms are still being developed. Research is needed into what other questions may be asked and how they can be answered using the Petri net model.
The construction of a distributed system from the high-level specifications of the form of Figures 3 and 4 proceeds in several steps. First the system would be compiled. The compiler would perform syntax checking, type checking and code generation. The types of all messages would be checked for compatibility with the type of the port for each send and receive statement.
The result of an error-free compilation would be a set of process blocks. Each process block would define the name, ports, and code of a process. These process blocks define a virtual network of processes and their communication links. This virtual network must then be implemented by mapping it onto the underlying physical collection of processors and communication network.
This mapping maps processes onto physical processors and messages onto sequences of bits sent through the communication network. Mapping processes onto processors defines a partition of the processes into classes. All processes in a particular class are implemented on the same processor. If there are sufficient processors, each process can be placed in a separate class; otherwise multiprogramming techniques must be used to provide a virtual processor for each of the processes in the class for a given physical processor.
Several criteria may be used to determine the partitioning of processes. The appropriate partitioning may be determined by ad hoc, non-technical criteria, or may seek to optimize some quantity. Two such optimization criteria would be a desire to minimize traffic over the communication network (by grouping on the same processor those processes which communicate frequently) or to partition processes so as to equalize the workload on each processor. Such optimization decisions will need performance data or estimates on message traffic or execution times in addition to a description of the abstract network.
The partitioning of processes to processors may also be affected by the topology of the underlying communication network. The routing problem is concerned with the mapping of messages between two processes into a path (or set of paths) through the communication network from the processor of one process to the processor of the other.
The mapping of the virtual network onto the physical network can be thought of as either static or dynamic. A static mapping does not change over time. Thus, a process will be assigned to a particular processor and will always exist on that processor. A dynamic mapping may change over time, allowing processes to move from processor to processor as the demands on the system change. A dynamic mapping will reevaluate the optimality of the mapping at various times and change the assignment of processes to processors and the routing of messages as necessary to attempt to improve performance.
In fact, reliability and growth considerations will typically require a dynamic mapping since routing and processor assignments will change to reflect the failure of communication lines and processors or the addition of new processors to increase capability. However, growth may be sufficiently rare as to allow a static view to be satisfactory. Thus, it is reliability in the face of failures which mandates a dynamic mapping. A dynamic mapping is a more difficult system design problem than a static mapping.
A further dichotomy can be made between manual and automatic mappings. A manual mapping is performed by a person who analyses the virtual network and specifies which virtual entities are to be mapped onto which physical entities. A more difficult problem is to define an algorithm which allows a computer to mechanically examine the network and define the mapping.
The trend will be towards automatic, dynamic mapping of the virtual network onto the physical network. The reliability requirement dictates a dynamic mapping, and a dynamic mapping all but requires an automatic mapping. An analogy can be made here to the development of virtual memory. Memory mapping was originally static and manual (absolute programming) but gradually became static and automatic as relocatable code and compilers came into use. Overlays were a dynamic, manual form of virtual memory, while paging and segmentation allow dynamic, automatic memory mapping. A similar trend will occur with the mapping of the virtual network onto the physical network.
After the mapping has been defined, the final system can be generated. If processors are heterogeneous, some procedures may need to be re-compiled into the machine language of the appropriate processor. If multiple processes are mapped onto the same processor, system code for multiprogramming will need to be added. Optimizations may allow reentrant code within separate processes which are mapped onto the same processor to be shared. The result would be modules to be loaded into each processor in the system to create the distributed system.
The definition of appropriate mapping algorithms to map processes onto processors is an important optimization problem. Some attempts have been made [14, 15] to define automatic mapping algorithms, but much is still possible. Research is needed both on the criteria for optimization and the algorithms for these criteria. We can borrow much from operating system theory, especially cpu scheduling and memory management results . The process location problem (where to put each process) is related to partitioning problems in operations research.
An important point not to be overlooked is that the model of distributed systems we have proposed is static. The entire system is totally and completely defined when compilation, mapping and final code generation occurs. Hence all algorithms can be defined for a fixed, finite number and type of processes. These restrictions can be lifted to allow more flexibility (and complexity) in system design and more difficulty in programming, compiling, analysis and mapping. The first step would be the introduction of a variable number of processes, but of fixed types. Then this final restriction could be lifted, allowing new types of processes to be created during system execution. The correct design of dynamically changing systems is an order of magnitude more difficult than the design of the static systems which we are just beginning to create.
We have presented an approach to distributed computing. Our objective is to define a small simple set of features which, while taking advantage of current programming methodologies (modularity, strong typing, hierarchical design, ...) does not overly restrict the system designer. We believe that the approach outlined here -- of a relatively unrestricted programming system followed by (automatic) Petri net analysis may lead to the development of practical and correct distributed systems.
begin_system constants data abstractions process A; ... process B; ... ... process M; ... end_system;
process I; constants data abstractions variables procedure a; ... procedure b; ... ... procedure n; ... main_program; ... end_process I;