分佈式系統中時間、時鐘和事件順序

本文是翻譯 萊斯利·蘭伯特(Leslie Lamport)的論文《Time, Clocks, and the Ordering of Events in a Distributed System》(http://lamport.azurewebsites.net/pubs/time-clocks.pdf),Leslie Lamport 在1978年提出邏輯時鐘的概念,也就是Lamport時間戳(Lamport timestamps)。因為太長,我會分5個部分完成。這是第一部分,講了什麼是偏序。

Time, Clocks, and the Ordering of Events in a Distributed System

Leslie Lamport

Massachusetts Computer Associates, Inc.


The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

本文研究了分佈式系統中一個事件先於另一個事件發生的概念,並用此概念定義事件的偏序關係。給出了用於同步邏輯時鐘的分佈式算法,該邏輯時鐘可用於對事件進行全序排序。通過解決同步問題的方法說明了全序排序的使用。然後,將該算法用於同步物理時鐘,並得出不同機器的物理時鐘可以漂移到不同步範圍的上限。

Key Words and Phrases:

distributed systems, computer networks, clock synchronization, multiprocess systems

關鍵詞和短語:分佈式系統,計算機網絡,時鐘同步,多進程系統

CR Categories: 4.32, 5.29

CR 類別:4.32、5.29


Introduction 介紹

The concept of time is fundamental to our way of thinking. It is derived from the more basic concept of the order in which events occur. We say that something happened at 3:15 if it occurred after our clock read 3:15 and before it read 3:16. The concept of the temporal ordering of events pervades our thinking about systems. For example, in an airline reservation system we specify that a request for a reservation should be granted if it is made before the flight is filled. However, we will see that this concept must be carefully reexamined when considering events in a distributed system.

時間的概念是我們思維方式的基礎。它源自事件發生順序的更基本概念。如果我們說某件事發生在3:15,那就是說發生事件時,我們的時鐘讀數為3:15之後且3:16之前。事件的時間順序概念充斥著我們對系統的思考。例如,在航空公司預訂系統中,我們如果在航班起飛前提出預訂請求,則應該得到批准。但是,我們將看到,在考慮分佈式系統中的事件時,必須仔細地重新檢查此概念是否成立。

A distributed system consists of a collection of distinct processes which are spatially separated, and which communicate with one another by exchanging messages. A network of interconnected computers, such as the ARPA net, is a distributed system. A single computer can also be viewed as a distributed system in which the central control unit, the memory units, and the input-output channels are separate processes. A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.

分佈式系統是由多個進程的集合組成,這些進程在物理空間上是分開的(分佈在不同的機器節點上),並且通過交換消息相互通信。計算機互聯網(例如ARPA網絡)就是一個分佈式系統。單個計算機也可以看作是分佈式系統,其中中央控制單元(CPU),存儲單元(RAM)和輸入輸出通道(I/O)是獨立的進程。如果消息傳輸延遲相對於單個進程中事件之間的時間差相比不可忽略,則該系統就是分佈式系統。

We will concern ourselves primarily with systems of spatially separated computers. However, many of our remarks will apply more generally. In particular, a multiprocessing system on a single computer involves problems similar to those of a distributed system because of the unpredictable order in which certain events can occur.

我們將主要關注空間上分離的計算機系統。但是,我們的許多言論將更普遍地適用。具體的說,由於某些事件的發生順序不可預測,因此單臺計算機上的多處理系統也會遇到與分佈式系統類似的問題。

In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation "happened before" is therefore only a partial ordering of the events in the system. We have found that problems often arise because people are not fully aware of this fact and its implications.

在分佈式系統中,有時很難說兩個事件中哪一個先發生。因此,“發生在之前”只是系統中事件的偏序關係。我們發現,經常出現問題是因為人們沒有完全意識到這一事實及其含義。

In this paper, we discuss the partial ordering defined by the "happened before" relation, and give a distributed algorithm for extending it to a consistent total ordering of all the events. This algorithm can provide a useful mechanism for implementing a distributed system. We illustrate its use with a simple method for solving synchronization problems. Unexpected, anomalous behavior can occur if the ordering obtained by this algorithm differs from that perceived by the user. This can be avoided by introducing real, physical clocks. We describe a simple method for synchronizing these clocks, and derive an upper bound on how far out of synchrony they can drift.

在本文中,我們討論了由“發生在之前”關係定義的全序排序,並給出了一種分佈式算法,用於將其擴展為所有事件的一致性全排序。該算法可以為實現分佈式系統提供有用的機制。我們以解決同步問題的簡單方法說明了它的用法。如果通過此算法獲得的順序與用戶所感知的順序不同,則可能會發生意外的異常行為。可以通過引入實際的物理時鐘來避免這種情況。我們描述了一種用於同步這些時鐘的簡單方法,並得出了它們可以漂移到不同步範圍的上限。

The Partial Ordering 偏序

Most people would probably say that an event a happened before an event b if a happened at an earlier time than b. They might justify this definition in terms of physical theories of time. However, if a system is to meet a specification correctly, then that specification must be given in terms of events observable within the system. If the specification is in terms of physical time, then the system must contain real clocks. Even if it does contain real clocks, there is still the problem that such clocks are not perfectly accurate and do not keep precise physical time. We will therefore define the "happened before" relation without using physical clocks.

大多數人可能會定義:如果事件a早於事件b發生,則事件a在事件b之前發生。他們可能會根據時間的物理理論證明這一定義是正確的。然而,如果要使一個系統能夠正確地滿足一個定義,則必須根據系統內可觀察到的事件來給出該定義。如果定義是基於物理時間的,則系統必須包含物理時鐘。即使它包含物理時鐘,仍然存在一個問題,即這些時鐘不是十分準確,並且不能保持精確的物理時間。因此,我們將在不使用物理時鐘的情況下定義“發生在之前”的關係。

We begin by defining our system more precisely. We assume that the system is composed of a collection of processes. Each process consists of a sequence of events. Depending upon the application, the execution of a subprogram on a computer could be one event, or the execution of a single machine instruction could be one event.

我們首先更精確地定義我們的系統。我們假設系統由一個進程集合組成。每個進程都包含一系列事件。根據應用的不同,在計算機上執行一個子程序可以算是一個事件;執行單個機器指令也可以算是一個事件。

We are assuming that the events of a process form a sequence, where a occurs before b in this sequence if a happens before b. In other words, a single process is defined to be a set of events with an a priori total ordering. This seems to be what is generally meant by a process. It would be trivial to extend our definition to allow a process to split into distinct subprocesses, but we will not bother to do so.

我們假設一個進程的多個事件形成一個序列,如果a發生在b之前,則a出現在該序列的b之前。換句話說,單個進程被定義為具有先驗全序的一組事件。這似乎是進程的一般含義。擴展我們的定義從而可以把進程拆分為不同的子進程,這看上去好像是微不足道的,但是我們不怕麻煩。

We assume that sending or receiving a message is an event in a process. We can then define the "happened before" relation, denoted by "-->", as follows. Definition.

我們假定發送或接收消息是進程中的事件。然後,我們可以如下定義“發生在之前”的關係,用“->”表示。

Definition. The relation "->" on the set of events of a system is the smallest relation satisfying the following three conditions: (1) If a and b are events in the same process, and a comes before b, then a ~ b. (2) If a is the sending of a message by one process and b is the receipt of the same message by another process, then a ~ b. (3) If a ~ b and b ~ c then a ---* c. Two distinct events a and b are said to be concurrent if a ~ b and b -/-* a. We assume that a ~ a for any event a.

定義。系統事件集上的關係“->”是滿足以下三個條件的最小關係:(1)如果a和b是同一進程中的事件,並且a在b之前,則a -> b。(2)如果a是一個進程發送的消息,而b是另一進程接收的同一消息,則a -> b。(3)如果a -> b和b -> c,則a -> c。如果a -x-> b並且b -x-> a,則兩個不同的事件a和b稱為併發。

We assume that a -> a for any event a. (Systems in which an event can happen before itself do not seem to be physically meaningful.) This implies that -> is an irreflexive partial ordering on the set of all events in the system.

我們假定a -> a對於任何事件a都成立。(一個事件先於這個同一事件發生的系統在似乎沒有實際意義。)這意味著 -> 是系統中所有事件的不自反的偏序排序。

It is helpful to view this definition in terms of a "space-time diagram" such as Figure 1. The horizontal direction represents space, and the vertical direction represents time--later times being higher than earlier ones. The dots denote events, the vertical lines denote processes, and the wavy lines denote messages. It is easy to see that a -> b means that one can go from a to b in the diagram by moving forward in time along process and message lines. For example, we have p -> r4 in Figure 1.


分佈式系統中時間、時鐘和事件順序 - Part 1 偏序

用諸如圖1的“時空圖”來查看此定義是有幫助的。水平方向表示空間,垂直方向表示時間--較晚的時間比較早的時間高。點表示事件,垂直線表示過程,波浪線表示消息。很容易看出,a -> b表示a可以從圖中通過沿進程和消息線前進走到b。例如,在圖1中有p1 -> r4。(p1 -> q2, q2 -> q4, q4 -> r3, r3 -> r4 => p1 -> r4)

Another way of viewing the definition is to say that a --) b means that it is possible for event a to causally affect event b. Two events are concurrent if neither can causally affect the other. For example, events p3 and q3 of Figure 1 are concurrent. Even though we have drawn the diagram to imply that q3 occurs at an earlier physical time than 1)3, process P cannot know what process Q did at qa until it receives the message at p, (Before event p4, P could at most know what Q was planning to do at q3.)

查看這個定義的另一種方法是說,a -> b意味著事件a和b有因果關係。如果兩個事件都不能相互影響,則兩個事件是併發的。例如,圖1的事件p3和q3是併發的。即使我們該圖從物理時間看上去q3發生在p3之前,進程P也無法知道進程Q在q3做了什麼,直到它在p4處收到消息為止(在事件p4之前,進程P最多可以知道進程Q計劃在q3做什麼。)

This definition will appear quite natural to the reader familiar with the invariant space-time formulation of special relativity, as described for example in [1] or the first chapter of [2]. In relativity, the ordering of events is defined in terms of messages that could be sent. However, we have taken the more pragmatic approach of only considering messages that actually are sent. We should be able to determine if a system performed correctly by knowing only those events which did occur, without knowing which events could have occurred.

對於熟悉狹義相對論的不變時空公式的讀者來說,這種定義將顯得很自然,例如參考文獻[1]或文獻[2]的第一章中所述。在相對論中,事件的順序是根據可以發送的消息來定義的。但是,我們採用了更為實用的方法,即只考慮實際發送的消息。通過僅知道確實發生的那些事件,而不知道可能發生的事件,我們應該能夠確定系統是否正確執行。

Logical Clocks 邏輯時鐘

待續...


本人持續分享分佈式系統設計和實現的相關的技術文章和視頻,歡迎關注和討論。



References

  1. Schwartz, J.T. Relativity in lllustrations. New York U. Press, New York, 1962.
  2. Taylor, E.F., and Wheeler, J.A. Space-Time Physics, W.H. Freeman, San Francisco, 1966.
  3. Lamport, L. The implementation of reliable distributed multiprocess systems. To appear in Computer Networks.
  4. Ellingson, C, and Kulpinski, R.J. Dissemination of system-time. 1EEE Trans. Comm. Com-23, 5 (May 1973), 605-624.


分享到:


相關文章: