Project Comment

The project finished and learning process never finishes. And one can always learn if he has the will.

1. What a project?

Client-server applications may not be net-work applications because they may run in the same host; Net-work applications may not be distributed systems because we consider distributed systems as integrity of multiple cooperative servers. (Someone may argue that every system can considered as client-server system just as if anything in this world is just relation of giving-and-taking.) Distributed systems may not be fault-tolerant system because they may not have high-availability and reliability. Some groups concentrated on distributed systems which stresses the cooperation of multiple servers and doesn't touch too much about the core of availability and reliability which has some subtle difference here. The reliable system may not be always available. However, an high-available may not be always reliable even though such system is not too much useful from my personal view.

2. What fault-tolerance?

The requirement is that the outside world is not affected by your internal failure if we consider fault-tolerance as a black-box. Just imagine that the user's life won't be affected by the failure of our broker exchange system. i.e. The money, the stock of all user (including broker itself, surely the broker doesn't want to compensate losses of his customers.) remains intact after some failure with or without noticing of users at the minimum or no cooperation from user side. The users don't have to be honest to return extra money or stock. So does the broker. This is quite intuitive, but I think it is illustrative.

The minimum requirement of system is to maintain the consistency between system with outside world. The consistency applies to those visible states such as balance, number of stock, transaction records etc.

3. What so critical?

The real critical part of this project happens during the process for synchronizing data from memory to persistent storage, i.e. database. This is the hard core of project. As for everything else we consider as comparatively easy. Why? As for writing log of operations, user can immediately detect the failure and we design to ask user to cooperate by resending the request. However, when log entry of operation is done and your system begins to make a checkpoint by storing in-memory data back to disk or database, here comes the most critical moment. If you don't have a log to support this process, you can never know what is finished in updating. This is why the design of check point and log is the kernel of this project.

4. What a monitor?

Should our monitor be able to kill failed system? Yes, because it is required in project. However, if we assume the failures are all benign which is simply crash failure, then there is no need to kill faulted system. Yet we still need to handle network transient failure. So, it becomes a simple job by sending a stop request to the non-response server. (Here non-response means not response when the monitor queries it and as it is transient network failure, sooner or later it will accept request.)