Distributed Computing Issues

Distributed computing (also described as grid computing, parallelism, and clustering) isn't suitable for all tasks. For example, an operation that takes a relatively short amount of time may actually be lengthened by the overhead needed to send network messages to other peers—not to mention the additional work needed to divide the problem into multiple chunks and reassemble the answer.

Most significantly, distributed computing increases the overall complexity of the system and often makes it more fragile. That's because distributed computing raises other issues that aren't encountered if a single process is doing the same work. These problems include

How to handle communication errors.
How to track free workers and in-progress tasks.
How to deal with a worker that doesn't respond in a timely fashion.
How to allocate work intelligently, depending on the perceived complexity of the problem and the computing resources (or communication speed) of the worker.

Most developers think of distributed computing as a way to break a single problem down into multiple pieces that can be worked on, independently, by multiple machines. This is the ideal scenario, but not the only case. In some instances, you might use a distributed-computing framework just to remove a bottleneck for a highly computational task on a server. For example, a web service that receives task requests could deliver these requests to a task manager. The task manager would then send each task to a separate computer. The overall throughput of the system would increase, but each individual task wouldn't be broken down or reassembled. We'll examine this pattern, which is often easier to manage in the enterprise world, toward the end of this chapter.

If you want to shorten the time taken to complete individual tasks, rather than simply improve the overall throughput of an application, you'll need to take advantage of parallelism by dividing each task into multiple pieces. Some problems are much more suitable for this approach than others. For example, in the next section you'll consider a work manager that calculates prime number lists. In this case, the problem (searching a range of values for prime numbers) is one that can easily be subdivided into smaller pieces, like many search and analysis tasks. However, some tasks can only be performed with all the data. One example is the encryption of a large amount of information with cipher-block streaming. In this case, each block of data is encrypted using information from the preceding block, and it's impossible to encrypt the data separately (although distributed computing is used with other cryptography problems, such as cracking unbreakable ciphers).

Parallelism also introduces a new kind of fragility because the overall process is only as successful as its weakest link. If you have a worker that goes offline in the middle of a task, or operates very slowly, the whole task will be held back. To avoid this problem, you can store statistics about peers and use the most reliable ones wherever possible. You might also want to regularly poll a worker to retrieve its progress so you can cancel a slow-running task and reschedule it elsewhere. Or you may want to simply assign a task multiple times (if you have a large pool of workers) and use the first received task results. This approach might seem wasteful, but in a large environment, it provides increased robustness through redundancy.

Finally, note that some tasks aren't well suited for any type of distributed computing. These include operations that perform simple tasks with large amounts of data, in which case the overhead required to transmit the information might not be worth the relatively minor benefits of parallelism. Generally, tasks that make heavy use of computation (for example, CPU-intensive calculations) are the best choices for distributed computing.

Note

For more information about new initiatives in distributed computing, you may be interested in visiting http://www.globus.org, which is a research project aimed at developing tools for grid computing on a large scale. They currently provide a toolkit for Java and are considering the promise of .NET. Another worthwhile site is http://www.gridforum.org, which is a community of researchers and developers working on emerging issues in grid computing.

In the next few sections, we'll create a distributed work system that's designed to solve a single problem: finding prime numbers. For maximum speed, it uses multiple workers in a single operation and assembles their results with a work manager.