Running a parallel program on a distributed computing system involves multiple complex system-level issues, such as communication, coordination, resource management, and task allocation. These issues are explained below:
1. Partitioning
This refers to dividing a program or its data for execution on multiple nodes.
- Computation Partitioning:
The program is split into smaller tasks that can run concurrently on different machines.
➤ Example: Each task processes a different part of the overall problem. - Data Partitioning:
The data set is divided into chunks, and each chunk is processed independently.
➤ Example: A large dataset is split into smaller pieces for different nodes to process in parallel.
2. Mapping
Assigns the divided tasks or data pieces to appropriate computing nodes (workers).
➤ This ensures optimal use of resources and parallel execution.
3. Synchronization
Ensures that tasks coordinate properly during execution.
- Avoids race conditions when multiple nodes try to access the same data.
- Manages data dependencies where one task needs results from another.
4. Communication
Data exchange between nodes becomes necessary due to data dependencies.
➤ This could involve transferring intermediate results between tasks on different machines.
5. Scheduling
If there are more tasks than resources, a scheduler is required to:
- Decide which task runs when and on which node.
- Follow a scheduling policy to balance load and reduce execution time.
There are two types of scheduling:
- Task-level scheduling (for a single program).
- Job-level scheduling (for multiple programs/jobs in the system).
6. Motivation for Programming Paradigms
Because managing all of the above manually is complex and time-consuming, programming paradigms like MapReduce, Hadoop, and Dryad are used. They:
- Abstract low-level system management.
- Increase developer productivity.
- Reduce time-to-market.
- Improve scalability and fault tolerance.