Introduction
Parallel processing: technique to perform multiple computations simultaneously. Motivation: overcome sequential bottlenecks, increase throughput, reduce latency. Basis of modern high-performance computing in CPUs, GPUs, clusters, and supercomputers.
"Parallel computing is the future of computing, enabling solutions beyond the scope of serial processing." -- Jack Dongarra
Definition and Concepts
Parallel Processing
Execution of multiple instructions or tasks concurrently by multiple processing elements. Exploits concurrency in computations to improve speed and efficiency.
Concurrency vs Parallelism
Concurrency: multiple tasks logically overlapping in time. Parallelism: multiple tasks physically executed simultaneously. All parallelism implies concurrency; not all concurrency is parallelism.
Granularity
Task size classification: fine-grained (small tasks, frequent communication), coarse-grained (large tasks, less communication). Affects performance and overhead.
Speedup and Efficiency
Speedup: ratio of sequential execution time to parallel execution time. Efficiency: speedup divided by number of processors. Measures resource utilization.
Types of Parallel Processing
Bit-Level Parallelism
Parallel operations on individual bits within a word. Improves instruction throughput by increasing word size.
Instruction-Level Parallelism (ILP)
Overlapping execution of instructions within a single processor pipeline. Techniques: pipelining, superscalar execution, out-of-order execution.
Data Parallelism
Same operation applied simultaneously to multiple data elements. Common in SIMD architectures.
Task Parallelism
Different tasks executed concurrently on different processors. Suitable for independent or loosely coupled computations.
Hardware Architectures
Multiprocessors
Multiple processors sharing memory and interconnect. Types: symmetric multiprocessing (SMP), asymmetric multiprocessing (AMP).
Multicomputers
Multiple independent computers connected via network. Distributed memory model, message passing required.
Multicore Processors
Multiple processing cores integrated on a single chip. Shared cache and memory hierarchy.
Vector Processors
Process vectors of data elements in single instructions. High throughput for scientific computations.
Graphics Processing Units (GPUs)
Massively parallel processors optimized for data parallel workloads. Thousands of lightweight cores.
| Architecture | Memory Model | Communication | Example |
|---|---|---|---|
| SMP Multiprocessor | Shared memory | Bus or interconnect | Intel Xeon |
| Multicomputer | Distributed memory | Network messages | Beowulf cluster |
| GPU | Shared global memory | On-chip interconnect | NVIDIA CUDA cores |
Parallel Algorithms
Decomposition Strategies
Problem decomposition into subtasks: data decomposition, task decomposition, hybrid decomposition.
Parallel Algorithm Design
Focus on load balancing, minimizing communication, synchronization overhead, exploiting locality.
Examples
Parallel sorting (bitonic sort, sample sort), matrix multiplication, graph algorithms (BFS, shortest path).
Complexity Models
PRAM (Parallel Random Access Machine), BSP (Bulk Synchronous Parallel), LogP models guide algorithm design and analysis.
Algorithm: Parallel Matrix MultiplicationInput: Matrices A (n×n), B (n×n)Output: Matrix C (n×n)1. Partition A, B into blocks assigned to processors2. Each processor computes partial products3. Synchronize and sum partial results4. Output C assembled from blocksSynchronization and Communication
Need for Synchronization
Ensure correct ordering of shared resource access, avoid race conditions, deadlocks, data corruption.
Synchronization Mechanisms
Locks, semaphores, barriers, atomic operations, monitors.
Communication Models
Shared memory: implicit communication. Message passing: explicit communication between processes.
Latency and Bandwidth
Communication costs impact overall performance; optimizing reduces overhead.
| Mechanism | Purpose | Overhead |
|---|---|---|
| Mutex Lock | Mutual exclusion of critical section | Medium |
| Semaphore | Resource counting and signaling | Low to medium |
| Barrier | Synchronization point for all threads | High (depending on threads) |
Performance Metrics
Speedup (S)
S = T_serial / T_parallel. Ideal speedup equals number of processors.
Efficiency (E)
E = S / P, where P is processor count. Measures resource utilization.
Amdahl's Law
Limits speedup due to sequential fraction of program: S ≤ 1 / (f + (1-f)/P), f = serial fraction.
Gustafson's Law
Assumes problem size scales with processor count: Speedup scales linearly for large problems.
Speedup Amdahl(f, P) = 1 / (f + (1-f)/P)Where:f = fraction of serial codeP = number of processorsScalability
Ability to maintain efficiency as processor count increases.
Scalability
Strong Scalability
Fixed problem size; measures speedup as processors increase.
Weak Scalability
Problem size increases proportionally with processors; measures maintenance of efficiency.
Factors Affecting Scalability
Communication overhead, synchronization delays, load imbalance, memory contention.
Scalability Strategies
Reducing communication, overlapping computation and communication, dynamic load balancing.
Programming Models
Shared Memory Model
Multiple threads/processes communicate via shared variables. Requires synchronization primitives.
Message Passing Model
Processes communicate by explicit messages. Common in clusters and distributed systems. Example: MPI.
Data Parallel Model
Same operation applied on distributed data sets. Supported by OpenMP, CUDA.
Task Parallel Model
Independent tasks executed concurrently, possibly with dependencies. Examples: Threading Building Blocks (TBB), Cilk.
Applications
Scientific Computing
Simulations, numerical methods, climate modeling, molecular dynamics.
Big Data Analytics
Parallel data processing frameworks: Hadoop, Spark.
Graphics and Visualization
Real-time rendering, image processing using GPUs.
Artificial Intelligence
Training deep neural networks via parallelized matrix operations.
Challenges and Limitations
Synchronization Overhead
Causes delays and reduced parallel efficiency.
Load Imbalance
Unequal task distribution leads to idle processors.
Communication Latency
High latency in distributed systems impacts performance.
Debugging Complexity
Difficult to detect and fix race conditions, deadlocks.
Hardware Limitations
Memory bandwidth, interconnect speeds constrain scaling.
Future Trends
Exascale Computing
Systems capable of 10^18 operations per second; require extreme parallelism.
Heterogeneous Architectures
Integration of CPUs, GPUs, FPGAs for adaptable workloads.
Quantum Parallelism
Quantum computing offers massive parallelism via qubits.
Improved Programming Models
Higher-level abstractions, automatic parallelization, fault tolerance.
References
- Hwang, K., & Briggs, F. A. Parallel Computing: Theory and Practice. McGraw-Hill, 1984, pp. 123-157.
- Grama, A., Gupta, A., Karypis, G., & Kumar, V. Introduction to Parallel Computing. Addison-Wesley, 2003, pp. 45-98.
- Flynn, M. J. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computers, Vol. C-21, No. 9, 1972, pp. 948-960.
- Bailey, D. H., et al. The NAS Parallel Benchmarks. International Journal of High Performance Computing Applications, Vol. 5, No. 3, 1991, pp. 63-73.
- Dean, J., & Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, 2008, pp. 107-113.