Parallel Processing

Introduction

Parallel processing: technique to perform multiple computations simultaneously. Motivation: overcome sequential bottlenecks, increase throughput, reduce latency. Basis of modern high-performance computing in CPUs, GPUs, clusters, and supercomputers.

"Parallel computing is the future of computing, enabling solutions beyond the scope of serial processing." -- Jack Dongarra

Definition and Concepts

Execution of multiple instructions or tasks concurrently by multiple processing elements. Exploits concurrency in computations to improve speed and efficiency.

Concurrency vs Parallelism

Concurrency: multiple tasks logically overlapping in time. Parallelism: multiple tasks physically executed simultaneously. All parallelism implies concurrency; not all concurrency is parallelism.

Granularity

Task size classification: fine-grained (small tasks, frequent communication), coarse-grained (large tasks, less communication). Affects performance and overhead.

Speedup and Efficiency

Speedup: ratio of sequential execution time to parallel execution time. Efficiency: speedup divided by number of processors. Measures resource utilization.

Types of Parallel Processing

Bit-Level Parallelism

Parallel operations on individual bits within a word. Improves instruction throughput by increasing word size.

Instruction-Level Parallelism (ILP)

Overlapping execution of instructions within a single processor pipeline. Techniques: pipelining, superscalar execution, out-of-order execution.

Data Parallelism

Same operation applied simultaneously to multiple data elements. Common in SIMD architectures.

Task Parallelism

Different tasks executed concurrently on different processors. Suitable for independent or loosely coupled computations.

Hardware Architectures

Multiprocessors

Multiple processors sharing memory and interconnect. Types: symmetric multiprocessing (SMP), asymmetric multiprocessing (AMP).

Multicomputers

Multiple independent computers connected via network. Distributed memory model, message passing required.

Multicore Processors

Multiple processing cores integrated on a single chip. Shared cache and memory hierarchy.

Vector Processors

Process vectors of data elements in single instructions. High throughput for scientific computations.

Graphics Processing Units (GPUs)

Massively parallel processors optimized for data parallel workloads. Thousands of lightweight cores.

Architecture	Memory Model	Communication	Example
SMP Multiprocessor	Shared memory	Bus or interconnect	Intel Xeon
Multicomputer	Distributed memory	Network messages	Beowulf cluster
GPU	Shared global memory	On-chip interconnect	NVIDIA CUDA cores

Parallel Algorithms

Decomposition Strategies

Problem decomposition into subtasks: data decomposition, task decomposition, hybrid decomposition.

Parallel Algorithm Design

Focus on load balancing, minimizing communication, synchronization overhead, exploiting locality.

Examples

Parallel sorting (bitonic sort, sample sort), matrix multiplication, graph algorithms (BFS, shortest path).

Complexity Models

PRAM (Parallel Random Access Machine), BSP (Bulk Synchronous Parallel), LogP models guide algorithm design and analysis.

Algorithm: Parallel Matrix MultiplicationInput: Matrices A (n×n), B (n×n)Output: Matrix C (n×n)1. Partition A, B into blocks assigned to processors2. Each processor computes partial products3. Synchronize and sum partial results4. Output C assembled from blocks

Synchronization and Communication

Need for Synchronization

Ensure correct ordering of shared resource access, avoid race conditions, deadlocks, data corruption.

Synchronization Mechanisms

Locks, semaphores, barriers, atomic operations, monitors.

Communication Models

Shared memory: implicit communication. Message passing: explicit communication between processes.

Latency and Bandwidth

Communication costs impact overall performance; optimizing reduces overhead.

Mechanism	Purpose	Overhead
Mutex Lock	Mutual exclusion of critical section	Medium
Semaphore	Resource counting and signaling	Low to medium
Barrier	Synchronization point for all threads	High (depending on threads)

Performance Metrics

Speedup (S)

S = T_serial / T_parallel. Ideal speedup equals number of processors.

Efficiency (E)

E = S / P, where P is processor count. Measures resource utilization.

Amdahl's Law

Limits speedup due to sequential fraction of program: S ≤ 1 / (f + (1-f)/P), f = serial fraction.

Gustafson's Law

Assumes problem size scales with processor count: Speedup scales linearly for large problems.

Speedup Amdahl(f, P) = 1 / (f + (1-f)/P)Where:f = fraction of serial codeP = number of processors

Scalability

Ability to maintain efficiency as processor count increases.

Scalability

Strong Scalability

Fixed problem size; measures speedup as processors increase.

Weak Scalability

Problem size increases proportionally with processors; measures maintenance of efficiency.

Factors Affecting Scalability

Communication overhead, synchronization delays, load imbalance, memory contention.

Scalability Strategies

Reducing communication, overlapping computation and communication, dynamic load balancing.

Programming Models

Shared Memory Model

Multiple threads/processes communicate via shared variables. Requires synchronization primitives.

Message Passing Model

Processes communicate by explicit messages. Common in clusters and distributed systems. Example: MPI.

Data Parallel Model

Same operation applied on distributed data sets. Supported by OpenMP, CUDA.

Task Parallel Model

Independent tasks executed concurrently, possibly with dependencies. Examples: Threading Building Blocks (TBB), Cilk.

Applications

Scientific Computing

Simulations, numerical methods, climate modeling, molecular dynamics.

Big Data Analytics

Parallel data processing frameworks: Hadoop, Spark.

Graphics and Visualization

Real-time rendering, image processing using GPUs.

Artificial Intelligence

Training deep neural networks via parallelized matrix operations.

Challenges and Limitations

Synchronization Overhead

Causes delays and reduced parallel efficiency.

Load Imbalance

Unequal task distribution leads to idle processors.

Communication Latency

High latency in distributed systems impacts performance.

Debugging Complexity

Difficult to detect and fix race conditions, deadlocks.

Hardware Limitations

Memory bandwidth, interconnect speeds constrain scaling.

Future Trends

Exascale Computing

Systems capable of 10^18 operations per second; require extreme parallelism.

Heterogeneous Architectures

Integration of CPUs, GPUs, FPGAs for adaptable workloads.

Quantum Parallelism

Quantum computing offers massive parallelism via qubits.

Improved Programming Models

Higher-level abstractions, automatic parallelization, fault tolerance.

References

Hwang, K., & Briggs, F. A. Parallel Computing: Theory and Practice. McGraw-Hill, 1984, pp. 123-157.
Grama, A., Gupta, A., Karypis, G., & Kumar, V. Introduction to Parallel Computing. Addison-Wesley, 2003, pp. 45-98.
Flynn, M. J. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computers, Vol. C-21, No. 9, 1972, pp. 948-960.
Bailey, D. H., et al. The NAS Parallel Benchmarks. International Journal of High Performance Computing Applications, Vol. 5, No. 3, 1991, pp. 63-73.
Dean, J., & Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, 2008, pp. 107-113.