Document Type

Technical Report

Publication Date


Technical Report Number



We give asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on parallel disk systems. In a BMMC permutation on N records, where N is a power of 2, each (lg N)-bit source address x maps to a corresponding (lg N)-bit target address y by the matrix equation y = Ax XOR c, where matrix multiplication is performed over GF(2). The characteristic matrix A is (lg N) x (lg N) and nonsingular over GF(2). Under the Vitter-Shriver parallel-disk model with N records, D disks, B records per block, and M records of memory, we show a universal lower bound of $\Omega \left( \frac{N}{BD} \left( 1 + \frac{\rank{\gamma}}{\lg (M/B)} \right) \right)$ parallel I/Os for performing a BMMC permutation, where gamma is the lower left (lg (N/B)) x (lg B) submatrix of the characteristic matrix. We adapt this lower bound to show that the algorithm for bit-permute/complement (BPC) permutations in Cormen93a is asymptotically optimal. We also present an algorithm that uses at most $\frac{2N}{BD} \left( 4 \ceil{\frac{\rank{\gamma}}{\lg (M/B)}} + 4 \right)$ parallel I/Os, which asymptotically matches the lower bound and improves upon the BMMC algorithm in Cormen93a. When rank (gamma) is low, this method is an improvement over the general-permutation bound of $\Theta \left( \frac{N}{BD} \frac{\lg (N/B)}{\lg (M/B)} \right)$.

We introduce a new subclass of BMMC permutations, called memory-load-dispersal (MLD) permutations, which can be performed in one pass. This subclass, which is used in the BMMC algorithm, extends the catalog of one-pass permutations appearing in Cormen93a.

Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, we show how to detect in at most $N/BD + \ceil{\frac{\lg (N/B) + 1}{D}}$ parallel I/Os whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using our algorithm.