(gmp.info.gz) FFT Multiplication

(gmp.info.gz) Toom 3-Way Multiplication
(gmp.info.gz) Multiplication Algorithms
(gmp.info.gz) Other Multiplication
 
 FFT Multiplication
 ------------------
 
 At large to very large sizes a Fermat style FFT multiplication is used,
 following Scho"nhage and Strassen ( References).  Descriptions
 of FFTs in various forms can be found in many textbooks, for instance
 Knuth section 4.3.3 part C or Lipson chapter IX.  A brief description
 of the form used in GMP is given here.
 
    The multiplication done is x*y mod 2^N+1, for a given N.  A full
 product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
 and y with high zero limbs.  The modular product is the native form for
 the algorithm, so padding to get a full product is unavoidable.
 
    The algorithm follows a split, evaluate, pointwise multiply,
 interpolate and combine similar to that described above for Karatsuba
 and Toom-3.  A k parameter controls the split, with an FFT-k splitting
 into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
 (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
 bit shifts in the split and combine stages.
 
    The evaluations, pointwise multiplications, and interpolation, are
 all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
 2^k and of `mp_bits_per_limb'.  The results of interpolation will be
 the following negacyclic convolution of the input pieces, and the
 choice of N' ensures these sums aren't truncated.
 
                 ---
                 \         b
      w[n] =     /     (-1) * x[i] * y[j]
                 ---
             i+j==b*2^k+n
                b=0,1
 
    The points used for the evaluation are g^i for i=0 to 2^k-1 where
 g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
 necessary cancellations at the interpolation stage, and it's also a
 power of 2 so the fast fourier transforms used for the evaluation and
 interpolation do only shifts, adds and negations.
 
    The pointwise multiplications are done modulo 2^N'+1 and either
 recurse into a further FFT or use a plain multiplication (Toom-3,
 Karatsuba or basecase), whichever is optimal at the size N'.  The
 interpolation is an inverse fast fourier transform.  The resulting set
 of sums of x[i]*y[j] are added at appropriate offsets to give the final
 result.
 
    Squaring is the same, but x is the only input so it's one transform
 at the evaluate stage and the pointwise multiplies are squares.  The
 interpolation is the same.
 
    For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
 the exponent representing 2^k recursed modular multiplies each
 1/2^(k-1) the size of the original.  Each successive k is an asymptotic
 improvement, but overheads mean each is only faster at bigger and
 bigger sizes.  In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
 thresholds where each k is used.  Each new k effectively swaps some
 multiplying for some shifts, adds and overheads.
 
    A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
 plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
 A k=4 FFT at O(N^1.333) can be expected to be the first faster than
 Toom-3 at O(N^1.465).  In practice this is what's found, with
 `MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
 and 1000 limbs, depending on the CPU.  So far it's been found that only
 very large FFTs recurse into pointwise multiplies above these sizes.
 
    When an FFT is to give a full product, the change of N to 2N doesn't
 alter the theoretical complexity for a given k, but for the purposes of
 considering where an FFT might be first used it can be assumed that the
 FFT is recursing into a normal multiply and that on that basis it's
 doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
 making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
 first FFT faster than Toom-3.  In practice `MUL_FFT_THRESHOLD' and
 `SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
 between 3000 and 10000 limbs.
 
    The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
 a multiple of 2^k and `mp_bits_per_limb' means that when
 2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
 The +k+3 means some values of N just under such a multiple will be
 rounded to the next.  The complexity calculations above assume that a
 favourable size is used, meaning one which isn't padded through
 rounding, and it's also assumed that the extra +k+3 bits are negligible
 at typical FFT sizes.
 
    The practical effect of the 2^(2k-1) constraint is to introduce a
 step-effect into measured speeds.  For example k=8 will round N up to a
 multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
 groups of sizes for which `mpn_mul_n' runs at the same speed.  Or for
 k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice
 it's been found each k is used at quite small multiples of its size
 constraint and so the step effect is quite noticeable in a time versus
 size graph.
 
    The threshold determinations currently measure at the mid-points of
 size steps, but this is sub-optimal since at the start of a new step it
 can happen that it's better to go back to the previous k for a while.
 Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
 will be needed.
Info Catalog
(gmp.info.gz) Toom 3-Way Multiplication
(gmp.info.gz) Multiplication Algorithms
(gmp.info.gz) Other Multiplication
automatically generated byinfo2html