Integer DWTs mod 2^64-2^32+1

ARM prime works using arithmetic modulo a 64 bit prime 2^64-2^32+1 (henceforth known as p). This prime was chosen because it has some very useful properties, the most important of which is that it is possible to calculate x mod p without doing any divisions. This is essential for the ARM which has no divide instruction.

The special form of p means that we can make a mod p routine which only involves a few shifts and arithmetic operations. We use these facts :-

  • 2^64 == 2^32 -1 mod p
  • 2^96 == -1 mod p
  • 2^128 == -2^32 mod p
  • 2^192 == 1 mod p
  • 2^n * 2^(192-n) = 1 mod p

Thus :-

x3 * 2^96 + x2 * 2^64 + x1 * 2^32 + x0 mod p
= x2 * 2^64 + x1*2^32 + x0-x3) [using 2^96 mod p = -1]
= (x1+x2) * 2^32 + x0 - x3 - x2 [using 2^64 mod p = 2^32 -1]

Note for ARM hackers: we also use the fact that 2^64 - p is 2^32-1 ie 0x00000000FFFFFFFF so instead of adding 0xFFFFFFFF00000001 mod 2^64 we subtract 0x00000000FFFFFFFF. This is convenient because the ARM carry flag is inverted for ADD and SUBtract so it is best (conditional execution wise) to follow an ADD with an ADDCS and a SUB with a SUBCC.

This p is can also be used for FFTs and DWTs also because is has the following properties.

Factors of p-1 are 2^32 * 3 * 5 * 17 * 257 * 65537

This means that practically an FFT can be defined of length up to 2^32 with optional factors of 3 and 5.

For the Discrete Weighted Transform we need an n-th root of 2. 2 has order 192 mod p (ie 2^192 mod p = 1) so we can have the

(p-1)/192 = (2^58 - 2^26) / 3 = 2^26 * 5 * 17 * 257 * 65537 th root of 2.

This means that we can do the DWT for lengths up to 2^26 with an optional factor of 5

7 is a primitive root mod p

An n-th root of unity can be generated by 7^(5*(p-1)/n) mod p.

An n-th root of two can be generated by 7^(5*(p-1)/192/n) mod p

So a suitable 5 * 2^26-th root of 1 is 0xED41D05B78D6E286 and the 5 * 2^26-th root of 2 is 0xC47FC73D33F80E14

Note that a 64-th root of unity is 8 mod p. This means that a radix-6 FFT can be done with no multiplies, only with shifts which are much quicker than multipliers on the ARM.

Many thanks to Peter-Lawrence Montgomery for working out the maths behind how to do the 128 bit to 64 bit reduction mod p and the shifts mod p so efficiently. Peter also suggested the idea of using shifts in the transform which really makes a lot of difference in execution speed on ARM