On implementing factorial (or other number) series..
by the most perverse ways possible

By Jason Spencer
First published: 28 August 2010
Last update: 22 November 2014
HISTORY

Boost C++ C++ templates metaprogramming number series pre-computation pre-processor templates

So, the other day I happened upon my old and trusty copy of Modern C++ Design by Andrei Alexandrescu and had the urge to go postal with templates. And what better way than program speed optimisation - pre-computation specifically - and a few spare minutes resulted in this exercise - to implement a simple function like the factorial function, but at compile time, not at run-time. And the more perverse the better. I'll try to keep adding to this article over time as inspiration for coding perversion hits me. Joking aside, it might be a useful example to any googlenauts turning up here while looking for the exit.

Factorial's been done ad-nauseam at run-time.. recursive functions and loops being the most obvious methods.. blah blah.. but the thing about the factorial series is that the numbers get very big very quickly, so, in fact, a 32 bit computer can only really store up to the twelfth number in a standard integer data type (4-byte int) - any further and the 32-bits overflow. Storing the numbers in 64 bits would allow us to store only the first twenty numbers.

So why not create a look up table for these numbers and either explicitly list them in the source as magic numbers (magic numbers aren't usually a good thing), or compute them at compile time.. here are a few ways of doing just that..

Bear in mind though that most if not all of the techniques below rely on data types restricted by C++ standards - so they will work for integer series, but not for series requiring floating-point numbers.

C++ Templates

Let's start with the simple recursive approach but at compile time: A recursive template with a specialisation to end off the sequence:

#include <iostream>
#include <algorithm>	// std::copy
#include <iterator>		// std::ostream_iterator
 
template <int N> struct facT {
	enum { value = N * facT<N-1>::value };
};
 
template <> struct facT <0> {
	enum { value = 1 };
};
 
int main (int, char **) {

	const int fact_lut [] = { facT<0>::value, facT<1>::value, facT<2>::value, facT<3>::value, facT<4>::value, facT<5>::value,\
						facT<6>::value, facT<7>::value, facT<8>::value, facT<9>::value, facT<10>::value, facT<11>::value, \
						facT<12>::value };

	std::copy( fact_lut, fact_lut+(sizeof(fact_lut)/sizeof(fact_lut[0])), std::ostream_iterator<int>(std::cout, " ") );
	std::cout << std::endl;

	return 0;
}

Which produces this:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ template_only1.cc
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

If one were to add a facT<13>::value to the array the compiler (with or without -Wall - the switch just makes the warning an error) would choke so:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ -Wall template_only1.cc
template_only1.cc: In instantiation of ‘struct facT<13>’:
template_only1.cc:17:32:   required from here
template_only1.cc:6:7: error: overflow in constant expression [-fpermissive]
  enum { value = N * facT<N-1>::value };
       ^
template_only1.cc:6:7: error: enumerator value for ‘value’ is not an integer constant
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

The limitation is due to the underlying data type used to store the enum. The current C++ standard, C++03 (INCITS+ISO+IEC+14882+2003) says the underlying data type is implementation defined and won't be larger than an int except when it won't fit into an int (say, when the value is specified as 1L) - for the code above g++ 4.4.3 on x86 *and* x64 makes this a 4-byte int because the typename of the template parameter N is an int. Changing this to unsigned long long allows the printing of the first 21 numbers in the series (zero based, so up to 20!), on both x86 and x64.

Have a look at Appendix II below for a quick analysis of the storage requirements and the overflow condition. The compiler cannot be relied upon to warn of overflow like it did above.

Boost PP

Ok, so that wasn't too perverse - but that fact_lut storing the numbers - it's painful to read, as well as being error-prone to have to add an entry for each value.. let's get rid of the manual enumeration.. Again the C++ template approach, but with automated array creation through the pre-processor..

#include <iostream>
#include <algorithm>  // std::copy
#include <iterator>       // std::ostream_iterator
#include <boost/preprocessor/repetition/enum.hpp>
 
template <int N> struct facT {
    enum { value = N * facT<N-1>::value };
};
 
template <> struct facT <0> {
    enum { value = 1 };
};
 
int main (int, char **) {
 
#define INSTANCE_generate(z, n, data) facT< n >::value
#define NUM_FACTORIAL_VALUES 13
 
const int fact_lut [] = {                                                 \
  BOOST_PP_ENUM( NUM_FACTORIAL_VALUES, INSTANCE_generate, dummyval)      \
};
 
    std::copy( fact_lut, fact_lut+(sizeof(fact_lut)/sizeof(fact_lut[0])), std::ostream_iterator<int>(std::cout, " ") );
    std::cout << std::endl;
     
    return 0;
}

Nice.. (the value of 13 for NUM_FACTORIAL_VALUES is because there are 13 values starting at 0, therefore up to 12!).. this is what the output of the pre-processor looks like:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ -P -E template_pp1.cc | tail -n 12
template <int N> struct facT {
 enum { value = N * facT<N-1>::value };
};
template <> struct facT <0> {
 enum { value = 1 };
};
const int fact_lut [] = { facT< 0 >::value , facT< 1 >::value , facT< 2 >::value , facT< 3 >::value ,	\
					facT< 4 >::value , facT< 5 >::value , facT< 6 >::value , facT< 7 >::value ,		\
					facT< 8 >::value , facT< 9 >::value , facT< 10 >::value , facT< 11 >::value , 	\
					facT< 12 >::value };
int main (int, char **) {
 std::copy( fact_lut, fact_lut+(sizeof(fact_lut)/sizeof(fact_lut[0])), std::ostream_iterator<int>(std::cout, " ") );
 std::cout << std::endl;
 return 0;
}
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

and the execution:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ template_pp1.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

A. Disassembly..

For those unconvinced that the factorials really were calculated at compile time here's the relevant part of the disassembly of the binary compiled from the "Boost PP" method source above:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ -g template_pp1.cc
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ objdump -DS ./a.out
..SNIP..
const int fact_lut [] = {													\
  BOOST_PP_ENUM( NUM_FACTORIAL_VALUES, INSTANCE_generate, dummyval )	\
};
 80486d5:       c7 45 bc 01 00 00 00    movl   $0x1,-0x44(%ebp)
 80486dc:       c7 45 c0 01 00 00 00    movl   $0x1,-0x40(%ebp)
 80486e3:       c7 45 c4 02 00 00 00    movl   $0x2,-0x3c(%ebp)
 80486ea:       c7 45 c8 06 00 00 00    movl   $0x6,-0x38(%ebp)
 80486f1:       c7 45 cc 18 00 00 00    movl   $0x18,-0x34(%ebp)
 80486f8:       c7 45 d0 78 00 00 00    movl   $0x78,-0x30(%ebp)
 80486ff:       c7 45 d4 d0 02 00 00    movl   $0x2d0,-0x2c(%ebp)
 8048706:       c7 45 d8 b0 13 00 00    movl   $0x13b0,-0x28(%ebp)
 804870d:       c7 45 dc 80 9d 00 00    movl   $0x9d80,-0x24(%ebp)
 8048714:       c7 45 e0 80 89 05 00    movl   $0x58980,-0x20(%ebp)
 804871b:       c7 45 e4 00 5f 37 00    movl   $0x375f00,-0x1c(%ebp)
 8048722:       c7 45 e8 00 15 61 02    movl   $0x2611500,-0x18(%ebp)
 8048729:       c7 45 ec 00 fc 8c 1c    movl   $0x1c8cfc00,-0x14(%ebp)
..SNIP..

That ebp (Extended Base Pointer) is the x86 register that points to the start of the current frame (the "local storage" on the stack for the function currently in scope), the -0x14, -0x18, -0x1c are offsets from the ebp (i.e. the location in the array where the value is to be stored) and the number before that is the explicit integer being copied into the memory that is the fact_lut array (0x1c8cfc00 = 479001600, 0x2611500 = 39916800 ).

The LUT could also appear differently if the const int fact_lut array were also declared static (in local or global scope). The compiler would then most likely put the LUT in the data segment of the program, rather than the text (code) segment. The disassembly would then look like this (x64 this time):

jsp@machina:~/c++_testing/compile_time_mpl_factorial$ g++ -g template_pp1.cc
jsp@machina:~/c++_testing/compile_time_mpl_factorial$ objdump -DS ./a.out
..SNIP..
0000000000400d80 <_ZL8fact_lut>:
  400d80:       01 00                   add    %eax,(%rax)
  400d82:       00 00                   add    %al,(%rax)
  400d84:       01 00                   add    %eax,(%rax)
  400d86:       00 00                   add    %al,(%rax)
  400d88:       02 00                   add    (%rax),%al
  400d8a:       00 00                   add    %al,(%rax)
  400d8c:       06                      (bad)  
  400d8d:       00 00                   add    %al,(%rax)
  400d8f:       00 18                   add    %bl,(%rax)
  400d91:       00 00                   add    %al,(%rax)
  400d93:       00 78 00                add    %bh,0x0(%rax)
  400d96:       00 00                   add    %al,(%rax)
  400d98:       d0 02                   rolb   (%rdx)
  400d9a:       00 00                   add    %al,(%rax)
  400d9c:       b0 13                   mov    $0x13,%al
  400d9e:       00 00                   add    %al,(%rax)
  400da0:       80 9d 00 00 80 89 05    sbbb   $0x5,-0x76800000(%rbp)
  400da7:       00 00                   add    %al,(%rax)
..SNIP..

This time the constants are not operands in instructions, rather just a chunk of memory that is accessed directly. In the disassembly above the assembler on the right can be ignored as it is invalid and just objdump's attempt to make sense of the values which are the actual LUT values (01 00 \ 00 00 is the first and second values etc.. little-endian architecture).

Should the LUT be static or non-static? Well, depends on how many other static variables you use, and how often you call the function which will contain the non-static local LUT. If the LUT is declared global or static and there aren't many other global variables in frequent use, then the computer will experience a cache miss when it does a look-up, which will slow things down, but if other globals are used a lot and often, then the LUT may already be in the cache. If the LUT is declared locally (first disassembly above) it'll be in the stack frame which has a much much higher chance of being in the cache, especially since the values are written at the start of the function. The downside is that this copying is done every time you invoke the function - so that's an overhead that has to be considered. Making the locally declared LUT static would put it in the same memory that's used if it were global, so that's no fix for the initialisation overhead.

Boost MPL

Ok, that was weird, wasn't it? Weird can be good. Here's some more weird.. why not use a boost::mpl::integral_c template that holds the value, as well as the type? We can then specify the type without having it deduced from the values ( cf. enum { value = 1L }; )

#include <iostream>
#include <algorithm>  // std::copy
#include <iterator>       // std::ostream_iterator
#include <boost/mpl/range_c.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/inserter.hpp>
#include <boost/mpl/transform.hpp>
#include <boost/mpl/integral_c.hpp>
#include <boost/mpl/placeholders.hpp>
#include <boost/mpl/vector/vector0.hpp>
 
typedef long long base_type;
#define NUM_FACTORIAL_VALUES 21	// 21 for base_type long long, 13 for int
 
template <base_type N> struct facT {
	typedef boost::mpl::integral_c< base_type, N * facT<N-1>::type::value > type;
    template < typename M > struct apply : boost::mpl::integral_c< typename M::value_type, facT<M::type::value>::type::value > {};
};
 
template <> struct facT <0> {
	typedef boost::mpl::integral_c< base_type, 1 > type;  
    template < typename M > struct apply : boost::mpl::integral_c< typename M::value_type, facT<M::type::value>::type::value > {};
};
 
typedef boost::mpl::range_c< base_type, 0, NUM_FACTORIAL_VALUES > myrange;
 
typedef boost::mpl::transform< myrange, facT< boost::mpl::_1::value >,	\
			boost::mpl::back_inserter< boost::mpl::vector0<> > >::type factorials;

struct stream_print {
	std::ostream & stream;
	char delim;
	stream_print(std::ostream & stream, char delim) : stream(stream), delim(delim) { }
    template< typename U > void operator()(U) { stream << U::type::value << delim; }
};

int main (int, char **) {
	
// access as facT< n >::type::value to build runtime lut

	stream_print cout_printer (std::cout, ' ');

    boost::mpl::for_each< myrange >( cout_printer );
    std::cout << std::endl;
 
    boost::mpl::for_each< factorials >( cout_printer );
    std::cout << std::endl;

    return 0;
}

Here I've changed the original templates to include a type member which is actually the boost::mpl::integral_c storing the value, and an apply member which is like type but is expressed as a member template class rather than a type. I've slightly overcomplicated the problem because I generate a range of numbers (myrange) and then for each one calculate the factorial - since we are generating a series it would be more efficient to generate from the previous number, but I wanted to maintain the ability to directly calculate the Nth factorial.
Additional complexity comes from the fact that the values stored in the compile-time classes (stored in the mpl::vector) that is result must be transferred somehow to run-time to be printed. That's what boost::mpl::for_each is for.

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ template_mpl1.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
0 1 2 3 4 5 6 7 8 9 10 11 12 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

Appendix I. Other series..

For those thinking that the factorial series example was deliberately chosen to make it easier to create look-up tables.. that's true - up to a point.. the point being that many increasing number series can still be done at compile time and will quickly overflow a 4-byte int, so can be stored in a look-up table of a manageable size..

Appendix IA. Fibonacci sequence..

A simple series where f(n) = f(n-1) + f(n-2) and f(0) = 1, f(1) = 1

#include <iostream>
#include <algorithm>
#include <iterator>
#include <boost/preprocessor/repetition/enum.hpp>
 
template <int N> struct seriesT { enum { value = seriesT<N-1>::value + seriesT<N-2>::value }; };
 
template <> struct seriesT <0> { enum { value = 1 }; };
 
template <> struct seriesT <1> { enum { value = 1 }; };
 
int main (int, char **){
 
#define INSTANCE_generate(z, n, data) seriesT< n >::value
#define NUM_SERIES_VALUES 46
 
    const int series_lut [] = { BOOST_PP_ENUM( NUM_SERIES_VALUES, INSTANCE_generate, dummyval) };
 
    std::copy( series_lut, series_lut+(sizeof(series_lut)/sizeof(series_lut[0])), std::ostream_iterator<int>(std::cout, " ") );
    std::cout << std::endl;
     
    return 0;
}

jsp@fatman:~/c++_testing/compile_time_mpl_series$ ./a.out 
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025
121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903

There are 46 numbers - a perfectly manageable size for a static array. The NUM_SERIES_VALUES set to 46 was reached by compiling once with a large number and adjusting based on the compile time error.

Appendix IB. Geometric sequence..

A simple sequence where f(n) = f(n-1) * 2 and f(0) = 1

#include <iostream>
#include <algorithm>
#include <iterator>
#include <boost/preprocessor/repetition/enum.hpp>
 
template <int N> struct sequenceT { enum { value = 2 * sequenceT<N-1>::value }; };
 
template <> struct sequenceT <0> { enum { value = 1 }; };
 
int main (int, char **){
 
#define INSTANCE_generate(z, n, data) sequenceT< n >::value
#define NUM_SEQUENCE_VALUES 31
 
        const int sequence_lut [] = { BOOST_PP_ENUM( NUM_SEQUENCE_VALUES, INSTANCE_generate, dummyval) };
 
        std::copy( sequence_lut, sequence_lut+(sizeof(sequence_lut)/sizeof(sequence_lut[0])), \
					std::ostream_iterator<int>(std::cout, " ") );
        std::cout << std::endl;
 
        return 0;
}

jsp@fatman:~/c++_testing/compile_time_mpl_series$ g++ template_pp1_geo2.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_series$ ./a.out 
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456 536870912 1073741824 
jsp@fatman:~/c++_testing/compile_time_mpl_series$

There are 31 numbers - a perfectly manageable size for a static array. The 31 can be deduced because we're dealing with a 32 bit signed integer - ergo 1 bit for the sign and 31 bits for the number (whether we're talking 1's complement or 2's complement, there is still one bit for sign) - the start of the series is 1 and the binary number is shifted every step of the series. A generalised form of this is at the end of Appendix II, below. Of course a geometric sequence with powers of two are best done with a bitwise left shift, but it's just an example of a geometric sequence.

Appendix II. Storage requirements and overflow..

Although the compiler caught the variable overflow in the factorial example above, it generally won't, and shouldn't be relied upon. The example above was just lucky - if the storage type were unsigned long long, for example, the output of the first 25 would be:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ template_only1_ull.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
sizeof(base_t) = 8 max base_t = 18446744073709551615
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 14197454024290336768 17196083355034583040 8128291617894825984 10611558092380307456 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

This is obviously wrong from the 22nd number (21!) onwards - there aren't enough zeroes at the end of the higher numbers - there should be one zero for every 2 and 5 pair appearing in the prime factors of the number ( 100 can be expressed as 2*2*5*5, while 355687428096000 can be expressed as (2^15)*(3^6)*(5^3)*(7^2)*11*13*17 - since we are taking the product of a sequence (and there are many numbers to fulfil the multiple-of-2 requirement) this effectively means an additional zero appears at the end of the result every time the number we are multiplying by is divisible by a power of 5 - e.g. multiples of 625 (add 4 zeroes), 125 (add 3), 25 (add 2), and 5 (add 1). The number of bits required to encode the result can be found by taking the log base 2 (referred to as log2() herein) of the factorial result, but we cannot take the log2 of the calculated factorial as we don't know whether it has overflowed, so we take the log2 of the product sequence instead, log2(prod(1..N)) = sum(log2(1)..log2(N)) ), and log2(N) is a much smaller number and won't overflow easily. This can be seen in the screen grab below (The n! column should be taken with a pinch of salt - although it doesn't overflow, it was generated in Open Office which, like Excel, only store the 15 most significant figures, so there are actually too many zeroes there, but the numbers are approximately correct).

Click to expand.

We can see that an int (signed by default - 31 bits to store the number on a 32-bit machine) will fit up to 12!, an unsigned long long (usually 64-bits to store number) will store up to 20!. So, 168 bytes (if you include an entry for 0) for an unsigned long long factorial look up table. Good going in my book.

Another option to store the factorial is a double precision floating point type - and we should do pretty well as there are a lot of even numbers in the product sequence, so lots of left shifts which can be absorbed by the exponent field in an IEEE 754 binary64. Although I'm still trying to do the derivation on the back of a laptop, apparently a IEEE 754 binary64 can store up to 170! without overflowing the exponent, however you can see from the table above that the 51bits of fraction can only store up to 22! (22!/(1<<19) fits in 51 bits, but 23!/(1<<19) does not), after which the least significant bits are lost. Still a manageable size for a pre-computed look-up table. The downer is, of course, that the double type is not a valid template constant parameter, only integer; so the LUT will have to be computed at runtime.

The storage requirement calculation for the geometric sequence in Appendix IB is simpler. The sequence definition states that f(n) = f(n-1) * a and f(0) = 1, or more simply f(n)=n^a. The space requirement is therefore log2(n^a), or a*log2(n). The most number of entries that will fit in B bits is therefore floor(B/log2(a)).

Appendix III. Performance comparison..

This article is, as it says on the tin, about calculating the factorial series by the most perverse way possible, but there is also a practical reason for doing this work - performance optimisation. So, does it help?

Well, let's take three methods of implementing the factorial function:

Pre-computated Look Up Tables as per "Template", "PP" and "MPL" methods above.
A loop method where a running product is multiplied by successive numbers.
A recursive method which multiplies a number by the factorial of that number less one.

The performance of the above methods would typically be governed by, in order:

Memory look-up speeds and cache-misses - we've seen from Appendix II that the storage requirements for the look-up table are minimal, and should have no problem fitting into L1 cache.
The loop method is governed by the rate at which integer multiplication can be done, and jumps and branch prediction.
The recursive method is bound by multiplication and stack frame manipulation time (each time we recurse into the function again a new stack frame has to be generated for the function). It's not possible to avoid the frame manipulation by inlining because it isn't obvious how far the function will recurse.

A typical implementation of the three methods might look like:

#include <iostream>
#include <algorithm>  // std::copy
#include <iterator>       // std::ostream_iterator
#include <limits>
// #include "/home/jsp/Desktop/XP share/mitie/src.main/CScopedTiming.hh"
#include "/home/jsp/c++_testing2/CScopedTiming.hh"
 
typedef unsigned long long base_t;
 
const unsigned long LOOPS = 1L << 23;

// templates (compile-time)
template <base_t N> struct facT {
    enum { value = N * facT<N-1>::value };
};
template <> struct facT <0ULL> {
    enum { value = 1ULL };
};

// iterative (run-time)
base_t factorial_loop(unsigned n) {
    base_t out = 1;
    while(n) out *= n--;
    return out;
}

// recursive (run-time) 
base_t factorial_rec(base_t n) {
    if(n==0) return 1;
    return n*factorial_rec(n-1);
}
 
int main (int, char **){
 
std::cout << "sizeof(base_t) = " << sizeof(base_t) << " max base_t = " << std::numeric_limits<base_t>::max() << std::endl;
 
    const base_t fact_lut [] = { facT<0>::value, facT<1>::value, facT<2>::value, facT<3>::value, facT<4>::value, \
                    facT<5>::value, facT<6>::value, facT<7>::value, facT<8>::value, facT<9>::value,        \
                    facT<10>::value, facT<11>::value, facT<12>::value, facT<13>::value, facT<14>::value,   \
                    facT<15>::value, facT<16>::value, facT<17>::value, facT<18>::value, facT<19>::value,   \
                    facT<20>::value };
    const unsigned maxN = (sizeof(fact_lut)/sizeof(fact_lut[0]));
    std::copy( fact_lut, fact_lut + maxN, std::ostream_iterator<base_t>(std::cout, " ") );
    std::cout << std::endl;
 
for(unsigned i=0;i<maxN;++i) std::cout << factorial_loop(i) << " ";
std::cout << std::endl;
for(unsigned i=0;i<maxN;++i) std::cout << factorial_rec(i) << " ";
std::cout << std::endl;
 
unsigned long long sum = 0;
double dur_precomp = 0.0, dur_loop = 0.0, dur_rec = 0.0;
{
jsplib::CScopedTiming timer(dur_precomp);
for(unsigned long long i=0;i<LOOPS;++i) for(unsigned j=0;j<maxN;++j) sum += fact_lut[j];
}
std::cout << "Pre-comp method took " << dur_precomp << " seconds. Result was " << sum << std::endl;
 
sum = 0;
{
jsplib::CScopedTiming timer(dur_loop);
for(unsigned long long i=0;i<LOOPS;++i) for(unsigned j=0;j<maxN;++j) sum += factorial_loop(j);
}
std::cout << "Loop method took " << dur_loop << " seconds. Result was " << sum << std::endl;
 
sum = 0;
{
jsplib::CScopedTiming timer(dur_rec);
for(unsigned long long i=0;i<LOOPS;++i) for(unsigned j=0;j<maxN;++j) sum += factorial_rec(j);
}
std::cout << "Recursive method took " << dur_rec << " seconds. Result was " << sum << std::endl << std::endl;
 
std::cout << "Duration ratios: 1 : " << (dur_loop/dur_precomp) << " : " << (dur_rec/dur_precomp) << std::endl;
 
 
return 0;
}

The factorials of the numbers from 0 to 20 are summed many times over (otherwise the operation is too quick to time), and the result printed. Two things to note: 1. The result is printed otherwise the calculation may be optimised away (the number is of course nonsense as it has overflowed many times over) and 2. the loops to call the various methods are identical.
The output on a Core Duo T2250 looks like this:

jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ -Wall template_only1_ull_perf.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
sizeof(base_t) = 8 max base_t = 18446744073709551615
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 

Pre-comp method took 1.22182 seconds. Result was 16467408726278537216
Loop method took 19.6101 seconds. Result was 16467408726278537216
Recursive method took 29.8842 seconds. Result was 16467408726278537216

Duration ratios: 1 : 16.0499 : 24.4588
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ g++ -Wall -O3 template_only1_ull_perf.cc 
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$ ./a.out 
sizeof(base_t) = 8 max base_t = 18446744073709551615
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 
1 1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 6227020800 87178291200 1307674368000 20922789888000 355687428096000 6402373705728000 121645100408832000 2432902008176640000 

Pre-comp method took 0.501364 seconds. Result was 16467408726278537216
Loop method took 8.46407 seconds. Result was 16467408726278537216
Recursive method took 16.1709 seconds. Result was 16467408726278537216

Duration ratios: 1 : 16.8821 : 32.2538
jsp@fatman:~/c++_testing/compile_time_mpl_factorial$

Increasing the compiler optimisation level approximately halved each of the method times, but relative to each other the pre-computation always wins, and by a big margin. Which is nice.
I wouldn't presume that the performance gain will always be this high, as we are concentrating all the factorial computations together, and were they interspersed throughout the code there would be cache-misses on the look-up table increasing latency. But still, job's a good'n.

To be continued..

When I have some time, I'm going to try and drop templates altogether and implement factorial purely in the Boost PP since it has support for rudimentary computation and recursion.
After that I'll continue searching for newer, "perversier" methods..

Thank you for reading my article. It's phrased in a slightly dismissive and tongue in cheek tone but the methods here are all very useful, powerful and could save a lot of headaches thanks to the templates. ..and of course the performance gains by doing pre-computation at compile time. On the other hand, there's no point in being clever just for the hell of it - and many consider C++ templates to be a bad thing (binary bloat, amongst other problems) - I think they're the bees knees and one of the reasons why I'm such a huge C++ fan. Always go with what you know, and stick to the project's coding style guidelines.

+ + + ATH0
NO CARRIER

Version history:

Sat 28 August 2010 23:00 UTC	initial version
Sat 19 January 2013 00:00 UTC	added storage requirements discussion
Mon 23 December 2013 00:00 UTC	added performance measurement
Sat 22 November 2014 00:00 UTC	added disassembly version for static const (data segment)

On implementing factorial (or other number) series.. by the most perverse ways possible