Brute-Force, Single-Threaded Prime Factorization

Brute-force, single-threaded prime factorization

Compare such an approach to a (pre-generated) sieve. Modulo is expensive, so both approaches essentially do two things: generate potential factors, and perform modulo operations. Either program should reasonably generate a new candidate factor in less cycles than modulo takes, so either program is modulo bound.

The given approach filters out a constant proportion of all integers, namely the multiples of 2 and 3, or 75%. One in four (as given) numbers is used as an argument to the modulo operator. I'll call it a skip filter.

On the other hand, a sieve uses only primes as arguments to the modulo operator, and the average difference between successive primes is governed by the prime number theorem to be 1/ln(N). For example, e^20 is just under 500 million, so numbers over 500 million have under a 5% chance of being prime. If all numbers up to 2^32 are considered, 5% is a good rule of thumb.

Therefore, a sieve will spend 5 times less time on div operations as your skip filter. The next factor to consider is the speed at which the sieve produces primes, i.e. reads them from memory or disk. If fetching one prime is faster than 4 divs, then the sieve is faster. According to my tables div throughput on my Core2 is at most one per 12 cycles. These will be hard division problems, so let's conservatively budget 50 cycles per prime. For a 2.5 GHz processor, that's 20 nanoseconds.

In 20 ns, a 50 MB/sec hard drive can read about one byte. The simple solution is to use 4 bytes per prime, so the drive will be slower. But, we can be more clever. If we want to encode all the primes in order, we can just encode their differences. Again, the expected difference is 1/ln(N). Also, they're all even, which saves an extra bit. And they are never zero, which makes extension to a multibyte encoding free. So using one byte per prime, differences up to 512 can be stored in one byte, which gets us up to 303371455241 according to that Wikipedia article.

Therefore, depending on the hard drive, a stored list of primes should be about equal in speed at verifying primality. If it can be stored in RAM (it's 203 MB, so subsequent runs will probably hit the disk cache), then the problem goes away entirely, as the FSB speed typically differs from the processor speed by a factor less than the FSB width in bytes — i.e., the FSB can transfer more than one prime per cycle. Then factor of improvement is the reduction in division operations, i.e. five times. This is borne out by the experimental results below.

Of course, then there is multithreading. Ranges of either primes or skip-filtered candidates can be assigned to different threads, making either approach embarrassingly parallel. There are no optimizations that don't involve increasing the number of parallel divider circuits, unless you somehow eliminate the modulo.

Here is such a program. It's templated so you could add bignums.

/*
 *  multibyte_sieve.cpp
 *  Generate a table of primes, and use it to factorize numbers.
 *
 *  Created by David Krauss on 10/12/10.
 *
 */

#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;

char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };

template< typename It >
unsigned decode_gap( It &stream ) {
    unsigned gap = static_cast< unsigned char >( * stream ++ );

    if ( gap ) return 2 * gap; // only this path is tested

    gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
    return gap + decode_gap( stream ); // shallow recursion
}

template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
    unsigned len = 0, bytes[4];

    gap /= 2;
    do {
        bytes[ len ++ ] = gap % encoding_base;
        gap /= encoding_base;
    } while ( gap );

    while ( -- len ) { // loop not tested
        * stream ++ = 0;
        * stream ++ = bytes[ len + 1 ];
    }
    * stream ++ = bytes[ 0 ];
}

template< size_t lim >
void generate_primes() {
    auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
    bitset< lim / 2 > &sieve = * sieve_p;

    ofstream out_f( primes_filename, ios::out | ios::binary );
    ostreambuf_iterator< char > out( out_f );

    size_t count = 0;

    size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
    for ( ; x != last; ++ x ) {
        if ( sieve[ x ] ) continue;
        size_t n = x * 2 + 1; // translate index to number
        for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    for ( ; x != lim / 2; ++ x ) {
        if ( sieve[ x ] ) continue;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    cout << prev * 2 + 1 << endl;
}

template< typename I >
void factorize( I n ) {
    ifstream in_f( primes_filename, ios::in | ios::binary );
    if ( ! in_f ) {
        cerr << "Could not open primes file.\n"
                "Please generate it with 'g' command.\n";
        return;
    }

    while ( n % 2 == 0 ) {
        n /= 2;
        cout << "2 ";
    }
    unsigned long factor = 1;

    for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
        factor += decode_gap( in );

        while ( n % factor == 0 ) {
            n /= factor;
            cout << factor << " ";
        }

        if ( n == 1 ) goto finish;
    }

    cout << n;
finish:
    cout << endl;
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) goto print_help;

    unsigned long n;

    if ( argv[1][0] == 'g' ) {
        generate_primes< (1ul<< 32) >();
    } else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
        factorize( n );
    } else goto print_help;

    return 0;

print_help:
    cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
            "\t" << argv[0] << " g -- generate primes file in current directory.\n";
}

Performance on a 2.2 GHz MacBook Pro:

dkrauss$ time ./multibyte_sieve g
4294967291

real    2m8.845s
user    1m15.177s
sys    0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279 

real    0m5.405s
user    0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real    0m25.147s
user    0m24.170s
sys 0m0.096s

run time of this Prime Factor function?

In your example, if num is prime then it would take exactly num - 1 steps. This would mean that the algorithm's runtime is O(num) (where O stands for a pessimistic case). But in case of algorithm that operate on numbers things get a little bit more tricky (thanks for noticing thegreatcontini and Chris)! We always describe complexity as a function of input size. In this case the input is a number num and it is represented with log(num) bits. So the input size is of log(num). Because num = 2 ^ (log(num)) then your algorithm is of complexity O(2^k) where k = log(num) - size of your input.

This is what makes this problem hard - input is very, very small and any polynomial from num leads to exponential algorithm ...

On a side note @rici is right, you need to check only up to sqrt(num), thus easily reducing the runtime to O(sqrt(num)) or more correctly O(sqrt(2) ^ k).

Prime factorization of a number

The simplest change you can make to fix your problem is to change your while loop condition:

def factors(nr):
    i = 2
    factors = []
    while i <= nr:
        if (nr % i) == 0:
            factors.append(i)
            nr = nr / i
        else:
            i = i + 1
    return factors

print factors(8)
print factors(9)
print factors(10)

Output

[2, 2, 2]
[3, 3]
[2, 5]

Can we use the extended euclidian algorithm to simplify factorization process of two products of large prime numbers with one common factor?

Of course. Suppose the first "large product of prime factors" is 3 × 7 = 21 and the second "large product of prime factors" is 5 × 7 = 35. Then the GCD of 21 and 35 is 7, which is a factor of both "large products of prime factors." You don't even need the extended algorithm, a simple GCD is sufficient.

But this isn't very useful, because you seldom have two large semi-primes that share a common factor.

Prime factor of 300 000 000 000?

Are you remembering to divide the number that you're factorizing by each factor as you find them?

Say, for example, you find that 2 is a factor. You can add that to your list of factors, but then you divide the number that you're trying to factorise by that value.

Now you're only searching for the factors of 150 billion. Each time around you should start from the factor you just found. So if 2 was a factor, test 2 again. If the next factor you find is 3, there's no point testing from 2 again.

And so on...

Brute-Force, Single-Threaded Prime Factorization