Can My Water Cooled Raspberry Pi Cluster Beat My MacBook?

I recently built a water-cooled Raspberry Pi cluster and a lot of people asked how the cluster would compare to a computer because Raspberry Pi’s themselves aren’t seen as being particularly powerful.

If you haven’t already, have a look at my post on building the Pi Cluster.

How the cluster compares to a traditional computer isn’t really an easy question to answer. It depends on a number of factors and what metrics you measure it against. So this got me thinking of how to fairly compare the cluster to a computer in a way that doesn’t rely too heavily on the software being run and uses my Pi Cluster in the way it was intended when I built it,.

Water Cooled Raspberry Pi Cluster Comparison

The cluster, and Raspberry Pi’s in general, aren’t designed for gaming or rendering high-end graphics, so obviously won’t perform well against a computer in this respect. But my intention behind building this cluster, apart from learning about and experimenting with cluster computing was to run mathematical models and simulations.

The Test Script

I initially thought of doing something along the lines of calculating Pi to a particular number of decimal places, but then I stumbled across a simple 4 node cluster setup mentioned in The Mag Pi which was used to find prime numbers up to a certain limit. This seemed like a good comparison as it is simple to understand and edit, it is easily adjustable and it can be run on Windows PCs, Macs and Raspberry Pi’s, so you can even join in and see how your computer compares.

Python Script For Finding Primes From The Mag Pi

The script just runs through each number, up to a limit, and checks its divisibility to figure out if it is a prime number or not. I have simplified their cluster script so that it can be run on a PC, Mac or single Raspberry Pi.

import time
import sys

#Start and end numbers
start_number = 1
end_number = 10000

#Record the test start time
start = time.time()

#Create variable to store the prime numbers and a counter
primes = []
noPrimes = 0

#Loop through each number, then through the factors to identify prime numbers
for candidate_number in range(start_number, end_number, 1):
    found_prime = True
    for div_number in range(2, candidate_number):
        if candidate_number % div_number == 0:
            found_prime = False
            break
    if found_prime:
        primes.append(candidate_number)
        noPrimes += 1

#Once all numbers have been searched, stop the timer
end = round(time.time() - start, 2)

#Display the results, uncomment the last to list the prime numbers found
print('Find all primes up to: ' + str(end_number))
print('Time elasped: ' + str(end) + ' seconds')
print('Number of primes found ' + str(noPrimes))
#print(primes)

I know that this is a very inefficient way of searching for prime numbers, but the intention is to make the script computationally expensive so that the processors have to work. There are some interesting thoughts and algorithms for finding prime numbers if you’d like to do some further reading.

For each setup, we’ll be testing the time it takes to find all prime numbers up to 10,000, 100,000 and 200,000.

I’ll be doing 5 comparisons, running the simulation on two laptops – a 2020 MacBook Air and a somewhat outdated HP Laptop running Windows 10 Pro. We’ll then compare these laptops to a single Pi 4B running at 1.5Ghz, then overclock the single Pi to 2.0Ghz, and then finally run the simulation on the Raspberry Pi Cluster with all of the Pis overclocked to 2.0Ghz.

There were a few requests on my build video to compare the cluster to a one of AMDs Ryzen CPU’s. So if any of you are running one, please try running the Python script which you can download above and share the results in the comments section. I’d also be interested to see how the Pi 400 performs if anyone has one of those.

Edit – Multi-process Test Script

Thanks to Adi Sieker for putting together a multi-process version of the script. This script makes use of all available cores and threads on the computer it’s being run on, so should give much better comparative results for multi-core processors.

I’ll add my updated test results for each system running this script at the end of this post.

import multiprocessing as mp
import time


#max number to look up to
max_number = 10000
#four processes per cpu
num_processes = mp.cpu_count() * 4

def chunks(seq, chunks):
        size = len(seq)
        start = 0
        for i in range(1, chunks + 1):
            stop = i * size // chunks
            yield seq[start:stop]
            start = stop

def calc_primes(numbers):
    num_primes = 0
    primes = []

    #Loop through each number, then through the factors to identify prime numbers
    for candidate_number in numbers:
        found_prime = True
        for div_number in range(2, candidate_number):
            if candidate_number % div_number == 0:
                found_prime = False
                break
        if found_prime:
            primes.append(candidate_number)
            num_primes += 1
    return  num_primes

def main():
    #Record the test start time
    start = time.time()

    pool = mp.Pool(num_processes)

    #0 and 1 are not primes
    parts = chunks(range(2, max_number, 1), 1)
    #run the calculation
    results = pool.map(calc_primes, parts)
    total_primes = sum(results)

    pool.close()

    #Once all numbers have been searched, stop the timer
    end = round(time.time() - start, 2)

    #Display the results, uncomment the last to list the prime numbers found
    print('Find all primes up to: ' + str(max_number) + ' using ' + str(num_processes) + ' processes.')
    print('Time elasped: ' + str(end) + ' seconds')
    print('Number of primes found ' + str(total_primes))

if __name__ == "__main__":
    main()

Testing The Laptops And Individual Pi

Now that we know what we’re going to be doing, let’s get started with testing the computers.

I’ll start off on my Windows PC. The windows PC has a 7th generation dual-core i5 processor running at 2.5GHz.

Outdated HP Laptop

Let’s start off by running the script to 10,000.

Windows 10000

So as expected, that was completed pretty quickly, 1.69 seconds to find 1230 prime numbers below 10,000.

Now let’s try 100,000. Remember that even though 100,000 is only ten times more than 10,000, it’s going to take significantly longer than 10 times the time, because there are exponentially more factors to check as the numbers get larger.

Windows 100000

So running the test to 100,000, we get a time of 73 seconds, which is a minute and 13 seconds and we found 9593 prime numbers.

Lastly, lets try 200,000.

Windows 200000

So it took 267 seconds or a little under 5 minutes to find the prime numbers to 200,00 and we found 17,985 primes.

Here’s a summary of the HP laptop’s results.

HP Laptop

Next, we’ll look at the MacBook Air. The MacBook Air has a 1.6 GHz Dual Core i5 processor, let see how that compares to the older HP laptop. We’d expect the MacBook to be a bit slower than the PC as it’s CPU is only running at 1.6GHz, while the PC is running at 2.5Ghz.

2020 MacBook Air

The MacBook Air was quicker to 10,000 but then took a little longer than the PC for the next two tests, taking just under 6 minutes to find the primes up to 200,000.

MacBook 10000
MacBook 200000
MacBook 200000

Here’s a summary of the results of the two tests so far:

Laptops

Let’s now move on to the singe Raspberry Pi running at 1.5Ghz.

Overclocked Pi 4B

The Pi 4B has a quad-core ARM Coretex-A72 processor.

Pi Running 1.5 Ghz

Even to 10,000, we can already see that the Pi is quite a bit slower than the other computers, taking 2 seconds for the first 10,000 and taking a little over 13 minutes to get to 200,000.

Next we’ll overclock the Pi to 2.0Ghz and see what sort of difference we see.

Pi Running 2.0 Ghz

Overclocking the Pi has made a bit of an improvement. It took 1.57 seconds to 10,000, and around 11 minutes to get to 200,000.

Here’s a summary of the results of our tests of the individual computers:

2 Ghz Pi

Setting Up The Raspberry Pi Cluster

Next, we need to get the Pi’s all overclocked and working together in a cluster. To do this, there are a couple of things we need to set up.

Setting Up The Raspberry Pi Cluster

I’ve installed a fresh copy of Raspberry Pi OS on the host or master node and then a copy of Raspberry Pi OS Lite on the other 7 nodes.

Prepare Each Node For SSH

Boot them up and then run the following lines to update them:

sudo apt -y update
sudo apt -y upgrade

Next, run;

sudo raspi-config

And change each Pi’s password, hostname. I used hostnames Node1, Node2 etc.. Also, make sure that SSH is turned on for each Pi so that you can access them over the network.

Changing Hostname And Password

Next, you need to assign static IP addresses to your Pi’s. Make sure that you’re working in a range which is not already assigned by your router if you’re not working on a dedicated network.

sudo nano /etc/dhcpcd.conf

Then add the following lines to the end of the file:

interface eth0
static ip_address=192.168.0.1/24

I used IP addresses 192.168.0.1, 192.168.0.2, 192.168.0.3 etc.

Then reboot your Pi’s and you should then be able to do the rest of the setup through Node 1.

We can now use the NMAP utility to see that all 8 nodes are online:

nmap 192.168.0.1-8
Using NMAP To See All Nodes Are Online

Overclock Each Node To 2.0 GHz

Next, we need to overclock each Pi to 2.0 GHz. I’ll do this from node 1 and SSH into each node to overclock it.

SSH into each Pi by entering into the terminal on Node1:

ssh [email protected]

You’ll then be asked to enter your username and password for that node and you can then edit the config file by entering:

sudo nano /boot/config.txt

Find the line which says #uncomment to overclock the arm and then add/edit the following lines:

over_voltage=6
arm_freq=2000
OverClocking Each Pi In Cluster Through SSH

Reboot each node once you’ve edited and saved the file.

Create SSH Key Pairs So That You Don’t Need To Use Passwords

Next, we need to allow the Pis to communicate with the host without requiring a password. We do this by creating SSH keys for the host and each of the nodes and sharing the keys between them.

Let’s start by creating the key on the host by entering:

ssh-keygen -t rsa
Creating Host Node's SSH Key

Just hit ENTER or RETURN for each question, don’t change anything or create a passphrase.

Next, SSH into each node as done previously and enter the same line to create a key on each of the nodes:

ssh-keygen -t rsa
Creating Individual Node's Keys

Before you exit or disconnect from each node, copy the key which you’ve created to the master node, node 1:

ssh-copy-id 192.168.0.1

Finally, do the same on the master node, copying it’s key to each of the other nodes:

ssh-copy-id 192.168.0.2

You’ll obviously need to increment the last digit of the IP address and repeat this for each of your nodes so that the key is copied to all nodes.

This is only done in pairs between the host and each node, so the nodes aren’t able to communicate with each other, only with the host.

You should now be able to SSH into each Pi from node 1 without requiring a password.

ssh '192.168.0.2'

Install MPI (Message Passing Interface) On All Nodes In The Raspberry Pi Cluster

Next, we’re going to install MPI, which stands for Message Passing Interface, onto all of our nodes. This allows the Pis to delegate tasks amongst themselves and report the results back to the host.

Let’s start by installing MPI on the host node by entering:

sudo apt install mpich python3-mpi4py
Installing MPI On Host Node

Again use SSH to then install MPI onto each of the other nodes using the same script:

Installing MPI On Other Nodes In The Raspberry Pi Cluster

Once you’ve done this on all of your nodes, you can test that they’re all working and that MPI is running by trying the following:

mpiexec -n 8 --host 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8 hostname

You should get a report back from each node with it’s hostname:

Check MPI Is Running On All Nodes

Copy The Prime Calculation Script To Each Node

The last thing to do is to copy the Python script to each of the Pis, so that they all know what they’re going to be doing.

Here is the script we’re going to be running on the cluster:

The easiest way to do this is with the following line:

scp ~/prime.py 192.168.0.2:

You’ll again obviously need to increment the IP address for each node, and the above assumes that the script prime.py is in the home directory.

You can check that this has worked by opening up an SSH connection on any node and trying:

mpiexec -n 1 python3 prime.py 1000

Once this is working, then we’re ready to try out our cluster test.

Testing The Raspberry Pi Cluster

We’ll start out with calculating the primes up to 10,000. So we’ll start a cluster operation with 8 nodes, list the node’s IP addresses and then tell the operation what script to run, in which application to run it and finally the limit to run the test up to:

mpiexec -n 8 --host 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4,192.168.0.5,192.168.0.6,192.168.0.7,192.168.0.8 python3 prime.py 10000

The cluster was able to get through the first 10,000 in 0.65 seconds – faster than either of our computers. Which is quite surprising given that the system needs to manage communication to and from the nodes as well.

Here are the results for the test to 10,000, 100,000 and then to 200,000:

Pi Cluster Finding Primes Computing Test

The search to 200,000 took just 85 seconds, which is again a little over 3 times faster than the Windows PC and 4 times faster than the MacBook. It was also a just a little slower than 8 times faster than the individual Pi.

Here is a comparison of the combined results from all of the tests done:

Pi Cluster

Lastly, I just ran the simulation to 500,000 on the cluster to see how fast it would be.

Raspberry Pi Cluster Additional Test Up To 500000

That took 526 seconds, or a little under 9 minutes.

I plotted a trend and forecast the 500,000 times for the other tests so that you can see how they compare. I’ve converted all of these values to minutes to make them a bit more understandable.

Raspberry Pi Cluster Forecast

So our cluster was able to beat the PC and Mac quite significantly, which might be somewhat surprising, but that is the power of cluster computing. You can imagine that when running really large simulations, which often take a couple of days on a PC, being able to run the simulation just 2-3 times faster is a massive saving. A week-long simulation on the PC can be completed by the Pi Cluster in just two and a half days.

Now obviously we could cluster PCs as well to achieve better simulation times, but remember that each Pi node in this setup costs just $35, so you can build a pretty powerful computer for a few hundred dollars using Raspberry Pis. You’re also not limited to just 8 nodes, you could add another 8 nodes to this setup for around $400 and you’d have a cluster which performs 6 times faster than the PC.

Multi-Process Test Results

As mentioned in an earlier edit in the post, Adi Sieker put together a multi-process version of the script.

Here are the results of the tests done so far (I’ll keep adding to them as I complete them on each platform):

HP Laptop – Using 16 processes:

  • 10,000 – 0.9 s
  • 100,000 – 18.27 s
  • 200,000 – 66.99 s
  • 500,000 – 374.3 s (6 mins 15 s)

What About The Temperature Of The Loop?

I also checked the temperature of the master node, which is midway through the cooling loop (5th in the loop), to see how warm it was after the test:

Check CPU Temperature Afterwards

It was only around 8 degrees above room temperature after the test.

Next, I’m going to be doing a full thermal test on the Raspberry Pi Cluster to check how it performs under full load for a duration of time. So be sure to check back in a week or two or subscribe to my channel for updates on Youtube.

As mentioned earlier, feel free to download the script and try it out on your own computer and share your results with us in the comments section. We’d love to see how some other setups compare.

Michael Klements
Hi, my name is Michael and I started this blog in 2016 to share my DIY journey with you. I love tinkering with electronics, making, fixing, and building - I'm always looking for new projects and exciting DIY ideas. If you do too, grab a cup of coffee and settle in, I'm happy to have you here.

36 COMMENTS

  1. Came to notice your water cooled PI cluster in YT.
    Since you have shown bench marks, I tried to run your scripts on my machine too Ryzen 1700 in windows 10.
    The timing was close to what you have shown; For 100K 42sec for the single process script and 24sec for the multi process script.

    Then I noticed that even with multi process script running only one thread on the task manager is at full load, the others are close to idle. I see a number of python process running. But the processor utilization is 0 on those (except for the one). So in effect the script is not utilizing the full power of the processors.
    Not sure if the story is different in Linux
    I tried the prime calculation of 100K (similar to your code) in C++/C#, both gives sub second performance (with full cpu utilization).

    • I tested on ubuntu on 2 different machines. one has 2 cores, one has 6. Same thing. The multiprocessor script doesn’t utilize all the cores. To me this means this test is not accurate.

      There is probably a better script / program out there that can be run to take advantage of SMP.

    • Saw the same Python Script effect on my HP-Pavilion NP192AA-ABA p6140f x64-based PC Processor Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz, 2333 Mhz, 4 Core(s), 4 Logical Processor(s).

      Reviewed the Python Script “FindingPrimesMulti.py” in the “FindingPrimesMulti-1.zip” file. Needed to change the following script code
      ” parts = chunks(range(2, max_number, 1), 1) ”
      to
      ” parts = chunks(range(2, max_number, 1), num_processes) ”
      to correct the failing to assign ranges of numbers to each processes.

      • Adding the Benchmark Results: (see notes below)
        Find all primes up to: 10000 using 4 processes.
        Time elasped: 0.61 seconds
        Number of primes found 1229

        Find all primes up to: 100000 using 16 processes.
        Time elasped: 14.61 seconds
        Number of primes found 9592

        Find all primes up to: 200000 using 16 processes.
        Time elasped: 51.24 seconds
        Number of primes found 17984

        Find all primes up to: 500000 using 16 processes.
        Time elasped: 297.43 seconds
        Number of primes found 41538

        Notes:
        1) Used “FindingPrimesMulti.py” with Python Script code change “parts = chunks(range(2, max_number, 1), num_processes)” to correct the failure to assign ranges of numbers to each processes.
        2) OS: Microsoft Windows 10 Pro, Version 10.0.18363 Build 18363
        3) Hardware: HP-Pavilion p6140f x64 Intel Core2 Quad CPU Q8200 @ 2.33GHz, 2333 Mhz, 4 Core(s), 4 Logical Processor(s), Memory (RAM) 8.00 GB
        4) Finding: Tested using 4 to 16 processes found that by using more processes like 16 kept all 4 cores more active because some assigned ranges ended quicker then others.

        • Lacking Raspberry Cluster hardware for kicks created a Virtual Raspberry Cluster.
          OS/Hardware used: Windows 10 Pro 64-bit 8GB RAM, Intel Core(TM)2 Quad CPU Q8200 @2.33GHz, 2333Mhz, 4 Core(s), SSD Drive
          Installed Oracle VM VirtualBox 6.0.24 64-bit. https://download.virtualbox.org/virtualbox/6.0.24/ (VirtualBox-6.0.24-139119-Win.exe)
          Because VT-x is not available on my Intel Core2 Quad CPU had to use Raspberry Pi Desktop 4.29 32-bit ISO (2021-01-11-raspios-buster-i386.iso)
          Created VirtualBox “Raspberry Pi Desktop 4.29”
          VM Debian (32-bit), 1 CPU (32-bit max), Enable PAE/NX, RAM 1GB, PIIX3, PS/2 Mouse, VBoxSVGA 128MB, Hardware Clock in UTC time
          SATA Port0-cd “2021-01-11-raspios-buster-i386.iso”, Port1-Raspberry.vdi 32GB dyn,
          Net Adpt1-Intel 1000MTDesktop Bridged adpt Realtek PCIe GbE Family Controller
          Net Adpt2-Intel 1000MTdesktop NAT
          Shared Folders: Path: c:\users\”user”\Desktop\Share, FolderName: Share, MountPoint: /home/pi/share
          Virtual booting of VirtualBox “Raspberry Pi Desktop 4.29” is very i/o write sensitive to disk drive delays, avoided boot hangups by using a SSD Drive.
          Updated VirtualBox “Raspberry Pi Desktop 4.29” VirtualBox Guest Additions 5.2.0 to 6.0.24 (cdrom: VBoxGuestAdditions_6.0.24.iso)
          sudo sh /../../media/cdrom/VBoxLinuxAdditions.run
          Used Router DHCP Reservation to assign permanent IP address to the four virtual cluster Nodes
          Lacking a DSN Server updated all four virtual cluster nodes /etc/hosts adding “192.168.2.3_ node_” _=1,2,3,4
          Cloned VirtualBox “Raspberry Pi Desktop 4.29” three times renaming them and their hostname to node_ _=1,2,3,4
          Used Node1 as the Cluster Master node and followed your documented steps to complete the ssh & mpiexec cluster nodes setup.

          Test VM Cluster setup:
          mpiexec -n 4 –host node1,node2,node3,node4 hostname
          node1
          node3
          node2
          node4

          Run VM Cluster Primes timings:
          mpiexec -n 4 –host node1,node2,node3,node4 python3 /home/pi/share/prime.py 10000 >> ./share/VMPrimes.txt
          Find all primes up to: 10000
          Nodes: 4
          Time elasped: 0.68 seconds
          Primes discovered: 1229

          mpiexec -n 4 –host node1,node2,node3,node4 python3 /home/pi/share/prime.py 100000 >> ./share/VMPrimes.txt
          Find all primes up to: 100000
          Nodes: 4
          Time elasped: 79.38 seconds
          Primes discovered: 9592

          mpiexec -n 4 –host node1,node2,node3,node4 python3 /home/pi/share/prime.py 200000 >> ./share/VMPrimes.txt
          Find all primes up to: 200000
          Nodes: 4
          Time elasped: 334.12 seconds
          Primes discovered: 17984

          mpiexec -n 4 –host node1,node2,node3,node4 python3 /home/pi/share/prime.py 500000 >> ./share/VMPrimes.txt
          Find all primes up to: 500000
          Nodes: 4
          Time elasped: 2435.46 seconds
          Primes discovered: 41538

    • Hi, I came here via your youtube video while browsing for things to do with my PI 400
      I am old school basic, cobol and pascal, as i have had a PI3b for 2 years i have never used python or anything newer than excel macros 🙂

      So i grabed python 3.94 from python.org to suit win 10 64 bit

      this machine is my daily driver and internet machine and is a simple Lenovo thinkcentre core2 machine G3220 but it does have 12G ram

      i only ran3 tests as it is late at night and i need to get some sleep

      single core
      10,000 0.59 seconds
      100,000 50.95 seconds

      Multi Core
      100,000 29.85 seconds

      On the weekend i will add python to my I3 6th gen laptop and the spare I7 4th gen used as a games machine by my son
      Cheers
      George

      • Thanks for sharing your results George. It would be interesting to try out on your older Pi3b as well if that’s still running.

  2. I google and found no solution. I assume the work just can’t be split.
    Because, this script here, that calculates a Monte Carlo Prime, does spawn the correct amount of -actually working- processes:

    #!/usr/bin/python

    import random
    from multiprocessing import Pool, cpu_count
    from math import sqrt
    from timeit import default_timer as timer

    def pi_part(n):
    print(n)

    count = 0

    for i in range(int(n)):

    x, y = random.random(), random.random()

    r = sqrt(pow(x, 2) + pow(y, 2))

    if r < 1:
    count += 1

    return count

    def main():

    start = timer()

    np = cpu_count()
    print(f'You have {np} cores')

    n = 100_000_000

    part_count = [n/np for i in range(np)]

    with Pool(processes=np) as pool:

    count = pool.map(pi_part, part_count)
    pi_est = sum(count) / (n * 1.0) * 4

    end = timer()

    print(f'elapsed time: {end – start}')
    print(f'π estimate: {pi_est}')

    if __name__=='__main__':
    main()

    ——-
    There was a spam detection because I used a VPN. The comment didn't show up so I made the same one without VPN. However it says it's a duplicate and the other comment is not visible, so here I am writing something to change the comment to make it be visible. …

    • Can you post this to a pastebin somewhere? I’m cleaning it up but not sure i have indentation correct, it was all lost in your post.

  3. I actually shared code here that you can run on multiprocessor systems, but the Moderator didn’t approve of it. Probably because this site ignores normal “new lines”, so it was all squeezed together.

    It was calculating Monte Carlo primes. I think you can find it easily with Google. Python, of course.

    Kind regards 🙂

  4. I tried this out on my AMD Ryzen 5900X, with the multi-cpu version of the script.
    It’s quite amusing fun:

    Find all primes up to: 200000 using 96 processes.
    Time elasped: 60.58 seconds
    Number of primes found 17984

    Process finished with exit code 0

    • why has your 12 Core Processor needed more time for a less range. Im using an R7 3700x on 4,4 Ghz allcore on Win 10 with visual Studio Code.

      Find all primes up to: 500000 using 64 processes.
      Time elasped: 50.26 seconds
      Number of primes found 41538

  5. This Spam Protection is really annoying.
    I changed the script to use all cores… You can find it as Comment of “MadMe86” on your YoutubeVideo
    Results of my 3950X
    First Script:
    Find all primes up to: 200000
    Time elasped: 90.6 seconds
    Number of primes found 17985
    MultiThread Script:
    Find all primes up to: 200000 using 128 processes.
    Time elasped: 54.47 seconds
    Number of primes found 17984

    with my modified Script:
    Find all primes up to: 200000 using 32 processes.
    Time elasped: 4.14 seconds
    Number of primes found 17984

    Find all primes up to: 1000000 using 32 processes.
    Time elasped: 95.78 seconds
    Number of primes found 78498

  6. I am not a python expert so please forgive the quick and dirty code 😉

    import multiprocessing as mp
    import time

    max_number = 10000
    num_processes = mp.cpu_count()

    def getchunks():
    superList = []

    for e in range(0, num_processes, 1):
    subList = []
    superList.append(subList)
    i = 0
    for i in range(2, max_number, 1):
    superList[i % num_processes].append(i)

    return superList

    def calc_primes(numbers, arr):
    num_primes = 0
    primes = []

    firstno = numbers[0] % num_processes

    #Loop through each number, then through the factors to identify prime numbers
    for candidate_number in numbers:
    found_prime = True
    for div_number in range(2, candidate_number):
    if candidate_number % div_number == 0:
    found_prime = False
    break
    if found_prime:
    primes.append(candidate_number)
    num_primes += 1
    #print(num_primes)
    arr[firstno] = num_primes

    def main():
    #Record the test start time
    start = time.time()
    resultList = mp.Array(‘i’,range(num_processes))

    parts = getchunks()

    processes = []

    for i in parts:
    p = mp.Process(target=calc_primes, args=(i, resultList))
    processes.append(p)
    p.start()

    for process in processes:
    process.join()

    total_primes = sum(resultList)

    end = round(time.time() – start, 2)

    print(‘Find all primes up to: ‘ + str(max_number) + ‘ using ‘ + str(num_processes) + ‘ processes.’)
    print(‘Time elasped: ‘ + str(end) + ‘ seconds’)
    print(‘Number of primes found ‘ + str(total_primes))

    if _name_ == “__main__”:
    main()

  7. when I ran the multi-CPU version on my AMD Ryzen 9 3900X 12-Core Processor I got this
    Find all primes up to: 100000 using 12 processes.
    Time elasped: 14.98 seconds
    Number of primes found 9592
    that is not what I was expecting that’s way too slow. come to find out it’s only running on 1 thread/core.
    I made a script that uses all available cores.
    https://github.com/Kodi4444/CPU_benchmark.git

  8. Hi Michael,
    Thanks for posting this series. Finally got my 4x cluster up and running. Just to add to the mess of comments talking about single-core performance, I modified your prime.py script (The one using mpi4py) so it would utilize all cores of the Raspberry Pi 4. It uses the same algorithm that your code uses, or at least it’s close. With my 4x Nodes using 4x cores each (16 cores), I have close to twice the performance of your cluster! Just thought I would share my results.

    https://github.com/joshjerred/mpi4py-with-multiprocessing-Check-for-primes

    4x Raspberry Pi 4s running between 1.8 and 2.0 GHz
    100,000 @ 9.68 seconds
    200,000 @ 45.2 seconds

  9. I just ran the script on my new Apple M1 MacBook Air with 8gb of RAM…here are my results

    10,000 @ 0.44 seconds
    100,000 @ 40.57 seconds
    200,000 @ 160.66 seconds

    • I ran the multi-process version of the script on Apple M1 MacBook Air with 16 gb of RAM:

      Find all primes up to: 10000 using 32 processes.
      Time elasped: 0.81 seconds
      Number of primes found 1229

      Find all primes up to: 100000 using 32 processes.
      Time elasped: 15.36 seconds
      Number of primes found 9592

      Find all primes up to: 200000 using 32 processes.
      Time elasped: 57.22 seconds
      Number of primes found 17984

  10. You might want to correct the multiprocessing implementation.

    Also, for the work defined by the algorithm used, distributing the load by simply splitting the search linearly is not efficient, quite often there are execution nodes/cores that end up starving for work.

    I’m the same person from YT that shared an implementation using joblib to make use of the other cores in a PC, I’ve extended (and fixed) the multiprocessing implementation shown here, both implementations (joblib and the modified multiprocessing one) are available at:
    https://drive.google.com/drive/folders/1_VUNGTMIvpuy_7pAXjTvD0MCGfQaf2NM?usp=sharing

    In the multiprocessing implementation I’ve also expose a bit why it is not very efficient to split the work as it was done in the original script.

      • Updated the shared folder with a simpler version that doesn’t evaluate the work done per execution node, as such there is not extra overhead and the results comparable with the rest of the scripts in this webpage.

        I’m probably digging too much into this rabbit hole, and this would probably be some nice subject for a blog post about distributed algorithms, but here it is.

        For a the modified work distribution in an old Intel® Core™ i5-4570 CPU @ 3.20GHz × 4, with Python 3.8.5 in Ubuntu 20.04.2 LTS I got the following results when requesting work for 4 tasks in a Pool of 4 workers:
        size: 10000 time: 0.09s primes: 1229
        size: 50000 time: 1.9s primes: 5133
        size: 100000 time: 7.16s primes: 9592
        size: 200000 time: 26.56s primes: 17984
        size: 500000 time: 146.88s primes: 41538
        size: 1000000 time: 557.53s primes: 78498

        Using the work distribution used in the initial run the results are as follows:
        size: 10000 time: 0.14s primes: 1229
        size: 50000 time: 2.89s primes: 5133
        size: 100000 time: 11.51 primes: 9592
        size: 200000 time: 40.85s primes: 17984
        size: 500000 time: 246.78 primes: 41538
        size: 1000000 time: 946.5 primes: 78498

        Looking at the results, without balancing the work between the nodes the code runs almost twice as slower. This means that at least one node spent half of the total time taken doing nothing!

        I suspect this gets worse as you add more nodes… you can mitigate it by subdividing the work into more sub-tasks, but then you’ll start to get more overhead as you set up more tasks. I might make a comparison later to check how it scales with the number of available worker nodes.

        While for the sake of simplicity I guess the prime algorithm used it is fine for comparing the performance, I feel since the prime algorithm has a few snags with the work distribution something more easily split would be better, then again you might not have such an easy time with an real-world algorithm you wish to run.

  11. Had to give it a shot 🙂

    Macbook Pro 16″ (2,3 GHz 8-Core Intel Core i9) 16gb ram.

    FindingPrimesMulti.py – with ‘parts = chunks(range(2, max_number, 1), num_processes)’ fix.

    Find all primes up to: 200000 using 64 processes.
    Time elasped: 14.9 seconds
    Number of primes found 17984

  12. Find all primes up to: 500000 using 32 processes.
    Time elasped: 142.17 seconds
    Number of primes found 41538

    I7-7700, using the corrected multi-processor script.

  13. AMD 5800X

    Summary of the benchmarking:

    | Method | Work size | Tasks | Time (s) | Primes
    | basic | 10000 | 16 | 0.08 | 1229
    | halved | 10000 | 32 | 0.02 | 1229
    | heuristic | 10000 | 16 | 0.02 | 1229
    | basic | 50000 | 16 | 0.52 | 5133
    | halved | 50000 | 32 | 0.49 | 5133
    | heuristic | 50000 | 16 | 0.41 | 5133
    | basic | 100000 | 16 | 1.96 | 9592
    | halved | 100000 | 32 | 1.74 | 9592
    | heuristic | 100000 | 16 | 1.56 | 9592
    | basic | 200000 | 16 | 7.24 | 17984
    | halved | 200000 | 32 | 6.47 | 17984
    | heuristic | 200000 | 16 | 5.9 | 17984
    | basic | 500000 | 16 | 42.51 | 41538
    | halved | 500000 | 32 | 37.92 | 41538
    | heuristic | 500000 | 16 | 34.01 | 41538
    | basic | 1000000 | 16 | 157.78 | 78498
    | halved | 1000000 | 32 | 142.46 | 78498
    | heuristic | 1000000 | 16 | 128.3 | 78498

  14. On my M1 Mac Mini, using 32 processes (4 per core) and chunk sizes 64-256:
    10k : 1.1
    100k : 4.0
    200k : 12.2
    500k : 66.0

  15. Hi,
    Just tried your instructions for Raspberry Pi Cluster. I have total three raspi’s running. Seems like they are reporting the results not as clustered but as individually.

    $ mpiexec -n 3 –host 192.168.1.60,192.168.1.61,192.168.1.62 python3 prime.py 10000
    bash: warning: setlocale: LC_ALL: cannot change locale (ja_JP.UTF-8)
    bash: warning: setlocale: LC_ALL: cannot change locale (ja_JP.UTF-8)
    Find all primes up to: 10000
    Nodes: 1
    Time elasped: 1.96 seconds
    Primes discovered: 1229
    Find all primes up to: 10000
    Nodes: 1
    Time elasped: 1.96 seconds
    Primes discovered: 1229
    Find all primes up to: 10000
    Nodes: 1
    Time elasped: 1.97 seconds
    Primes discovered: 1229

    According to your blog post it should be reporting as combined results but seems mine is not. Do you have any clue what is going wrong? I’m a newbie for clustering raspberry pi’s and would like to know how to make these little machine act like a super computer.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest posts

Voxelab Proxima 6.0 SLA 3D Printer Unboxing and Review

Today I'm going to be unboxing and reviewing the Voxelab Proxima 6.0, which Voxelab sent to me to share with you. The Proxima 6.0 is...

Mini Raspberry Pi Server With Built In UPS

Today we're going to be building a mini Raspberry Pi server with a built-in UPS and OLED stats display. A Raspberry Pi makes a...

Make An Arduino Tic Tac Toe Game With An AI Opponent

Today we're going to be building a Tic Tac Toe or Noughts and Crosses shield for an Arduino. The game board is made up...

Related posts

Want to stay up to date with the latest news?

We would love to hear from you! Please fill in your details and we will stay in touch. It's that simple!