Compute Cluster Build

tinygrad

Machine learning rig built from 2 3080’s and a Tesla K80

The entire system runs on a Proxmox server running docker and Ubuntu servers

I’ve been meaning to do this project for a while: build a compute cluster that includes older (cheaper) data center GPU cards to do machine learning. These GPUs may be the only ones that have dropped in price, so you can make a pretty impressive compute cluster out of a couple. That gets me to this project, the Proxmox compute cluster!



This project was done on a budget, so it includes a hodgepodge of hardware. The only pieces I ended up purchasing were the mining frame, and a new power supply. On the motherboard is one NVME 1 TB drive running Proxmox 8, and a 10 TB HDD. On Proxmox I have 2 Ubuntu server VM’s installed. I needed 2 because apparently you can’t install 2 different GPU device drivers from the same company, and the TESLA K80 has been discontinued for some time (latest driver was 470.223.02, CUDA version 11.4!). So if I wanted to use both the 3080’s and the K80, I had to have 2 servers. 

K80 setup

The K80 has a push fan on the back I purchased from ebay when I got the card (just over 100$). It’s a bit load, and it gets it’s power from a fan slot on the motherboard so in theory it could be controlled by software, but I adjust for this by having the rig far away, and I remote in. The hardest part was actually dealing with the wierd differences between server cards and consumer ones. The main mounting screw (blue in the picture) is actually further out for these cards, and slightly thicker, so if you look you can actually see a slight separation, and the screw doesn’t go all the way in. It’s definitely snug enough, and I have a fan on the front helping to pull the air through. Also, when I attempted to install the card in an already working Proxmox server, I had issues with booting the system up. For some reason the boot sequence wouldn’t load the network such that no lights would blink on the RJ45 plugs. I tried working around this with dhclient, but couldn't really get it to work, so I tried a fresh install and that did the trick. So now I have 2 Proxmox servers (the wife is thrilled!).

Tesla K80 CUDA

One of the next issues was that these GPU’s don’t have monitor out ports, so I had to add another GPU initially to set everything up. I started with the top 3080 in the picture, but for some reason Proxmox wouldn’t recognize it with the K80, still not sure why, maybe a similar drive situation like above. But some swapping of cards fixed this, and once Proxmox is installed you’re good to go by remoting in. The K80’s are seen in the OS as 2 separate GPUs with 11 gigs of VRAM (more VRAM than my 3080!). Since CUDA is only maintained to 11.4, this really limits the available software, i.e. I can only run Pytorch 1.12.1, still not too shabby though. I tried running some standard CNN’s pytorch code on the K80’s but it’s unfortunately hit or miss. I get memory errors, even though I know I’m well below the memory utilization, and this might explain why these cards are so cheap now, it’s difficult to run modern machine learning on them with the older CUDA. My future goal for this fix is to use Tinygrad. So far Tinygrad works perfectly fine on it, and kudos to George Hotz and his open source crusade for keeping this alive!

3080’s

Next up are the beefy 3080’s. One AORUS Xtreme, which is comically large. I seriously couldn’t fit this in some of my rigs because of the massive fan housing. And a 3080 Founders, both of which only sport 10 GB VRAM, also comical considering these cards' price. I’m currently using these cards to train DNN’s to sort neural data for me, a project I hope to write about soon, and I will say, they are crazy fast. I’m going to use these cards to experiment with sharding, which I hear can be a bit complicated, as well as trying some multi-agent LLMs.