llama – Personal Blog

I get empowered by being told something is impossible, or difficult, or too expensive, and this has been the case for me with trying to find an affordable way to run large language models at home.

I’ve still got a very expensive (something like AUD$2000) NVidia Jetson Orin NX 16GB that for the last couple of years has been my processor, but it is very slow in comparison to modern GPUs and dips just below the threshold of usable, especially with prefill sometimes taking 25 seconds before you get the first token out.

This is where a recent find enters the picture: the AMD BC250 crypto mining boards that can be bought for just over AUD$200 via AliExpress and other means and repurposed to be an AI processor (as well as games machine if you want).

This is the exact kind of thing that makes me excited: DIY methods of repurposing tech to do much more interesting things for cheap.

First up, as well as the card, you’re going to need a few things:
1. Power Supply (I chose a SilverStone DA750R Gold)
2. 2 x fans (I chose Noctua NF-A12x25 and a Noctua NF-A14x25 – it’s overkill but I wanted silence)
3. 1 x PWM Splitter (basically just splits the fan connector from the board to 2 x fans – mine is a Silverstone 100mm CPF01 black)
4. Cheap NVME SSD (I chose a Crucial E100 480GB drive, it doesn’t need to be fast as the board only supports older speeds)
5. Thermal paste (optional, but best to probably do this down the track as these cards have worked hard – I bought Frost X45)
6. Display port to HDMI adaptor (the board only has DP output and I have an HDMI screen)
7. USB Keyboard (for installing the OS)
8. USB thumb drive (for installing the OS)
9. Network cable (does not have wifi)
10. CR2032 button battery (for the BIOS/Clock settings)

Initial Assembly:
Place the board flat on a bench top sitting on something like matches to give it a tiny bit of air underneath. Install the SSD, Plug everything in and only install the largest of the 2 fans for now and just sit it on top of the heatsink (IMPORTANT YOU DO THIS). Plug the power supply in next to it but leave it switched off. Ensure that the PCIe cable from the power supply is connected to the J1000 connector with the 6+2 (8 total) connector, the other two power plugs next to it are unneeded. Lastly, don’t forget to install the CR2032 battery so that BIOS settings and clock can save reliably.

Power supply power on signal:
ATX power supplies don’t switch on unless they have 2 pins shorted (the motherboard usually does this work when you press your PC power button). This guide here will show you how:
https://www.youtube.com/watch?v=Ea1dcJ0QyAE
For now, just leave it as a small bit or wire or paperclip, but tape it up with electrical tape so it can’t come out or short on anything. In future it’ll be better to cut the wires and extract them to a switch or simply short them.

Preparing the USB thumb drive:
You’ll want to now download on your own computer Ubuntu Server 26.04 from the Ubuntu website:
https://ubuntu.com/download/server
You’ll also want to get Balena Etcher:
https://etcher.balena.io/

Using Balena Etcher, with your USB Thumb drive plugged in, choose the Ubuntu Server ISO you’ve downloaded, then choose the thumb drive, and Balena Etcher will set it up to be a bootable medium for the BC250.

Plug the USB into the BC250 in any of the open ports.

Turning on:
Ensuring that you have USB Keyboard and monitor plugged in, it’s time to finally turn on your board. Have one final check that you’ve got the correct jumper on the ATX power supply plug (wrong jumping can cause a failure/burn) and turn it on.

You should see the monitor spring to life, and BIOS come up asking you to set the time and initial settings. If this doesn’t happen, reboot to make sure you set this up.

Once done, hit save and exit, and hopefully watch your USB drive boot. Choose try/install from the menu and follow through the install. On mine I did a minimal install, no LVM, super basic everything to keep my footprint small (we want every last drop of RAM available).

On completion, it’ll ask you to remove the drive, and press enter. It should reboot and you now have a new system ready to go.

From here I recommend interacting with the board via SSH over network, so log in on the machine, and check its IP address with:
ip ad

On your local machine check you can connect to it via SSH at that address before you unplug the keyboard and monitor.

Configure base system packages/user config:
We’ll want to update the system packages as needed, and install some others to provide basic functionality.

Update/upgrade system:
sudo apt update sudo apt upgrade

Install the build packages:
sudo apt install build-essential cmake git pkg-config

Mesa/Vulkan packages (for interacting with GPU):
sudo apt install mesa-utils vulkan-tools

Add your user to the system groups that are allowed to access the GPU:
sudo usermod -aG render,video $USER

Then reboot:
sudo reboot

Configure the GPU speed governor:
If we leave the GPU speeds to just sit at the default values, we’ll be using lots of power and generating lots of heat we don’t need to, so some people have dug into the speed governing of this quirky board.

You can see the source a tool for that here:
https://github.com/filippor/cyan-skillfish-governor

Simply follow the instructions on the page to install it (follow the building debian package instructions) and be aware you may need to install other packages along the way (if it tells you).

I’m writing this all after I’ve done the work, but I believe I had to install these:
sudo apt install cargo rustc libdrm-dev

I did set my minimum frequency to be 500mhz so I could really clock down when not in use, as AI inference workloads tend to be blob based, not continuous.

I could then see what speed the GPU was going with this:
watch -n 1 cat /sys/class/drm/card1/device/pp_dpm_sclk
(updates every 1 second, hit ctrl-c to quit it)

CPU Clocking:
You may think that we should do this also to the CPU speed, but as far as I can find and read, it doesn’t seem like this is something we can actually do. Some talk about it having a built in governor, so for now I didn’t go any deeper as some of the packages I found to address this said themselves that they didn’t do much.

VRAM/System RAM split:
The BC250 uses a shared RAM setup between system RAM and video RAM which isn’t unusual on semi-embedded boards/laptop systems. The trick is that we want to ensure that the system frees up as much as possible to the GPU so that we can run heavier LLM models.

For this, we will set what feels like a counter intuitive method of setting the GPU to 512 which is actually dynamic mode, so the more the GPU needs, it shuffles system ram down to compensate.

We will use this package to do that:
https://github.com/fanoush/bc250_memcfg

I did the following the build and set it:
cd ~ git clone https://github.com/fanoush/bc250_memcfg cd bc250_memcfg make sudo ./bc250memcfg UMA_SIZE 512 # 512 = dynamic, NOT a 512MB cap sudo reboot

The GTT ceiling fix:
The amdgpu driver defaults to capping GTT at roughly 50% of system RAM despite the dynamic setting we did above. We have to edit some kernel parameters in the bootloader (grub) so that the kernel behaviour with the driver changes.

Install nano text editor:
sudo apt install nano

Edit grub boot loader:
sudo nano /etc/default/grub

Change “GRUB_CMDLINE_LINUX_DEFAULT” to:
GRUB_CMDLINE_LINUX_DEFAULT="ttm.pages_limit=3959290 ttm.page_pool_size=3959290 amdgpu.gttsize=14750"
Press ctrl-o then enter to save, then ctrl-x to quit.

Update grub itself:
sudo update-grub

Reboot:
sudo reboot

Building Llama.cpp:
This is where the real AI inference happens. Llama.cpp is what we’ll use as our base. Before we start MAKE SURE YOU REALLY HAVE YOUR FAN ON THE HEATSINK: THIS WILL GENERATE LOTS OF HEAT!

Install the required packages:
sudo apt install libvulkan-dev glslc spirv-headers spirv-tools glslang-tools

Get the source code and build it:
cd ~ git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc)

Confirm the GPU is detected as a real device, not falling back to CPU:
./build/bin/llama-cli --list-devices

Expect: Vulkan0: AMD BC-250 (RADV GFX1013) (XXXX MiB, XXXX MiB free)

And finally: test that it actually works (this will pull my favourite model from hugging face and run it with a small prompt:
./build/bin/llama-cli -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M -ngl 99 -p "Hello, introduce yourself briefly."

Running as a server for network links:
If you’re like me and you’ll want to link this with an agent framework (like my Supernova framework: https://github.com/JesseCake/supernova ) then you need to run LLama.cpp as an OpenAI compatible network endpoint.

This is the command I run mine with that fits well with the hardware (multiple slots for different contexts, tuning to run well on the BC250):
~/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M --jinja -c 393216 --parallel 3 --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --reasoning off --host 0.0.0.0 --port 8080 -b 8192 -ub 1024 --cache-ram 2048

Next up/Todo: TurboQuant version of LLama.cpp to provide speed and RAM improvements..

After that: Unlocking all 40 CU cores for even faster throughput

Tag: llama

Using a BC250 crypto-mining board as an AI inference processor for local LLMs