June 11, 20265:03

Apple Showed Us How to Run Local AI Agents the Right Way

By Samuel Gregory

About this video

Your Mac is a powerhouse that you are likely wasting on simple spreadsheets and emails. In this video, we dive into Apple's official MLX LM framework to show you how to run high-speed local AI models directly on your machine. We move past the limitations of GGUF to explore the sheer speed of native Apple Silicon optimisation.\n\nKey takeaways from the video:\n- Why MLX is the superior choice for Apple Silicon users compared to GGUF.\n- How to install and configure MLX LM using Python and Conda.\n- Steps to serve a model locally and connect it to tools like Open Code.\n- Using Pi Agent to turn your local model into a functional system assistant.\n- The foundational knowledge required to eventually link multiple Macs for massive VRAM gains.

OpenCode config

"mlx": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "MLX (local)",
  "options": {
    "baseURL": "http://127.0.0.1:8080/v1"
  },
  "models": {
    "mlx-community/gemma-4-e2b-it-4bit": {
      "name": "Gemma 4"
    }
  }
}

Pi config

"mlx": {
  "baseUrl": "http://127.0.0.1:8080/v1",
  "api": "openai-completions",
  "apiKey": "null",
  "models": [
    {
      "id": "mlx-community/gemma-4-e2b-it-4bit",
      "name": "Gemma 4"
    }
  ]
}

Why Your Mac is Secretly an AI Powerhouse

The cloud is a crutch that is holding your development back. For too long, we have been told that high-level machine learning requires massive server farms and monthly subscriptions, but the truth is sitting right on your desk. Following the latest WWDC updates, Apple has made it clear that local AI is the future, and their MLX framework is the key to unlocking it.

The Power of MLX

MLX is an open-source framework designed specifically for Apple Silicon. Unlike traditional methods like GGUF, MLX is built from the ground up to take advantage of the unified memory architecture of your Mac. The speed differences are not just noticeable, they are incomparable. If you are serious about local LLMs, you need to be using the tools designed for your hardware.

Setting Up Your Local Environment

Getting started is remarkably simple. With Python and pip, you can install the MLX LM library and begin generating responses from models like Gemma or Quen in minutes. For those who prefer a cleaner setup, using Conda for encapsulation is a brilliant way to manage your packages without cluttering your system.

To run a model, use the generate command: mlx_lm.generate --model [model_name] --prompt \"Hello World\"

Moving Beyond the Terminal

Running a model in the terminal is just the beginning. By serving the model locally on your machine, you can bridge the gap between a raw LLM and your actual workflow. Integrating these models with tools like Open Code or Pi Agent allows you to take actions across your computer, effectively building a private, local AI assistant that respects your data privacy.

The Road to Multi-Mac Clusters

This setup is more than just a party trick. It represents the groundwork for something far more powerful: linking multiple Macs together. By mastering MLX now, you are preparing for a future where you can spread a model across the VRAM of every Mac in your office, squeezing every drop of performance out of your hardware.

Stop waiting for the cloud. Start building locally today.

Transcript▾

Apple's WWDC conference has kicked off. And with it, they have released a whole bunch of videos on machine learning and AI, all that AI developer goodness that we love. And I took a little look at these and I thought I'd share some of my learnings with you, specifically how they like to run their models locally on your machine. Now, we will be using a library called MLX LM. I would not necessarily suggest this for a majority of use cases. However, this is the sort of baseline or the prep work that we will need to know to understand how to link multiple Macs together to either spread a model across all of their VRAM or squeeze more juice using their RAM and capabilities. So, just to recap, we will be running a local model using MLX LM the Apple way. We will serve that model locally across our machine and tap into it using open code and use that local AI model to write code. And as a Brucey bonus, we will plug it into PI so you can take action across your computer. Now, what is MLX? MLX is an open-source framework for running and training local models specifically on your Mac. I have done a massive comparison between GGUF and MLX, and the differences are incomparable. You just get so much more speed out of MLX versus GGUF. I did the test because TurboQuant had been released by Google and I have yet to see a simple way to implement Turboquant on MLX models. Saying that, the library we will be using is a Python library called MLX LM and it is very simple to install. Obviously with Python and pip installed, we are going to run this inside of our terminal. If you prefer to use Conda, which is encapsulation, I am not an expert on Python but this is if you want to a set of packages and a set a version of Python and etc. Conda is the way to sort of encapsulate that similar to Docker which we went through on the Odysseus video. Now if we scroll down here to the command line we can literally type mlx generate specifying model and then giving it a prompt will take the model from Hugging Face which we have spoken about tons on this channel. It is effectively a repository for models. Let's take Gemma for 2B. And this one should be fine. By the way, the research I did before the show, this is not an advert for Warp, but to get Gemma models to run, they have not updated it yet to be able to run a new KV cache storage thing that Gemma is using in Apple's doc to use Quen 3.5, which obviously does not have this problem, and eventually had to install a specific version because a version of MLX contained a bug and it could not support the Gemma framework. With that, we should just be able to run MLX generate and download the model and give me a response. To run this to serve this across our local machine, we are just going to run MLX LM server and pass in that model. It seems like we have that running and it is running on 127.0.0.1 port 8080. And if we hit that URL there, /v1/models, we actually get a result. And inside of my root folder, I have got a hidden config folder, open code, open code.json. I have added this provider here. I will leave links to everything down below. And I have set my added a model there of the Gemma 4 from the MLX community with a name of Gemma 4. If I spin up open code, type /connect mln anything there and we got a local model. Hi, what does this code do? Gemma 4 is not the smartest model. It is not meant for coding whatsoever, but this is running on my local machine. Now with a more powerful model, one that is more tuned to coding, you can imagine a better result here. Pi is slightly different. Again, leave the links to everything down below, but this is in my root pi agent models.json running pi. Hit models here. Should be able to find the MLX community one. We are running local models. So, that is a quick look into how Apple recommend that you run these local models. Like I say, there will be another video using OMLX, which I will go through, which is a better way to run these local models for reasons I will get into in that version. However, this is the groundwork for running models across many, many Macs and getting the best performance out of that. So, I hope you enjoyed this. Like, subscribe if you haven't already. Till next time, keep on vibing.

Share this video

Watch on YouTube Twitter LinkedIn

Back to All Videos