About this video
Your high-end GPU is likely a complete waste of capital for everyday AI tasks. In this video, I test the limits of the MacBook Neo to see if it can handle local LLMs like Gemma 4. Key Takeaways: - OMLX is a superior, lightweight alternative to LM Studio for older or base-model Macs. - A 2B parameter model can run at a respectable 24 tokens per second on just 8GB of RAM. - Local AI is perfectly viable for general chat and basic tasks, though it still struggles with complex logic like game development. - Hugging Face remains the go-to repository for finding quantised MLX models.
The Era of the Monster AI Rig is Officially Over for the Average Developer.
There is a common misconception in the tech world that you need a liquid-cooled, multi-GPU behemoth to run local artificial intelligence. I recently set out to debunk this by testing the MacBook Neo: a machine that many would dismiss as a mere consumer toy.
The Minimalist Approach to AI
My goal was simple: find the absolute minimum hardware required to run local LLMs (Large Language Models) without the experience being a total disaster. While we are not looking to train the next GPT-4 or handle massive codebases, there is a growing ecosystem of models like Gemma 4 and Qwen 3.5 designed specifically for local deployment.
The Tooling: Why OMLX Wins
If you are using a Mac, you have likely heard of LM Studio. However, for those of us pushing the limits of base-model hardware, LM Studio is starting to feel a bit bloated. I opted for OMLX, a lightweight alternative that allows us to run MLX-optimised models with significantly less overhead.
The setup process is straightforward:
- Download the OMLX release (I used version 3.6 for stability).
- Head to Hugging Face to find quantised models.
- Use the OMLX admin panel to download and serve the model locally.
Performance Reality Check
I tested a 2-billion parameter version of Gemma 4, quantised to 4 bits by the team at Onslaught. On a machine with only 8GB of RAM, the results were surprisingly usable:
- Pre-fill speed: Nearly instant for short prompts.
- Generation speed: Approximately 24 tokens per second.
- Thermal performance: The laptop remained comfortably cool to the touch.
The Snake Game Test
To push the context window, I asked the model to generate a functional HTML5 Snake game. While it produced the code almost instantly, the logic was flawed: the snake existed, but the food did not. Even after a follow-up prompt to fix the code, the model struggled with the complex logic required for a fully functional game.
Conclusion
Can you run local AI on a MacBook Neo? Yes.
Is it a replacement for a high-end workstation? No.
For glorified Google searches, general chatting, and basic creative writing, these small models on modest hardware are more than capable. It is time to stop worrying about your VRAM and start experimenting with what you already own.
Transcript▾
In my everlasting quest to find out what's the minimum hardware we can use to run local AR models, I recently picked up the MacBook Neo. Now, let's get expectations set right from the beginning. We are not going to be running the biggest models. We are not going to be running huge context size or doing any sort of coding on this machine. [laughter] However, with all of these models that we can get like Gemma 4 and Qwen 3.5 that are intended to be ran on local hardware including phones, what can we get running on this? I literally haven't downloaded anything. So, we're going to get everything set up and you'll learn a little bit about how we get things set up regardless of what MacBook you're on.
But, once we have it downloaded, let's see what we can do on it. I originally purchased this to test out a local AI server setup, which I'll link to if it's already released. If it's not, subscribe for that one. So, without further ado, let's just jump in. I'll show you how to get things set up and then let's let it rip.
Right, first things first with Macs, you want to download OMLX. This allows us to run MLX models on MacBooks. And you can do this with LM Studio. However, I'm beginning I just prefer the lighter weight version of OMLX and there's reports it's even faster than LM Studio. It's just becoming a little bit bloated. So, with that, we have 3.6 available. Now, there was a 3.7 release candidate that looks like it's been removed by the looks of things. Oh, we've got a release candidate, too. Let's keep with 3.6. We'll scroll down. We're on Tahoe, so let's just download that.
Now, while that's downloading as well, you want to go to Hugging Face, which is a repository similar to GitHub in that they just store local models for you to be able to download. And Gemma 4, let's be honest, Gemma 4 is going to be the one that we could download here because they've got that I think it's a 3 or 2 billion parameter model. So, let's type MLX. Let's go 2 billion. Yeah, so we've got a 2 billion parameter model here. Which comes in at 4 and a half gig, which I think is enough to fit on this. I genuinely don't know.
What we're going to do is copy this, which copies Onslaught Gemma 4. Onslaught are the people who have taken the model and quantised it and converted it to MLX. And we get up to setting things up. Now, if you want things to be saved in a certain directories, you can do all that. We're going to run on port 8000. Then we're going to set up an API key here, and we can start the server.
Now, and if we open up the admin panel, type in our API key, we'll get taken to the dashboard. Now, this primarily runs in the browser. Everything is accessed in the browser, and we come to models, downloader, and this is where we're going to paste that Hugging Face URL, which we copied. We can download that. If you go to our manager here, give that a refresh, we'll start to see our model downloaded.
And again, we're just going to be doing chat with this. You know, coding is just not good on these models in general. Again, we might be able to get something running. We might be able to get it to read a file, but any sort of codebase, it's going to choke at reading any more than a couple of files. We might get it to generate a HTML file. In fact, let's actually do that.
And then with that, go over to chat here. We can say tell me a thousand word story. And we are chugging here. We are chugging. Pushing right up against that 8 gig. I genuinely didn't know how much RAM we had in this thing, but clearly we have 8 gig. I mean, the mouse isn't stuttering. It is writing. If we scroll up here, we get some statistics and we'll start to see how long things take.
Things are a little bit chuggy, I will admit, but we'll start we can see some of the logs coming through here. So, that took 71 seconds. Pre-fill at two total tokens a second. 24 tokens a second. Not really that bad. Let's see if we can generate some HTML file. Create one page HTML old school snake game. And that took very little time to actually start up. That pretty much started instantly. You saw that there. I mean, the heat is not doing too bad at all. It's like nothing. Like it's barely that's not very uncomfortable.
And if you did some mods on this, which is what I've seen online, you could probably get some decent performance out of this thing. Okay, it's done. Let's copy that. Okay. Woah, that's fast. Start game. Okay. I mean, I've got a snake, but I do not have any food. So, how about we follow up, we push this context, and say works great, but there's no food for me to eat. Score points. Grow.
As these follow-ups that really start to show, you know, the pre-fill might be sooner, but we're just adding to that context. We're building up that context, which I mean, it's chugging, but there we go. Oh, you're absolutely correct. We've had that before. Here's the correct HTML and JavaScript. Should have probably just got it to give me the JavaScript, to be honest. The actual HTML is mostly the same, but probably would like to know what sort of context we're talking about here.
Here we go. There's the default 32,000. Still going. Okay. So, we've apparently fixed it. Okay, we're still not seeing anything that enables us to score. Game initialized. Game started. No food. Okay, well, I mean, you know, what can you expect really? But, we have Gemma 4 working. I think this is fine for glorified Google searches, general chatting, the sort of thing you would expect from 2 billion, 4 billion parameter models, particularly as it's been quantised 4 bits. But, that does answer the question. Can we run local AI on a MacBook Neo? The answer is yes, we can, but set your expectations.