6 settings I always change before running a local LLM

6 settings I always change before running a local LLM


Most local LLM advice spends about 90% of its time on which model to download and almost none on what to do once it’s loaded. You just pick a model and a quant, and off you go. It seems a lot of people leave the settings and parameters alone unless something’s wrong, which is a shame because half the complaints about local models being slow or weird aren’t really the model; they’re the settings. And that’s a much easier problem to solve than swapping to a different model entirely.

Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won’t find anywhere else on the site. Subscribe by modifying your newsletter preferences!

Context length is a budget, not a ceiling

The number everyone touches first and gets wrong

6 settings I always change before running a local LLM

This one feels almost too obvious to put on a list, except most people I’ve seen running local LLMs either don’t touch it or push it as high as the model allows because the option is there. But after half a year of experimenting with local LLMs, I’ve realized that context is more of a budget than a ceiling, and it’s worth being thoughtful about how you spend it.

Context length is basically how many tokens the model can hold in view at once, and that includes your prompt, the model’s reply, any documents you’ve attached, and the rest of the conversation up to that point. Tokens are roughly three-quarters of a word, so 8K tokens is around 6,000 words of total room to work with.

The other reason to be careful is that bigger isn’t necessarily better for output quality. There’s a well-documented “lost in the middle” effect where models pay attention to the start and end of long contexts but get fuzzy on what’s between. For example, Llama 3.1 was trained on 128k but performs best somewhere under 16K. So even if your hardware can technically support a massive context window, the model itself often does better work in a tighter one.

For most chat and general use where you know the session won’t last very long, 8K is actually fine. For document work or longer back-and-forth, I’ll push to 16K or 32K. The number on the slider is the maximum the model can accept, not a recommendation.


gemma 4 on llama.cpp on desktop pc, clock and lamp in view


I finally found an open-source local LLM that actually competes with cloud AI

Open-source is catching up

GPU offload is the difference between fast and unusable

The auto setting usually plays it too safe

gpu offload in lm studio

GPU offload decides how many of the model’s transformer layers get loaded onto the GPU versus left on the CPU. On a card like mine with 8GB of VRAM, this matters more than almost anything else, because most useful models in the 7B-13B range don’t fully fit at a decent quantization.

The point of the slider is to put as much of the model on the GPU as you can without overshooting, since GPU layers run roughly 10 to 40 times faster than CPU ones. Push it too high and you spill over into shared system RAM, and that’s where things get bad – LM Studio’s own performance penalty for spilling is up to 30x slower than fitting in VRAM. Set it too low and you’re leaving the GPU idle while the CPU does work it didn’t need to.

The auto setting tries to play it safe, which usually means a few layers under what your card could actually handle. My approach is to start one or two layers below max, watch VRAM usage as the model loads, then push up until I’m close to the ceiling without touching it. And remember context length feeds back into this: if you bumped that up, you’ve got less room for offload than the slider thinks.

KV cache offload to GPU memory

I didn’t even notice it was there at first

kv cache offload in lm studio

The KV cache is basically the model’s scratchpad for the conversation – it stores what’s already been seen so tokens don’t get recomputed from scratch each time, and it grows with context length. This toggle decides whether that scratchpad lives on the GPU (faster) or in system RAM (slower but frees up VRAM). Default is on, and that’s fine if you’ve got headroom. The case for flipping it off is when you’re pushing context past what the card comfortably handles and you want that VRAM back for actual model layers instead. So this one depends entirely on what hardware you’re working with.

Temperature

The setting everyone’s heard of but no one adjusts properly

Temperature is probably the setting most people have heard of, it’s the first one I learned about, but I still think it gets overlooked too often. It controls how random the model’s output is – low values make it pick the highest-probability next token almost every time, high values let it reach for less likely ones. Most runners default somewhere around 0.7 or 0.8, which leans slightly creative, and that’s fine as a general-purpose setting except your tasks aren’t all general-purpose.

For anything analytical – code, summarization, pulling facts out of a document, anything where you want the model to stop trying to be interesting – drop it to 0.2 or 0.3. For creative tasks, brainstorming, or anything where variety actually helps, push it up to 1.0 or higher. The scale usually goes to 2.0 but above 1.2 things start getting weird.

The point is that one setting shouldn’t be doing the work for tasks that are very different in nature. It might seem obvious, but this is an important one to stay on top of depending on the workflow. I recommend creating presets for different temps if your runner allows it.


Running queries in llama-vscode


Two old GPUs I salvaged are doing more AI work than a brand new $2000 card, and I won’t be upgrading anytime soon

I built a local AI setup out of two old GPUs that sell for cheap, and it beats a single new card

Min-P is what makes high temperature actually usable

The sampler that pairs with temperature

min p value in lm studio

Min-p is the one I always forget the mechanics of but keep coming back to. It’s a sampling method that cuts off any token whose probability falls below a fraction of the most likely one in that step. So when the model is confident about its next pick, min-p stays strict and only lets in tokens that are close to the top. When the model is uncertain, it loosens up and lets more candidates through. This functionality is what makes it pair so well with high temperature, because temperature flattens the probability distribution and lets weird tokens sneak in, and min-p cuts those before they have a chance to sneak in.

For anything where you’ve pushed temperature up to 1.0 or above, Min-P should be somewhere between 0.05 and 0.1. Lower than 0.05 and the filter barely does anything; higher than 0.1 and you start losing the variety you turned temperature up for in the first place.

Repeat penalty (and DRY, if you have it)

All it needs is a tiny adjustment

presence penalty at 0.7 in lm studio

Repeat penalty discourages the model from reusing tokens it’s already used recently. At 1.0 it’s off and the model writes freely, and a small nudge to 1.05 or 1.1 is usually all you need to clean up the worst of the looping without affecting the flow of your model’s responses. I don’t recommend pushing it to 1.2 though because the model starts dodging common words like “the” just to avoid the penalty, and the outputs will start getting really weird at that point.

On a similar note, if your runner supports DRY (Don’t Repeat Yourself), use that instead. It targets repeated phrases rather than single tokens, which is closer to what we actually mean when we say the model is repeating itself.


A MacBook air connected to a monitor running DeepSeek-R1 locally


7 things I wish I knew when I started self-hosting LLMs

I’ve been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

The model isn’t always the part that needs changing

Myself included, I feel like the local AI crowd hops between models too easily when we don’t get what we want out of the previous one. But a lot of the time the model is fine, it’s just the settings that need adjusting. Some of my recommendations here might seem obvious, but that’s exactly the point – they’re the ones to focus on if you want better output and less strain on your hardware.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *