Most local LLM use is through some high-level application such as SillyTavern. These offer advantages such as automatically templating input appropriately, setting context sizes, and potentially offering other features, such as image generation. They’re often focused on chatbots, and giving certain personalities to the characters chatted with.

You might want to avoid that if you want less of a chat experience (i.e. more one-shot prompts), if you want some feature that’s only available in the bleeding edge, or if you just want something lighter weight.

I opted for llama.cpp, but you could go for something more in-between such as ollama (and Kobold users would drop down to KoboldCPP).

llama.cpp

Following instructions (clone, make, download a model), you can use the tinyllama to generate some simple text:

llama.cpp -m ../tinyllama-1.1b-chat-v0.3.Q6_K.gguf -i -p "An article about the fall of Rome:\n\n" -n 200 -e

I note:

  • token generation is fast
  • quality is highly variable, but prompt tuning can make it very decent
  • it doesn’t read like ChatGPT (although probably still seems AI generated?)
  • in interactive mode (-i), token generation doesn’t stop – after making an article, you’ll get some ending text (e.g. copyright), then another article

Downloading another model (e.g. Zephyr) gives much slower generation, better quality, and also stops after a suitable amount of time.

Chat templates

One possible reason for the lack of stop tokens on TinyLlama is that the input is incorrect: checking the source, you can see that desired input format is ChatML. That page tells you the exact format, but you can also derive them from chujiezheng’s chat_templates (or the HuggingFace transformers library).

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Even using this template, it didn’t often stop, possibly due to the low model quality.

Zephyr also has a particular input it does best with, but accepts (and does well with) pure user prompts as well.

<|system|>
{system}</s>
<|user|>
{user}</s>
<|assistant|>