AWS SageMaker -> Ollama -> Raspberry Pi Fail

November 28, 2024, 15:30

magichombre

Halp! I've used AWS SageMaker to fine tune Llama 3.2 1B with a set of questions and answers, downloaded the output from S3, but when I try converting it to run in Ollama it seems that an extra two tokens have mysteriously appeared and stop it from working:

% ollama create llama-q-and-a
transferring model data 100% 
converting model 
Error: vocabulary is larger than expected '128258' instead of '128256

If I trick it by modifying the downloaded config.json by changing "vocab_size": 128256 to "vocab_size": 128258` it will then create it, but then running it breaks because the architecture is out by two:

% ollama create llama-q-and-a
transferring model data 100% 
converting model 
creating new layer sha256:27cc8e47a5b0677b27796952267dc8a821d478de44482bee52a2860f01a2d380 
creating new layer sha256:e4e2d5fb1c3129b5ccc8fc5c19d1c06f6e8421f28d7dcfc3e80a081e34ecffdf 
writing manifest 
success
% ollama run llama-q-and-a
Error: llama runner process has terminated: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  2048, 128258, got  2048, 128256,     1,     1

I've tried various ways of converting the model to GGUF and ONNX with a spot of Python first, none have worked so far. Any advice greatly appreciated. Ultimately I want to be able to use Ollama + my model on a Raspberry Pi 5 8GB. Thanks 🙂 PS For reference, when I load and run the model with HF transformers in Python it's fine and I can run inferences fine - it's just transformers is too meaty for my needs whereas Ollama is inference-only optimised.