Meta released Llama 3.2, with four models. Their smallest two are meant to run ON DEVICE, meaning your phone or your regular computer. LLM inference costs are driving towards zero, just power that you have on a normal device.
More impressive than just the models are their 25 release partners, including Groq and AWS Bedrock, making these new models instantly available to anyone.
My personal use, on PyroPrompts, shows Llama 3.2 1b to be super fast, over 2k tokens/second. Fast enough for realtime conversation (if you have text-to-voice). Fast enough for split-second decision making. The quality of the 1b mode is so-so, wasn’t blown away with its knowledge or ability to make up a story to convey a lesson. More testing is needed! I really like OpenAI’s Structured Outputs, would like similar support via Llama models, though it is achievable with Outlines by .txt.
Have you tried Llama 3.2? What’s your experience with it?