Mistral Small 22B just dropped today and I am blown away by how good it is. I was already impressed with Mistral NeMo 12B’s abilities, so I didn’t know how much better a 22B could be. It passes really tough obscure trivia that NeMo couldn’t, and its reasoning abilities are even more refined.

With Mistral Small I have finally reached the plateu of what my hardware can handle for my personal usecase. I need my AI to be able to at least generate around my base reading speed. The lowest I can tolerate is 1.5~T/s lower than that is unacceptable. I really doubted that a 22B could even run on my measly Nvidia GTX 1070 8G VRRAM card and 16GB DDR4 RAM. Nemo ran at about 5.5t/s on this system, so how would Small do?

Mistral Small Q4_KM runs at 2.5T/s with 28 layers offloaded onto VRAM. As context increases that number goes to 1.7T/s. It is absolutely usable for real time conversation needs. I would like the token speed to be faster sure, and have considered going with the lowest Q4 recommended to help balance the speed a little. However, I am very happy just to have it running and actually usable in real time. Its crazy to me that such a seemingly advanced model fits on my modest hardware.

Im a little sad now though, since this is as far as I think I can go in the AI self hosting frontier without investing in a beefier card. Do I need a bigger smarter model than Mistral Small 22B? No. Hell, NeMo was serving me just fine. But now I want to know just how smart the biggest models get. I caught the AI Acquisition Syndrome!

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    1 day ago

    You can try a smaller IQ3 imatrix quantization to speed it up, but 22B is indeed tight for 8GB.

    If someone comes out with an AQLM for it, it might completely fit in VRAM, but I’m not sure it would even work for a Pascal card TBH.

    • Smokeydope@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      21 hours ago

      Thanks for the recommendation. Today I tried out Mistral Small IQ4_XS in combination with running kobold through a headless terminal environment to squeeze out that last bit of vram. With that, the GPU layers offloaded were able to be bumped up from 28 to 34. The token speed went up from 2.7t/s to 3.7t/s which is like a 50% speed increase. I imagine going to Q3 would get things even faster or allow for a bump in context size.

      I appreciate you recommending Qwen too, ill look into it.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        18 hours ago

        A Qwen 2.5 14B IQ3_M should completely fit in your VRAM, with longish context, with acceptable quality.

        An IQ4_XS will just barely overflow but should still be fast at short context.

        And while I have not tried it yet, the 14B is allegedly smart.

        Also, what I do on my PC is hook up my monitor to the iGPU so the GPU’s VRAM is completely empty, lol.