Testing the Limits: My GTX 1070 Rig vs Mistral Small 22B

Smokeydope@lemmy.world · 2 days ago

Testing the Limits: My GTX 1070 Rig vs Mistral Small 22B

brucethemoose@lemmy.world · edit-2 1 day ago

You can try a smaller IQ3 imatrix quantization to speed it up, but 22B is indeed tight for 8GB.

If someone comes out with an AQLM for it, it might completely fit in VRAM, but I’m not sure it would even work for a Pascal card TBH.

Smokeydope@lemmy.world · 21 hours ago

Thanks for the recommendation. Today I tried out Mistral Small IQ4_XS in combination with running kobold through a headless terminal environment to squeeze out that last bit of vram. With that, the GPU layers offloaded were able to be bumped up from 28 to 34. The token speed went up from 2.7t/s to 3.7t/s which is like a 50% speed increase. I imagine going to Q3 would get things even faster or allow for a bump in context size.

I appreciate you recommending Qwen too, ill look into it.

brucethemoose@lemmy.world · edit-2 18 hours ago

A Qwen 2.5 14B IQ3_M should completely fit in your VRAM, with longish context, with acceptable quality.

An IQ4_XS will just barely overflow but should still be fast at short context.

And while I have not tried it yet, the 14B is allegedly smart.

Also, what I do on my PC is hook up my monitor to the iGPU so the GPU’s VRAM is completely empty, lol.