I played around with transformers on GPTQ to see if multi-core or single core made a difference. I found some results. Single core, at least in this case, is "better" https://pastebin.com/XM7uASJh
Output was perfectly deterministic. I used the same prompt and got the same text back with the preset.
However, using 8bit, split layers, or even single GPU llama-7b I get the same incoherent output. Already downloaded a couple of weights from decapoda/lama-7b-hf repo and they MD5 the same as the ones I converted with the latest script in transformers.
This is what I get and it goes to the end.
Is there something up with my weights, setup or...?
I think this is due to lack of fine tuning. When I use the lora on the 7b model it works fine as picture. SO the lora does do something for chat.