I developed this originally for llamafile, which was included in the last two releases: https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.2 Now we're upstreaming it to the llama.cpp project. There are other performance enhancements you can currently only get from llamafile, such as Kawrakow's work making K quants go much faster.
Is that just because nobody has made an effort yet to port them upstream, or is there something inherently difficult about making those changes work in llama.cpp?
I get the impression most llama.cpp users are interested in running models on GPU. AFAICT this optimization is CPU-only. Don't get me wrong – a huge one! – and opens the door to running llama.cpp on more and more edge devices.
Wait, computing SiLU directly using some numerical analysis is probably a lot faster than doing an exp each time. Is there a significant perf impact to doing this?
With expf() most of the work had already been done and I could kill two birds with one stone. If you want to do the math for doing SiLU directly, that'd be an awesome change I'd happily merge into llamafile. You might even be able to get that into PyTorch and even bigger name projects.