llms – Frenetic Etchings

In my last post, I mentioned the vital need for a large language model test bench. Essentially, the program I have in mind consists of a few different features:

Ability to choose the model you’re connecting to (without necessitating that you run it out of the application or on-device).
Ability to attach a sentence embedding framework or model to be run on-device – probably by calling out to python?
Ability to store conversations with various models that preserve things like model name, parameters used, etc. for later retrieval.
Ability to develop prompt templates
- Include testing of templates – spreading tests out amongst different models, using different parameters, and storing all of these details for later analysis
Ability to develop workflows where the products of templates feed into other templates.
Ability to generate on-the-fly A-B testing scenarios to compare the outcomes of various schemes.

OK, now for some explanation here: pretty early in 2023, chain-of-thought libraries started cropping up to take advantage of a lot of the schemes I’m describing above – static chains of queries that encouraged the model to branch its own behavior over multiple actions to extend and diversify what they were capable of doing. This led to some interesting possibilities, such as the practice of building a faux ‘team’ of roles that you could have the model roleplay as in order to solve different aspects of a problem and collaborate with itself to develop a more-robust solution than may have been possible with straight querying.

It’s becoming obvious that more-advanced querying – developing meta-queries and second-order (and possibly third-order, whatever that would mean) queries – would benefit everyone in finding more-advanced use-cases for LLMs in the wild. It’s difficult to even constrain the realm of possibilities with these technologies given how broadly they can be applied, but finding where that edge is (and finding spaces that perhaps have not been explored yet) requires a test bench of this kind to be able to crank through example cases, store back the results, send them through human evaluation, and then store and display statistics on what works and what doesn’t.

This is probably going to be a larger project, but it’s a necessary precursor to any larger LLM project, whether that be in gaming or education or any of the other scenarios I have humming in the back of my head. I’d like to explain the education one further in a later post.

For those of us that sport rich inner worlds, it’s often difficult to remember that, on occasion, it’s useful to form the habit of pulling the ideas and thoughts and musings out of the mental menagerie to see the light of day and live a little in the sun. I want to get back in the habit of doing this – both for my own sanity, to chronicle my thoughts for later reflection, and to force myself to verbalize the ideas hammering around inside in a constant maelstrom.

To this end, I’m probably going to be converting this space into a place to vent. Not because I have a lot of angst to offgas but because I have a lot of ideas competing for my attention and motivation in my head and they do change a lot. Like many people with a similar personality to me, the trail of projects is long and reminiscent of a highway of misfit toys – incomplete, ill-formed, and abandoned (until they’re not).

My first instinct is to make an outline of all of the ideas I could possibly talk about, but a list is a device that represents an inner bureaucrat that I have a distinct aversion to, and so I’m going to keep this more off-the-cuff.

So, the first topic I need to vent about are… ternary machine learning systems! Or, the misshapen idea I have in my head that probably doesn’t work, but I haven’t had the patience to complete the system yet. So, instead of writing the code and debugging it and getting it to work nicely, I’m going to write about what it is here, how it works, and how I want it to work in the future. And also the crippling problems the system has that would probably require that I create new kinds of math or contend with literal impossibilities in order to overcome its weaknesses.

So, ternary models: those following AI/machine learning/large language models closely will remember Microsoft releasing the ‘Era of 1-bit LLMs’ arxiv article. The article outlines schemes for a new quantization method that converts floating point values to a ternary-based system – i.e., three values only (-1, 0, and 1). These are technically referred to as 1.58-bit models since log₂(3) = ~1.5850 – because for an arbitrarily large amount of ternary values, they can be packed at a size of about 1.58 bits per value.

This may not sound like much or may not be intuitively a ‘better’ thing on its face if you don’t know a whole lot about the kinds of calculation that go into large language model inference, but suffice it to say that once you constrain the values to these three, it’s possibly to do away with large portions of the extremely performance-heavy matrix multiplications required and instead replace them with matrix additions with some mathematical cleverness.

My idea was to take this a step further. I realized that if you’re constraining the values this much, the calculations would represent an awful lot of repetition – and could probably be cached and stored in some way so that instead of even having to do matrix addition, you could just do lookups in a hash table. O(n) complexity essentially I think? My complexity calculation math is rusty. It might even be faster than that.

So how does that work? Well, inference calculations for weights in the feed forward network are usually just a rote multiplication (which, for these values, either means doing nothing for 1, zeroing everything out for 0, or negating everything for -1), summing all the weights, and then performing a ReLU function or, in the case of Mistral models, SwiGLU function (which apparently combines a sigmoid function and a gated linear unit algorithm – the success of which was apparently attributed to ‘divine benevolence’ in lieu of a solid explanation). This last portion is basically to calculate the degree to which a given set of weights connecting a previous layer activate a particular neuron in the next layer.

If you have a set number of values that these weights could possibly be (-1, 0, 1) you can potentially pre-calculate all of that beforehand, as I mentioned. The scheme I had in mind was to essentially pack 5 ternary values into a single object, which, if you do the quick math here, 3⁵ = 243 – meaning you can encode the entire set of grouped ternary values into a byte value, or an unsigned char value in C. Basically, the encoding is between 0 and 255 with some extra empty encodings at the end. At that point, every single permutation of 5 ternary values can be represented in a single byte and adjustments to those values are just translations from one encoded value to another – something very easily mappable.

If you take these concepts and try to translate them from fast inference (easy-peasy) into fast training, the problem becomes orders of magnitude more difficult. For reference, the backpropagation process for achieving gradient descent involves performing multivariate linear analysis on the entire set of previous values (and the functions for producing said values) at each step of the whole process. These sorts of calculations are easy to wrap your head around when the values represent continuous vectors, but… things get weird (for me, at least) when talking about the same kinds of calculations for quantized values. I could see a way for these calculations to also be heavily simplified and even pre-calculated, but… that’s not something I’ve researched enough to know. It could be feasible.

If it is feasible, I can imagine a system that is capable of being run in training AND inference mode in real-time on consumer hardware with relative ease. The parameter count would probably have to be much higher, but since these parameters are much tinier and the computations are simply faster, that shouldn’t matter too much.

I have yet to finish building a working prototype for this though. I want to, but it’s… a lot for me to wrap my head around. Just learning the calculus involved in the backpropagation process was a lot given my… rough background in calculus. Still, this is a project I do want to devote some time too since real-time training and inference are a worthwhile goal.

I think I’ll leave this topic here for now, as there are several others I would like to work on for future posts.

Test Bench

Back Again