ternary – Frenetic Etchings

For those of us that sport rich inner worlds, it’s often difficult to remember that, on occasion, it’s useful to form the habit of pulling the ideas and thoughts and musings out of the mental menagerie to see the light of day and live a little in the sun. I want to get back in the habit of doing this – both for my own sanity, to chronicle my thoughts for later reflection, and to force myself to verbalize the ideas hammering around inside in a constant maelstrom.

To this end, I’m probably going to be converting this space into a place to vent. Not because I have a lot of angst to offgas but because I have a lot of ideas competing for my attention and motivation in my head and they do change a lot. Like many people with a similar personality to me, the trail of projects is long and reminiscent of a highway of misfit toys – incomplete, ill-formed, and abandoned (until they’re not).

My first instinct is to make an outline of all of the ideas I could possibly talk about, but a list is a device that represents an inner bureaucrat that I have a distinct aversion to, and so I’m going to keep this more off-the-cuff.

So, the first topic I need to vent about are… ternary machine learning systems! Or, the misshapen idea I have in my head that probably doesn’t work, but I haven’t had the patience to complete the system yet. So, instead of writing the code and debugging it and getting it to work nicely, I’m going to write about what it is here, how it works, and how I want it to work in the future. And also the crippling problems the system has that would probably require that I create new kinds of math or contend with literal impossibilities in order to overcome its weaknesses.

So, ternary models: those following AI/machine learning/large language models closely will remember Microsoft releasing the ‘Era of 1-bit LLMs’ arxiv article. The article outlines schemes for a new quantization method that converts floating point values to a ternary-based system – i.e., three values only (-1, 0, and 1). These are technically referred to as 1.58-bit models since log₂(3) = ~1.5850 – because for an arbitrarily large amount of ternary values, they can be packed at a size of about 1.58 bits per value.

This may not sound like much or may not be intuitively a ‘better’ thing on its face if you don’t know a whole lot about the kinds of calculation that go into large language model inference, but suffice it to say that once you constrain the values to these three, it’s possibly to do away with large portions of the extremely performance-heavy matrix multiplications required and instead replace them with matrix additions with some mathematical cleverness.

My idea was to take this a step further. I realized that if you’re constraining the values this much, the calculations would represent an awful lot of repetition – and could probably be cached and stored in some way so that instead of even having to do matrix addition, you could just do lookups in a hash table. O(n) complexity essentially I think? My complexity calculation math is rusty. It might even be faster than that.

So how does that work? Well, inference calculations for weights in the feed forward network are usually just a rote multiplication (which, for these values, either means doing nothing for 1, zeroing everything out for 0, or negating everything for -1), summing all the weights, and then performing a ReLU function or, in the case of Mistral models, SwiGLU function (which apparently combines a sigmoid function and a gated linear unit algorithm – the success of which was apparently attributed to ‘divine benevolence’ in lieu of a solid explanation). This last portion is basically to calculate the degree to which a given set of weights connecting a previous layer activate a particular neuron in the next layer.

If you have a set number of values that these weights could possibly be (-1, 0, 1) you can potentially pre-calculate all of that beforehand, as I mentioned. The scheme I had in mind was to essentially pack 5 ternary values into a single object, which, if you do the quick math here, 3⁵ = 243 – meaning you can encode the entire set of grouped ternary values into a byte value, or an unsigned char value in C. Basically, the encoding is between 0 and 255 with some extra empty encodings at the end. At that point, every single permutation of 5 ternary values can be represented in a single byte and adjustments to those values are just translations from one encoded value to another – something very easily mappable.

If you take these concepts and try to translate them from fast inference (easy-peasy) into fast training, the problem becomes orders of magnitude more difficult. For reference, the backpropagation process for achieving gradient descent involves performing multivariate linear analysis on the entire set of previous values (and the functions for producing said values) at each step of the whole process. These sorts of calculations are easy to wrap your head around when the values represent continuous vectors, but… things get weird (for me, at least) when talking about the same kinds of calculations for quantized values. I could see a way for these calculations to also be heavily simplified and even pre-calculated, but… that’s not something I’ve researched enough to know. It could be feasible.

If it is feasible, I can imagine a system that is capable of being run in training AND inference mode in real-time on consumer hardware with relative ease. The parameter count would probably have to be much higher, but since these parameters are much tinier and the computations are simply faster, that shouldn’t matter too much.

I have yet to finish building a working prototype for this though. I want to, but it’s… a lot for me to wrap my head around. Just learning the calculus involved in the backpropagation process was a lot given my… rough background in calculus. Still, this is a project I do want to devote some time too since real-time training and inference are a worthwhile goal.

I think I’ll leave this topic here for now, as there are several others I would like to work on for future posts.