Test Bench

In my last post, I mentioned the vital need for a large language model test bench. Essentially, the program I have in mind consists of a few different features:

  1. Ability to choose the model you’re connecting to (without necessitating that you run it out of the application or on-device).
  2. Ability to attach a sentence embedding framework or model to be run on-device – probably by calling out to python?
  3. Ability to store conversations with various models that preserve things like model name, parameters used, etc. for later retrieval.
  4. Ability to develop prompt templates
    • Include testing of templates – spreading tests out amongst different models, using different parameters, and storing all of these details for later analysis
  5. Ability to develop workflows where the products of templates feed into other templates.
  6. Ability to generate on-the-fly A-B testing scenarios to compare the outcomes of various schemes.

OK, now for some explanation here: pretty early in 2023, chain-of-thought libraries started cropping up to take advantage of a lot of the schemes I’m describing above – static chains of queries that encouraged the model to branch its own behavior over multiple actions to extend and diversify what they were capable of doing. This led to some interesting possibilities, such as the practice of building a faux ‘team’ of roles that you could have the model roleplay as in order to solve different aspects of a problem and collaborate with itself to develop a more-robust solution than may have been possible with straight querying.

It’s becoming obvious that more-advanced querying – developing meta-queries and second-order (and possibly third-order, whatever that would mean) queries – would benefit everyone in finding more-advanced use-cases for LLMs in the wild. It’s difficult to even constrain the realm of possibilities with these technologies given how broadly they can be applied, but finding where that edge is (and finding spaces that perhaps have not been explored yet) requires a test bench of this kind to be able to crank through example cases, store back the results, send them through human evaluation, and then store and display statistics on what works and what doesn’t.

This is probably going to be a larger project, but it’s a necessary precursor to any larger LLM project, whether that be in gaming or education or any of the other scenarios I have humming in the back of my head. I’d like to explain the education one further in a later post.

New Media

On my first failed foray into the world of post-Bachelor’s academia, one of the central themes in the education department was the concept of ‘new media’. This would have taken place during the initial thrust of the belief that iPads for every student would accelerate and propel education into a new millennium; grappling with this concept was a new paradigm in attempting to address inherent differences and permutations of students’ pedagogical needs. Most of the idea surrounded migrating the base assumptions of literacy into literacies of varying types given the shifting and dynamic nature of new technologies – along with recognizing that the cultural/memetic landscape was moving and evolving faster and creating more niches for self expression and actualization, and that all of these had to be taken into account in modern classrooms.

There were inherent contradictions in the methods teachers were forced into implementing by precedent and law, but there were good points being made about the nature of the shifting landscape of media and interaction in my classes. Video games are a great example of this: the medium being interactive means that the negotiation of meaning and literacy becomes a two way road where authorship and knowledge are collaborative concepts.

This collaboration becomes all the more relevant with the development and adoption of large language models. There have been a few instances of games being developed that utilize this newborn technology, but they have been closer to test benches with a fancy façade wacked on top rather than serious attempts to weave meaningful and complex experiences from this technology.

A thought has been bouncing around my brain: interactive fiction has everything to gain from this revolution. One of the biggest advantages of these large language models is their power when structure is applied that guides their voice. This structure – something every game designer and programmer and story writer is familiar with – could provide the possibility for incredibly dynamic depth, just by writing the edges of a given game world and shaping the direction of its products to point it at really interesting targets, while keeping things fuzzy enough to allow for truly unique stories and experiences for each playthrough.

I started looking into some interesting systems that could be dropped on top of a large-language-model-core that could provide rich experiences. Here are a few ideas:

Psychological Anthropomorphization

    Inspired by Disco Elysium, this system essentially takes a modified version of the Myers-Briggs/Jungian Personality Dynamics and transforms the different cognitive functions into anthropomorphized characters that pipe up throughout your journey to advise and aid or ail you.

    Politics/Management Simulation

    After doing light reading on organizational dynamics, I was interested in tinkering with ways to develop worlds modeled after different historical semi-stable government structures: monarchy, constitutional monarchy, feudal societies, etc. – along with the cultural values accompanying each. What would be fascinating would be to have a set of generation steps develop a new world from the outputs of a large language model. The idea behind this portion of the system would be to spend time generating at least a set of characters along with their inter-relatedness in organizational structures to provide a basis for interesting storylines to develop.

    More ideas are coming, but there’s an absolutely vital project all of this development depends on: a GUI-based prompt-generation and testing program that would save prompts that would have dynamically-fillable portions to consistently be able to query these models for information or decisions. Such a program would essentially save these prompts, have the potential to develop massive tests of on them to see their success and viability with different underlying parameters and in a variety of circumstances, and then potentially develop chains of these pre-arranged prompts to keep the world creation and world simulation humming along.

    Have yet to develop this though – it’s on the list.

    Back Again

    For those of us that sport rich inner worlds, it’s often difficult to remember that, on occasion, it’s useful to form the habit of pulling the ideas and thoughts and musings out of the mental menagerie to see the light of day and live a little in the sun. I want to get back in the habit of doing this – both for my own sanity, to chronicle my thoughts for later reflection, and to force myself to verbalize the ideas hammering around inside in a constant maelstrom.

    To this end, I’m probably going to be converting this space into a place to vent. Not because I have a lot of angst to offgas but because I have a lot of ideas competing for my attention and motivation in my head and they do change a lot. Like many people with a similar personality to me, the trail of projects is long and reminiscent of a highway of misfit toys – incomplete, ill-formed, and abandoned (until they’re not).

    My first instinct is to make an outline of all of the ideas I could possibly talk about, but a list is a device that represents an inner bureaucrat that I have a distinct aversion to, and so I’m going to keep this more off-the-cuff.

    So, the first topic I need to vent about are… ternary machine learning systems! Or, the misshapen idea I have in my head that probably doesn’t work, but I haven’t had the patience to complete the system yet. So, instead of writing the code and debugging it and getting it to work nicely, I’m going to write about what it is here, how it works, and how I want it to work in the future. And also the crippling problems the system has that would probably require that I create new kinds of math or contend with literal impossibilities in order to overcome its weaknesses.

    So, ternary models: those following AI/machine learning/large language models closely will remember Microsoft releasing the ‘Era of 1-bit LLMs’ arxiv article. The article outlines schemes for a new quantization method that converts floating point values to a ternary-based system – i.e., three values only (-1, 0, and 1). These are technically referred to as 1.58-bit models since log2(3) = ~1.5850 – because for an arbitrarily large amount of ternary values, they can be packed at a size of about 1.58 bits per value.

    This may not sound like much or may not be intuitively a ‘better’ thing on its face if you don’t know a whole lot about the kinds of calculation that go into large language model inference, but suffice it to say that once you constrain the values to these three, it’s possibly to do away with large portions of the extremely performance-heavy matrix multiplications required and instead replace them with matrix additions with some mathematical cleverness.

    My idea was to take this a step further. I realized that if you’re constraining the values this much, the calculations would represent an awful lot of repetition – and could probably be cached and stored in some way so that instead of even having to do matrix addition, you could just do lookups in a hash table. O(n) complexity essentially I think? My complexity calculation math is rusty. It might even be faster than that.

    So how does that work? Well, inference calculations for weights in the feed forward network are usually just a rote multiplication (which, for these values, either means doing nothing for 1, zeroing everything out for 0, or negating everything for -1), summing all the weights, and then performing a ReLU function or, in the case of Mistral models, SwiGLU function (which apparently combines a sigmoid function and a gated linear unit algorithm – the success of which was apparently attributed to ‘divine benevolence’ in lieu of a solid explanation). This last portion is basically to calculate the degree to which a given set of weights connecting a previous layer activate a particular neuron in the next layer.

    If you have a set number of values that these weights could possibly be (-1, 0, 1) you can potentially pre-calculate all of that beforehand, as I mentioned. The scheme I had in mind was to essentially pack 5 ternary values into a single object, which, if you do the quick math here, 35 = 243 – meaning you can encode the entire set of grouped ternary values into a byte value, or an unsigned char value in C. Basically, the encoding is between 0 and 255 with some extra empty encodings at the end. At that point, every single permutation of 5 ternary values can be represented in a single byte and adjustments to those values are just translations from one encoded value to another – something very easily mappable.

    If you take these concepts and try to translate them from fast inference (easy-peasy) into fast training, the problem becomes orders of magnitude more difficult. For reference, the backpropagation process for achieving gradient descent involves performing multivariate linear analysis on the entire set of previous values (and the functions for producing said values) at each step of the whole process. These sorts of calculations are easy to wrap your head around when the values represent continuous vectors, but… things get weird (for me, at least) when talking about the same kinds of calculations for quantized values. I could see a way for these calculations to also be heavily simplified and even pre-calculated, but… that’s not something I’ve researched enough to know. It could be feasible.

    If it is feasible, I can imagine a system that is capable of being run in training AND inference mode in real-time on consumer hardware with relative ease. The parameter count would probably have to be much higher, but since these parameters are much tinier and the computations are simply faster, that shouldn’t matter too much.

    I have yet to finish building a working prototype for this though. I want to, but it’s… a lot for me to wrap my head around. Just learning the calculus involved in the backpropagation process was a lot given my… rough background in calculus. Still, this is a project I do want to devote some time too since real-time training and inference are a worthwhile goal.

    I think I’ll leave this topic here for now, as there are several others I would like to work on for future posts.