Test Bench – Frenetic Etchings

In my last post, I mentioned the vital need for a large language model test bench. Essentially, the program I have in mind consists of a few different features:

Ability to choose the model you’re connecting to (without necessitating that you run it out of the application or on-device).
Ability to attach a sentence embedding framework or model to be run on-device – probably by calling out to python?
Ability to store conversations with various models that preserve things like model name, parameters used, etc. for later retrieval.
Ability to develop prompt templates
- Include testing of templates – spreading tests out amongst different models, using different parameters, and storing all of these details for later analysis
Ability to develop workflows where the products of templates feed into other templates.
Ability to generate on-the-fly A-B testing scenarios to compare the outcomes of various schemes.

OK, now for some explanation here: pretty early in 2023, chain-of-thought libraries started cropping up to take advantage of a lot of the schemes I’m describing above – static chains of queries that encouraged the model to branch its own behavior over multiple actions to extend and diversify what they were capable of doing. This led to some interesting possibilities, such as the practice of building a faux ‘team’ of roles that you could have the model roleplay as in order to solve different aspects of a problem and collaborate with itself to develop a more-robust solution than may have been possible with straight querying.

It’s becoming obvious that more-advanced querying – developing meta-queries and second-order (and possibly third-order, whatever that would mean) queries – would benefit everyone in finding more-advanced use-cases for LLMs in the wild. It’s difficult to even constrain the realm of possibilities with these technologies given how broadly they can be applied, but finding where that edge is (and finding spaces that perhaps have not been explored yet) requires a test bench of this kind to be able to crank through example cases, store back the results, send them through human evaluation, and then store and display statistics on what works and what doesn’t.

This is probably going to be a larger project, but it’s a necessary precursor to any larger LLM project, whether that be in gaming or education or any of the other scenarios I have humming in the back of my head. I’d like to explain the education one further in a later post.

Leave a Reply Cancel reply