The mod composes of two parts:
1. An exe written in Python which is used to connect the various components together (Whisper / LLM / xVASynth / NPC background descriptions). See here for the code: https://github.com/art-from-the-machine/Mantella
2. A Skyrim Papyrus mod which both reads out voicelines generated via the exe and passes in-game information to the exe (the NPC selected / in-game events). The code can be found in Files -> Mantella Spell here: https://www.nexusmods.com/skyrimspecialedition/mods/98631
The mod supports local (eg Llama 2), OpenAI (eg GPT-4), and OpenRouter (eg Claude v2) language models. NPC memories are handled by recursively summarising previous conversations. When a given conversation is ended, the conversation is passed to the LLM with the task of summarising the conversation in a short paragraph. This paragraph is saved to a text file along with all previous conversation summaries. If the text file itself reaches the LLM's token limit, the full conversation history is then re-summarised and the process begins again. This means that the "lucidity" of NPC memories is tied to the token length of the model used (LLMs with lower token limits will need to re-summarise memories more often). I have also subjectively noticed that the ability to recall memories depends on the LLM (eg GPT-3.5 vs GPT-4).
Mantella contains over 1,000+ NPC background descriptions which are passed to the LLM within the starting prompt to help it "get in character". This gives the LLM a head start, but from my testing / other user reports models like GPT-3.5 already contain lots of data related to Skyrim lore and its characters to work with. It will sometimes get details wrong, but for the most part this doesn't break the experience.
The performance of the mod will vary depending on your hardware and chosen setup. xVASynth and Whisper are run locally, and the chosen LLM can either be run locally or via an API. When I run Mantella via the OpenAI API I have an average response time of ~5 seconds on my 5800x3D CPU. You can also run each component on your GPU, but since I run Skyrim in VR the drop in frames is too much of a sacrifice for shorter response times.
With the "how" covered, I want to also touch on why I started this project. Skyrim in VR feels incredibly immersive and realistic, and the reason I started working on this mod was because I felt like unique dialogue was the missing puzzle piece to complete the experience. By having NPCs I can talk to over long journeys, or react to the things I have done and treat me differently, or remember me from my last visit to their town that I might not even remember, it builds up this narrative over time that is unique to me.
Another reason I started this project is because I am interested in seeing what is possible / what are the limits of these different technologies. While this passion project is a mod for a video game released in 2011, I can't wait to see what developers building games from the ground up with this technology in mind can achieve. I also hope that Mantella gives an idea of how such a system could be implemented in newly released games. While AAA studios could possibly pull off charging users on a monthly basis to use these services, indie developers might have a harder time achieving this. I hope that Mantella provides insight into what is possible when running everything completely offline. Local models are constantly improving, and I can't wait to see how these improvements continue. Overall I am incredibly excited to see where this kind of technology goes in the future!
Mantella Trailer: https://youtu.be/FLmbd48r2Wo?si=QLe2_E1CogpxlaS1