Show HN: LLMs can generate valid JSON 100% of the time

854 points

3 years ago

Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.

Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.

Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.

From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.

I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.

I look forward to feedback, bug reports, feature requests and discussions!

Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702

303 comments