Hello everyone, After checking out the latest commercially available models, we decided we wanted a model that could code and keep up with the latest AI research. To build our own, self-hosted copilots we needed a large dataset, and we wanted to share as we go.
The focus for this version is on creating a baseline “Manager level” understanding when someone types the question/prompt:
“define how this software works in the module: ./path/some.py”
The model responds with the generative response wrapped in a yaml payload. To build this dataset, we started by extracting how to use: classes, global functions, base classes (inheritance/polymorphism), and imports from 1207 python AI research repos that we are learning. We also wanted to draw and speak/hear with transformers so we added modes to the dataset for hopefully getting more audio/image models in this space.
Here's the summary (everything is in parquet files):
~2.3M unique source coding rows ~1.1M instruct alpaca yaml text rows ~923K png knowledge graph images with alpaca text description ~334K mp3s over ~2 years of continuous audio playtime requires 1.5 TB storage on disk
We plan on training and fine tuning using these datasets with models like Code Llama 70 B, and we shared an overview of some of the other coding models we liked that may help others looking to do the same on our blog: https://matlok.ai/
Lastly if these datasets are not useful, then there also a lot of good datasets already on Hugging Face too: https://huggingface.co/datasets