[Personal Blog](http://zafar.cc/not-notmnist-dataset-generation/)
[GitHub Link](https://github.com/zafartahirov/not_notMNIST)
I wrote a little script that you can use to generate datasets for classification (like MNIST or notMNIST).
It takes fonts that you have, and creates images + label/features pickle that you can load into Python.
A more detailed explanation here: http://zafar.cc/not-notmnist-dataset-generation/ I would really appreciate any critique, issue requests, and pull requests on GitHub: https://github.com/zafartahirov/not_notMNIST
The benefits that I personally see is that if you want to test your classification on datasets that involve Unicode characters, you can. The problem is that you have to have a lot of fonts to be able to generate a decent dataset. If you have a lot of fonts in your language, I would appreciate if you could share the dataset :) I generated some using Hiragana, but I don't have a license for a lot of fonts, so it is more of a demo (check GitHub). I would really love to have a dataset for Chinese, Arabic, Hebrew, Cyrillic, etc.