dsync can use multiple threads (specified via an option). It uses a bounded multi-producer multi-consumer queue (C implementation of Dmitry Vyukov's bounded MPMC queue using gcc atomic builtins). The main thread traverses the given sources and adds the files to the queue. Other threads dequeue entries from the queue and do the sync/copy work concurrently.
dsync performs well with multiple threads compared to GNU cp when copying directories with a lot of small files such as the linux kernel repository. More details are in the README.