Throughput over latency. The problem is embarrassingly parallel and can be divided to suit any number of cores. Cores could be register or stack machines. Cores need to have access to >100 MB. Algorithm does not depend on order of communication between cores however inter core communication is required (shared memory or message passing is fine). The workload is branch rich so SIMD/SIMT is not possible. The workload is mostly integer instructions. The workload is memory bound not compute bound.
Can you think of any off the shelf or rentable hardware better suited to this workload then many core ARM chips which have some vector instructions i won't use?
[1] https://www.igorslab.de/wp-content/uploads/2021/12/alder_lak...
Notes: GPUs tends to have huge vector width. Xeon PHI has huge vectors. https://www.greenarraychips.com/ s cores are to small, have to little RAM and are not available in large machines, Adapteva chips are not taped out on a competitive node, Sunway SW26010 has huge vectors, Graphcore has huge vectors