TLDR: How important are complexity-based arguments for limitations of SSMs in the real world?
With all the discussion right now on Mamba vs transformers, I am reminded of the Big Bird paper from late 2020 (https://arxiv.org/pdf/2007.14062). Big Bird introduces a nice sparse attention implementation, and goes on to provide a strong example of its limitations. Specifically, it leverages the Orthogonal Vector Conjecture to show there are tasks a full transformer can do in a single layer that a sparse transformer cannot solve with a fixed number of layers.
I think a lot of the critiques of Mamba fall into the same category, specifically there are tasks that cannot be solved by a model that runs in linear time with sequence length. You can frame it as, there are some tasks for which to know what parts of the start of the sequence are important to remember you need to know the end of the sequence, or as, a model with a fixed memory bank will for some length not be able to store all information, but I think it all boils down to complexity theory.
The question I have is, do we care? There are also tasks that a transformer cannot solve in a fixed number of layers by the same logic, trivially including any NP-hard problem, but we don't necessarily see that as meaning we should not rely on transformers.