I heavily documented my work with comments, though I don't know if I did a good job explaining the process, and I tend to notice a lot of grammatical errors when rereading my English writing.
This basically generalizes the algorithm described in Intel's paper to work for any CRC parameters. The paper does vaguely explain how to do handle the various cases that arise when using different parameters, but most implementations tend to be targeted towards a specific type of CRC.
The algorithm uses some clever, complicated math to reduce the data buffer to a much smaller one, and then computes the CRC using the new smaller buffer.
The Paper's algorithm goes this way: Fold-by-4 -> Fold-by-1 -> Fold to 64-bits -> Barret Reduction.
Things get confusing by the end when you need to actually compute the CRC out of the smaller buffer, and the complexity doesn't really help improve the performance.
My simplification of the algorithm: Fold-by-4 -> Lookup Table
I figured that I could do this after reading a section of the paper that stated that the new small buffer is "congruent" to the original. This doesn't affect performance much since the longer section (Fold-by-4) is still utilizing intrinsics. The software table algorithm is only used to reduce the remaining data (fewer than 200 bytes).
I've just finished it so I don't know if it's mature yet and it will definitely require further testing.
What do you think?