We prove that the intermediate hidden layers in transformers are robust feature extractors for text classification.
On content safety, LEC models achieved a 0.96 F1 score vs GPT-4o's 0.82 and Llama Guard 8B's 0.71.The LEC models were able to outperform the other models with only 15 training examples for binary classification and 50 examples for multi-class classification across 66 categories.
On prompt injection,LEC models achieved a 0.98 F1 score vs GPT-4o's 0.92 and deBERTa v3 Prompt Injection v2's 0.73. LEC models were able to outperform deBERTa with only 5 training examples and GPT-4o with only 55 training examples.
Read the full paper and our approach here: https://arxiv.org/abs/2412.13435