Lightweight Safety Classification Using Pruned Language Models

19 points

a year ago

Layer Enhanced Classification (LEC) is a novel technique that outperforms current industry leaders like GPT-4o, LlamaGuards 1 and 8B, and deBERTa v3 Prompt Injection v2 for content safety and prompt injection tasks.

We prove that the intermediate hidden layers in transformers are robust feature extractors for text classification.

On content safety, LEC models achieved a 0.96 F1 score vs GPT-4o's 0.82 and Llama Guard 8B's 0.71.The LEC models were able to outperform the other models with only 15 training examples for binary classification and 50 examples for multi-class classification across 66 categories.

On prompt injection,LEC models achieved a 0.98 F1 score vs GPT-4o's 0.92 and deBERTa v3 Prompt Injection v2's 0.73. LEC models were able to outperform deBERTa with only 5 training examples and GPT-4o with only 55 training examples.

Read the full paper and our approach here: https://arxiv.org/abs/2412.13435

3 comments