By using Feature Activations and Contrastive Search we can build a jailbreak resistant model.

Through this approach we were able to drastically lower the ability to jailbreak the model, using jailbreak prompts from the StrongREJECT dataset.