Jailbreak Resistance

By using Feature Activations and Contrastive Search we can build a jailbreak resistant model. Through this approach we were able to drastically lower the ability to jailbreak the model, using jailbreak prompts from the StrongREJECT dataset.