Skip to main content
By using Feature Activations and Contrastive Search we can build a jailbreak resistant model. Through this approach we were able to drastically lower the ability to jailbreak the model, using jailbreak prompts from the StrongREJECT dataset.
Open in Colab
I