By using Feature Activations and Contrastive Search we can build a jailbreak resistant model.

Through this approach we were able to drastically lower the ability to jailbreak the model, using jailbreak prompts from the StrongREJECT dataset.

By using Feature Activations and Contrastive Search we can build a jailbreak resistant model.

Through this approach we were able to drastically lower the ability to jailbreak the model, using jailbreak prompts from the StrongREJECT dataset.