Advanced Functionality

Colab here

The Goodfire SDK provides a powerful way to steer your AI models by changing the way they work internally. To do this we use mechanistic interpretability to find human-interpretable features and alter their activations.

This advanced tutorial will build on the quickstart.ipynb and show you how to: - Try different modes for setting feature interventions - Define conditional feature interventions - Explore feature nearest neighbors - Inspect logits when sampling model responses

Setup

[1]:
!pip install goodfire==0.2.11
Collecting goodfire==0.2.11
  Downloading goodfire-0.2.11-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: httpx<0.28.0,>=0.27.2 in /usr/local/lib/python3.10/dist-packages (from goodfire==0.2.11) (0.27.2)
Collecting ipywidgets<9.0.0,>=8.1.5 (from goodfire==0.2.11)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: numpy<2.0.0,>=1.26.4 in /usr/local/lib/python3.10/dist-packages (from goodfire==0.2.11) (1.26.4)
Requirement already satisfied: pydantic<3.0.0,>=2.9.2 in /usr/local/lib/python3.10/dist-packages (from goodfire==0.2.11) (2.9.2)
Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (3.7.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (2024.8.30)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (1.0.7)
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (3.10)
Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (1.3.1)
Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (0.14.0)
Collecting comm>=0.1.3 (from ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11)
  Downloading comm-0.2.2-py3-none-any.whl.metadata (3.7 kB)
Requirement already satisfied: ipython>=6.1.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (7.34.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (5.7.1)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: jupyterlab-widgets~=3.0.12 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (3.0.13)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.9.2->goodfire==0.2.11) (0.7.0)
Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.9.2->goodfire==0.2.11) (2.23.4)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.9.2->goodfire==0.2.11) (4.12.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (75.1.0)
Collecting jedi>=0.16 (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (4.4.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (3.0.48)
Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (2.18.0)
Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.2.0)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.1.7)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (4.9.0)
Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<0.28.0,>=0.27.2->goodfire==0.2.11) (1.2.2)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.8.4)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=6.1.0->ipywidgets<9.0.0,>=8.1.5->goodfire==0.2.11) (0.2.13)
Downloading goodfire-0.2.11-py3-none-any.whl (27 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.8/139.8 kB 4.9 MB/s eta 0:00:00
Downloading comm-0.2.2-py3-none-any.whl (7.2 kB)
Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 28.7 MB/s eta 0:00:00
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 35.6 MB/s eta 0:00:00
Installing collected packages: widgetsnbextension, jedi, comm, ipywidgets, goodfire
  Attempting uninstall: widgetsnbextension
    Found existing installation: widgetsnbextension 3.6.10
    Uninstalling widgetsnbextension-3.6.10:
      Successfully uninstalled widgetsnbextension-3.6.10
  Attempting uninstall: ipywidgets
    Found existing installation: ipywidgets 7.7.1
    Uninstalling ipywidgets-7.7.1:
      Successfully uninstalled ipywidgets-7.7.1
Successfully installed comm-0.2.2 goodfire-0.2.11 ipywidgets-8.1.5 jedi-0.19.2 widgetsnbextension-4.0.13
[2]:
from google.colab import userdata

# Add you Goodfire API Key to your Colab secrets
GOODFIRE_API_KEY = userdata.get('GOODFIRE_API_KEY')
[3]:
import goodfire

client = goodfire.Client(GOODFIRE_API_KEY)

# Instantiate a model variant for use later in the notebook
variant = goodfire.Variant("meta-llama/Meta-Llama-3-8B-Instruct")

Feature intervention modes

There are two primary modes for feature interventions:

  • Pinning (mode='pin'): Sets the feature weight in the variant to a specific value consistently.

  • Nudging (mode='nudge', default): Biases the model towards that feature, amplifying its activation where it’s already present.

The key difference is that nudging enhances existing activation patterns, while pinning enforces a consistent activation strength regardless of the original state.

[4]:
pirate_features, relevance = client.features.search(
    "pirate",
    model=variant,
    top_k=5
)
picked_pirate_feature = pirate_features[1]
[8]:
variant.set(picked_pirate_feature, 0.6, mode='nudge')
variant
[8]:
Variant(
   base_model=meta-llama/Meta-Llama-3-8B-Instruct,
   edits={
      Feature("Pirate characters and themes in fiction and role-playing games"): {'mode': 'nudge', 'value': 0.6},
   }
)
[9]:
for token in client.chat.completions.create(
    [
        {"role": "user", "content": "Hello. How are you?"}
    ],
    model=variant,
    stream=True,
    max_completion_tokens=50,
):
    print(token.choices[0].delta.content, end="")
Ahoy matey! I'm doin' swell, thank ye for askin'! I be a helpful assistant, ready to set sail on the high seas of adventure with ye! What be bringin' ye to these fair waters? Got

Conditional feature interventions

You can establish relationships between different features (or feature groups) using conditional interventions. You can use the Controller to define relevant conditional statements, and pass the controller to a ProgrammableVariant for model sampling. We show how you can use our API to fire the pirate feature intervention only if a whale feature had fired in the prompt or in the generation.

[10]:
whale_feature, _ = client.features.search("whales", "meta-llama/Meta-Llama-3-8B-Instruct", top_k=1)
whale_feature
[10]:
FeatureGroup([
   0: "Whales and their characteristics"
])
[11]:
pirate_features, _ = client.features.search("talk like a pirate", "meta-llama/Meta-Llama-3-8B-Instruct", top_k=5)
pirate_features
[11]:
FeatureGroup([
   0: "The model should roleplay as a pirate",
   1: "Pirate-related language and themes",
   2: "Pirate characters and themes in fiction and role-playing games",
   3: "Mentions of rum, especially in pirate or cocktail contexts",
   4: "The model's turn to speak while roleplaying as an expert"
])
[12]:
programmable_variant = goodfire.variants._experimental.ProgrammableVariant(base_model="meta-llama/Meta-Llama-3-8B-Instruct")

with programmable_variant.controller.when(
    whale_feature > 0.1
):
    programmable_variant.controller[pirate_features[0]] = 0.5

for token in client.chat.completions.create(
    [
        {"role": "user", "content": "Tell me about squids."}
    ],
    model=programmable_variant,
    stream=True,
    max_completion_tokens=50,
):
    print(token.choices[0].delta.content, end="")

print('\n\n---\n\n')

for token in client.chat.completions.create(
    [
        {"role": "user", "content": "Talk to me about whales."}
    ],
    model=programmable_variant,
    stream=True,
    max_completion_tokens=50,
):
    print(token.choices[0].delta.content, end="")
WARNING:goodfire:ProgrammableVariants are an experimental feature and may change in the future.
Squids! They're some of the most fascinating creatures in the ocean!

Squids are cephalopods, related to octopuses and cuttlefish. They have a soft, boneless body that's often covered in a m

---


Whales, matey! They be the largest creatures in the seven seas, don't ye know! From the blue whale to the humpback, they're a mighty fine sight to behold. And they don't just swim around, they sing

(Experimental) Explore feature nearest neighbors

Get neighboring features by comparing against either individual feature directions or the centroid of a feature group. When using feature directions, the method analyzes similarity in the actual embedding space (as defined by decoder weights), while providing a feature group will find features closest to the calculated center point (centroid) of that group’s combined feature directions.

[13]:
whale_feature, _ = client.features.search("animals such as whales", top_k=1)

client.features._experimental.neighbors(whale_feature, model=variant)
Warning: The experimental features API is subject to change.
[13]:
FeatureGroup([
   0: "Marine ecosystems and coral reef biodiversity",
   1: "Semantic understanding of elephants",
   2: "Naval warfare and ship types",
   3: "Swimming safety equipment and activities for children",
   4: "Maritime and shipping terminology in non-English languages",
   5: "Wildlife and nature observation in outdoor settings",
   6: "Violent or turbulent natural phenomena, especially involving water",
   7: "The model begins providing substantive information",
   8: "Seafood dishes and restaurants",
   9: "Large bodies of water (seas and oceans)"
])
[14]:
assistant_features, _ = client.features.search("the assistant should", top_k=5)
language_features, _ = client.features.search("foreign languages", top_k=5)

client.features._experimental.neighbors(assistant_features | language_features, model=variant)
[14]:
FeatureGroup([
   0: "Language identification for translation tasks",
   1: "Nationality and language adjectives",
   2: "References to Indian and Chinese markets or businesses",
   3: "Classical Western civilization and its historical influence",
   4: "The English language as a subject of study or communication",
   5: "Korean language tokens in translation contexts",
   6: "The model should provide Python programming help",
   7: "The model's turn to speak or translate in a non-English language",
   8: "European and other nationalities of notable figures or companies",
   ...
   10: "Language and linguistic concepts"
])

(Experimental) Get logits

You can also get the raw logit outputs of a model.

[15]:
client.chat._experimental.logits(
    [
        {"role": "user", "content": "Hello. How are you?"},
        {"role": "assistant", "content": "I am feeling very"},
    ],
    model=variant,
    top_k=5,
)
[15]:
LogitsResponse(logits={' well': 22.125, ' good': 17.5, ' pleased': 16.0, ' fine': 14.8125, ' chip': 14.5})