Safety Judges help keep language models safe by filtering out harmful content. They ensure that responses are:

  • legal
  • respectful
  • safe for users

Steps to Create a Safety Judge

Select Safety Judge

Select Safety Model Type

You can choose between:

  • Collinear Guard
  • Collinear Guard Nano
  • Llama Guard
  • Wild Guard
  • Prompted Model

1. Creating a Collinear Guard judge

This judge evaluates the safety of model outputs on a more granular scale, providing a detailed assessment of the content’s safety level. The Likert scale enables a nuanced view, from identifying highly unsafe outputs to confirming very safe responses.

Safety Rating: Likert scale rating from 1 to 5

  • 1: Very unsafe
  • 2: Unsafe
  • 3: Neutral
  • 4: Safe
  • 5: Very safe

Once you select Collinear Guard judge, select “Continue”.

Set Judge Name

Name it according to your preference and select “Create Judge”.

2. Creating a Collinear Guard Nano judge

The Collinear guard nano model supports three types of evaluations:

Prompt Evaluation: Binary classification

  • 0: The prompt is deemed unsafe.
  • 1: The prompt is considered safe.

Response Evaluation: Binary classification

  • 0: The response is deemed unsafe.
  • 1: The response is considered safe.

Refusal Evaluation: Binary classification

  • 0: Indicates the model refused to generate a response.
  • 1: Indicates the model successfully generated a response.

Once you select Collinear Guard Nano judge, select “Continue”.

Select your Evaluation Type

  1. Reponse evaluation

  2. Prompt evaluation

  3. Refusal evaluation

and then click “Continue”.

Set Judge Name

Name it according to your preference and select “Create Judge”.

3. Creating a LLama Guard judge

This LLama Guard judge provides a simple and direct safety assessment, ensuring that unsafe content is flagged and only safe content passes through.

LLamaGuard Evaluation: Binary classification

  • 0: The content is deemed unsafe.
  • 1: The content is considered safe.

Once you select LLama Guard judge, select “Continue”.

Set Judge Name

Name it according to your preference and select “Create Judge”.

4. Creating a Wild Guard judge

The Wild Guard judge provides a straightforward safety evaluation for prompts and responses, along with refusal handling, ensuring that unsafe interactions are flagged and refusals are properly identified.

Prompt Evaluation: Binary classification

  • 0: The prompt is deemed unsafe.
  • 1: The prompt is considered safe.

Response Evaluation: Binary classification

  • 0: The response is deemed unsafe.
  • 1: The response is considered safe.

Refusal Evaluation: Binary classification

  • 0: Indicates the model refused to generate a response.
  • 1: Indicates the model successfully generated a response.

Once you select Wild Guard judge, select “Continue”.

Set Judge Name

Name it according to your preference and select “Create Judge”.

5. Creating a Prompted Model judge

This safety judge will evaluate model outputs based on predefined safety criteria, ensuring that unsafe responses are flagged for further review, while safe outputs are approved for deployment.

Output: Binary classification

  • 0: Indicates the response is deemed unsafe.
  • 1: Indicates the response is considered safe.

Once you select Prompted Model judge, select “Continue”.

Select your Prompted Model

You can select your model from the drop-down. If you haven’t added a model, select “Add New Model” to create a new one.

Edit your prompt template

You can proceed with the template or edit it and then select “Continue.”

Set Judge Name

Name it according to your preference and select “Create Judge.”