Overview

The Compare page allows users to compare different model runs and their respective judgments on a given dataset. The page is divided into several sections, each providing valuable insights into the performance of models against specific datasets.

Runs Comparison

This section provides an overview of the different runs being compared. Each run includes the model name, dataset name, and a safety status.

  • Run Details:
    • Contains the judge name and dataset name

Score Breakdown

A breakdown of scores given to each run, presented as bar graphs. Scores are categorized numerically as follows:

  • Binary - These scores are binary, either pass or fail.
  • Likert - These scores are on a scale of 1-5, with 1 being the lowest and 5 being the highest.

Each run bar graph illustrates the distribution of these scores.

Detailed Judgments

A tabular representation of responses, judgments, and labels. It compares how each response scores according to different judges. Each row includes:

  • Response: The given response being evaluated.
  • Judgement: The scores or passes assigned to each response

Controls

  • Query Filter: Allows users to filter the responses according to specific queries.More about query language here
  • Export Button: Facilitates the export of the current view into a JSON format.

Usage

This interface aids in evaluating models’ capabilities to handle toxic language by:

  • Comparing multiple runs side by side.
  • Analyzing judgment consistency across different judges.
  • Identifying areas requiring improvement for model safety.