Collinear AI home page
Search...
⌘K
Ask AI
Support
Go To Homepage
Go To Homepage
Search...
Navigation
Assess
Compare Model and Judge Performance
Documentation
API Reference
Blog
Introduction
What is Collinear AI?
Get Started
Login
Create Space
Accessing Your API Key
Accessing Space ID
Accessing Judge ID
Assess
Create Safety Assessment
Create Reliability Assessment
Create a Performance Assessment
Compare Model and Judge Performance
Query Language
Agentic AI
Overview
Upload Agent Workflow Logs
Generate Evaluation Data
Assess Agentic Workflows
Review & Export Results
Improve
Creating a Data Curation Run
Judge
Types of Judges
Veritas - Reliability Judge
Datasets
Format Dataset
Upload Dataset
Run Judge on Dataset
Run Conversation Model On Dataset
Export Dataset
On this page
Overview
Runs Comparison
Score Breakdown
Detailed Judgments
Controls
Usage
Assess
Compare Model and Judge Performance
Use the Collinear AI Platform to compare model and judge performance.
Overview
The Compare page allows users to compare different model runs and their respective judgments on a given dataset. The page is divided into several sections, each providing valuable insights into the performance of models against specific datasets.
Runs Comparison
This section provides an overview of the different runs being compared. Each run includes the model name, dataset name, and a safety status.
Run Details:
Contains the judge name and dataset name
Score Breakdown
A breakdown of scores given to each run, presented as bar graphs. Scores are categorized numerically as follows:
Binary - These scores are binary, either pass or fail.
Likert - These scores are on a scale of 1-5, with 1 being the lowest and 5 being the highest.
Each run bar graph illustrates the distribution of these scores.
Detailed Judgments
A tabular representation of responses, judgments, and labels. It compares how each response scores according to different judges. Each row includes:
Response:
The given response being evaluated.
Judgement:
The scores or passes assigned to each response
Controls
Query Filter:
Allows users to filter the responses according to specific queries.
More about query language here
Export Button:
Facilitates the export of the current view into a JSON format.
Usage
This interface aids in evaluating models’ capabilities to handle toxic language by:
Comparing multiple runs side by side.
Analyzing judgment consistency across different judges.
Identifying areas requiring improvement for model safety.
Was this page helpful?
Yes
No
Suggest edits
Raise issue
Create a Performance Assessment
Query Language
Assistant
Responses are generated using AI and may contain mistakes.