Evaluation Metrics

Preset Metrics:

Assessment Process

Interpreting Results

Evaluate agent performance on custom metrics

Assess Agentic Workflows

Collinear AI

Collinear AI offers solution to continuously improve your model by generating high-quality, safety-aware post-training data using specialized judge models.
The API provides access to Collinear's proprietary post-training technologies:
1. AI Judges for assessing eneterprise AI for Safety, Reliability, and Bespoke quality criteria
2. Synthetic data generation and curation capabilities
3. Automated redteaming for predefined budget and use case spec

What is Collinear AI

Login

Learn how to set up a Space in Collinear AI to manage your models, datasets, and evaluations effectively.

Create Space

Use the Collinear AI Platform to access your API Key.

Accessing Your API Key

Use the Collinear AI Platform to access your Space Id.

Accessing Space ID

Use the Collinear AI Platform to get Judge Id.

Accessing Judge ID

Use the Collinear AI Platform to create a new safety evaluation

Create Safety Evaluation

Use the Collinear AI Platform to create a new reliability evaluation.

Create Reliability Evaluation

The Collinear AI Platform allows you to effortlessly evaluate AI models with flexibility and precision. This guide walks you through the steps to create a new evaluation run using the Flex Evaluation feature.

Create Collinear Flex Evaluation

Use the Collinear AI Platform to compare model and judge performance.

Compare Model and Judge Performance

This documentation provides an overview of the custom query language designed to parse and evaluate complex queries on structured data. The language supports a variety of conditions, logical operators, and special features that enable users to filter data effectively.

Query Language

Overview of Agentic AI workflow evaluation and metrics.

Overview

Upload agent workflow logs in various formats for analysis

Upload Agent Workflow Logs

Automatically create evaluation datasets from workflow logs

Generate Evaluation Data

Analyze and export agent evaluation results

Review & Export Results

Collinear empowers your business with a dashboard that tracks key metrics to ensure your AI performs safely, accurately, and reliably.

Safety Monitor

Track your AI system's consistency and accuracy with the Collinear AI Reliability Monitoring dashboard.

Reliability Monitor

Curate high-quality synthetic data with flexibility and precision using the Collinear AI Platform. Learn about amplification, supported models, and quality criteria.

Creating a Data Curation Run

A comprehensive overview of models categorized by judge type, including reliability and safety judges.

Types of Judges

You can use the Veritas model to evaluate the factual correctness of model outputs, ensuring that the responses are accurate and free from hallucinated content.

Veritas - Reliability Judge

The Collinear AI Platform allows you to upload datasets in JSON or CSV format.

Uploading Dataset

Use the Collinear AI Platform to generate conversations.

Conversation Builder

Use the Collinear AI API to upload a dataset to your space.

Upload Dataset

Use the Collinear AI API to run judge on your dataset.

Run Judge on Dataset

Use the Collinear AI API to run conversation model on your dataset.

Run Conversation Model On Dataset

Use the Collinear AI Platform to export your dataset rows in a JSON file.

Metric	Description	Scale
Goal Completion	Does the agent achieve its purpose?	0-1
Step Efficiency	Optimal path to solution	1-5
Context Retention	Maintains conversation memory	1-5
Error Rate	Unsuccessful steps	%
User Satisfaction	Predicted user experience	1-5

Introduction

Get Started

Assess

Agentic AI

Guard

Improve

Judge

Datasets

Assess Agentic Workflows

Evaluation Metrics

Preset Metrics:

Assessment Process

Interpreting Results

Introduction

Get Started

Assess

Agentic AI

Guard

Improve

Judge

Datasets

​Evaluation Metrics

​Preset Metrics:

​Assessment Process

​Interpreting Results

Evaluation Metrics

Preset Metrics:

Assessment Process

Interpreting Results