✅ What is a Reliability Evaluation?

A Reliability Evaluation measures how consistently and truthfully your model responds across a dataset. Collinear AI runs each sample through a selected reliability judge, which detects hallucinations or factual inconsistencies.

This helps you:

  • Quantify your model’s factual accuracy
  • Identify hallucination-prone outputs
  • Compare performance across different models or prompts

🎥 Interactive Walkthrough

Want to see it in action? Follow this guided demo to create your reliability run:

Introduction

Once you connect your model or upload your dataset, you can run a reliability evaluation on it using Collinear AI’s suite of reliability judges.


🚀 Getting Started

After connecting your model or uploading your dataset, you can initiate a reliability evaluation using one of Collinear AI’s reliability judges.


🧑‍⚖️ Select a Judge

Choose from the following reliability models:

  1. Lynx 8B – Patronus AI’s off-the-shelf model for hallucination detection.
  2. Veritas Nano – Collinear’s ultra-fast binary classifier for hallucination detection.
  3. Veritas – Collinear’s advanced large model for in-depth hallucination detection.
  4. Prompted Model – Use any custom model with a tailored prompt for flexible evaluation.

🧠 Select a Context Engine

Choose how you’d like to include contextual grounding during evaluation:

Options

  1. Use Context From Dataset Pulls relevant context directly from your uploaded dataset.

  2. Add Context Engine Use a RAG (Retrieve-and-Generate) engine to provide additional context.

Required Fields for RAG Integration:

  • Content Engine API Key – Authenticate securely with your context engine.
  • RAG Host – URL for the server powering the RAG service.
  • Index – Optimized data structure for efficient search and retrieval.
  • Namespace – Logical grouping to avoid identifier conflicts.
  • Top K – Controls how many of the top results to fetch from the index.