Creating a Data Curation Run

The Collinear AI Platform makes it effortless to create synthetic datasets tailored to your exact needs. This guide walks you through setting up a Data Curation Run — from selecting generator models to defining quality criteria.

Interactive Walkthrough

Want to see it in action? Follow this guided demo to create your curation run:

What is Synthetic Curation?

Synthetic curation is the process of automatically generating diverse, high-quality examples to power model training, evaluation, or fine-tuning. With Collinear AI, you can amplify your dataset while maintaining control over quality, diversity, and relevance. Key features:

Scale fast: Generate hundreds to thousands of samples with minimal effort.
Maintain precision: Define exactly what quality looks like for your use case.
Stay flexible: Customize generation settings and quality filters.

Amplification Factor

The Amplification Factor determines how much you expand your data:

Higher amplification → More synthetic samples per input.
Lower amplification → Fewer, more targeted samples.

This setting allows you to balance quantity and curation precision based on your project’s needs.

Supported Generator Models

Collinear AI supports the most powerful and versatile models for synthetic data generation:

GPT-4o — OpenAI’s flagship model, excelling at reasoning, creativity, and instruction-following.
LLaMA 70B — Optimized for balanced performance and efficiency.
LLaMA 405B — Massive-scale model offering unparalleled depth and nuance.
Qwen 72B — Strong multilingual and instruction-following capabilities, ideal for diverse domains.

Select the model that best matches your data complexity, domain, and scale requirements.

Curation Quality Criteria

To ensure every piece of generated data meets high standards, Collinear AI allows you to curate outputs based on these key dimensions:

Correctness: How accurately outputs match desired criteria and ground truth examples.
Naturalness: How well outputs mimic real-world language patterns and tone.
Diversity: The range and uniqueness of generated variations, ensuring broad coverage.
Coherence: Logical and semantic flow throughout the output.
Instruction Following: Adherence to the specified prompts and task instructions.
Quality of Reasoning: Depth, logic, and coherence in the explanations and conclusions.

You can apply one or multiple curation filters to align the dataset with your specific needs.

Best Practices for Successful Curation Runs

Mix Quality Filters: Combine multiple curation criteria for a balanced dataset.
Tune Amplification: Start small and adjust amplification as you analyze generation patterns.
Spot-Check Outputs: Always manually review a subset of samples to validate automatic curation.

Following these steps ensures you achieve high-quality, task-aligned, and scalable synthetic datasets every time.

Next Steps

Ready to start? Head to your workspace and create your first curation run today!

Introduction

Get Started

Assess

Agentic AI

Improve

Judge

Datasets

Creating a Data Curation Run

Creating a Data Curation Run

Interactive Walkthrough

What is Synthetic Curation?

Amplification Factor

Supported Generator Models

Curation Quality Criteria

Best Practices for Successful Curation Runs

Next Steps

Introduction

Get Started

Assess

Agentic AI

Improve

Judge

Datasets

​Creating a Data Curation Run

​Interactive Walkthrough

​What is Synthetic Curation?

​Amplification Factor

​Supported Generator Models

​Curation Quality Criteria

​Best Practices for Successful Curation Runs

​Next Steps

Creating a Data Curation Run

Interactive Walkthrough

What is Synthetic Curation?

Amplification Factor

Supported Generator Models

Curation Quality Criteria

Best Practices for Successful Curation Runs

Next Steps