πŸš€ Creating a Data Curation Run

The Collinear AI Platform makes it effortless to create synthetic datasets tailored to your exact needs. This guide walks you through setting up a Data Curation Run β€” from selecting generator models to defining quality criteria.

πŸŽ₯ Interactive Walkthrough

Want to see it in action? Follow this guided demo to create your curation run:


πŸ§ͺ What is Synthetic Curation?

Synthetic curation is the process of automatically generating diverse, high-quality examples to power model training, evaluation, or fine-tuning. With Collinear AI, you can amplify your dataset while maintaining control over quality, diversity, and relevance.

Key features:

  • Scale fast: Generate hundreds to thousands of samples with minimal effort.
  • Maintain precision: Define exactly what quality looks like for your use case.
  • Stay flexible: Customize generation settings and quality filters.

πŸ”₯ Amplification Factor

The Amplification Factor determines how much you expand your data:

  • Higher amplification β†’ More synthetic samples per input.
  • Lower amplification β†’ Fewer, more targeted samples.

This setting allows you to balance quantity and curation precision based on your project’s needs.


🧠 Supported Generator Models

Collinear AI supports the most powerful and versatile models for synthetic data generation:

  • GPT-4o β€” OpenAI’s flagship model, excelling at reasoning, creativity, and instruction-following.
  • LLaMA 70B β€” Optimized for balanced performance and efficiency.
  • LLaMA 405B β€” Massive-scale model offering unparalleled depth and nuance.
  • Qwen 72B β€” Strong multilingual and instruction-following capabilities, ideal for diverse domains.

Select the model that best matches your data complexity, domain, and scale requirements.


πŸ› οΈ Curation Quality Criteria

To ensure every piece of generated data meets high standards, Collinear AI allows you to curate outputs based on these key dimensions:

  • Correctness: How accurately outputs match desired criteria and ground truth examples.
  • Naturalness: How well outputs mimic real-world language patterns and tone.
  • Diversity: The range and uniqueness of generated variations, ensuring broad coverage.
  • Coherence: Logical and semantic flow throughout the output.
  • Instruction Following: Adherence to the specified prompts and task instructions.
  • Quality of Reasoning: Depth, logic, and coherence in the explanations and conclusions.

You can apply one or multiple curation filters to align the dataset with your specific needs.


βœ… Best Practices for Successful Curation Runs

  • Mix Quality Filters: Combine multiple curation criteria for a balanced dataset.
  • Tune Amplification: Start small and adjust amplification as you analyze generation patterns.
  • Spot-Check Outputs: Always manually review a subset of samples to validate automatic curation.

Following these steps ensures you achieve high-quality, task-aligned, and scalable synthetic datasets every time.


🎯 Next Steps

Ready to start? Head to your workspace and create your first curation run today!