Skip to main content

Creating a Data Improvement Run

The Collinear AI Platform makes it effortless to create synthetic datasets tailored to your exact needs. This guide walks you through setting up a Data Improvement Run — from selecting generator models to defining quality criteria.

Interactive Walkthrough

Want to see it in action? Follow this guided demo to create your improvement run:

What is Synthetic Improvement?

Synthetic improvement is the process of automatically generating diverse, high-quality examples to power model training, evaluation, or fine-tuning. With Collinear AI, you can amplify your dataset while maintaining control over quality, diversity, and relevance. Key features:
  • Scale fast: Generate hundreds to thousands of samples with minimal effort.
  • Maintain precision: Define exactly what quality looks like for your use case.
  • Stay flexible: Customize generation settings and quality filters.

Amplification Factor

The Amplification Factor determines how much you expand your data:
  • Higher amplification → More synthetic samples per input.
  • Lower amplification → Fewer, more targeted samples.
This setting allows you to balance quantity and improvement precision based on your project’s needs.

Supported Generator Models

Collinear AI supports the most powerful and versatile models for synthetic data generation:
  • GPT-4o — OpenAI’s flagship model, excelling at reasoning, creativity, and instruction-following.
  • LLaMA 70B — Optimized for balanced performance and efficiency.
  • LLaMA 405B — Massive-scale model offering unparalleled depth and nuance.
  • Qwen 72B — Strong multilingual and instruction-following capabilities, ideal for diverse domains.
Select the model that best matches your data complexity, domain, and scale requirements.

Improvement Quality Criteria

To ensure every piece of generated data meets high standards, Collinear AI allows you to filter outputs based on these key dimensions:
  • Correctness: How accurately outputs match desired criteria and ground truth examples.
  • Naturalness: How well outputs mimic real-world language patterns and tone.
  • Diversity: The range and uniqueness of generated variations, ensuring broad coverage.
  • Coherence: Logical and semantic flow throughout the output.
  • Instruction Following: Adherence to the specified prompts and task instructions.
  • Quality of Reasoning: Depth, logic, and coherence in the explanations and conclusions.
You can apply one or multiple improvement filters to align the dataset with your specific needs.

Best Practices for Successful Improvement Runs

  • Mix Quality Filters: Combine multiple improvement criteria for a balanced dataset.
  • Tune Amplification: Start small and adjust amplification as you analyze generation patterns.
  • Spot-Check Outputs: Always manually review a subset of samples to validate automatic improvement.
Following these steps ensures you achieve high-quality, task-aligned, and scalable synthetic datasets every time.

Next Steps

Ready to start? Head to your workspace and create your first improvement run today!
I