Creating a Data Curation Run
Curate high-quality synthetic data with flexibility and precision using the Collinear AI Platform. Learn about amplification, supported models, and quality criteria.
π Creating a Data Curation Run
The Collinear AI Platform makes it effortless to create synthetic datasets tailored to your exact needs. This guide walks you through setting up a Data Curation Run β from selecting generator models to defining quality criteria.
π₯ Interactive Walkthrough
Want to see it in action? Follow this guided demo to create your curation run:
π§ͺ What is Synthetic Curation?
Synthetic curation is the process of automatically generating diverse, high-quality examples to power model training, evaluation, or fine-tuning. With Collinear AI, you can amplify your dataset while maintaining control over quality, diversity, and relevance.
Key features:
- Scale fast: Generate hundreds to thousands of samples with minimal effort.
- Maintain precision: Define exactly what quality looks like for your use case.
- Stay flexible: Customize generation settings and quality filters.
π₯ Amplification Factor
The Amplification Factor determines how much you expand your data:
- Higher amplification β More synthetic samples per input.
- Lower amplification β Fewer, more targeted samples.
This setting allows you to balance quantity and curation precision based on your projectβs needs.
π§ Supported Generator Models
Collinear AI supports the most powerful and versatile models for synthetic data generation:
- GPT-4o β OpenAIβs flagship model, excelling at reasoning, creativity, and instruction-following.
- LLaMA 70B β Optimized for balanced performance and efficiency.
- LLaMA 405B β Massive-scale model offering unparalleled depth and nuance.
- Qwen 72B β Strong multilingual and instruction-following capabilities, ideal for diverse domains.
Select the model that best matches your data complexity, domain, and scale requirements.
π οΈ Curation Quality Criteria
To ensure every piece of generated data meets high standards, Collinear AI allows you to curate outputs based on these key dimensions:
- Correctness: How accurately outputs match desired criteria and ground truth examples.
- Naturalness: How well outputs mimic real-world language patterns and tone.
- Diversity: The range and uniqueness of generated variations, ensuring broad coverage.
- Coherence: Logical and semantic flow throughout the output.
- Instruction Following: Adherence to the specified prompts and task instructions.
- Quality of Reasoning: Depth, logic, and coherence in the explanations and conclusions.
You can apply one or multiple curation filters to align the dataset with your specific needs.
β Best Practices for Successful Curation Runs
- Mix Quality Filters: Combine multiple curation criteria for a balanced dataset.
- Tune Amplification: Start small and adjust amplification as you analyze generation patterns.
- Spot-Check Outputs: Always manually review a subset of samples to validate automatic curation.
Following these steps ensures you achieve high-quality, task-aligned, and scalable synthetic datasets every time.
π― Next Steps
Ready to start? Head to your workspace and create your first curation run today!