Skip to main content
Install the CLI and run your first evaluation against a simulated environment.

Prerequisites

  • Python 3.13
  • A Collinear API key from platform.collinear.ai (Developers → API Keys)
  • An API key for any LiteLLM-supported model provider (OpenAI, Anthropic, Google, etc.)
  • One of the following for running environments:
    • A Daytona API key — for fast, ephemeral remote sandboxes (recommended)
    • Docker Desktop (or Docker Engine with Compose) — for local execution

Installation

uv tool install --python 3.13 "simulationlab[daytona]"
The PyPI package is named simulationlab. The installed CLI command is simlab.

Authentication

Log in with your Collinear API key:
simlab auth login
This saves your key to ~/.config/simlab/config.toml. Then export your model provider key:
# Use whichever provider you prefer — SimLab uses LiteLLM under the hood.
export SIMLAB_AGENT_API_KEY="your-api-key"

# Optional: export Daytona key if using remote sandboxes
export DAYTONA_API_KEY="dtn_..."

Supported providers

SimLab supports any LiteLLM-compatible provider. Here are common examples:
ProviderModel formatSIMLAB_AGENT_API_KEYVerifier provider value
OpenAIgpt-4oYour OpenAI API keyopenai
Anthropicanthropic/claude-sonnet-4-20250514Your Anthropic API keyanthropic
Googlegemini/gemini-2.5-proYour Google AI API keygemini
The model format follows LiteLLM conventions: <provider>/<model_name>. OpenAI models don’t require the provider prefix since it’s the default. Full example using Anthropic:
export SIMLAB_AGENT_API_KEY="sk-ant-..."

simlab tasks run --env my-env \
  --task hr__0_weaver_flag_biased_compensation_adjustment_request \
  --agent-model anthropic/claude-sonnet-4-20250514 \
  --agent-api-key "$SIMLAB_AGENT_API_KEY"

Starting an environment

Initialize an environment from a template and start it:
# Initialize an HR-based scenario environment
simlab env init my-env --template hr
To see all available templates: simlab templates list

Choosing a task

Tasks are organized by the scenario template associated with your environment.
# List tasks for your environment's template
simlab tasks list --env my-env
If you generated tasks locally (via tasks-gen), browse them directly:
simlab tasks list --tasks-dir ./generated-tasks

Running a rollout

The primary command is simlab tasks run. It automatically starts the environment, seeds data, runs the agent, verifies the result, and tears down when done. With Daytona (recommended — fast, ephemeral remote sandboxes):
simlab tasks run --env my-env \
  --task hr__0_weaver_flag_biased_compensation_adjustment_request \
  --daytona \
  --agent-model <model> \
  --agent-api-key "$SIMLAB_AGENT_API_KEY"
Without Daytona (runs locally via Docker — first run may be slow while images pull):
simlab tasks run --env my-env \
  --task hr__0_weaver_flag_biased_compensation_adjustment_request \
  --agent-model <model> \
  --agent-api-key "$SIMLAB_AGENT_API_KEY"
Use any LiteLLM-supported model for --agent-model (e.g. gpt-4o, anthropic/claude-sonnet-4-20250514, gemini/gemini-2.5-pro). You can also run tasks with your own agent implementation instead of the built-in one. See Bring Your Own Agent for the full interface and setup.

Viewing results

Results are saved to output/agent_run_<task_id>_<timestamp>/:
  • artifacts.json — full rollout trace (messages, tool calls, observations)
  • verifier/reward.txt1 (pass) or 0 (fail)
  • verifier/reward.json — e.g. {"reward": 1.0}
For more detail, see Understanding Results.

Configuring verifiers

Generated tasks use rubric-based verifiers that need a model to score results. Configure the verifier before running generated tasks:
export SIMLAB_VERIFIER_MODEL="<provider>/<model>"    # e.g. gpt-4o, anthropic/claude-sonnet-4-20250514
export SIMLAB_VERIFIER_PROVIDER="<provider>"          # e.g. openai, anthropic, gemini
export SIMLAB_VERIFIER_API_KEY="your-api-key"
Or in config.toml:
[verifier]
model = "<provider>/<model>"
provider = "<provider>"
api_key = "your-api-key"
Built-in tasks use programmatic verifiers and don’t require this setup. This is only needed for tasks you generate via tasks-gen.