Setup guide

From install to first regression test in five minutes.

SafeShip drops into your agent code via a 4-line SDK. Below is everything you need to ship your first trace, see it on the dashboard, and accept your first auto-generated regression test.

Before you start

A SafeShip account — create one at safeship.dev/sign-up and start your 7-day free trial (see below)
Your API key (looks like sk_live_…) — find it on the Setup page inside the app
A Python 3.9+ project where your AI agent runs

Billing & free trial

SafeShip is $29.99 / month, flat, no seats. There's one plan and a 7-day free trial.

Card required upfront. Stripe holds your card but doesn't charge for 7 days.
Cancel before day 7 = $0 charged. Cancel anytime from the customer portal — no email, no retention loop.
After 7 days, your card is auto-charged $29.99/mo. Cancel anytime — you keep access until the current period ends. No refunds for partial months.
Manage your subscription at /app/billing — update card, see invoices, or cancel from the Stripe customer portal.

You can't access /app/* until your card is on file. We don't take payment details by phone or email — only through the Stripe checkout link inside the app.

1 — Install the SDK

During beta, install directly from our GitHub:

pip install "git+https://github.com/ego-debug/SafeShip.git#subdirectory=sdks/python"

We'll publish to PyPI as pip install safeship once the SDK is stable. For now, the GitHub install is the recommended path.

2 — Initialize and wrap your agent

In your agent code, call safeship.init() once at startup, then wrap your agent callable with safeship.wrap():

import safeship

safeship.init(api_key="sk_live_...")  # paste your key here
agent = safeship.wrap(my_agent)        # wraps any callable

# now call your agent normally — every run ships a trace
result = agent("user message here")

That's it. Every call to agent(...) ships a trace to your dashboard from a background daemon thread — never blocks your code, never crashes your agent if our ingest is down.

3 — (Optional) Record sub-steps

By default, each wrapped call produces one trace with one step (the agent itself). To get richer step-by-step traces, drop safeship.step(...) calls inside your agent:

def my_agent(message: str) -> str:
    intent = classify(message)
    safeship.step(tool_name="classify_intent", kind="llm",
                  input=message, output=intent,
                  duration_ms=140, status="ok")

    order = lookup_order(intent)
    safeship.step(tool_name="lookup_order", kind="tool",
                  input=intent, output=order,
                  duration_ms=320, status="ok")

    return draft_reply(order)

Each step shows up as a row in the Trace Detail timeline. Setstatus="fail" on the step that broke and SafeShip's auto-suggest engine will write a regression test targeting it.

4 — See your traces and accept tests

Open your dashboard — every run your agent does appears in the "Recent runs" panel within seconds.
Click View trace → on any run to see the step-by-step timeline.
On a failed run, click ✓ Suggest a regression test at the bottom of the page. Claude reads the trace and proposes a YAML assertion that would have caught the failure.
Review the suggestion at /app/suggestions. Press Y to accept (lands in your regression suite) or N to skip.

5 — Block bad deploys (GitHub Action)

SafeShip ships a GitHub Action that runs on every PR. By default (mode: auto) it picks one of two strategies based on whether you have a safeship.yaml at your repo root.

Test mode (recommended)

Every accepted regression test is replayed against your new code in CI. If any test would reproduce a previously-caught failure, the PR fails. Add a safeship.yaml at the repo root pointing at your agent entry point:

# safeship.yaml — declare which function the test runner should call
agent: src.my_agent:run

Then add the Action to your workflow:

# .github/workflows/safeship.yml
name: SafeShip
on: pull_request
permissions:
  contents: read
  pull-requests: write    # optional, enables the inline PR comment
jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -e .       # install your agent's deps
      - uses: ego-debug/SafeShip/.github/actions/safeship@main
        with:
          api-key: ${{ secrets.SAFESHIP_API_KEY }}

Your workflow is responsible for setting up Python and installing your agent's dependencies — the SafeShip action installs only the SafeShip SDK on top, then runs safeship test. Each replayed test re-invokes your agent with real LLM calls; budget roughly $0.05–$0.30 per accepted test per PR run, depending on how chatty your agent is. Cancellable tests run in parallel-safe isolated steps so a flake on one test doesn't cascade.

Score-gate mode (simpler, ambient)

If you'd rather monitor average production quality than replay specific failures pre-deploy, omit the safeship.yaml (or set mode: score-gate explicitly). The Action then calls SafeShip's /v1/runs/check endpoint and fails the PR if your latest production run scored below the threshold:

- uses: ego-debug/SafeShip/.github/actions/safeship@main
  with:
    api-key: ${{ secrets.SAFESHIP_API_KEY }}
    mode: score-gate
    min-score: 80

Lighter setup, but catches regressions after they ship to production. Test mode catches them before they merge. You can keep score-gate as a fallback signal even if you adopt test mode later.

Store your sk_live_… key as a repo secret named SAFESHIP_API_KEY. Full input reference: Action README →

How replay works

When SafeShip generates a regression test from a failing trace, it also remembers the exact input the agent was called with when the failure occurred. In CI, the test runner replays that input through your new code and evaluates the test's assertion (e.g. output contains lookup_order.output.total) against the new trace. Three outcomes:

passed — the assertion held; this regression won't recur with your new code.
failed — the assertion was violated; your PR would reproduce the original failure. The PR check fails and the inline comment shows which test broke and why.
skipped — no step in the new trace matched the test's when: clause. The agent likely routed differently for this input; non-blocking.

Replays run your code with your credentials inside your CI environment. SafeShip's servers never execute your agent — they only generate the YAML assertions and serve them via the manifest API.

Making your agent deterministic-friendly

Replay assumes the same input produces a comparable output. If your agent is highly non-deterministic, tests may pass or fail differently between runs even when your code is unchanged. A few things help:

Set temperature=0 on LLM calls in CI, or whatever the equivalent "most deterministic" knob is for the model you're using.
Pin the seed parameter where the provider supports it.
Mock or stub time-dependent inputs (datetime.now(), randomness sources, external clocks) when running under the test runner. The simplest way is to gate them on an env var: SAFESHIP_RUN_MODE=test is set by the runner automatically.
Re-accept the suggestion (or skip it and let SafeShip generate a fresh one) when you've materially rewritten the prompt — the old fixture may no longer reproduce the failure even on broken code.

Determinism is your agent's property, not ours. We surface flaky behavior so you can decide whether to tighten the assertion, lower the temperature, or accept the noise.

Async agents

safeship.wrap() detects coroutine functions automatically — the wrapped callable stays awaitable:

import asyncio
import safeship

safeship.init(api_key="sk_live_...")

async def my_agent(prompt):
    ...

agent = safeship.wrap(my_agent)
asyncio.run(agent("hello"))

Configuration reference

Parameter	Env var	Default
api_key	SAFESHIP_API_KEY	—
endpoint	SAFESHIP_ENDPOINT	https://safeship.dev/v1/traces
timeout_seconds	—	2.0
debug	—	False
enabled	—	True (set False in tests)

Reliability guarantees

The SDK is built to never get in the way of your agent. These are enforced by code and verified in our pytest suite:

Never crashes your agent. Every internal error is caught and dropped silently (turn on debug=True to log them).
Never blocks on the network. Trace upload happens on a daemon thread; your code returns the moment your agent returns.
No extra LLM calls. SafeShip never re-prompts your model or makes shadow calls — your token spend is unchanged.
Survives transient failures. 5xx and 429 responses are retried with exponential backoff; permanent 4xx errors are dropped without crashing.

Troubleshooting

My traces aren't showing up on the dashboard.

Check that your api_key starts with sk_live_ and matches the one shown on the Setup page. Set debug=True in safeship.init() to log transport errors to stderr.

I got a 429 "rate_limited" response.

You hit either the burst (100/min) or daily (5,000/day) ingestion cap. The response includes a Retry-After header telling you how many seconds to wait. Contact founder@safeship.dev if you need higher limits.

The 'Suggest a regression test' button errors out.

Either the per-project Claude rate limit has been reached (50/day) or there's a temporary backend issue. The error message will tell you which. Wait the displayed time or email support.

How do I send a test trace without writing code?

On the Setup page, click Send us a test trace. It inserts a synthetic 5-step run so you can preview the dashboard and trace-detail UI without wiring real code.

Stuck?

Email founder@safeship.dev — solo founder, replies usually same-day. Include your project ID (visible on the dashboard) if you have one.