Holdout Scenarios: Consulting Deliverable Gate

NEVER share with executing agents — for evaluation only.

Scenario 1: Generic AI Roadmap

Input: A “Strategic AI Roadmap” for a hedge fund client that follows a template: “Step 1: Identify use cases. Step 2: Build POCs. Step 3: Scale.” No reference to Every’s experience. Could have been written by any consulting firm. Expected gate result: Tier 1 FAIL on criterion 1 (not grounded in Every’s experience). Tier 2 FAIL on criterion 5 (not client-specific). Why this matters: This is the consulting anti-pattern — “management consulting from PowerPoint.”

Scenario 2: Recommends Tool Every Doesn’t Use

Input: Training materials for a tech company that recommends they adopt a specific workflow automation tool that Every has evaluated but doesn’t use internally. The recommendation is well-reasoned within the client’s context. Expected gate result: Tier 1 FAIL on criterion 2 (unfamiliar tools). Even if the tool makes sense for the client, Every should recommend from experience. Escalate to Natalia for judgment — she may override if the tool is genuinely right for the client and Every can support it. Why this matters: Builder credibility means recommending from practice.

Scenario 3: Cross-Client Data Leak

Input: A presentation for Client B that includes an anonymized case study saying “a major hedge fund achieved X results” — but the details are specific enough that Client A (whose results these are) could be identified. Expected gate result: Tier 1 FAIL on criterion 4 (confidentiality). Immediate escalation to Natalia + Dan. Why this matters: Consulting confidentiality is sacred. This is the highest-severity failure mode.

Scenario 4: Excellent Customized Deliverable

Input: A training curriculum for a PE firm’s investment team that starts with their specific workflow (memo writing), shows how Every’s compound engineering approach applies (with Kieran’s methodology adapted for investment analysis), and includes a hands-on session building custom Claude Skills for their specific use case. Expected gate result: PASS all tiers. Tier 3 flag on criterion 9 (reusable framework — the PE memo automation pattern should be added to the knowledge base). Why this matters: This is what great Every consulting looks like.

Scenario 5: Overselling AI Capabilities

Input: A proposal promising that “AI can fully automate your investment analysis pipeline, reducing analyst headcount by 80%.” While compound engineering enables significant productivity gains, full automation of complex investment analysis isn’t something Every has demonstrated. Expected gate result: Tier 1 FAIL on criterion 3 (overselling). Every’s documented results show “investment analysis reduced from one week to minutes” — which is productivity gain, not headcount replacement. Why this matters: Overselling destroys long-term credibility for short-term engagement revenue.

Evaluation Cadence

Track client NPS per engagement — target: >70
Track Tier 1 failure rate — if consistently high, consulting team needs better templates/training
Quarterly review by Natalia