Plan-B Next-Gen CI/CD Platform

The plan, restored

This is the spine exactly as it was locked on 2026-05-25 (the imperative-crafting-wand strategic plan, 12 pillars / D1–D55). The flow was always simple: dev → staging (TEST) → main (PROD) — every change goes to staging, passes the whole battery again, then one human promote moves it to prod. No automated promotion. All 12 pillars ship before the first paying customer goes live.

The spine

3-tier topologyCore flow — DEV / TEST / PROD

The plan's spine is a strict 3-tier DEV / TEST / PROD topology with absolute prod isolation, and a one-way dev → preview-on-TEST → main → PROD promotion flow. Verbatim intent from the plan:

THE 3 TIERS (lines 304-327):
- DEV (laptops): local pnpm/npm + Vercel localhost + Neon `dev` branch + Vercel `development` env + live PlaySmart logs via read-only log splitter.
- TEST (cloned prod): NEW separate Hetzner project `plan-b-test`; servers TEST-siemsys-syslog + TEST-MDR-01 (3 containers); DNS test-siem.plan-b.co.il (Vercel preview alias) + test-siem-api.plan-b.co.il (TEST nginx, grey-cloud); Vercel `preview` env; Neon `dev` branch; Storage Box scoped /test/ (new BX11).
- PROD (untouchable): Hetzner PROD project; PROD-siemsys-syslog + PROD-MDR-01 (3 containers); siem-api.plan-b.co.il + siem.plan-b.co.il; Vercel `production` env; Neon `main` branch; Storage Box /prod/.

THE FLOW (lines 317-337): 'PR opens → Preview deploys to TEST → all Pillar 4-5 gates run against TEST data. Mike approves merge to main → PROD deploy → post-deploy verification runs → Auto-rollback on SLO breach.' Two boundaries: DEV↔TEST is ephemeral/frequent (every PR lands here; TEST is where dev 'shows itself' before being trusted); TEST→PROD is rare, human-approved (Mike merges to main), one-way only, NO automated promotion.

THREE ABSOLUTE RULES (lines 333-337): (1) No prod URL/DSN/key reachable from any non-prod code path (scripts/audit-prod-isolation.mjs enforces, gate #13 in check-all). (2) PROD has zero inbound dependencies on TEST or DEV (log splitter not log-forward; clones not mirrors). (3) Emergency override is Mike-only and audit-logged post-incident within 24h.

DATA-FLOW RULES (lines 322-327): PROD→TEST only via deterministic anonymizer (weekly staging refresh); TEST→PROD never (no code path, no script, no admin action); PlaySmart UDP 514 → log splitter → fans out to BOTH PROD-syslog AND TEST-syslog (its own cheap cx22; PROD has no outbound dependency on TEST). Per-PR ephemeral envs (Pillar 3 folded into Phase 1): Neon branch pr-{N} from dev + Vercel preview + PR-scoped OpenSearch index pr-{N}-test + auto-teardown on PR close.

no automated promotionPromote model — 5 gate layers + Mike's merge

A change reaches prod ONLY by passing the full staged-gate stack and then Mike merging the PR to `main`; there is NO automated promotion. The 5-layer defense-in-depth gate model (lines 619-640): Layer 1 local pre-push (Husky), Layer 2 CI on push (GH Actions: install/lint/typecheck/check-all/test/build), Layer 3 pre-merge branch protection (Layer 2 + Aikido + CodeRabbit no-outstanding-request-changes + on main: 1 approving review from Mike), Layer 4 pre-deploy rehearsal = Pillar 5 Stage 4 against TEST (Playwright/visual/a11y/perf/DAST/mutation/contract), Layer 5 post-deploy verification = Pillar 7 (synthetic + SLO burn-rate + auto-rollback). Merge to main is gated: branch protection requires all checks green AND Mike's approving review; no-bypass-for-admins is OFF so even Mike uses the PR flow (emergencies covered by the emergency-override runbook). On merge to main, Vercel deploys to PROD and post-deploy verification runs. Progressive delivery (Pillar 7): every meaningful PR ships behind a GrowthBook flag; canary ramps 1%→10%→50%→100% over 24h (4h/8h/12h gates); auto-rollback on ANY of {error rate 2x baseline, p99 2x baseline, synthetic fail 3x}; instant code rollback via Vercel alias switch (<30s) + flag flip. D30: ramp stages are manually approved by Mike until the Release Agent (Pillar 9 Tier 3) ships, automated after. The 'trust the gates' principle (D26, lines 1620-1623): post-launch Renovate auto-merges ANY green PR — patch/minor/major alike — because the CI/CD's entire job is to guarantee anything merged is safe; majors needing real migration fail gates naturally and stay open. System goes live to first paying customer only when ALL 12 pillars ship — NO phased go-live (line 2161).

the implicit gapEngine deploy intent — signed image, not Vercel alias

The plan is portal-centric (Next.js → Vercel) for the dev→test→main→PROD spine; the MDR engine + edr-puller (the backend Docker containers on PROD-MDR-01 / TEST-MDR-01) are governed differently and the plan does NOT define a full git-promotion pipeline for them. Key locked intents:

1) Engine/puller are NOT modified from pbsiem PRs. CodeRabbit path_instructions (line 717): 'mdr-engine/**, edr-puller/** → contract-drift only; no code edits suggested (we don't modify these from pbsiem PRs).' The crypto-contract gate (lines 731-736) treats src/lib/mdr-crypto.ts (portal), mdr-engine/src/action-executor.ts decrypt, and edr-puller/src/decrypt.ts as 'read-only imports — we don't modify these files'; the gate only asserts the three decryptors stay contract-compatible against a golden v2 GCM envelope.

2) How the engine reaches prod = signed-container-pull, not Vercel alias. Pillar 6 (supply chain) sub-deliverable 5 (lines 1597, 1609): 'Sigstore-signed images for docker containers (mdr-engine, edr-puller) pushed to GHCR. GHCR push step in CI signs images via cosign sign (keyless OIDC). Engine + puller deployment scripts verify signatures before pulling.' Verification (line 1660): cosign verify ghcr.io/plan-b-systems/mdr-engine:<sha> returns valid keyless signature. So the engine reaches PROD-MDR-01 by deployment scripts pulling cosign-verified, SLSA-L3-attested images from GHCR.

3) The engine is woven into the cross-cutting pillars even though it isn't on the Vercel promote spine: OTel instrumentation in MDR engine + edr-puller with W3C TraceContext propagation portal→engine→puller (Pillar 8, lines 1212-1219); GrowthBook SDK in portal + engine + puller (Pillar 7 B, lines 1425-1431); shared crypto/rate-limit/audit-log cross-component contract gates (Pillar 4 H); cross-product attestation verification at deploy (Pillar 6 #9).

NET: the plan locks the engine's prod path as 'signed image in GHCR → verified pull onto the PROD Hetzner box,' kept honest by contract-drift gates and signature/provenance verification — but it does NOT specify a Vercel-style canary/alias-switch promotion or a TEST→PROD git-merge gate for the engine itself the way it does for the portal. This is the implicit gap: the promote-to-prod spine is portal-shaped; the engine relies on container-signing discipline rather than the dev→test→main button-press flow.

The 12 pillars

Each pillar’s locked intent and the decisions that shaped it. The current state of every one is in the Current state tab; what’s missing is in Gaps.

P1Pillar 1 — Environment topology & isolation

Intent: DEV/TEST/PROD with absolute prod isolation; a NEW separate Hetzner project hosts the TEST stack as a lean shadow of prod, mechanically enforced so no prod URL/DSN/key is reachable from any non-prod code path; emergency override is Mike-only with a post-incident audit trail.

Key decisions

D5: Pillar 3 ephemeral envs FOLDED into Phase 1 (not a separate phase). D6: TEST topology = lean shadow, cpx22/cx22 x2 + cx22 splitter, ~EUR20-25/mo. D7: TEST data = live PlaySmart via splitter + minimal synthetic fixtures (anonymizer deferred to P12). D8: per-PR ephemeral envs YES in Phase 1 (Neon branch per PR + scoped OS index + auto-teardown). D9: external SaaS keys split (Twilio/SendGrid/Tranzila/Finbot/Hetzner/GitHub), shared+capped (Anthropic), disabled-in-TEST (SentinelOne), narrow-scope shared (Cloudflare). D10: NEW BX11 Storage Box dedicated to TEST (~EUR4/mo), independent SSH keys+quota. New script scripts/audit-prod-isolation.mjs becomes check-all gate #13.

P2Pillar 2 — Identity, secrets & trust

Intent: Workload identity (OIDC where supported), no long-lived CI creds, weekly rotation cron (already shipped), per-environment secret scoping, audit-logged secret access, with 'no human sees prod credentials' as the north star.

Key decisions

D11: OIDC adoption DEFERRED to v2 (Mike's call: would complicate things); keep weekly rotation cron. D12: audit-logged secret access PROD-only, sensitive-secret types only (encryption keys, payment processor keys, AI keys). D13: secret-access proxy pattern getSecret('X') wrapping process.env.X (clean DX, greppable, central audit point). v1 scope reduced from 10 to 6 sub-deliverables after OIDC deferral; rotation extends to MDR_ENCRYPTION_KEY + bearer tokens + PATs + GrowthBook keys.

P3Pillar 3 — Per-change ephemeral environments

Intent: Neon branch per PR + Vercel preview against TEST + OpenSearch PR-scoped index + deterministic seed fixtures + auto-teardown on PR close, with a per-env cost cap and an aggregate cap.

Key decisions

FOLDED into Phase 1 (Pillar 1, D5) — no longer its own phase. Mechanics: GH workflow .github/workflows/ephemeral-env.yml on PR opened/synchronize/reopened/closed creates/deletes Neon branch pr-{N} from dev + Vercel per-branch env override (DATABASE_URL, OPENSEARCH_PR_INDEX, PR_NUMBER) + PR-scoped OS index pr-{N}-test (1 shard, 0 replicas, ISM delete min_age 1d). Idempotent, never-fail-quiet teardown + daily GC cron + cost cron. Aggregate ephemeral budget ~EUR100/mo; est. <EUR1/PR. Flagged prereqs: Neon branch quota (Scale plan) + Vercel per-branch env requires Pro.

P4Pillar 4 — Quality gates (defense in depth)

Intent: Five layers — local pre-push, CI on push, pre-merge branch protection, pre-deploy rehearsal, post-deploy verification — extending the existing 11 check-all gates with typecheck, lint-warnings-as-errors, Aikido, CodeRabbit, a crypto-contract gate, and a generic cross-component contract-gate framework.

Key decisions

D14: pre-commit/push tool = Husky + lint-staged. D15: CodeRabbit request_changes_workflow TRUE (blocks merge on requested changes). D16: GH Actions pinning hybrid — SHA-pinned for security-relevant (Aikido/CodeQL/SLSA), @v4 for build-relevant (checkout/setup-node/cache). D17: cross-component contract gate framework BUILT NOW (scripts/check-contracts/) with crypto-envelope as the v1 contract; v2 candidates rate-limit/pii-scrub/audit-log. D18: --no-verify escape hatch ALLOWED with documented emergency-override runbook. Branch protection: dev = 7 required checks no direct push; main = same PLUS 1 Mike approval, admin no-bypass OFF. Critical sequencing: workflows must exist before branch protection so required check names match.

P5Pillar 5 — Production-shaped rehearsal (Stage 4)

Intent: Against the per-PR ephemeral env on TEST, every PR runs Playwright E2E (existing 86 specs) + visual regression EN+HE + OpenAPI/contract tests + perf budget (Lighthouse + k6) + a11y (axe-core) + i18n key parity + cross-tenant isolation + mutation testing (Stryker) + migration safety + AI cost guard.

Key decisions

D19: visual regression = critical-path 12 pages x EN+HE = 24 baselines (2% pixel threshold). D20: OpenAPI via zod-to-openapi (requires standalone Phase 5a Zod audit ~2-3 days). D21: perf budget baseline = rolling baseline from main. D22: Stryker mutation threshold = 30 days to 70%, start at first-run floor ~60%. D23: cross-tenant isolation coverage = top-12 sensitivity endpoints initially (catches the CR-03 IDOR class). D24: AI cost guard = Claude API only for v1; three-tier caps $0.50/run, $5/PR-lifetime, $50/repo/mo. Gates trigger on deployment_status=success against the Pillar 1 ephemeral env, post as PR checks, required by Pillar 4 branch protection.

P6Pillar 6 — Supply-chain integrity

Intent: SBOM (CycloneDX) per release attached to GH Release + Storage Box retention; SLSA L3 provenance via Sigstore + GitHub attest-build-provenance; SLSA Source Track; Aikido SAST/secret/dep-CVE/IaC; ZAP DAST; dependency-audit cron with auto-PR.

Key decisions

D25: signed commits on main DEFERRED to v2 (team disruption); target SLSA Source Track L1 (branch protection alone satisfies). D26: Renovate auto-merge = TRUST THE GATES — manual during build; post-launch auto-merge ANY green PR (no patch/minor/major distinction; majors needing migration fail gates and stay open). D27: SBOM retention = FOREVER on Storage Box (~100KB JSON each). Sigstore-signed mdr-engine + edr-puller images to GHCR via cosign keyless OIDC; deployment scripts verify signatures before pulling; cross-product attestation verification ships in ci-templates.

P7Pillar 7 — Progressive delivery & safe rollback

Intent: GrowthBook self-hosted feature flags; canary 1%->10%->50%->100% over 24h; auto-rollback on SLO breach; expand-contract schema migrations enforced by a Migration Review Agent; instant rollback via Vercel alias switch + flag flip.

Key decisions

D28: canary ramp 1%->10%->50%->100% over 24h (4h/8h/12h stage gates). D29: rollback trigger = ANY of {error rate 2x baseline, p99 2x baseline, synthetic fail 3x} -> auto-rollback (false rollbacks cheap, missed rollbacks expensive). D30: manual approval gates until Release Agent ships in Pillar 9, automated after. D31: default flagging = every meaningful PR by default, opt-out via no-flag-needed PR label. D32: pre-launch rollout scope = Phase A PlaySmart-only (de facto today), Phase B per-customer opt-in as customers onboard, Phase C tiered cohorts at 10+ customers. SDK integration spans portal + engine + puller; Vercel alias switch rollback <30s via scripts/rollback-prod.mjs.

P8Pillar 8 — Observability, SLOs & cost attribution

Intent: Golden signals via OpenTelemetry + Prometheus + Grafana + Sentry + Better Uptime; 5 named SLOs enforced by burn-rate alerts; plus cost-per-customer + cost-per-feature attribution via a client_id tag on every log/metric/trace (critical for SIEM usage-based pricing).

Key decisions

D33: Prometheus + Grafana self-hosted on TEST initially (~EUR4/mo); revisit Grafana Cloud later. D34: SLO-as-code = hand-written Prometheus alert rules for v1 (5 SLOs: login 99.9%/2s, license check 99.95%/500ms, MDR APIs 99.5%/1s, alert delivery 99%/5min, ingestion 99.9%/30s); Pyrra/Sloth deferred. D35: OTel exporter = OTLP/gRPC to self-hosted collector. D36: Sentry sampling tracesSampleRate 0.1, replaysSessionSampleRate 0.01. D37: cost attribution = PER-CUSTOMER + PER-FEATURE (client_id + feature_flag labels everywhere). D38: PROD-obs-stack provisioned 2-4 weeks after TEST-obs-stack stable. OTel instruments portal + MDR engine + edr-puller with W3C TraceContext propagation.

P9Pillar 9 — AI agent fleet with bounded authority

Intent: A runtime-enforced (not prompt-enforced) Tier 0-4 authority framework — refuse to expose the tool rather than ask the prompt to refuse — with earned-trust progression, an <untrusted> prompt-injection channel, an agent-to-agent graph cap, per-agent golden-input regression suites, and monthly post-mortem + agent-off day.

Key decisions

D39: build on Claude Agent SDK (allowedTools enforcement + PreToolUse/PostToolOutput hooks), not raw Anthropic SDK. D40: align Tier 0-4 to CSA Agentic Trust Framework (Feb 2026). D41: earned-trust thresholds T0 100 inv/<5% override, T1 200/<3%, T2 500/<2%, T3 1000/<1%; agents start in SHADOW, auto-graduate to LIVE. D42: telemetry extends Pillar 8 OTel/Prometheus/Grafana (no LangSmith/Helicone). D43: agent-to-agent graph cap = max 3 hops, max $5 cascade cost. D44: monthly 20-decision-sample post-mortem per agent + monthly agent-off day. D45: AI cost cap $0.50/run, $5/PR-lifetime, $50/repo/mo; model-tier discipline (Haiku routine, Sonnet complex, Opus only monthly review). 10 new agents built lowest-risk-first (3 Tier-0, 5 Tier-1, 1 Tier-2 On-Call, 1 Tier-3 Release Agent) + 4 existing agents retrofitted; v1 has ZERO Tier-4 agents (humans always click prod-high-blast actions).

P10Pillar 10 — Compliance evidence as exhaust

Intent: Immutable audit log, SBOM-per-release with retention, per-PR change-impact tags (customer-facing/auth/schema/security/PII), auditd on prod, quarterly drill evidence, and a control-to-evidence mapping table (SOC 2 CC-* + PCI DSS 4.0) where one artifact proves many controls — so audit becomes 'export the report,' not 'rebuild the trail.'

Key decisions

D46: SOC 2 Type II readiness = 6 months post-launch (aligns first-enterprise-prospect window; Type II needs trailing-period evidence). D47: PCI DSS scope = SAQ-A per Tranzila's written guidance (API path is split-second transit, never persisted; Tranzila reaffirmed after Codex flagged SAQ-A-EP/SAQ-D risk); guidance archived in compliance/pci-tranzila-scope-letter.pdf. D48: audit-log export format = JSON + CSV + PDF (GDPR machine-readable + enterprise auditor + spreadsheet). Customer-facing tenant-scoped export endpoint /api/customer/audit-export (GDPR Article 15).

P11Pillar 11 — Disaster recovery (proven, not planned)

Intent: DR you have tested in a drill, not designed in a doc — daily backup-restore verification into TEST, RTO/RPO measured every drill, cross-region Storage Box (Falkenstein->Helsinki), quarterly automated game-day, and a failover decision tree mapping breach x duration -> action.

Key decisions

D49: DR scope SPLIT TO A SEPARATE PLAN — only the daily backup->TEST restore cron stays in THIS plan (Phase 3, because it reuses CI/CD machinery). Everything else (cross-region replication, quarterly drill, failover decision tree, Phase B Neon->self-hosted PG migration research, customer PITR) moves to the future dedicated DR plan. Reference-only rationale preserved: cold DR at launch, warm Helsinki replica in Y1, status page auto-updates from SLO with a 5-min confirmation window, drills engineer-driven first then agent-assisted.

P12Pillar 12 — Platform meta-pillars + ci-templates extraction

Intent: The v2-polish that prevents 12-24 month rot — services.yaml catalog, Terraform IaC + drift detection, runbooks-as-code, scheduled chaos game-days, status-page-to-SLO binding, a shared deterministic anonymizer library, API versioning/deprecation policy, customer-facing export interface, PII scan of fixtures+screenshots — culminating in extracting reusable patterns into the plan-b-systems/ci-templates product (SemVer v1.x.y, consumers SHA-pin).

Key decisions

D50: IaC tool = Terraform (industry standard; maintained Hetzner/Cloudflare/Neon providers). D51: ci-templates repo visibility = PUBLIC (nothing competitively secret; attracts engineering talent). D52: anonymizer library = OWN implementation (~500 LOC, 1-2 days; deterministic + schema-aware; vendors overkill at our schema size). The grand finale (M10): extract GH Actions workflows + agent-runtime + agent prompts + IaC modules + SLO schemas + doc templates into public plan-b-systems/ci-templates; pbsiem refactors to consume via uses:@v1.0.0; stub product-3 onboards in 30 min as the conformance test; consumers SHA-pin (never @main, never @v1); MAJOR = 30-day deprecation + auto-PR per consumer + 90-day support.

Current state — verified against code

Every line here was checked against the actual repos on 2026-06-13 by a 16-agent adversarial analysis — not from memory, not from a prior session’s claims. This is the reality column for all 14 areas.

55missing

8built, not wired

45built different

21aligned

3 Missing 3 Built different 3 AlignedPillar 1 — Environment topology & isolation (DEV/TEST/PROD, prod isolation, per-PR ephemeral, log splitter, audit gate, emergency override)

TEST stack is REAL and LIVE: test-siem-api.plan-b.co.il resolves to 178.105.214.148 and serves a valid Let's Encrypt cert (openssl Verify return code: 0). pbsiem/docs/01-infrastructure.md:80-97 documents the separate Hetzner CI/CD project: TEST-siemsys-syslog (132962657 / 178.105.214.148), TEST-MDR-01 (132962660 / 178.105.148.156, 3 containers engine+postgres+puller), splitter-syslog (133160347 / 178.105.116.35, rsyslog UDP-514 fan-out to PROD 88.198.217.216 AND TEST), network Siem-internal-test (12257988, 10.1.0.0/16), BX11 (584099, u601754.your-storagebox.de). Per-PR ephemeral WIRED: .github/workflows/ephemeral-env.yml calls cicd-system ephemeral-env.yml@main (Neon branch from parent "production", Vercel preview override) AND pr-backend-stack.yml@main with teardown on PR close (lines 19-56). Full-system TEST deploy WIRED (contradicts runbook's open gap): .github/workflows/stage4.yml header "THE RELEASE-CERTAINTY CORE (S5, D-4=Option C)" rebuilds mdr-engine+edr-puller at PR SHA via docker-build, deploys BOTH to the shared TEST stack via siem-deploy.yml@main with required-env-marker:TEST + health/RestartCount gate (lines 85-125), migrates MDR-PG, syncs receiver, then runs e2e/dast/perf/visual/link-crawl vs the real preview. Prod-isolation gate WIRED: scripts/check-all.mjs:52 invokes check-prod-isolation.mjs. ENGINE PROD PROMOTE: NONE. prod-deploy.yml + prod-promote.yml are VERCEL-PORTAL-ONLY (prod-promote.yml:48-64 calls vercel-promote.yml to move alias siemsys.plan-b.systems; prod-deploy.yml has zero engine/docker/ssh/hetzner steps). siem-deploy.yml is called only by stage4.yml + rollback-drill.yml (both TEST) — never for prod; its own header advertises a "deploy to PROD host (env-gated)" mode that no caller uses.

3 Missing 1 Built different 2 AlignedPillar 2 — Identity, secrets & trust

VERIFIED in pbsiem (plan-b-systems/pbsiem, local clone). (1) Rotation cron SHIPPED + WIRED: .github/workflows/rotate-secrets.yml:42 cron '19 20 * * 5' + workflow_dispatch; scripts/rotate-secrets.sh rotates exactly TWO secrets — OPENSEARCH_PROXY_KEY (line 122/149-150) and mdr_admin PG password (line 116-117, ALTER USER at 206-210). Host discovery via GET on src/app/api/cron/rotation-targets/route.ts (route.ts exists, 1884 bytes) using CRON_SECRET bearer. NO MDR_ENCRYPTION_KEY, bearer-token, PAT, or GrowthBook rotation present. (2) "No human sees prod credentials" ALIGNED for the 2 rotated secrets: rotate-secrets.sh:355-359 logs sha256 fingerprints (first 16 chars) only; plaintext values file shredded on every exit path (trap EXIT INT TERM, line 135; confirmed 361-363). (3) getSecret() proxy MISSING: no src/lib/secrets/get-secret.ts (dir absent); the only getSecret matches are unrelated LOCAL helpers — src/lib/trusted-browser.ts:21 (HMAC signing key) and src/lib/twilio-callback-token.ts:32 — NOT the central audited proxy. (4) SECRET_ACCESS audit MISSING: grep "SECRET_ACCESS" across pbsiem returns ZERO hits; AuditLog model exists (prisma/schema.prisma:586) but action is a free-text string with no secret-access usage. (5) OIDC correctly DEFERRED: long-lived VERCEL_TOKEN still referenced by 8 workflows (ephemeral-env, post-deploy-verify, prod-deploy, prod-promote, prod-verify, rotate-secrets, stage4, visual-baselines). (6) Per-env scoping PARTIAL: GH environment: keyword only in prod-deploy.yml; rotation script keeps preview/TEST creds separate from PROD (lines ~154). (7) docs/secrets.md MISSING.

2 Missing 2 Built different 3 AlignedPillar 3 — Per-change ephemeral environments

WIRED AND RUNNING GREEN today (verified 2026-06-14). cicd-system ships reusable `workflow_call` workflows `.github/workflows/ephemeral-env.yml` (Neon branch create/reuse with read_write endpoint, hard-fail on wrong parent, EPHEMERAL_PAUSED gate, connection-uri retry, Vercel preview vars, seed, PR comment, full teardown on close) and `pr-backend-stack.yml` (per-PR engine+puller+postgres docker-compose on TEST pool host 178.105.148.156, deterministic HMAC PG password, pool eviction/MAX_STACKS, image GC). pbsiem `.github/workflows/ephemeral-env.yml:19` calls the cicd reusable @main with neon-project-id autumn-thunder-97996128, neon-parent-branch=production, seed-script scripts/qa-seed.ts, pr-stack-host. pbsiem `stage4.yml` deploys engine/puller/receiver to TEST + stands up the PR's own backend stack + runs e2e/dast/perf/visual/links against the PR's real Vercel preview. Live: `gh run list` shows ephemeral-env success (14-44s) and stage4 success (10-13m) on 2026-06-14; PR #78 stage4 jobs ALL green incl. deploy-engine/deploy-puller/pr-stack/pr-services/e2e/links. Crons exist: `src/app/api/cron/ephemeral-env-gc/route.ts` (Neon-only, 03:00) and `ephemeral-env-cost/route.ts` (Neon branch count, 04:00, dual-auth) in pbsiem vercel.json:192/196. Admin UI at `src/app/admin/ephemeral-envs/page.tsx` (not the plan's (admin)/admin/infra path).

8 Built different 2 AlignedPillar 4 — Quality gates (defense in depth)

VERIFIED on pbsiem origin/main + gh + cicd-system. Layer 1 EXISTS: .husky/pre-commit=`npx lint-staged`, .husky/pre-push=`npm run prepush`, .nvmrc=22.16.0 (plan said 22.11.0), husky^9.1.7 + lint-staged^15.4.3, lint-staged config matches, engines>=22.11.0. BUT `prepush`=`npm run check && npm run typecheck && npm run lint && npm run test:unit` — uses `npm run lint` (NOT lint:strict) + `test:unit` (NOT full test); diverges from plan AND from README L28 which claims `--max-warnings 0` + `vitest run`. Layer 2 BUILT-DIFFERENT: pbsiem/.github/workflows/ci.yml calls reusable cicd-system/.github/workflows/ci.yml@main as ONE combined job (single required check `ci / CI`, sequential steps), NOT 6 separate checks; pbsiem passes `lint-command: "npm run lint"` overriding the reusable default `lint:strict` — lint-warnings-as-errors NOT enforced in CI. check-all.mjs runs 13 lints + auto-discovers scripts/check-contracts/*.mjs (crypto-envelope.mjs present) — D17 framework BUILT+WIRED+ENFORCED (aligned). Aikido MISSING/REPLACED: cicd-system/.github/workflows/security-scan.yml L1+L3 = "Security Scan (Semgrep + Trivy) … Replaces Aikido (D54 superseded)"; runs Semgrep+Trivy only, NO Aikido step; pbsiem security-scan.yml enforce:true block CRITICAL,HIGH w/ allowlist. CodeRabbit: app IS installed (coderabbitai reviewer; PR#78 check `CodeRabbit pass / Review skipped`) but ADVISORY — .coderabbit.yaml does NOT exist on origin/main (removed commit fb8a2fb "remove CodeRabbit — can't downgrade to free tier"), request_changes_workflow defaults FALSE, NOT a required check. Branch protection (gh api): dev strict=False, 10 checks {ci/CI, impact, dast, e2e, perf, verify, visual, Vercel, links, security-scan/scan}, 0 reviews, enforce_admins=False; main strict=True, ONLY {ci/CI}, req_approvals=0, enforce_admins=TRUE, force=off. Neither branch lists Lint/Typecheck/Test/Build/Aikido/CodeRabbit as separate checks; main has 0 approvals (plan wanted 1) + enforce_admins ON (plan wanted OFF). Docs: README Contributing 5-layer + --no-verify→docs/runbooks/prod-emergency-override.md; docs/runbooks/ci-gate-failures.md EXISTS; ARCHITECTURE.md L133 still says "6 parallel jobs" + "Aikido + CodeRabbit" pre-merge — STALE.

5 Missing 2 Built · not wired 4 Built different 1 AlignedPillar 5 — Production-shaped rehearsal (Stage 4)

VERIFIED. cicd-system holds reusable workflows for the whole battery (.github/workflows/: playwright-e2e.yml, visual-regression.yml, perf-budget.yml [Lighthouse+k6], a11y-i18n.yml [axe-core + i18n parity], mutation-test.yml [Stryker], dast-zap.yml, link-crawl.yml). pbsiem stage4.yml (lines 208-269) wires ONLY 5 gates to its per-PR run: e2e, dast, perf, visual, links — plus a genuine full-system TEST deploy (build-engine/build-puller/migrate-test-db/deploy-engine/deploy-puller/deploy-receiver) AND a per-PR isolated backend stack (pr-stack/pr-db/pr-services with synthetic-seed.ts), which EXCEEDS the plan's "ephemeral env" scope. dev branch protection required contexts (gh api): ci/CI, impact, dast, e2e, perf, verify, visual, Vercel, links, security-scan — confirmed live and green on real PRs (stage4.yml runs 2026-06-13/14 = success). BUT: (1) e2e runs ONLY tests/e2e/dev-environment.spec.ts (stage4.yml:218) of 90 spec files — not the 86-spec suite; workflow comment admits "broaden... once the ephemeral DB seeding is wired deterministically." (2) visual covers 3 pages login/pricing/status, EN only, baselines tests/e2e/visual-regression.spec.ts-snapshots/*-chromium-linux.png — not 24 EN+HE. (3) perf = lighthouse-urls "/,/login", configs/lighthouse-budget.json static budget, k6-script-path empty (no k6, no rolling-from-main baseline). (4) i18n-parity.yml calls a11y-i18n.yml but ONLY the i18n job, gated on paths src/lib/i18n/** and NOT in branch protection (last run 2026-06-10 green). (5) axe-core accessibility job: NEVER called by pbsiem. (6) mutation-test.yml: NEVER called by pbsiem. (7) cross-tenant: tests/e2e/security/ does not exist; grep cross-tenant = 0 hits. (8) zod-to-openapi/oasdiff: 0 hits anywhere. (9) AICostLedger: not in prisma/. (10) migration safety: no check-migration-safety.mjs; only scripts/check-contracts/crypto-envelope.mjs exists (stage4 does honest prisma db push WITHOUT --accept-data-loss, a partial substitute). GAP A confirmed: main branch protection required check = ONLY "ci / CI" — the Stage-4 battery does NOT re-run on the dev→main promotion.

5 Missing 1 Built · not wired 2 Built different 2 AlignedPillar 6 — Supply-chain integrity (SBOM, SLSA provenance, Sigstore image signing, Renovate, dependency-audit, supply-chain risk score)

VERIFIED in code + live gh: (1) SBOM — sbom.yml reusable exists (cicd-system/.github/workflows/sbom.yml) and IS wired: pbsiem/.github/workflows/sbom.yml calls it @main on push to main; live runs succeed (most recent 2026-06-14T01:06 success). BUT it diverges from plan: uploads a 90-DAY-retention artifact (sbom.yml:68 retention-days: 90), NOT attached to any GH Release, NOT pushed to Storage Box forever — and it only SBOMs npm portal deps, never the container images. (2) SLSA L3 — slsa-provenance.yml reusable exists (cicd-system) and is well-formed (attest-build-provenance@v2, gh attestation verify), but has ZERO callers: 404 for the workflow in pbsiem, and an org code-search for slsa-provenance.yml@main returns nothing. Its release-attach step (slsa-provenance.yml:54 `if: github.event_name == 'release'`) can never fire — pbsiem has NO releases at all (gh release list empty). => SLSA = built-but-not-wired. (3) Sigstore image signing — docker-build.yml (cicd-system) builds+pushes mdr-engine/edr-puller to GHCR with NO cosign sign step (docker-build.yml:74-95); pbsiem/docker-build.yml calls it unchanged. siem-deploy.yml pulls/ssh-loads images with NO cosign verify (siem-deploy.yml:109-191). grep for "cosign" across cicd-system (non-node_modules) hits only docs + a findings route, never a workflow. => image signing & verify = MISSING. (4) SLSA Source Track L1 — ALIGNED: branch protection is enforced (per the plan's chosen L1 target satisfied by branch protection; signed commits correctly deferred). (5) Renovate / dep-audit cron — MISSING: no .renovaterc.json in pbsiem or cicd-system; no osv-scanner/npm-audit-cron workflow in either repo. (6) Supply-chain risk score — MISSING: no Grafana/dashboard pane; the only dashboard "provenance" string is the Vercel promotion-PR matchesMainTip check (promotions/page.tsx:201,466), unrelated to SLSA/cosign. (7) docs/supply-chain.md — MISSING (no such file). Note: security-scan.yml:3 says "Replaces Aikido (D54 superseded)" — Aikido is actually OUT in code (Semgrep+Trivy only), contradicting build.py:169's "Aikido Back IN" claim.

5 Missing 1 Built · not wired 2 Built different 1 AlignedPillar 7 — Progressive delivery & safe rollback

The canary/flag machinery is BUILT-BUT-NOT-WIRED, and several pieces are MISSING. (1) cicd-system/.github/workflows/canary-gate.yml exists with real logic: GrowthBook PUT /api/v1/features rollout-coverage ramp following the D28 default schedule (line 39), 5-min gate-window error-rate-vs-baseline checks, and on breach a GrowthBook flag-disable + workflow fail (lines 128-219). BUT it is a workflow_call reusable invoked by NOTHING real — grep shows the only reference is configs/example-callers/canary-caller.yml; zero pbsiem workflows reference canary/growthbook/feature-flag (grep of pbsiem/.github/workflows returned empty). (2) GrowthBook is NOT deployed: configs/growthbook-setup.md is a doc only (docker-compose snippet inside markdown), and all three planned hosts (gb.plan-b.co.il, growthbook.plan-b.co.il, gb-api...:3100) return connection failure (curl 000). No @growthbook/growthbook SDK in portal/engine/puller. (3) NO scripts/rollback-prod.mjs exists (find scripts/deploy returned only unrelated scripts). The Vercel-alias primitive instead lives in pbsiem prod-promote.yml -> cicd vercel-promote.yml, used as a HUMAN PROMOTE (alias repoint, no ramp); and a Vercel promote-previous rollback is inline in post-deploy-verify.yml (lines 130-183) but is SEALED OFF on production (auto-rollback default false; lines 29-33, 191 enforce verify+page-Mike, never auto-rollback on prod per S6). (4) NO /api/cron/canary-ramp, NO /api/cron/burn-rate-rollback. (5) NO /admin/canary or /admin/deploy/rollback UI in pbsiem (grep empty). (6) NO Migration Review Agent and NO expand-contract enforcement anywhere — cicd change-impact.yml only applies a "schema-changing" LABEL to prisma/schema + /migrations/ paths (lines 71-76), no prisma migrate diff, no ALTER blocking, no expand-contract suggestion; pbsiem change-impact.yml does not even reference prisma. (7) NO canary AuditLog event types (grep for CANARY_RAMP/CANARY_ROLLBACK/PROD_DEPLOY_ROLLBACK in pbsiem returned empty). The TEST-only rollback-drill.yml (container restore, PLANB_ENV=TEST assert) WAS exercised (S6: restore in ~1s/21s) — that is the only progressive-delivery-adjacent thing proven to work, and it is engine container restore, not canary or alias rollback.

3 Missing 1 Built · not wired 4 Built different 1 AlignedPillar 8 — Observability, SLOs & cost attribution (D33-D38)

VERIFIED. The TEST obs stack genuinely exists and runs: observability/docker-compose.yml (Prometheus v2.55.1, Alertmanager v0.27.0, Grafana 11.3.1, node-exporter, blackbox v0.25.0, otel-collector-contrib 0.114.0, Caddy) on the Hetzner Build-Runner; grafana.plan-b.co.il/api/health returns 200 {database:ok, version 11.3.1} — confirmed live by curl. slo/slo-definitions.json holds the 6 SLOs; observability/generate-slo-alerts.mjs renders observability/prometheus/rules/slo-alerts.yml (11 alerts). BUT: (1) Those alert rules query http_requests_total{slo=...} and http_request_duration_seconds_bucket{slo=...} — metrics that NOTHING emits. pbsiem/src/instrumentation.ts:8-15 only imports sentry.server/edge config; there is NO @vercel/otel, no OTLP exporter, no registerOTel, no W3C propagation in portal/engine/puller (grep of pbsiem src = 0 hits outside node_modules). The collector's traces pipeline exports to 'debug' (collector.yaml:36, discarded). So the burn-rate alerts can never fire on real traffic — they sit at 0 baseline. (2) The dashboard's SLOs are a DIFFERENT set: dashboard/src/lib/slo-compute.ts:65-93 computes 'CI success rate', 'CI p90 duration', 'Deployment success rate' from the dashboard's OWN build/deployment rows — NOT the 6 MDR SLOs. dashboard/src/app/api/slo/route.ts serves these. (3) Cost attribution: ZERO client_id or feature_flag labels anywhere in observability/ (grep = no matches). dashboard /api/metrics (route.ts) exports only cicd_* platform metrics; the sole cost metric is cicd_weekly_estimated_cost_eur from EphemeralEnvCost (ephemeral-env spend), not per-customer/per-feature. No cost-exporter code exists (Glob **/cost-exporter* = none). (4) Sentry: tracesSampleRate 0.05 prod (not 0.1) and Session Replay deliberately disabled (sentry.client.config.ts:11-16) — no client_id in beforeSend. (5) Better Uptime: not integrated; 'synthetic monitoring' = blackbox probes against only 2 URLs (prometheus.yml:51-53), not 6 at 1-min via Better Uptime. (6) Alertmanager routes to ALERT_WEBHOOK_URL = the CI/CD dashboard webhook (writes ALERTMANAGER_WEBHOOK audit rows), NOT pbsiem's SVC-HEALTH /admin/infra/service-health surface the plan names. (7) D38 PROD-obs-stack: not provisioned (TEST-only).

6 Missing 1 Built · not wired 4 Built different 1 AlignedPillar 9 — AI agent fleet with bounded authority

A real, coherent agent FRAMEWORK exists in cicd-system/agents/ but is invoked by NOTHING. Files: framework/types.ts (AgentTier enum 0-4, AgentSpec, TrustThreshold), framework/runtime.ts (TIER_PERMISSIONS map, createAgentRuntime with hand-rolled pre/post-tool-use hooks + tier+allowlist+cost gating, loadAgentSpec/listAgentSpecs), framework/telemetry.ts (recordInvocation→local JSON, checkEarnedTrust, TRUST_THRESHOLDS), framework/cost-guard.ts (MODEL_PRICING, DEFAULT_CAPS per-run 0.5/per-PR 5/per-repo-mo 50, checkCostBudget/recordCost). 10 specs in agents/specs/*.json (migration-review T0, pii-detection T0, cost-anomaly T0, ci-triage T1, doc-drift T1, visual-regression-triage T1, test-generation T1, dependency-update T1, oncall-investigation T2, release T3). WIRING: grep for createAgentRuntime/loadAgentSpec/runAgent/agents/framework hits ONLY docs/cicd-platform-build-plan.md, agents/README.md, and the framework's own runtime.ts/index.ts — zero callers in dashboard/, .github/workflows/ (none of the ~22 workflows reference it), scripts/, or pbsiem. No root package.json/tsconfig — the framework TS is not compiled into any build target; no *.test.* for the framework. D39 VIOLATED: no Claude Agent SDK dependency anywhere; allowedTools is a plain JSON string[] checked by hand (runtime.ts:201-228), not the SDK's API-layer allowedTools; hooks are hand-coded, not SDK PreToolUse/PostToolOutput. D41 earned-trust numbers match (telemetry.ts:30-34) BUT there is NO SHADOW/LIVE field anywhere and checkEarnedTrust only REPORTS eligibility — never auto-graduates (README:119 "Promotion is not automatic"); also overrideRate is computed as "human overrode a tier DENIAL" (telemetry.ts:96), NOT the plan's "human did something different than the agent suggested" — a different metric. D42 VIOLATED: telemetry is local gitignored JSON (agents/telemetry/*.json), not OTel/Prometheus/Grafana. MISSING entirely: <untrusted> channel (grep: only in docs), agent-to-agent graph cap / cascade cost (grep: only docs), golden-input suites (none), monthly post-mortem + agent-off-day (none), AgentTrustState + AICostLedger persistent models (absent from pbsiem schema — the only "trust" models are TrustedDevice/Persona-4 trust-period, unrelated), CSA framework alignment (terminology not present). The 4 existing agents are NOT retrofitted/tier-declared. The ONLY actually-WIRED agent-fleet-adjacent thing is dashboard/src/lib/promotion-verdict.ts: two-LLM advisory "verdict reviewers" (Anthropic claude-sonnet-4-6 + OpenAI gpt-4o) called from /api/promotions/verdict, /api/promotions, and /api/cron/promotion-notify, persisted to AuditLog (action=AI_VERDICT), surfaced in the Promotions pane / scoreboard — but it is a standalone raw-HTTP implementation: it does NOT import agents/framework, has no tier runtime, no cost guard, no allowedTools, and feeds the commit message into the prompt WITHOUT <untrusted> wrapping (promotion-verdict.ts:35-68). It is advisory/shadow only (never gates the button, Safety Rule 5).

8 Missing 1 Built · not wired 1 Built different 1 AlignedPillar 10 — Compliance evidence as exhaust

PARTIAL. What EXISTS and is WIRED: (1) C5 change-impact auto-tagging is LIVE — reusable workflow cicd-system/.github/workflows/change-impact.yml applies all 5 plan labels (auth-touching/schema-changing/customer-facing/security-relevant/pii-touching, lines 62-103) incl. content-based PII diff scan (146-162); pbsiem/.github/workflows/change-impact.yml:17 calls it @main on pull_request to dev/main; gh shows successful runs 2026-06-14 (run 27485718386, 14s). (2) An AuditLog store EXISTS on the dashboard (cicd-system/dashboard/prisma/schema.prisma:265-280) and is actively WRITTEN — but only for dashboard CONTROL-PLANE actions: promotions/dispatch (controls/dispatch/route.ts:84, promotions/dispatch:72, merge-pr:86, promotion-notify:106), security finding status (security/findings/[id]/status:59, repair-queue:127), webhooks/autofix:58, webhooks/alertmanager:45. (3) MDR has its own AuditLog model (pbsiem/prisma/schema.prisma:586-598: admin_email/action/target/details/ip/created_at) + check-audit-coverage.mjs gate + check-pci-body-logging.mjs gate, both in check-all.mjs (lines 35,41). BROKEN/NOT-WIRED: (a) C1 pipeline-layer audit is a WRITE-TO-VOID — cicd-system/.github/workflows/audit-log.yml POSTs a custom payload (repo/workflow/run_id/changed_files/duration_seconds) with header X-GitHub-Event:workflow_run to /api/webhooks/github, but dashboard/src/app/api/webhooks/github/route.ts:55-64 requires payload.repository and returns HTTP 400 "Invalid payload" for that shape; the workflow swallows non-2xx as a ::warning:: (audit-log.yml:91) and still "succeeds." The Build table is populated by the org GitHub App's NATIVE workflow_run events, NOT by audit-log.yml — its data is never persisted to any audit store. MISSING ENTIRELY (verified absent in both repos via find): compliance/ directory; C2 soc2-control-map.yaml; C3 pci-control-map.yaml; D47 pci-tranzila-scope-letter.pdf; C4 docs/data-flows.md; C6 auditd configs; C7 compliance/drills/ signed reports; C8 /api/customer/audit-export (no audit-export endpoint anywhere; only admin/audit + admin clients export-request exist); C9 retention enforcement (no 7yr/5yr crons, no append-only/immutability guard on either AuditLog model); C10 docs/compliance.md. No SOC 2 / PCI / GDPR mapping artifacts exist at all.

2 Missing 1 Built · not wired 3 Built differentPillar 11 — Disaster Recovery (proven, not planned)

The one in-scope DR item is BUILT-BUT-NOT-WIRED, partially fake, and architecturally divergent from D1. (1) .github/workflows/backup-restore-verify.yml exists on plan-b-systems/cicd-system main but is `on: workflow_call` ONLY (lines 3-4). Commit 2dd43a8 ("fix: remove schedule trigger from backup-restore-verify workflow", 2026-05-28) DELETED the `schedule: cron '0 3 * * *'` (was at the top of the file) because it kept failing on cicd-system itself (no NEON_API_KEY) — with the stated intent that a consumer repo would call it. NO consumer repo ever did: grep across pbsiem and cicd-system finds ZERO callers (only ARCHITECTURE-CI.md prose references). The last 8 GH runs are all 0s `failure` (the pre-removal self-repo failures). (2) The workflow restores a Neon branch into a Neon EPHEMERAL branch (backup-restore-verify.yml:82-126), NOT "yesterday's PROD backup INTO TEST" as D1 demands — it never touches a real backup artifact or the TEST stack. (3) The "Verify row counts within tolerance" step (lines 169-193) is a NO-OP: it unconditionally writes `deviation_ok=true` whether or not data exists — the advertised tolerance check does nothing. (4) It covers only the Neon portal DB; the actual MDR product data (PostgreSQL on PROD-MDR-01, infra/prod-mdr-01/README.md) is not in this workflow. (5) docs/cicd-platform-build-plan.md:216 specifies the cron-route implementation that was never built — src/app/api/cron/backup-restore-verify/route.ts does not exist. (6) The host-side MDR PG daily backup (02:30 -> Storage Box) claimed in build.py:872 is not verifiable in any committed repo artifact — infra/prod-mdr-01/backups/ holds only one manual April pre-stage1 pgdump; no cron unit/script is committed. (7) Closest live "drill" is pbsiem .github/workflows/rollback-drill.yml (active, but workflow_dispatch only) — a Pillar 7 container-restore rehearsal on TEST, NOT the DR D4 quarterly drill (no RTO/RPO measurement, no signed report, not scheduled). (8) No DR docs exist (docs/dr.md, docs/runbooks/dr-drill.md, docs/runbooks/failover-decision.md all absent) — but those are out-of-scope per D49.

7 Missing 3 Built different 1 AlignedPillar 12 — Platform meta-pillars + ci-templates extraction (M1-M10: services.yaml, Terraform IaC+drift, runbooks-as-code, chaos game-days, status-page-to-SLO, anonymizer library, API versioning/deprecation, customer export interface, PII scan of fixtures/screenshots, and the grand-finale public plan-b-systems/ci-templates extraction with SHA-pinned consumers)

M10 partial-but-divergent: the reusable-workflow store EXISTS as plan-b-systems/cicd-system (27 reusable workflows under .github/workflows/) and is genuinely consumed — pbsiem has 30 `uses: plan-b-systems/cicd-system/.github/workflows/*.yml@main` refs (ci.yml:18, ephemeral-env.yml:19/48, post-deploy-verify.yml:26/53, prod-promote.yml:48/72, ...) and vaughnblades consumes 3 (ci/sbom/security-scan, all @main). BUT: (1) the repo is PRIVATE (gh repo list: `plan-b-systems/cicd-system ... private`), violating D51 PUBLIC; (2) NO plan-b-systems/ci-templates repo exists (gh: "Could not resolve to a Repository"); (3) NO SemVer tags / @v1.0.0 — consumers pin @main (30/30 refs in pbsiem, 0 SHA-pins, 0 @v pins), directly violating the plan's "never @main" rule even though README.md:18/24 DOCUMENTS "@<sha>... Pin by SHA, not branch. Renovate auto-updates" (discipline written, not practiced). vaughnblades (39 merged PRs, a live real product) IS the de-facto conformance test, onboarded from cicd-system. M6 anonymizer: only scripts/anonymize-seed.mjs exists — a SQL-stream REGEX scrubber over pg_dump text (emails/phones/IPs/cards), NOT the plan's schema-aware `scrub(row, schema)` library; no weekly-TEST-refresh wiring; orphaned vs the separate S2b/S2c scripts/dev-scrub logic. M1 services.yaml: ABSENT in both repos (Glob: no files) even though the agents framework + the very specs it must feed exist (agents/specs/cost-anomaly.json, oncall-investigation.json, doc-drift.json) — those agents have no catalog to read (compounding the 'agent framework built but wired to nothing' divergence). M2 Terraform: ZERO .tf files anywhere (Glob: no files); infra hand-managed. M3 runbooks-as-code: docs/runbooks/*.md exist in both repos but are free-form (auto-fix.md, link-crawl.md, ...) with NO required-section schema and NO CI validation. M4 chaos: NO chaos scripts; the only 'chaos' hits are in pbsiem's stale ARCHITECTURE-CI.md (Litmus/Stage-7 staging — a different unbuilt design) and the plan-mirror doc. M5/M7/M8/M9: not built — no status-page-to-SLO binding, no X-Deprecated/API-versioning policy, no coherent customer export surface, no PII-scan gate over fixtures/visual baselines (security-scan.yml:85 runs Trivy `vuln,secret,misconfig` which incidentally catches secrets but is NOT the planned fixture/screenshot PII gate).

4 Missing 3 Built different 2 AlignedCORE SPINE — DEV/TEST/PROD topology + dev→test→main flow + promote model + MDR engine/backend deploy path

Topology: pbsiem docs + runbook confirm 3 TEST servers + plan-b-test project (not re-verified live this session). Flow is built DIFFERENT from the plan: (1) per-PR ephemeral = portal Vercel preview + Neon branch ONLY (ephemeral-env.yml); (2) stage4.yml deploys engine+puller to the SHARED TEST stack via siem-deploy.yml with required-env-marker:TEST + TEST_MDR_DEPLOY_HOST (stage4.yml:89-127) — shared, serialized cross-PR, not per-PR isolated; (3) the plan's human merge-to-main gate is REPLACED by promotion-autopilot.yml (PROMOTION_AUTOPILOT var = on, verified live) which auto-opens dev→main promotion PRs + "fold" PRs and auto-merges them once certify is green — removing "the human merge step in the middle" (promotion-autopilot.yml:8-9); (4) PROD reached by prod-deploy.yml (push to main → vercel-gated-deploy, staging artifact, NO alias move) then a SEPARATE human Promote click (prod-promote.yml dispatched by dashboard /api/promotions/dispatch/route.ts) that moves the Vercel alias only. Live: prod-deploy run 27484292775 (PR #77) + Promote 27484423128 both succeeded 2026-06-14; fold PR #76 active. ENGINE PROD PATH MISSING: siem-deploy.yml is called ONLY by stage4.yml (TEST) and rollback-drill.yml — there is NO caller that deploys the engine/puller to PROD-MDR-01. docker-build.yml pushes to GHCR but the cicd-system docker-build reusable has NO cosign sign step, NO SLSA attestation; zero slsa-provenance/attest-build-provenance callers in pbsiem; siem-deploy.yml does NO cosign-verify before pulling. So the plan's signed-image→verified-pull engine prod spine does not exist; the engine only ever reaches TEST. Staging-data refresh MISSING: no anonymize-seed.mjs / no scrub(row,schema) library / no verify-dev-scrub / no weekly-refresh cron in pbsiem (dev DB drifts → gate flake, the P0 the owner hit). Prod-isolation gate ALIGNED-renamed: scripts/check-prod-isolation.mjs (config-driven via audit-prod-isolation.config.yaml) IS wired into scripts/check-all.mjs:52.

2 Missing 5 Built different 1 AlignedRUNBOOK accuracy — does cicd-runbook build.py mirror the locked plan AND honestly reflect current reality, or has it drifted / over-claimed

Verified in pbsiem/.github/workflows: prod-deploy.yml implements verify-BEFORE-promote (header L1-28: "production alias is STRUCTURALLY unreachable by an unverified artifact"; calls vercel-gated-deploy.yml@main with promote-after-verify:true, auto-promote:false, prod-alias siemsys.plan-b.systems) and parks a staging artifact ("READY TO PROMOTE", proven run 27328491331 per cicd vercel-gated-deploy.yml:8). prod-promote.yml is a separate workflow_dispatch human gate that re-verifies provenance, repoints the alias, then crawls the live alias (prod-promote.yml:6-72) — Gap B is CLOSED in code. stage4.yml (L41-126) DOES build mdr-engine + edr-puller at the PR SHA, applies the TEST DB schema, and deploys BOTH to TEST-MDR-01 via siem-deploy.yml@main with PLANB_ENV/health-gated rollback — full-system per-PR TEST deploy is WIRED (contradicting the gaps-tab "open" claim). certify.yml exists on PRs to main, mechanically gates on protected_merge + introducing-PR gates, and runs an AI CERTIFY/REFUSE only when ANTHROPIC_API_KEY exists (certify.yml:145-161) — dormant by design. promotion-autopilot.yml opens a dev→main PR and ARMS GitHub auto-merge once certify=CERTIFY + checks green (L92-137) — an automated promotion path the plan forbids. MISSING on critical path: (a) NO engine→PROD deploy/promote workflow — docker-build.yml builds engine/puller images on push to main but the only engine DEPLOY targets are TEST (stage4, rollback-drill); PROD-MDR-01 (178.104.172.142) appears only in rotate-secrets known-hosts, never as a deploy target. (b) NO scripts/dev-scrub/ dir and NO scheduled TEST-data-refresh / anti-drift cron in pbsiem (only rotate-secrets cron exists) — staging data goes stale, which is the documented gate-flake root cause.

Gaps — plan vs. reality, every divergence

Reality is ahead of the runbook in the TEST/ephemeral substrate but behind the LOCKED PLAN on the spine. Per-PR ephemeral env, full-system TEST deploy (mdr-engine + edr-puller + receiver + isolated per-PR backend stack), and a 5-gate battery (e2e/dast/perf/visual/links) are genuinely WIRED and green on real pbsiem PRs to dev, exceeding the runbook's stale 'portal + DB branch only' claim. BUT the locked TEST->PROD spine is broken: prod-deploy.yml and prod-promote.yml move ONLY the Vercel portal alias (verified: zero engine/docker/ssh/hetzner steps; siem-deploy.yml's env-gated PROD mode has no caller), so the engine reaches PROD only via out-of-band manual deploy with no verified-staging-to-prod promotion. That is the P0 that turned a 40-minute change into 11 hours. Compounding it: the shared TEST stack's MDR-PG/OpenSearch data is never refreshed (verified: no weekly anonymized dump), so gates flake on empty/stale data; staging is unlabeled and the only human gate is a portal-alias click. Pillar 4 quietly relaxed (lint:strict not enforced in CI, Aikido absent despite a runbook claiming 'back IN', CodeRabbit advisory, main gated by only ci/CI with 0 approvals yet enforce_admins ON). Pillar 5's battery is real but thin (1 smoke spec of ~86, 3 EN visual baselines of 24, no k6/a11y/mutation/cross-tenant/oasdiff/migration-safety). Pillar 2 and cost models (EphemeralEnvCost, AICostLedger) are largely unbuilt. Net: the engine works for TEST; the plan's release-certainty promise to PROD does not yet exist mechanically.

129 findings across 14 areas, each typed and evidenced. Missing = never built. Built · not wired = exists in cicd-system but pbsiem never calls it. Built different = exists but diverges from the plan. Aligned = matches the plan (collapsed). The Tasks tab turns these into an ordered mission list.

Pillar 1 — Environment topology & isolation (DEV/TEST/PROD, prod isolation, per-PR ephemeral, log splitter, audit gate, emergency override)

🔴 MissingEngine (mdr-engine/edr-puller) has NO promote-to-PROD path

Plan's TEST->PROD spine (lines 317-331) is 'verified build promoted to prod' for the whole system. Reality: prod-promote.yml + prod-deploy.yml move ONLY the Vercel portal alias; they contain zero engine/docker/ssh steps. siem-deploy.yml (which advertises an env-gated PROD mode) is called only by stage4/rollback-drill against TEST, never for prod. So the engine reaches PROD-MDR-01 only via an out-of-band manual deploy with no verified-staging-to-prod promotion, no provenance re-check, and no human-gated alias-equivalent. This is the core P0 divergence the owner hit.

pbsiem/.github/workflows/prod-deploy.yml (grep engine/docker/ssh/hetzner = NONE); prod-promote.yml:48-64 (vercel-promote.yml, alias siemsys.plan-b.systems only); siem-deploy callers = stage4.yml:89,113 + rollback-drill.yml:20 (all TEST)

🔴 MissingShared-TEST staging data is never refreshed (no weekly anonymized PlaySmart refresh into MDR-PG/OpenSearch)

Plan D7 + sub-deliverable I require a weekly staging-shape refresh (anonymized prod dump) so the shared TEST stack carries real-shape data and gates don't flake on an empty DB. The splitter feeds live syslog into TEST-OpenSearch, but the MDR-PG side gets only per-PR schema push (migrate-test-db) + per-PR Neon qa-seed users. No weekly refresh cron and no anonymizer library exist (only scripts/anonymize-seed.mjs seed-scrubbing). Runbook itself defers this to Phase 7. This is the 'staging data goes stale -> gates flake' P0 the owner hit.

pbsiem/.github/workflows/stage4.yml:65 (migrate-test-db = schema push only), :246 (qa-seed users); grep weekly/refresh -> only rotate-secrets.yml; build.py:818-819 ('full scrub library + weekly TEST refresh is Phase 7')

🔴 MissingEmergency-override runbook (Mike-only, post-incident audit within 24h)

Plan sub-deliverable H requires docs/runbooks/prod-emergency-override.md defining legitimate Mike-only direct-prod actions + the [EMERGENCY-OVERRIDE] commit convention + the required AuditLog entry within 24h, plus a one-line ARCHITECTURE.md reference. The runbooks dir holds 6 other runbooks but not this one; no EMERGENCY-OVERRIDE reference exists in ARCHITECTURE.md. (Partial substitute exists: the global claude-guard provides a Mike-only time-boxed override, but the documented policy/audit-trail runbook is absent.)

pbsiem/docs/runbooks/ = ci-gate-failures.md, multi-stack.md, owner-totp-recovery.md, p3b-findings-demo.md, wef-ca-offline-procedure.md, windows-event-forwarding.md (no prod-emergency-override.md); grep EMERGENCY-OVERRIDE in docs/+ARCHITECTURE.md = none

🟠 Built differentprod-isolation audit script (check-all gate #13) does only 1 of the plan's 4 checks

Plan's audit-prod-isolation.mjs was specced to (a) diff vercel preview vs production env values, (b) grep hardcoded prod indicators, (c) assert no schema.prisma DB-URL fallbacks, (d) assert no fixture uses a real prod client_id. The built script (renamed check-prod-isolation.mjs) implements ONLY (b) — a static grep — and grandfathers 89 pre-existing files via a legacy allowlist. The runtime env-var diff (the part that actually enforces 'no prod DSN/key reachable from non-prod') is absent. Worse, the config DECLARES capabilities the code never reads: known_prod_client_ids (line 28-29) and neon_hosts catch-all (line 12-13) are in audit-prod-isolation.config.yaml but never referenced by check-prod-isolation.mjs — a config-vs-code over-claim.

pbsiem/scripts/check-prod-isolation.mjs:27-30 (only urls+ips read; no neon_hosts/client_ids/vercel-env), :70-89 (grep+legacy allowlist only); scripts/audit-prod-isolation.config.yaml:12-13,28-29 (unread keys); scripts/check-all.mjs:52; prod-isolation-legacy-allowlist.txt = 89 entries

🟠 Built differentStaging is not a NAMED, labeled stage with the human-approval wired as the prod gate

Plan names the spine dev->staging->prod with one full-battery staging gate + human promote. Reality: the TEST stack functions as staging but is unlabeled as such, and the only human gate is the manual Vercel prod-promote click (portal alias). The runbook itself flags this ('TEST stack ~= staging, unlabeled', 'wire the approval as the prod gate'). The promote re-verifies Vercel artifact provenance but there is no single staging-green-then-promote step covering BOTH portal and engine.

build.py:331-332,360-362 (warnA 'name + isolate' / 'wire the approval as the prod gate'); prod-promote.yml:1-16 (portal-only human promote)

🟠 Built differentRunbook's 'Full-system TEST deploy per change' is listed OPEN but is actually DONE

The runbook gap entry (build.py:639-645) claims the per-change deploy covers 'portal + DB branch only' and that engine/puller/OpenSearch are NOT in any per-change deploy. stage4.yml proves otherwise (engine+puller deployed to TEST every PR). This is a runbook-vs-reality staleness that under-claims progress and should be corrected so the doc stops misrepresenting the state to readers/Mike.

build.py:639-645 (status 'open', 'portal + DB branch only') vs pbsiem/.github/workflows/stage4.yml:1-19,85-125 (engine+puller to TEST per PR)

✅ 3 aligned with the plan — click to expand

✅ AlignedTEST stack provisioned as a separate Hetzner CI/CD project (lean shadow + splitter + BX11)

Separate Hetzner project with TEST-siemsys-syslog + TEST-MDR-01 + splitter-syslog (UDP-514 fan-out to PROD AND TEST), network 10.1.0.0/16, dedicated BX11 box, test-siem-api DNS + LE cert all exist and are live. Matches D6/D7-splitter/D10 and sub-deliverables A/F/G/J.

pbsiem/docs/01-infrastructure.md:80-97,173-177; openssl s_client test-siem-api.plan-b.co.il -> CN=test-siem-api, Verify return code: 0; DNS -> 178.105.214.148

✅ AlignedPer-PR ephemeral env (Neon branch + Vercel preview + PR backend stack + auto-teardown)

ephemeral-env.yml creates a Neon branch from parent 'production', sets the Vercel preview override, spins a per-PR backend stack, and tears it (+Neon branch) down on PR close. Implements D5+D8. Minor naming note: the Neon parent branch is 'production' not the plan's 'dev', handled explicitly by the reusable.

pbsiem/.github/workflows/ephemeral-env.yml:19-56 (uses ephemeral-env.yml@main + pr-backend-stack.yml@main, pr-stack-down on action==closed)

✅ AlignedFull-system TEST deploy per change (engine+puller to TEST every PR)

stage4.yml rebuilds mdr-engine+edr-puller at the PR SHA and deploys both to the shared TEST stack (env-marker TEST, health+RestartCount gate, serialized cross-PR), then runs the battery against the real preview. This is built and wired — the runbook's 'open' gap entry on this is STALE and should be flipped to done.

pbsiem/.github/workflows/stage4.yml:1-19 (header 'RELEASE-CERTAINTY CORE S5 D-4=Option C'), :39-125 (build-engine/build-puller/migrate-test-db/deploy-engine/deploy-puller, siem-deploy.yml@main, required-env-marker: TEST)

Pillar 2 — Identity, secrets & trust

🔴 MissingSecret rotation EXTENSION to MDR_ENCRYPTION_KEY + bearer tokens + GH PATs + GrowthBook keys

Plan item #4 (L1565) requires extending the rotation set beyond the current 2. The script rotates only OPENSEARCH_PROXY_KEY + mdr_admin PG password. MDR_ENCRYPTION_KEY is the hard/important one (must stay in sync across Vercel + engine .env + puller); none of the additional targets are rotated. Verification criterion (L1656 — encryption key rotates and all 3 components pick it up without manual restart) is unmet.

scripts/rotate-secrets.sh rotates only 2 (lines 116-122, 206-210); no MDR_ENCRYPTION_KEY/PAT/GrowthBook handling; build.py:733-736 status=dep confirms

🔴 MissinggetSecret('X') audit-logged secret-access proxy (D13/D12, plan item #5)

No central proxy exists. src/lib/secrets/get-secret.ts is absent and there is no SECRET_ACCESS AuditLog event anywhere in pbsiem. The two getSecret() symbols found are unrelated per-file helpers, not the planned wrapper around process.env. Without this there is no central audit point for sensitive-secret reads (encryption/payment/AI keys) in PROD.

no src/lib/secrets/ dir; grep SECRET_ACCESS -> 0 hits; getSecret only in src/lib/trusted-browser.ts:21 + src/lib/twilio-callback-token.ts:32; AuditLog model prisma/schema.prisma:586; build.py:737-740 status=open ('not started')

🔴 Missingdocs/secrets.md (rotation model + human-permission tiers + OIDC-deferral note) — plan item #10

Plan calls for a single docs/secrets.md explaining the rotation model, permission tiers (Mike admin / Nadav developer / Roy viewer), and the OIDC integration/deferral. File does not exist.

ls docs/secrets.md -> NO docs/secrets.md

🟠 Built differentPer-environment GH Actions secret scoping (production/preview environments) — plan item #6

Plan wants every sensitive secret narrowed to environment scope. Only prod-deploy.yml uses a GH environment: block; the other 7 VERCEL_TOKEN-consuming workflows pull repo-level secrets with no environment gate. Rotation does keep preview/TEST creds separate from PROD, so the env separation exists at the value level but is not enforced via GH environment scoping across workflows.

grep 'environment:' .github/workflows -> only prod-deploy.yml; VERCEL_TOKEN in 8 workflows; rotate-secrets.sh keeps preview creds separate (~line 154)

✅ 2 aligned with the plan — click to expand

✅ AlignedWeekly rotation cron (the one piece the plan said was already shipped)

rotate-secrets.yml fires weekly + workflow_dispatch; rotate-secrets.sh rotates the 2 in-scope secrets, discovers hosts via rotation-targets API, logs sha256 fingerprints only, shreds plaintext on every exit. This is the 'already shipped' baseline the plan preserves (D11), and the 'no human sees prod credentials' posture holds for these 2 secrets.

pbsiem .github/workflows/rotate-secrets.yml:42; scripts/rotate-secrets.sh:122,206-210,355-359,135; src/app/api/cron/rotation-targets/route.ts

✅ AlignedOIDC federation (GH Actions -> Vercel/Hetzner/Cloudflare)

Plan D11 explicitly DEFERS OIDC to v2 and says keep long-lived tokens + rotation. Reality matches: VERCEL_TOKEN remains long-lived across 8 workflows. Not a gap — divergence is sanctioned by the plan. Do NOT build OIDC in v1.

plan L112/L1575; pbsiem grep VERCEL_TOKEN -> 8 workflow files

Pillar 3 — Per-change ephemeral environments

🔴 MissingOpenSearch PR-index teardown + ISM delete policy (min_age 1d)

Plan wants pr-{N}-test indices auto-expired by an ISM min_age:1d/delete policy as a belt-and-braces backstop. No ISM policy exists anywhere (grep for _ism/ism_policy/min_age finds only unrelated cold-archive/demo scripts). The reusable workflow CAN delete OS indices on teardown, but only when opensearch-url + OPENSEARCH_API_KEY are passed — pbsiem's caller passes neither (only backend-opensearch-url / BACKEND_OPENSEARCH_PROXY_KEY), so that teardown branch is dead for pbsiem. The GC cron touches Neon only, never OpenSearch.

pbsiem ephemeral-env.yml has no opensearch-url/OPENSEARCH_API_KEY; cicd ephemeral-env.yml:327-344 (gated on those inputs); ephemeral-env-gc/route.ts is Neon-only

🔴 MissingEphemeralEnvCost Prisma model + 30-day rolling cost + SVC-HEALTH alert at EUR80/EUR100

Plan (lines 556-577) calls for the cost cron to persist to an EphemeralEnvCost model with a 30-day rolling total and SVC-HEALTH/ServiceHealthAlert at EUR80 warn / EUR100 critical, plus a stale-PR (>30d) force-delete sweep. The MDR cost cron only counts live Neon branches, returns JSON, warns via console.warn at a hard-coded threshold of 5 branches, and persists nothing. No EphemeralEnvCost model in pbsiem prisma (grep = no matches); runbook line 724 confirms it lives only in the dashboard DB. No EUR-denominated aggregate cap, no rolling trend, no alert routing, no >30d stale sweep.

src/app/api/cron/ephemeral-env-cost/route.ts:45-68; grep EphemeralEnvCost in pbsiem/prisma = no matches; build.py:724-725

🟠 Built differentOpenSearch query-layer scoping helper getTestWriteIndex()/OPENSEARCH_PR_INDEX in src/lib/opensearch.ts

Plan §2 (lines 503-517) wants the app to read OPENSEARCH_PR_INDEX via a getTestWriteIndex() helper writing to a pr-{N}-test index. That helper does not exist — grep of src/lib/opensearch.ts and all of src returns zero matches for getTestWriteIndex/OPENSEARCH_PR_INDEX/OPENSEARCH_INDEX_PREFIX. The workflow sets OPENSEARCH_INDEX_PREFIX on the preview but nothing reads it. OS isolation is instead achieved by per-PR client-id namespacing (synthetic-seed creates pr{N}-* clients; engine index pattern logs-{client_id}-* confines reads to logs-pr{N}-*). Functionally isolating, but divergent from the locked mechanism.

grep getTestWriteIndex/OPENSEARCH_PR_INDEX across pbsiem/src = no matches; scripts/synthetic-seed.ts:8-72 (PR_NAMESPACE pr{N}); pr-backend-stack.yml:5-9 comment

🟠 Built differentAdmin UI path + dark-mode styling

Admin visibility page exists and works (reads the cost report, force-cleanup button), but lives at src/app/admin/ephemeral-envs/page.tsx instead of the plan's src/app/(admin)/admin/infra/ephemeral-envs/page.tsx, and is styled light (bg-gray-50/text-gray-500) against the dark-mode-default memory rule. Minor; functionally present.

src/app/admin/ephemeral-envs/page.tsx:24-180

✅ 3 aligned with the plan — click to expand

✅ AlignedNeon branch-per-PR create + Vercel preview override + idempotent teardown

cicd-system ephemeral-env.yml creates pr-{N} from the configured parent with a read_write endpoint (lines 160-170), sets DATABASE_URL/PR_NUMBER/OPENSEARCH_INDEX_PREFIX/EPHEMERAL_ENV per branch, and on close deletes the branch + sweeps all gitBranch-scoped Vercel vars. Hardened beyond plan: no-fallback parent guard + EPHEMERAL_PAUSED gate (red-team fix). Running green on every pbsiem PR to dev.

cicd .github/workflows/ephemeral-env.yml:109-347; pbsiem .github/workflows/ephemeral-env.yml:19-43; gh run 27485718373 success

✅ AlignedFull-system per-change deploy (engine+puller+receiver+per-PR backend stack) wired to the ephemeral preview

EXCEEDS the plan. stage4.yml builds engine/puller at PR SHA, deploys to TEST (health-gated, PLANB_ENV=TEST assert), schema+seeds a dedicated per-PR postgres, and runs every gate against the PR's real preview. This is the release-certainty core the plan only gestured at. Note the runbook's own 'open' gap (build.py:639-645) describing this as missing is STALE.

pbsiem .github/workflows/stage4.yml:84-339; cicd pr-backend-stack.yml; gh run 27485718389 jobs all green (deploy-engine/deploy-puller/pr-stack/pr-services)

✅ AlignedDeterministic seed fixtures per ephemeral env

Plan wants deterministic seed fixtures. ephemeral-env.yml seeds the Neon branch via scripts/qa-seed.ts (synchronize re-seeds so a PR that edits the seed re-runs it); the per-PR backend stack runs synthetic-seed.ts with deterministic synth-det-* ids. Honest prisma db push (no --accept-data-loss) is the post-#101 fix.

pbsiem ephemeral-env.yml:25; cicd ephemeral-env.yml:252-277; scripts/synthetic-seed.ts:44-73

Pillar 4 — Quality gates (defense in depth)

🟠 Built differentlint:strict (lint-warnings-as-errors) in CI and pre-push (plan sub-deliv C/D)

Plan requires lint to run as `next lint --max-warnings 0` in both pre-push AND CI. Reality: package.json `prepush` calls `npm run lint` (plain), and pbsiem ci.yml passes `lint-command: "npm run lint"` overriding the reusable workflow's `lint:strict` default. So eslint warnings never block — the plan's 'warning → lint:strict fails' verification (L820) would NOT fail today. README L28 and runbook build.py:697 both claim lint:strict is in effect — both over-claim.

pbsiem origin/main package.json scripts.prepush + ci.yml `lint-command: "npm run lint"`; cicd-system .github/workflows/ci.yml default `lint-command: npm run lint:strict`; README L28; build.py:697

🟠 Built differentCI as 6 separate required checks (plan sub-deliv D + branch protection I)

Plan specifies install/lint/typecheck/check-all/test/build as six distinct jobs, each a named required status check. Reality: a single reusable 'ci / CI' job runs them as sequential steps, exposed as ONE required check. Functionally the steps all run and block (good), but the plan's granular required-check names (Lint, Typecheck, check-all, Test, Build) do not exist, so branch-protection contexts can't match the plan and per-stage visibility/parallelism is lost. This is the deliberate cicd-system reusable-workflow drift (build.py:179).

pbsiem origin/main ci.yml (uses plan-b-systems/cicd-system/.github/workflows/ci.yml@main); cicd-system ci.yml single job name CI; gh api branches show only `ci / CI`

🟠 Built differentAikido SAST gate (plan sub-deliv E, D-table)

Plan + D-table call for an Aikido workflow (security-review.yml). Reality: Aikido is entirely absent from the wired pipeline — cicd-system security-scan.yml header states 'Replaces Aikido (D54 superseded)' and runs Semgrep + Trivy only. The Semgrep+Trivy gate IS wired and enforcing (enforce:true, block CRITICAL/HIGH, allowlist), so security SAST/dep/secret coverage exists — but NOT via Aikido. CRITICAL: the runbook (build.py:169) claims 'Aikido Back IN as a FREE blocking gate' which is FALSE in code, and build.py:586 simultaneously claims 'after Aikido removal' — the runbook contradicts itself and reality. Mike's 2026-06-11 decision to re-add Aikido was never implemented.

cicd-system .github/workflows/security-scan.yml L1,L3 'Replaces Aikido (D54 superseded)'; pbsiem security-scan.yml enforce:true; build.py:169 vs build.py:586 (contradiction)

🟠 Built differentCodeRabbit request_changes_workflow blocking merge (D15, sub-deliv F)

Plan D15 mandates .coderabbit.yaml with request_changes_workflow:TRUE so a CodeRabbit 'request changes' DISABLES the merge button, and CodeRabbit as a required check. Reality: the App is installed and posts reviews (advisory), but .coderabbit.yaml was REMOVED (commit fb8a2fb) so request_changes_workflow is unconfigured/false, and CodeRabbit is NOT a required check on dev or main. This is a deliberate re-decision (build.py:171 'FREE tier, advisory only, never a required gate') superseding D15 — but the plan is still the locked source of truth, so it's a divergence to either ratify or bend back.

pbsiem .coderabbit.yaml absent on origin/main (git cat-file fail; removed in fb8a2fb); PR#78 check 'CodeRabbit pass / Review skipped'; gh api protection (no CodeRabbit context); build.py:171

🟠 Built differentmain branch: 1 required approving review (plan sub-deliv I)

Plan requires main to need 1 approving review from Mike. Reality: main req_approvals=0. This is the documented re-decision (build.py:173 'No second approver; Mike's chat instruction after green gates authorizes protected merges') — but means GitHub itself enforces no human review on main; the 'human gate' is procedural (Mike's word), not mechanical. Divergence to ratify or bend back.

gh api repos/plan-b-systems/pbsiem/branches/main/protection req_approvals=0; build.py:173; plan L774-776

🟠 Built differentmain branch: admin no-bypass OFF (plan sub-deliv I, ties to emergency-override runbook D18)

Plan explicitly sets enforce_admins/no-bypass OFF on main so Mike can do emergency overrides via the documented runbook. Reality: main enforce_admins=TRUE (admins CANNOT bypass). This is stricter than the plan (arguably safer, and consistent with the global claude-guard layer) but it means the plan's emergency-override path is blocked at the GitHub layer — the override now depends on temporarily toggling protection rather than admin fast-forward.

gh api branches/main/protection enforce_admins=True; plan L776; build.py:701 claims this as 'verified DONE'

🟠 Built differentmain required checks weaker than dev (sequencing/coverage gap)

Plan says main = dev's checks PLUS review. Reality: main requires ONLY `ci / CI`, while dev requires 10 checks including security-scan, Stage-4 (dast/e2e/perf/visual), and links. So a hotfix PR straight to main bypasses security-scan, Stage-4, and link-crawl entirely at the protection layer. In practice main is reached via promotion of already-gated dev commits (autopilot folds), but the protection config alone does not guarantee main-bound code passed those gates.

gh api branches/main/protection contexts=['ci / CI'] vs branches/dev/protection 10 contexts

🟠 Built differentARCHITECTURE.md gate documentation (sub-deliv J)

ARCHITECTURE.md L133 still describes the OLD design: '6 parallel jobs' and 'pre-merge (branch protection: all checks + Aikido + CodeRabbit)'. Reality is 1 combined CI job, Semgrep+Trivy (no Aikido), CodeRabbit advisory. Doc is stale/misleading; ci-gate-failures.md and README Contributing exist (those parts aligned).

pbsiem origin/main ARCHITECTURE.md L133; docs/runbooks/ci-gate-failures.md present

✅ 2 aligned with the plan — click to expand

✅ AlignedHusky local hooks (Layer 1)

pre-commit (lint-staged), pre-push (prepush), .nvmrc, engines, husky+lint-staged devDeps all present and wired on origin/main; --no-verify documented in README pointing to prod-emergency-override.md. Minor drift: .nvmrc=22.16.0 vs plan 22.11.0 (harmless, both Vercel-compatible).

pbsiem origin/main .husky/pre-commit=`npx lint-staged`, .husky/pre-push=`npm run prepush`, package.json husky^9.1.7/lint-staged^15.4.3; README L19-48

✅ AlignedCross-component contract gate framework + crypto-envelope (D17, sub-deliv G+H)

scripts/check-contracts/ generic runner is built, auto-discovered by check-all.mjs, and ENFORCED in CI via the check-all step. crypto-envelope.mjs is the v1 contract. Exactly matches the plan including the reusability intent.

pbsiem origin/main scripts/check-all.mjs L (readdirSync contractsDir + run each) ; scripts/check-contracts/crypto-envelope.mjs present; ARCHITECTURE.md L134

Pillar 5 — Production-shaped rehearsal (Stage 4)

🔴 MissingCross-tenant isolation suite (D23: top-12 sensitivity endpoints, catches CR-03 IDOR)

No tests/e2e/security/ directory, no cross-tenant-*.spec.ts, zero grep hits. This is the highest-value security gate in the pillar (mechanically prevents the IDOR/cross-tenant-leak class) and it does not exist anywhere. Runbook correctly marks it open.

pbsiem: tests/e2e/security/ absent; grep cross-tenant = 0; build.py:714-719; plan lines 991-1007

🔴 MissingAPI contract gate (D20: zod-to-openapi spec + oasdiff breaking-change detection)

No zod-to-openapi, no oasdiff, no openapi.yaml anywhere in pbsiem or cicd-system. The Phase-5a Zod-audit prerequisite was never started. Breaking API changes are entirely unguarded. Runbook marks it open.

grep zod-to-openapi|oasdiff|openapi in pbsiem = 0; cicd-system reusable workflows have no contract gate; build.py:720-722; plan lines 919-933

🔴 MissingMigration safety gate (prisma migrate diff + lock-impact)

No scripts/check-migration-safety.mjs; risky ALTER/DROP statements are not detected. Partial substitute exists: stage4 migrate-test-db does an honest prisma db push WITHOUT --accept-data-loss (so a destructive change fails loudly on TEST), but that is not the planned diff-based, expand-contract-advising gate and does not run on schema-only PRs as a check.

pbsiem scripts/ has only check-contracts/crypto-envelope.mjs; stage4.yml:74-82; plan lines 1025-1037

🔴 MissingAI cost guard + AICostLedger (D24: $0.50/run, $5/PR, $50/repo/mo)

No AICostLedger Prisma model in pbsiem; no per-run/per-PR/per-repo cap enforcement wired into Stage-4. Runbook notes cost-guard code exists in the agents framework but the ledger model is absent in MDR (only EphemeralEnvCost exists in the dashboard DB) — i.e. write path and caps are not wired to the rehearsal stage.

grep AICostLedger in pbsiem/prisma = 0; build.py:723-726; plan lines 1039-1061

🔴 MissingGap A — Stage-4 battery is NOT re-run on the dev->main promotion (the prod gate)

main branch protection requires ONLY 'ci / CI'. The full prod-mirror battery (e2e/dast/perf/visual/links) runs on PRs into dev but is never re-executed before code reaches production main. This is the P0 critical-path gap the owner hit: an unverified-against-the-battery change can reach prod via the promotion PR. The plan's safety rule 1 ('production structurally unreachable by an unverified change') and rule 2 ('staging is a faithful mirror, re-tested before prod') are not enforced at the promote step.

gh api repos/plan-b-systems/pbsiem/branches/main/protection required_status_checks.contexts = ["ci / CI"]; build.py:302-303 (Gap A explicit); stage4.yml:22-23 triggers only on PRs to dev

🔌 Built · not wiredAccessibility (axe-core, WCAG 2.1 AA, zero violations)

A complete axe-core reusable workflow exists in cicd-system (a11y-i18n.yml accessibility job) but pbsiem NEVER calls it — not in stage4.yml, not as a required check. The plan makes it a merge-blocking gate on 12 critical-path pages. Zero coverage today.

cicd-system a11y-i18n.yml:111-164 (axe-core job exists); grep a11y/axe in pbsiem/.github/workflows = only i18n-parity.yml (i18n job only); plan lines 960-977

🔌 Built · not wiredMutation testing (D22: Stryker, ratchet 60%->70%)

A full Stryker reusable workflow exists (mutation-test.yml, vitest runner, configurable mutate-paths + threshold) but pbsiem NEVER calls it and it is not scheduled nightly on dev as the plan specifies. No stryker.config in pbsiem. Zero mutation coverage despite the tool being built. The runbook scorecard omits it entirely.

cicd-system mutation-test.yml (complete); grep mutation|stryker in pbsiem/.github/workflows = 0; plan lines 1009-1023

🟠 Built differentPlaywright E2E gate (Plan A: the existing 86 specs)

The e2e gate is wired and required, but runs a SINGLE smoke spec (dev-environment.spec.ts) out of 90 spec files. The plan requires the full ~86-spec suite. The workflow comment itself flags this as deliberate-temporary pending deterministic ephemeral-DB seeding (now built via pr-db/synthetic-seed.ts, so the blocker is largely cleared).

pbsiem/.github/workflows/stage4.yml:208-220 (test-pattern: tests/e2e/dev-environment.spec.ts); find tests/e2e -name *.spec.ts = 90; cicd-system playwright-e2e.yml

🟠 Built differentVisual regression (D19: 12 pages x EN+HE = 24 baselines)

Wired and required, honest comparison (CI=1 fails on missing snapshot, no --update-snapshots vacuity). But covers only 3 public pages (login, pricing, status) in EN/Linux only = 3 baselines, vs the locked 24. No authed/admin/MDR pages, no HE locale.

pbsiem/.github/workflows/stage4.yml:256-269; tests/e2e/visual-regression.spec.ts-snapshots/{login,pricing,status}-chromium-linux.png; plan lines 900-913

🟠 Built differentPerformance budget (D21: Lighthouse + k6, rolling baseline from main)

Lighthouse runs on 2 public URLs (/ and /login) against a static budget file. k6 hot-endpoint load test is not wired (k6-script-path empty in the caller; no tests/perf/*.k6.js). No rolling baseline from main — uses a fixed configs/lighthouse-budget.json. The reusable workflow supports k6, so this is an unwired half.

pbsiem/.github/workflows/stage4.yml:231-238; cicd-system perf-budget.yml:133-156 (k6 job gated on k6-script-path != ''); plan lines 949-958

🟠 Built differenti18n key parity wired as a required gate

i18n-parity.yml calls the reusable a11y-i18n.yml (i18n job only), works and is green. But it is path-gated to src/lib/i18n/** so it doesn't fire on most PRs, and it is NOT in dev/main branch protection — so it is not a blocking Stage-4 gate as the plan requires (plan: wired into check-all, required).

pbsiem/.github/workflows/i18n-parity.yml (paths: src/lib/i18n/**); dev required checks list has no i18n context; plan lines 979-989

✅ 1 aligned with the plan — click to expand

✅ AlignedFull-system TEST deploy + per-PR isolated backend stack as the substrate for the battery

stage4.yml builds mdr-engine + edr-puller at the PR SHA, applies MDR-PG schema, deploys both to the live TEST stack (health-gated with RestartCount=0, PLANB_ENV=TEST identity assert, auto-rollback), syncs the syslog receiver, AND spins a per-PR isolated postgres+engine+puller stack seeded with synthetic data (synthetic-seed.ts). This matches and exceeds the plan's Pillar-1/Pillar-5 handshake — the battery genuinely runs against a prod-shaped system, not just a portal preview. Live and green on real PRs.

pbsiem/.github/workflows/stage4.yml:41-163,271-339; gh run list stage4.yml = success 2026-06-14T02:17Z; cicd-system siem-deploy.yml + pr-backend-stack.yml

Pillar 6 — Supply-chain integrity (SBOM, SLSA provenance, Sigstore image signing, Renovate, dependency-audit, supply-chain risk score)

🔴 MissingSigstore/cosign signing of mdr-engine + edr-puller images — MISSING from the build path

Plan requires cosign keyless-OIDC signing of engine + puller images pushed to GHCR. docker-build.yml builds+pushes with no signing step; images on GHCR are unsigned. No cosign in any workflow.

docker-build.yml:74-95 (build-push, no sign); pbsiem/docker-build.yml:52-67 calls it unchanged; grep cosign over cicd-system => docs + findings route only

🔴 MissingDeploy-time signature verification — MISSING (engine deploy pulls unverified images)

Plan: deployment scripts MUST verify signatures before pulling. siem-deploy.yml streams/pulls the image and runs it with no cosign verify gate — the engine PROD deploy path (P0 critical-path subsystem) trusts whatever tag it is handed.

siem-deploy.yml:109-191 (ssh-load + docker compose up, no verify); plan line 1609 'verify signatures before pulling'

🔴 MissingRenovate config + weekly dependency-audit cron — MISSING

Plan D26 wants .renovaterc.json (manual-merge during build) + weekly npm audit/osv-scanner cron with auto-PR. Neither repo has a renovate config or a dep-audit cron workflow.

no .renovaterc* in either repo (Glob); ls => none; no osv/npm-audit cron workflow; build.py:771-772 status:open

🔴 MissingSupply-chain risk score dashboard pane — MISSING

Plan wants a 0-100 per-repo-per-week risk score aggregating scanner+audit+Dependabot findings. No such pane exists; the only dashboard 'provenance' label is the unrelated Vercel promotion-PR tip check.

dashboard promotions/page.tsx:201,466 (matchesMainTip 'provenance' is Vercel-PR, not SLSA); no risk-score component found; plan lines 1599,1612

🔴 Missingdocs/supply-chain.md (external-verification guide) — MISSING

Plan calls for docs/supply-chain.md explaining SBOM/SLSA/Sigstore + how to externally verify an artifact. File does not exist.

plan line 1614; no docs/supply-chain.md in cicd-system

🔌 Built · not wiredSLSA L3 build provenance workflow exists but has zero callers (built, not wired)

slsa-provenance.yml is a correct reusable (attest-build-provenance@v2 + gh attestation verify) but NOTHING calls it: 404 in pbsiem, no org caller. Its release-attach step can never fire because pbsiem has no releases. Runbook marks this DONE — an over-claim.

cicd-system/.github/workflows/slsa-provenance.yml:46-77; gh: slsa-provenance.yml 404 on pbsiem default branch; gh search code (org) => 0 hits; build.py:784-787 (status:done)

🟠 Built differentSBOM retention = 90-day artifact instead of FOREVER on Storage Box + not attached to any GH Release

Plan D27 mandates forever-retention on Storage Box and SBOM attached to every GH Release. Reality: a 90-day GH artifact that auto-expires and is never archived; no release-attach (no releases exist at all).

sbom.yml:68 (retention-days: 90); plan lines 143,1606,1625; gh release list (pbsiem) empty

🟠 Built differentAikido status: code says OUT (Semgrep+Trivy), runbook says 'Back IN' — internally contradictory

Pillar 6 plan lists Aikido SAST/secret/dep-CVE/IaC. Reality: security-scan.yml explicitly 'Replaces Aikido (D54 superseded)' and runs Semgrep+Trivy only. build.py simultaneously claims Aikido is back IN (line 169) and removed (lines 586-587). Real coverage differs from plan and the runbook self-contradicts.

security-scan.yml:1-3,70-85; build.py:169 vs build.py:586-587; plan line 37

✅ 2 aligned with the plan — click to expand

✅ AlignedSBOM generation (CycloneDX) wired on pbsiem main

sbom.yml reusable is real and invoked by pbsiem on every push to main; live runs succeed (2026-06-14). This is the one genuinely working, wired piece of Pillar 6 — though it covers only npm portal deps, not the engine images.

cicd-system/.github/workflows/sbom.yml:55-68; pbsiem/.github/workflows/sbom.yml:9; gh run list sbom.yml => success 2026-06-14T01:06

✅ AlignedSLSA Source Track L1 via branch protection — matches the deferred plan

D25 deferred signed-commits-on-main to v2 and targets Source Track L1 satisfied by branch protection alone. Branch protection is enforced (promotion-PR-only path to main). This matches the plan's chosen scope.

plan lines 141,1608,1618; pbsiem prod-deploy.yml:4-8 (main reachable only via certified branch-protected promotion PR)

Pillar 7 — Progressive delivery & safe rollback

🔴 MissingGrowthBook self-hosted feature-flag platform

Plan requires GrowthBook self-hosted (D-E, lines 1399, 1420-1426) as the foundation of the whole pillar. Only a setup doc exists; no instance is deployed and no SDK is integrated in portal/engine/puller.

configs/growthbook-setup.md is markdown-only; curl to gb.plan-b.co.il / growthbook.plan-b.co.il / gb-api.plan-b.co.il:3100 all return 000 (connection fail); no @growthbook/growthbook dependency wired.

🔴 MissingCanary controller cron /api/cron/canary-ramp + /admin/canary UI

Plan's stateful controller (reads ramping flags, advances on time + no-SLO-breach, manual pause/resume/skip) does not exist. The canary-gate workflow's in-job 13h sleep loop is a degenerate substitute, not the planned cron-driven controller, and isn't wired anyway.

plan lines 1433-1439; no cron route nor admin page found in pbsiem (grep /admin/canary empty).

🔴 MissingExpand-contract migrations enforced by Migration Review Agent

Plan requires blocking risky ALTERs (ADD COLUMN NOT NULL no-default on large tables) and suggesting expand-contract; tables >100k rows must declare phases. Reality: cicd change-impact.yml only adds a 'schema-changing' label to prisma paths — no migrate-diff, no lock-impact, no ALTER block, no expand-contract. pbsiem change-impact doesn't reference prisma at all. No agent.

cicd change-impact.yml lines 71-76 (label only); grep for expand-contract/migration-review/ALTER in both repos returned empty; plan lines 1037, 1449-1450, 1032, 1518.

🔴 MissingDefault-flag-every-PR policy (D31) + no-flag-needed opt-out

No automation creates a per-PR GrowthBook flag and no workflow honors a no-flag-needed label, because GrowthBook and the flag convention are unbuilt.

configs/growthbook-setup.md lines 69-75 describe the convention as intent only; no PR workflow creates flags (grep empty).

🔴 MissingCanary AuditLog event types (CANARY_RAMP_ADVANCE/HALT, CANARY_ROLLBACK, PROD_DEPLOY_ROLLBACK)

Plan requires every ramp/rollback decision written to AuditLog. None of the named event types exist in pbsiem.

grep for CANARY_RAMP/CANARY_ROLLBACK/PROD_DEPLOY_ROLLBACK across pbsiem .ts/.tsx/.prisma returned empty; plan line 1473.

🔌 Built · not wiredcanary-gate.yml reusable (GrowthBook ramp + flag-flip rollback)

Real, plan-shaped ramp logic exists (D28 default schedule baked into the input default; D29 error-rate-vs-baseline gate; flag-disable on breach) but is a workflow_call invoked by nothing real. No merge-to-main / prod-deploy path calls it.

cicd .github/workflows/canary-gate.yml lines 35-39, 128-219; only caller is configs/example-callers/canary-caller.yml; grep of pbsiem workflows for canary/growthbook returned empty.

🟠 Built differentAuto-rollback on SLO breach (D29 triple-trigger + burn-rate consumer)

Plan wants ANY of {err 2x/15m, p99 2x/5m, synthetic 3x} feeding /api/cron/burn-rate-rollback to flip a flag, or alias-switch for structural breaches. Reality: post-deploy-verify does a single point-in-time health + error-rate-ratio check and a Vercel promote-previous, but SEALED to auto-rollback=false on prod (S6 rule). No p99 trigger, no synthetic-3x trigger, no burn-rate consumer, no flag-flip path.

post-deploy-verify.yml lines 29-33 (default false), 104-128 (single ratio check), 130-183 (promote-previous), 191; plan D29 line 150 + lines 1442-1444, 1468-1470.

🟠 Built differentscripts/rollback-prod.mjs Vercel alias switch <30s

Plan names a dedicated script + /admin/deploy/rollback button for <30s structural rollback. No such script exists; the alias-repoint primitive is implemented in cicd vercel-promote.yml and is wired as a human PROMOTE (prod-promote.yml), not as a rollback lever. There is no manual rollback button.

find scripts/deploy shows no rollback-prod.mjs; pbsiem prod-promote.yml uses cicd vercel-promote.yml for alias repoint as promote/discard; plan lines 1452-1456.

✅ 1 aligned with the plan — click to expand

✅ AlignedTEST container rollback drill (engine restore)

Not the canary mechanism, but the one safe-rollback capability that IS built, wired and PROVEN: rollback-drill.yml restores the engine at a known-good tag via siem-deploy with a PLANB_ENV=TEST guard; S6 proved restore in ~1s (round-trip 21s) vs the 5-min target. This is adjacent to (not a substitute for) the planned alias/flag rollback.

pbsiem rollback-drill.yml lines 1-46; build.py S6 line 844/923-924 records the drill result and the prod-never-auto-rollback seal.

Pillar 8 — Observability, SLOs & cost attribution (D33-D38)

🔴 MissingOTel instrumentation in portal + MDR engine + edr-puller with OTLP/gRPC to collector + W3C propagation (D35)

The plan's core wiring does not exist. pbsiem/src/instrumentation.ts only loads Sentry; no @vercel/otel, no @opentelemetry/sdk-node, no OTLP exporter, no TraceContext propagation in any app. The collector receives no app telemetry and discards traces to debug. Without this, golden signals for the actual product are uncollected.

pbsiem/src/instrumentation.ts:8-15; collector.yaml:33-36 (traces->debug); grep @opentelemetry in pbsiem src outside node_modules = 0; build.py:741-745 flags this OPEN

🔴 MissingPer-customer + per-feature cost attribution via client_id + feature_flag labels (D37)

The plan's headline novel deliverable (critical for SIEM usage-based pricing) is absent. No client_id or feature_flag label anywhere in observability; /api/metrics emits only cicd_* platform series; the only cost number is ephemeral-env EUR. No 'Cost per client' / 'Cost per feature' Grafana dashboards. The cost-exporter build.py claims is 'deployed on Hetzner' has no source in the repo — an over-claim.

grep client_id|feature_flag in observability/ = no matches; dashboard/src/app/api/metrics/route.ts:312-336 (only cicd_weekly_estimated_cost_eur); Glob **/cost-exporter* = none; build.py:506 over-claim vs build.py:748 honest 'open'

🔴 MissingPROD-obs-stack provisioning (D38)

Only the TEST/Hetzner stack exists. No PROD obs stack. Per D38 this is correctly deferred until TEST is stable, so this is on-plan-not-yet-due rather than a defect — but it remains unbuilt.

single docker-compose.yml on Build-Runner; no prod obs infra; plan line 1350 (2-4 weeks after TEST stable)

🔌 Built · not wired6 MDR SLO burn-rate alerts actually firing on real traffic (D34)

slo-definitions.json + generate-slo-alerts.mjs render 11 valid Prometheus rules, but they query http_requests_total{slo=...}/http_request_duration_seconds_bucket{slo=...} which NOTHING emits (no OTel). The rules are loaded but permanently evaluate against empty series — they cannot fire. The SLO definitions are built; the measurement substrate is not.

observability/prometheus/rules/slo-alerts.yml:7,15 (queries http_requests_total{slo}); no emitter — depends on the missing OTel above

🟠 Built differentDashboard SLO surface shows MDR SLOs

The dashboard computes and displays 3 CI/CD-PLATFORM SLOs (CI success rate, CI p90 duration, deployment success rate) from its own build/deploy rows — a useful platform meta-SLO view, but NOT the 5/6 MDR product SLOs the plan names (login/license/mdr-api/alert-delivery/ingestion). Two disjoint SLO worlds exist: the JSON (alert rules, no data) and the dashboard (real data, different SLOs).

dashboard/src/lib/slo-compute.ts:65-93; dashboard/src/app/api/slo/route.ts:24-60

🟠 Built differentSentry sampling + client_id (D36)

tracesSampleRate is 0.05 in prod (plan: 0.1) and Session Replay is DELIBERATELY disabled (plan: replaysSessionSampleRate 0.01) — a defensible PII decision by Mike for a security product that renders customer log/alert content. beforeSend scrubs but does not attach client_id. Divergent from the literal decision but the replay call is intentional and arguably correct; only the client_id tagging and trace rate are true gaps.

pbsiem/sentry.client.config.ts:11-16; sentry.server.config.ts:11; plan lines 1254-1256

🟠 Built differentBetter Uptime synthetic monitoring — 6 probes, 1-min intervals

No Better Uptime integration. 'Synthetic monitoring' is Prometheus blackbox probes against only 2 URLs (siemsys health + cicd login). Misses the 6-critical-endpoint, 1-min-interval external-vantage coverage the plan specifies.

observability/prometheus/prometheus.yml:46-53 (2 blackbox targets); Better Uptime hits only in CI workflows/docs, not a monitor integration

🟠 Built differentBurn-rate alerts route to pbsiem SVC-HEALTH (/admin/infra/service-health + SMS/email)

Alertmanager posts to ALERT_WEBHOOK_URL = the CI/CD dashboard webhook (writes ALERTMANAGER_WEBHOOK audit rows + dashboard alert pane). The plan routes burn-rate alerts into pbsiem's existing SVC-HEALTH surface with SMS+email mutual-fallback dispatch. Different sink; no SMS/email page path on burn-rate today.

observability/alertmanager/alertmanager.yml:18-26; dashboard /api/metrics:104-108 (ALERTMANAGER_WEBHOOK rows); plan line 1327 (/api/cron/burn-rate-webhook -> ServiceHealthAlert -> SVC-HEALTH)

✅ 1 aligned with the plan — click to expand

✅ AlignedSelf-hosted Prometheus+Grafana+Alertmanager+OTel collector+blackbox on TEST/Hetzner (D33)

The stack is built and live exactly as D33 specifies — self-hosted on the Hetzner CI/CD obs host, ~EUR4/mo footprint, 30d retention. grafana.plan-b.co.il/api/health = 200 (verified by curl). This is the genuinely-done core of Pillar 8.

observability/docker-compose.yml; observability/README.md:6-21; live curl grafana.plan-b.co.il/api/health => {database:ok,version:11.3.1}

Pillar 9 — AI agent fleet with bounded authority

🔴 MissingF4 — <untrusted> prompt-injection channel

No <untrusted> wrapping exists in code anywhere. The plan requires every untrusted input (PR text, commit message, log line) wrapped so the agent treats it as data. The one live AI surface (promotion-verdict) ingests the raw commit message straight into the prompt with no markers — an open injection path.

grep untrusted|prompt.?inject → only docs/cicd-platform-build-plan.md; dashboard/src/lib/promotion-verdict.ts:35-68 (commitMessage injected raw); plan lines 1699, 1761

🔴 MissingD43 — agent-to-agent graph cap (max 3 hops, max $5 cascade cost)

No multi-agent orchestration, hop counter, or cascade-cost cap exists. The cost guard is per-run/per-PR/per-repo only; there is no notion of a chain root or cascade budget.

grep graph.?cap|cascade|max.?hops → only docs; cost-guard.ts has no cascade concept; plan lines 174, 1700

🔴 MissingPer-agent golden-input regression suites

Plan requires a golden-input regression suite per agent; none exist. No golden fixtures, no test harness for the specs.

grep golden → only docs + specs' description text; no test files under agents/; plan line 43, runbook S11

🔴 MissingD44 — monthly post-mortem + agent-off day cadence

No scheduled post-mortem (20-decision sample) and no agent-off-day mechanism (disable all Tier1+ for 24h). README documents a manual 'monthly review process' but nothing schedules or enforces it; no docs/runbooks/agent-post-mortem.md.

agents/README.md:144-153 (manual prose only); no cron/workflow for it; plan lines 175, 1780-1782

🔴 MissingAgentTrustState + AICostLedger persistent models

Plan/runbook S10 calls for durable trust + cost-ledger models so state survives across runs and feeds the dashboard. State today is ephemeral local JSON; the pbsiem Prisma schema has no AgentTrustState/AICostLedger (only unrelated TrustedDevice / Persona-4 trust period).

pbsiem prisma/schema.prisma grep agent|trust → TrustedDevice:197, trust_period_ends_at:1057 (unrelated); runbook S10 line ~1021 'AgentTrustState + AICostLedger models'

🔴 MissingRetrofit the 4 existing agents (qa-runner, code-reviewer, monday-process, mike-bot) into the framework with declared tiers

None of the 4 existing .claude agents are tier-declared or routed through the runtime; they run as before, outside the bounded-authority layer.

no specs for them in agents/specs/ (only the 10 new); runbook gap line ~795; plan lines 1697, 1836-1840

🔌 Built · not wiredBounded-authority runtime (tier+allowlist+cost enforcement, 10 specs)

Framework is real, coherent code with correct tier/threshold/cost-cap numbers, but createAgentRuntime/loadAgentSpec have zero callers outside the framework+docs. No workflow, dashboard route, script, or pbsiem code invokes it; no root package.json/tsconfig compiles it; no tests. It runs nothing.

agents/framework/runtime.ts:166 (createAgentRuntime), :69 (loadAgentSpec); grep createAgentRuntime|loadAgentSpec|agents/framework → only docs/cicd-platform-build-plan.md, agents/README.md, framework's own files; runbook line ~509 'Not wired to run yet'; no root package.json

🟠 Built differentD39 — build on the Claude Agent SDK (allowedTools at API layer + PreToolUse/PostToolOutput hooks)

The plan's core insight — let the SDK refuse to expose un-allowed tools at the API layer — is not used. The runtime is hand-rolled on no SDK at all: allowedTools is a JSON string[] checked in JS, hooks are inline functions. This is the exact divergence the plan warned against (raw vs SDK), except it's not even raw Anthropic SDK — it's a bespoke validator the model never actually sees.

agents/framework/runtime.ts:201-228 (manual tierToolSet/allowedToolSet .has() checks); no @anthropic-ai/claude-agent-sdk or @anthropic-ai/sdk dependency (grep agent-sdk|claude-agent → only docs); plan lines 1727-1734

🟠 Built differentEarned-trust SHADOW→LIVE auto-graduation (D41) + correct override metric

Thresholds match the plan, but there is NO SHADOW/LIVE mode field; agents can't run in shadow nor auto-graduate (checkEarnedTrust only reports eligibility, README says promotion is manual). And overrideRate measures 'human overrode a tier DENIAL' rather than the plan's 'human did something different than the agent suggested' — a weaker, mostly-zero signal.

agents/framework/telemetry.ts:122-214 (checkEarnedTrust reports only), README.md:119; AgentInvocation.overridden = tier-denial override (types.ts:86-87, runtime.ts:289); plan lines 1698, 1753-1757

🟠 Built differentD42 — telemetry extends Pillar 8 OTel + Prometheus + Grafana

Agent telemetry is written to local gitignored JSON files only; it does not emit the plan's named OTel metrics (agent_invocation_total{...}) into Prometheus/Grafana. The Pillar-8 stack exists on Hetzner but the agent fleet is not wired into it.

agents/framework/telemetry.ts:23-24 (INVOCATIONS_FILE local json), .gitignore 'agents/telemetry/*.json'; plan lines 173, 1768

🟠 Built differentTier-3 Release Agent as the AI-certification-before-prod gate

The plan makes the Tier-3 Release Agent the positive 'will not break production' certifier consuming a full evidence bundle. What actually runs is dashboard promotion-verdict.ts: two-LLM ADVISORY verdicts (Anthropic+OpenAI, shadow-scored, never gating) on a minimal evidence object — wired into the Promotions pane/scoreboard but bypassing the framework entirely (no tier runtime, no cost guard, no allowedTools, no full evidence bundle). It realizes the SPIRIT (human holds the gate, AI shadow-scored first) but diverges from the architected agent.

dashboard/src/lib/promotion-verdict.ts:1-151 (raw fetch to Anthropic/OpenAI, AuditLog AI_VERDICT, advisory-only); /api/promotions/verdict/route.ts; release.json spec is Tier3 deploy_canary/rollback — unused; plan lines 1458-1461, 564

✅ 1 aligned with the plan — click to expand

✅ AlignedTier 0-4 taxonomy + per-tier cost caps + zero Tier-4 agents

The tier enum, tier→tool permission map, the D41 promotion thresholds, and the D45 cost caps ($0.50/run, $5/PR, $50/repo-mo; release elevated to $2/$10) all match the plan exactly, and v1 declares zero Tier-4 agents. This part is faithfully modeled — it just isn't invoked.

types.ts:12-23 (Tier 0-4), runtime.ts:24-58 (TIER_PERMISSIONS), telemetry.ts:30-34 (thresholds), cost-guard.ts:30-34 + README:89 (caps), release.json:6-7 (elevated caps); plan lines 171-175, 1709-1715

Pillar 10 — Compliance evidence as exhaust

🔴 MissingC2 SOC 2 control-to-evidence map (compliance/soc2-control-map.yaml)

No compliance/ directory or control map in either repo. The core 'one-artifact-many-controls' deliverable does not exist.

find over cicd-system + pbsiem: no soc2-control-map / control-map / compliance.md (empty result)

🔴 MissingC3 PCI DSS SAQ-A control map + D47 Tranzila scope letter

No compliance/pci-control-map.yaml and no compliance/pci-tranzila-scope-letter.pdf (D47 names this exact path). PCI evidence surface absent.

find: no pci-control / tranzila-scope artifact; plan D47 §183, §1966

🔴 MissingC6 auditd on PROD-siemsys-syslog + PROD-MDR-01 (daily Storage-Box roll-up)

No auditd config templates or roll-up cron anywhere. Prod host-level audit is unimplemented.

find: no auditd* files in either repo; plan §1946

🔴 MissingC8 customer-facing /api/customer/audit-export (GDPR Art.15, D48 JSON+CSV+PDF)

No audit-export endpoint exists. Only admin-scoped audit views and an admin client export-request tracker exist; neither is the tenant-scoped self-serve audit trail the plan requires.

pbsiem find: only src/app/api/admin/audit + admin/clients/[id]/export-request; no customer/audit-export; plan §1948, D48 §184

🔴 MissingC7 quarterly drill signed-report scaffolding (compliance/drills/)

No compliance/drills/ directory or signed-report generator. (Note: a rollback-drill.yml exists in pbsiem but produces no SLSA-signed compliance report.)

find: no compliance/drills; plan §1947

🔴 MissingC9 retention policy enforcement (AuditLog 7yr, drill 5yr, auditd 1yr+7yr)

No retention crons and no enforced retention on either AuditLog model. Retention exists only for the unrelated subscription/data-purge domain in MDR.

grep: no auditLog.deleteMany / 7-year retention in dashboard src; plan §1949

🔴 MissingC4 GDPR DPA evidence pack (docs/data-flows.md, processor list)

DPA appendix ships in MDR product, but the plan's added data-flow diagrams / processor / sub-processor archive for the pipeline do not exist.

find: no data-flows.md; plan §1944

🔴 MissingC10 docs/compliance.md (control→evidence→retention master index)

Auditor's first-read index does not exist in either repo.

find: no compliance.md; plan §1950

🔌 Built · not wiredC1 pipeline-layer audit-log workflow (deploys/runs → immutable store)

audit-log.yml fires and 'succeeds', but its custom payload is rejected by the dashboard webhook (needs payload.repository / payload.workflow_run, which the custom shape lacks) and dropped; non-2xx is swallowed as a warning. Pipeline audit entries are persisted nowhere. The runbook's 'only piece running' is misleading — it runs but writes to void.

audit-log.yml:54-92 (custom payload, X-GitHub-Event:workflow_run, ::warning:: on non-2xx); dashboard/src/app/api/webhooks/github/route.ts:48-64 returns 400 on missing payload.repository; build.py:807-808 over-claims

🟠 Built differentAuditLog immutability + extension to pipeline events (deploy/canary/agent/secret-access)

A dashboard AuditLog model exists and is written, but only for control-plane UI actions (promotions, dispatch, finding-status, autofix). It is NOT immutable/append-only (plain Prisma rows, deletable/updatable) and does NOT capture the plan's pipeline events. MDR AuditLog is admin-action-scoped, not pipeline-scoped.

dashboard/prisma/schema.prisma:265-280 (no append-only guard); writers at controls/dispatch:84, promotions/merge-pr:86, security/findings/[id]/status:59; pbsiem/prisma/schema.prisma:586-598

✅ 1 aligned with the plan — click to expand

✅ AlignedC5 per-PR change-impact auto-tagging (5 labels, PII content scan)

Matches plan exactly: cicd-system change-impact.yml applies auth-touching/schema-changing/customer-facing/security-relevant/pii-touching; pbsiem calls it @main on PRs to dev/main; passing in CI today.

cicd-system/.github/workflows/change-impact.yml:62-162; pbsiem/.github/workflows/change-impact.yml:1-19; gh run 27485718386 (success, 2026-06-14)

Pillar 11 — Disaster Recovery (proven, not planned)

🔴 MissingMDR product database (PG on PROD-MDR-01) is excluded from the verification loop

The verification only covers the Neon portal DB. The real MDR detection/incident data lives in siemsys-mdr-postgres on PROD-MDR-01. There is no restore-verification path for it, despite the runbook banner claiming the prod MDR Postgres backup is 'tested end-to-end'.

infra/prod-mdr-01/README.md (siemsys-mdr-postgres container); backup-restore-verify.yml uses NEON_API_KEY only (lines 24-26); build.py:872 over-claims

🔴 MissingNo quarterly DR drill, RTO/RPO measurement, cross-region replication, failover tree, customer PITR, or DR docs

D2-D10 are entirely absent. This is ALIGNED with D49 (they were deliberately deferred to a separate DR plan), so they are not deficiencies against THIS plan — flagged only so the owner knows the gap exists and is intentionally out-of-scope. rollback-drill.yml is a Pillar 7 rehearsal, not the DR drill.

plan lines 1993-1997 + 2020-2022 (D49 split-out); docs/dr.md, docs/runbooks/dr-drill.md, failover-decision.md all absent; rollback-drill.yml is workflow_dispatch container restore only

🔌 Built · not wiredDaily backup-restore verification is not scheduled and has no caller (the single in-scope DR deliverable, D1)

backup-restore-verify.yml is workflow_call-only after commit 2dd43a8 removed its schedule; nothing in pbsiem or cicd-system invokes it. The runbook itself admits this (build.py:752 'no schedule invokes it for MDR daily') and S8 defers it. This is the only DR item D49 keeps in-plan, so its non-wiring is the headline Pillar 11 gap.

backup-restore-verify.yml:3-4 (on: workflow_call only); git commit 2dd43a8 diff (removed cron '0 3 * * *'); zero grep callers in pbsiem; build.py:752-753; gh run list = 8x 0s failure

🟠 Built differentThe restore target diverges from D1: Neon-branch->Neon-ephemeral instead of prod-backup->TEST

D1 specifies restoring yesterday's PROD backup INTO the TEST stack and running smoke tests there. The built workflow clones a Neon branch into a throwaway Neon branch and runs SELECTs against it — it never exercises a real backup artifact, the TEST environment, or the MDR product database. It proves Neon branching works, not that a backup restores.

backup-restore-verify.yml:82-126 (create ephemeral branch from primary), :128-167 (smoke queries against ephemeral conn_uri); plan line 2009 (D1 'Restores yesterday's prod backup INTO TEST')

🟠 Built differentThe tolerance / row-count verification is a no-op (fake gate)

The 'Verify row counts within tolerance' step always sets deviation_ok=true regardless of input, so the data-integrity check the runbook calls REAL ('Neon ephemeral restore + smoke + tolerance', build.py:752) does nothing. A restore that lost rows would still pass.

backup-restore-verify.yml:169-193 (both branches write deviation_ok=true; no comparison to primary)

🟠 Built differentRunbook S-PRE banner over-claims a live, tested MDR PG backup cron that is not verifiable in code

build.py:872 states the daily encrypted MDR PG backup cron is 'live on PROD-MDR-01 02:30 -> Storage Box (tested end-to-end)' and 'restore-verified on TEST = 36 tables'. No committed cron unit, script, or scheduled workflow backs this; the only committed dump is a single manual April file. Likely a host-only artifact, but it is asserted as DONE without an auditable source — exactly the 'hallucinated done' pattern the owner was burned by.

infra/prod-mdr-01/backups/ holds only 2026-04-29-pre-stage1-pgdump.sql; no committed cron; build.py:872-873 vs no repo evidence

Pillar 12 — Platform meta-pillars + ci-templates extraction (M1-M10: services.yaml, Terraform IaC+drift, runbooks-as-code, chaos game-days, status-page-to-SLO, anonymizer library, API versioning/deprecation, customer export interface, PII scan of fixtures/screenshots, and the grand-finale public plan-b-systems/ci-templates extraction with SHA-pinned consumers)

🔴 Missingservices.yaml service catalog (M1) — agents that need it already exist

No services.yaml in either repo, yet the three agent specs the plan says it must feed are already authored. The agents have no catalog to read, compounding the 'agent framework built but wired to nothing' divergence. ~1-2 days; high leverage because it activates already-built agents.

Glob **/services.yaml -> No files found; agents/specs/cost-anomaly.json, oncall-investigation.json, doc-drift.json present; plan line 2070

🔴 MissingTerraform IaC + daily drift cron (M2, D50)

Zero .tf files; Hetzner/Cloudflare/Neon hand-managed. The biggest single piece of Pillar 12 (~5-6 days). Runbook honestly flags it OPEN. Lower priority than the staging spine but the canonical 12-24-month-rot prevention.

Glob **/*.tf -> No files found; build.py:810-812 'No IaC anywhere; ... hand-managed'; plan lines 196/2071/2086/2106

🔴 MissingRunbooks-as-code with required sections + CI validation (M3)

Free-form docs/runbooks/*.md exist in both repos but with no enforced Symptoms/Diagnostics/Common-fixes/Escalation schema and no CI gate validating structure, so the On-Call Investigation agent has no machine-readable contract to consume. ~2 days.

cicd-system docs/runbooks/ (auto-fix.md, link-crawl.md, ...) and pbsiem docs/runbooks/ are prose; no structure-validation workflow found; plan line 2072

🔴 MissingQuarterly chaos game-days (M4)

No scripted chaos scenarios exist. The only 'chaos' references are in pbsiem's STALE ARCHITECTURE-CI.md (Litmus/Stage-7 staging — a different, never-built design) and the plan-mirror doc — not the plan's quarterly scripted Neon/OpenSearch/Twilio game-days with post-mortems.

grep chaos pbsiem -> only ARCHITECTURE-CI.md:53/306/311/327/644 (Litmus/staging); grep chaos cicd-system -> only docs/cicd-platform-build-plan.md; plan line 2073

🔴 MissingAPI versioning + deprecation policy (M7)

No X-Deprecated header pattern, no deprecate-in-MINOR/remove-in-MAJOR policy, no codification. Not in the runbook gap list either.

grep 'X-Deprecated|deprecation policy|API versioning' both repos -> no hits; plan lines 56/2076

🔴 MissingCustomer-facing export interface (M8)

The plan wants a COHERENT customer API surface consolidating audit-log exports (Pillar 10), SBOM (Pillar 6), cosign attestation verification, and status embeds (Pillar 11). Pieces are scattered/unbuilt; no unified surface. Runbook notes the audit-export endpoint specifically is absent. Depends on Pillars 6/10/11.

build.py:806-808 'No ... customer audit-export endpoint'; no export-interface module found; plan line 2077

🔴 MissingPII scan of test fixtures + visual-regression baselines (M9)

No CI gate scans test fixtures or visual-regression baseline IMAGES for accidental real-PII (the classic visual-baseline leak vector). security-scan.yml runs Trivy 'secret' scanning which incidentally catches credentials, but that is not the planned fixture/screenshot PII gate.

security-scan.yml:85 scanners 'vuln,secret,misconfig'; no fixture/baseline PII gate; visual-regression.yml has no PII step; plan line 2078

🟠 Built differentci-templates extraction as a PUBLIC SemVer product (M10 + D51)

The reusable patterns were centralized into PRIVATE cicd-system from day one (deliberate D2 drift) and consumed via uses:@main, instead of being extracted into a PUBLIC plan-b-systems/ci-templates product with SemVer v1.x.y. The central-store approach achieves the reuse goal, but drops three locked specifics: D51 PUBLIC visibility, the SemVer release/versioning surface, and the standalone product framing. Pillar 12 is the LAST phase and the platform is at M1 (not yet shipped), so deferral is correct — but the runbook should stop calling it a closed 'deliberate drift' and reopen the PUBLIC+SemVer delta as owed work.

gh repo list -> 'plan-b-systems/cicd-system ... private'; gh repo view plan-b-systems/ci-templates -> 'Could not resolve to a Repository'; build.py:179-180 reframes as closed 'deliberate drift'; plan lines 197/2091-2095 (D51 PUBLIC), 59/2116 (SemVer + SHA-pin)

🟠 Built differentConsumer pinning discipline: @main instead of SHA/@v1.0.0

Every consumer ref pins @main, which the plan explicitly forbids ('never @main, never @v1') and which cicd-system's own README documents against. @main means any push to cicd-system silently changes every consumer's CI with no version gate — the exact rot Pillar 12 exists to prevent. Cheap to fix now and does not need ci-templates to exist; cicd-system can be SHA-pinned today.

pbsiem: 30 refs all @main (ci.yml:18, ephemeral-env.yml:19/48, post-deploy-verify.yml:26/53, prod-promote.yml:48/72, ...), 0 SHA-pins, 0 @v pins; vaughnblades ci.yml:15 @main; README.md:18/24 'Pin by SHA, not branch'; plan line 59

🟠 Built differentAnonymizer library = schema-aware scrub(row, schema) (M6, D52)

scripts/anonymize-seed.mjs is a regex line-scrubber over pg_dump TEXT, not the plan's deterministic SCHEMA-AWARE scrub(row, schema)->scrubbedRow library with the Day-1 interface (plan line 2331). It is deterministic (sha256-seeded) which is good, but cannot guarantee referential integrity by schema and is not wired to a weekly TEST refresh. The real S2b/S2c staging refresh uses separate scripts/dev-scrub logic, so this 'library' is effectively orphaned.

scripts/anonymize-seed.mjs:1-21 docstring 'Reads SQL from stdin', processLine() regex approach; build.py:817-820 acknowledges the full scrub(row,schema) library is Phase 7; plan lines 198/2097-2101/2331

✅ 1 aligned with the plan — click to expand

✅ AlignedReusable-workflow CENTRALIZATION + real consumption (the reuse intent of M10)

The core intent — products inherit Plan-B CI patterns without copy-paste — is genuinely achieved and WIRED: 27 callable workflows in cicd-system, consumed by pbsiem (30 refs) and vaughnblades (a live 39-PR product). vaughnblades is a working conformance test of '2nd/3rd product onboards fast'. The divergences are visibility (private), packaging (no SemVer), and pinning (@main) — not the reuse mechanism itself.

cicd-system/.github/workflows/ = 27 files; pbsiem 30 uses: refs resolve at run time; vaughnblades ci/sbom/security-scan consume cicd-system; build.py:813-816 names vaughnblades the live conformance test

CORE SPINE — DEV/TEST/PROD topology + dev→test→main flow + promote model + MDR engine/backend deploy path

🔴 MissingEngine/puller PROD deploy path (signed GHCR image → verified pull onto PROD-MDR-01)

Plan locks the engine's prod path as cosign-signed + SLSA-L3-attested GHCR images that deployment scripts cosign-verify before pulling (lines 1597,1609,1660). Reality: docker-build pushes to GHCR but with NO cosign sign and NO SLSA attestation; siem-deploy does NO signature verification; and siem-deploy is invoked ONLY for TEST (stage4) + rollback-drill — there is NO workflow that deploys the engine to PROD-MDR-01 at all. The MDR engine reaches prod only by manual SSH today. This is the owner's named gap: the promote spine is portal-shaped, the engine has no promote path.

cicd-system docker-build.yml (no cosign/attest/sign tokens); zero slsa-provenance/attest callers in pbsiem; siem-deploy.yml callers = stage4.yml(TEST)+rollback-drill.yml only

🔴 MissingCosign signing + SLSA-L3 attestation + verify-before-pull for engine/puller

Pillar 6 #5/#9 (lines 1609,1613) and verification line 1660 require keyless cosign sign on GHCR push and cosign-verify at deploy. Neither the docker-build reusable nor siem-deploy implements any of it. slsa-provenance.yml exists in cicd-system but has no pbsiem caller.

cicd-system .github/workflows/docker-build.yml:74-93 (build-push only, no sign); slsa-provenance.yml present but uncalled by pbsiem

🔴 MissingWeekly TEST/staging data refresh via deterministic anonymizer

Plan locks PROD→TEST via a deterministic anonymizer with WEEKLY staging refresh (lines 322-323). Reality: no scrub(row,schema) library, no anonymize-seed.mjs (runbook claims it exists — it does NOT in pbsiem/scripts), no verify-dev-scrub workflow, no refresh cron. Result is the empty/drifted dev DB that flakes the gates — the P0 the owner hit.

pbsiem scripts/ has no anon/scrub file (find returned none); no schedule cron targets TEST refresh; build.py:818-819 defers it to 'Phase 7'

🔴 MissingEngine woven into cross-cutting pillars (OTel trace propagation, GrowthBook SDK in engine/puller)

Plan ties engine into Pillar 8 (W3C TraceContext portal→engine→puller, lines 1212-1219) and Pillar 7 (GrowthBook SDK in engine+puller, lines 1425-1431). Out of this area's critical path but part of the engine-spine intent; not verified built and deferred per runbook phasing.

plan lines 1212-1219,1425-1431; not in scope of any current pbsiem workflow examined

🟠 Built differentHuman merge-to-main gate (plan: Mike approves merge to main; no automated promotion)

Plan locks merge-to-main as the rare human-approved boundary with NO automated promotion (lines 317, 331, 627-631). Reality: promotion-autopilot.yml auto-opens dev→main promotion PRs AND fold PRs and arms auto-merge once certify is green, explicitly 'removing the human merge step in the middle'. The single human gate moved to a later Vercel-alias Promote click. Live and ON.

pbsiem .github/workflows/promotion-autopilot.yml:8-9,107-164; gh var PROMOTION_AUTOPILOT=on; fold PR #76 + promotion PR #77 live 2026-06-14

🟠 Built differentPromote model = Vercel-alias click (verify-before-promote)

Plan's promote = merge to main triggers PROD deploy (one event). Reality splits it: prod-deploy.yml builds a staging artifact WITHOUT moving the alias, then a separate human Promote (prod-promote.yml via dashboard) repoints the alias after re-verifying provenance. Defensible and arguably stronger (structurally unreachable prod), but it is a portal-only alias switch, not the plan's merge-deploys-to-prod model, and the canary 1%→10%→50%→100% ramp (Pillar 7, plan) is absent.

pbsiem prod-deploy.yml:1-18,44-46 (auto-promote:false); prod-promote.yml:46-64; dashboard src/app/api/promotions/dispatch/route.ts:8-16,51-62

🟠 Built differentFull-system TEST deploy is per-PR isolated

Plan + runbook headline 'full system per PR'. Reality: per-PR ephemeral is portal+Neon-branch only; engine/puller/receiver deploy to a SHARED TEST stack (Option C, serialized via siem-deploy concurrency), not a per-PR isolated full system. Runbook line 292 over-claims this as a delivered per-PR stage.

pbsiem stage4.yml:1-9,89-138 (shared TEST stack, TEST_MDR_DEPLOY_HOST); build.py:560,641 self-flags engine/puller NOT in per-change deploy

✅ 2 aligned with the plan — click to expand

✅ Aligned3-tier topology (DEV/TEST/PROD servers, networks, Vercel env split)

TEST stack (TEST-siemsys-syslog, TEST-MDR-01, splitter) + plan-b-test project + Siem-internal-test 10.1.0.0/16 exist; matches plan lines 304-327, 342-346. Prod isolation rule #1 enforced.

build.py:682-685; plan lines 304-327; siem-deploy env-marker assert siem-deploy.yml:94-107

✅ AlignedProd-isolation audit as a check-all gate

Plan's gate #13 (lines 297, 363-369) exists as check-prod-isolation.mjs driven by audit-prod-isolation.config.yaml and IS in the check-all chain. Renamed from the plan's audit-prod-isolation.mjs but functionally the locked gate.

pbsiem scripts/check-all.mjs:52 ('check-prod-isolation.mjs'); scripts/check-prod-isolation.mjs + scripts/audit-prod-isolation.config.yaml present

RUNBOOK accuracy — does cicd-runbook build.py mirror the locked plan AND honestly reflect current reality, or has it drifted / over-claimed

🔴 MissingNo engine→PROD promote path exists, and the runbook never names this gap

docker-build.yml builds engine images on push to main but there is no workflow that deploys/promotes mdr-engine or edr-puller to PROD-MDR-01 — engine prod deploys remain manual/SSH, outside the gated promote spine the plan requires for the whole system (plan L292). The Vercel promote path covers only the portal.

pbsiem/.github/workflows/docker-build.yml:3-10 (build only, targets main/dev); no PROD-MDR target in any deploy workflow (PROD-MDR-01 178.104.172.142 only in rotate-secrets:103 known-hosts); runbook build.py has no engine-prod-promote gap item

🔴 MissingStaging-data refresh is unbuilt (no dev-scrub scripts, no scheduled PROD→TEST refresh cron) yet the runbook banner says 'ONLY M1 REMAINS'

Plan L323/L406-409 mandate a weekly anonymized PROD→TEST refresh; the runbook itself ties gate-flake to an empty dev DB (S2b/S3 notes). S2b/S2c are admitted 'pending' but the EXECUTION-STATUS banner over-claims completeness.

build.py:844 ('ONLY M1 REMAINS'), L905-912 (S2b/S2c pending) vs filesystem: no pbsiem scripts/dev-scrub dir, no schedule: cron in any pbsiem workflow except rotate-secrets.yml:41-42

🟠 Built differentRunbook 'today's pipeline' table still shows Gap B (deploy-to-live-then-verify) as current + verify-before-promote as a P1 to-build, but code already does verify-before-promote with a separate human Promote gate

build.py L306/L342-343 over-state the gap; the same file's roadmap (L435,L844) says M1 SHIPPED + P1 done — an internal contradiction. Reality: prod-deploy.yml verify-before-promote + prod-promote.yml human gate are built and proven (run 27328491331).

build.py:306,342-343,418 vs pbsiem/.github/workflows/prod-deploy.yml:1-49 + prod-promote.yml:6-72 + cicd vercel-gated-deploy.yml:5-19

🟠 Built differentGap Analysis tab marks 'Full-system TEST deploy per change' as OPEN, claiming engine/puller/OpenSearch/syslog are NOT in any per-change deploy — but stage4.yml builds + deploys them per PR

Runbook UNDER-claims (says missing when built). The gaps tab (status:open) conflicts with the roadmap's S5 DONE. A reader working the gaps tab top-to-bottom would rebuild something that exists.

build.py:639-642 (status open) vs pbsiem/.github/workflows/stage4.yml:41-126 (build-engine/build-puller at PR SHA → siem-deploy to TEST-MDR-01)

🟠 Built differentpromotion-autopilot auto-merges dev→main, diverging from the plan's 'no automated promotion / Mike merges to main'

Plan L317/L324/L331-336 is explicit that TEST→PROD has no automated path. Reality arms GitHub auto-merge on a dev→main PR (certify-gated). The runbook presents this as a Phase-4 feature without flagging it as a plan divergence requiring deliberate amendment.

pbsiem/.github/workflows/promotion-autopilot.yml:92-137 (enablePullRequestAutoMerge on dev→main) vs plan imperative-crafting-wand.md L324,L331-336; runbook build.py:453 lists autopilot as delivered, not as override

🟠 Built differentWorkflows-tab security-loop text still says 'Stack after Aikido removal' contradicting the runbook's own Aikido-back-IN override

Stale prose from the 2026-06-01 removal survives next to the 2026-06-11 reinstatement override — the runbook contradicts itself on whether Aikido is a gate.

build.py:586-587 ('after Aikido removal: Semgrep + Trivy + ZAP + CodeRabbit') vs build.py:169,393,483 (Aikido back IN as free blocking gate)

🟠 Built differentProd alias drift: runbook/code use siemsys.plan-b.systems; plan names siem.plan-b.co.il

Minor but unreconciled — the runbook should either bend to the plan's DNS or amend the plan note. As-is the plan's named hostnames are silently superseded.

plan L293,L318 (siem.plan-b.co.il / siem-api.plan-b.co.il) vs prod-deploy.yml:43 + build.py prod-alias siemsys.plan-b.systems

✅ 1 aligned with the plan — click to expand

✅ Alignedcertify AI gate is real but dormant (skips without ANTHROPIC_API_KEY) — runbook states this accurately

The runbook honestly reports certify activates only when the key is added (build.py:844, S6b); code matches (certify.yml:145 skips, exit 1 only when AI refuses with key present). This is an example of accurate mirroring.

pbsiem/.github/workflows/certify.yml:140-161 vs build.py:844,447

Tasks — the gaps as an ordered mission list

Close the two P0s — the engine promote path and the staging-data refresh — and a Java/engine upgrade goes through staging and out to prod as one promote, not an 11-hour hand-deploy. The other 21 are real, but they are hardening, not blockers. Wire what exists, align what drifted, build what’s missing.

🎯 Do these first

P0 — Build the engine TEST->PROD promote path (Pillar 1): the mdr-engine/edr-puller reach PROD only via out-of-band manual deploy with no verified-staging-to-prod promotion (prod-deploy/prod-promote move only the Vercel portal alias; siem-deploy's PROD mode has no caller). This broken spine is what turned a 40-min change into 11 hours.
P0 — Stand up the weekly anonymized staging-data refresh (Pillar 1): with no refresh, TEST-MDR-PG/OpenSearch run empty/stale and the per-PR gates flake — the second half of the 11-hour pain. Build the minimum-viable scrub + weekly load.
P1 — Name + label staging and wire the single post-green-stage4 human promote as the ONE gate that moves both portal and engine (Pillar 1; folds into the engine-promote mission), so promote-to-prod is one auditable, provenance-checked action.
P1 — Reconcile main branch protection (Pillar 4): main is gated by ONLY ci/CI with 0 approvals while dev has 10 checks, so a hotfix straight to main bypasses security-scan + Stage-4 + links. Close the coverage gap before relying on the new promote path.
P1 — Complete the prod-isolation runtime env-diff check (Pillar 1) so 'no prod DSN/key reachable from non-prod' is mechanically enforced, not just statically greppable — the guardrail that keeps the promote path honest.
P1 — Enforce lint:strict in CI/pre-push and resolve the Aikido contradiction (Pillar 4): the plan/runbook/README claim enforcement the code does not perform; fix the behavior, then make plan + runbook + code agree.

🔌 Wire what’s built but not connected

Fastest wins — the workflow already exists in cicd-system; pbsiem just never calls it.

P1effort MPillar 5 — Stage 4Wire the built reusable gates (a11y, mutation, k6) into pbsiem stage4

Closes: Pillar 5 layers E/H + k6 half of D: a11y-i18n.yml axe-core, mutation-test.yml Stryker, perf-budget.yml k6 exist in cicd-system but pbsiem never calls them

Steps: In stage4.yml add caller jobs for a11y-i18n.yml (axe-core, zero WCAG 2.1 AA) and mutation-test.yml (Stryker on src/lib + src/app/api, ~60% start ratchet). Set k6-script-path to a new tests/perf/hot-endpoints.k6.js so perf-budget.yml's k6 job (gated on non-empty path) runs p95/p99 budgets. Add each as a required dev check. Pure wiring of proven workflows.

P1effort MPillar 5 — Stage 4Broaden the e2e gate from 1 smoke spec to the full suite

Closes: Pillar 5 layer A: stage4 e2e runs ONLY tests/e2e/dev-environment.spec.ts of ~90 specs; deterministic seeding (pr-db + synthetic-seed.ts) that blocked broadening is now built

Steps: Change stage4.yml test-pattern from the single dev-environment.spec.ts to the full tests/e2e suite (excluding specs needing real prod data), pointed at the per-PR isolated backend stack seeded by synthetic-seed.ts. Triage flaky specs against the synthetic seed instead of disabling the gate.

P1effort MPillar 3 — Ephemeral envsActivate or retire the OpenSearch teardown branch in the reusable

Closes: Pillar 3: cicd ephemeral-env.yml has an OS-index teardown branch gated on opensearch-url + OPENSEARCH_API_KEY, but pbsiem's caller passes neither, so it is dead code; GC cron is Neon-only

Steps: Decide with Mike. Option A: pass opensearch-url + OPENSEARCH_API_KEY from pbsiem's caller so teardown executes, add an ISM min_age:1d/delete policy (1 shard/0 replicas) as backstop. Option B: document OS isolation is client-id-namespaced via the per-PR stack, delete the dead OPENSEARCH_INDEX_PREFIX preview var and dead OS-teardown branch. Either way add ONE backstop that reaps stale pr-* indices.

🔧 Align what was built differently from the plan

Code, runbook, and plan disagree. Fix the behavior, then make all three agree.

P1effort SPillar 4 — Quality gatesEnforce lint:strict (warnings-as-errors) in CI and pre-push

Closes: Pillar 4 C/D: plan requires next lint --max-warnings 0 in CI+pre-push; reality prepush calls plain npm run lint and ci.yml overrides reusable's lint:strict default. README L28 + build.py:697 over-claim

Steps: In ci.yml set lint-command to npm run lint:strict (or drop the override). In package.json set prepush to npm run lint:strict + npm test. Open a PR with a planted eslint warning to prove it fails (plan L820). Fix README L28 + build.py:697 only after behavior matches.

P1effort MPillar 4 — Quality gatesResolve the Aikido contradiction; make plan + runbook + code agree

Closes: Pillar 4 E: Aikido absent (security-scan.yml header 'Replaces Aikido D54 superseded', Semgrep+Trivy only) yet build.py:169 says 'Aikido back IN as a blocking gate' while build.py:586 says 'after Aikido removal' — self-contradicting and false

Steps: Get Mike's binding call. OUT: amend LOCKED PLAN D-table + sub-deliv E to record the D54 supersession, delete the false build.py:169 row. IN: add a SHA-pinned Aikido scan step to cicd security-scan.yml gated on AIKIDO_SECRET_KEY (skip if unset), provision the key, add 'Aikido scan' as a required check. All three sources must match.

P1effort MPillar 4 — Quality gatesReconcile main branch protection with the plan (approvals, admin-bypass, coverage)

Closes: Pillar 4 I + GAP A (verified live): main requires ONLY ci/CI, approvals=0, enforce_admins=TRUE; dev has 10 checks. main is weaker at the gate (hotfix skips security-scan + Stage-4 + links) yet stricter on admin-bypass than the plan (plan wanted 1 approval + enforce_admins OFF)

Steps: With Mike: (a) approvals: set main approvals=1 per plan OR amend plan to ratify Mike's chat-word/0 approvals (build.py:173); (b) admin-bypass: choose plan's OFF (override runbook path) vs current ON, document the winning path (cross-link claude-guard); (c) make main require dev's security-scan + Stage-4 + links checks OR document main is only reached via gated-dev promotion + add a promotion-time re-gate. Update plan I + build.py:700-703.

P1effort MPillar 1 — Topology & isolationComplete prod-isolation audit to the plan's 4 checks; stop the config-vs-code over-claim

Closes: Pillar 1: check-prod-isolation.mjs does ONLY the static grep (b) and grandfathers 89 files; the runtime env-var diff that enforces 'no prod DSN/key from non-prod' is absent; config declares known_prod_client_ids + neon_hosts the code never reads

Steps: Extend check-prod-isolation.mjs: (a) pull vercel preview AND production env, fail if any backend value (DATABASE_URL/MDR DSN/OpenSearch URL/keys) is identical — no grandfathering on this check; (b) read config.known_prod_client_ids, fail on any WRITE/fixture ref outside *.test.ts; (c) read config.neon_hosts, flag direct prod Neon hostnames; (d) assert prisma/*.prisma has no DB-URL fallback. Keep the allowlist for the static-grep only.

P1effort SPillar 1 — Topology & isolationName + label staging; wire the human approval as the unified prod gate

Closes: Pillar 1: TEST stack functions as staging but is unlabeled; the only human gate is the portal-alias click. Runbook itself flags 'name + isolate' / 'wire the approval as the prod gate' (build.py:331-332,360-362)

Steps: Label the TEST stack 'staging' in docs + deploy nomenclature; make the single post-green-stage4 human promote the gate that moves BOTH portal and engine (folds into the engine-promote mission). Update ARCHITECTURE/key-workflow-rules to dev->staging->prod with one full-battery staging gate.

P2effort SPillar 4 — Quality gatesRatify-or-restore CodeRabbit request_changes_workflow (D15)

Closes: Pillar 4 F: plan D15 mandates .coderabbit.yaml request_changes_workflow:true blocking merge + CodeRabbit required; reality .coderabbit.yaml removed (fb8a2fb), CodeRabbit advisory-only, not required (build.py:171)

Steps: Decide with Mike. Advisory: amend LOCKED PLAN D15 to 'advisory, not a required gate'. D15 stands: re-add .coderabbit.yaml at repo root with request_changes_workflow:true + plan F path_instructions, add CodeRabbit as a required check. Do not leave plan and code silently disagreeing.

P2effort MPillar 3 — Ephemeral envsRestore EphemeralEnvCost as a cost-governor (EUR cap + SVC-HEALTH alert + stale sweep)

Closes: Pillar 3: plan wants EphemeralEnvCost persisted with 30-day rolling EUR total + ServiceHealthAlert at EUR80/100 + >30d stale force-delete; reality cron only counts Neon branches, console.warn at hard-coded 5, persists nothing; no model in prisma

Steps: Add EphemeralEnvCost to pbsiem/prisma/schema.prisma; in ephemeral-env-cost/route.ts query Neon compute-units (not branch count), compute daily + 30-day rolling EUR, persist, raise ServiceHealthAlert at EUR80 warn / EUR100 critical. Add the stale sweep: any pr-* branch whose PR closed >30d ago is force-deleted with an AuditLog entry.

P2effort SPillar 3 — Ephemeral envsMove ephemeral-envs admin page to the planned path + apply dark mode

Closes: Pillar 3: page is at src/app/admin/ephemeral-envs/page.tsx not the plan's (admin)/admin/infra/ephemeral-envs, and is styled light (bg-gray-50/text-gray-500) vs dark-mode-default rule

Steps: Relocate to the (admin)/admin/infra grouping the plan specifies; restyle to dark-mode-default with bright fonts + readable hovers. Cosmetic/structural, low risk.

P2effort MPillar 5 — Stage 4Expand visual regression to the locked 24 baselines (12 pages x EN+HE)

Closes: Pillar 5 B / D19: visual covers only 3 public EN pages (login/pricing/status), no authed/admin/MDR pages, no HE; plan requires 12 critical-path pages x EN+HE = 24 baselines at 2%

Steps: Extend tests/e2e/visual-regression.spec.ts to the 12 critical-path pages in EN+HE (auth via the per-PR seeded stack for admin/MDR), generate the 24 baselines at 2% threshold, keep the honest CI=1 no-update-snapshots posture. Gate is wired; this is coverage.

P2effort SPillar 4 / cross-cuttingCorrect the stale/contradictory runbook + ARCHITECTURE entries

Closes: Cross-cutting staleness: build.py:639-645 lists 'Full-system TEST deploy per change' OPEN (portal+DB only) but stage4 deploys engine+puller to TEST every PR; ARCHITECTURE.md L133 still says '6 parallel jobs' + 'Aikido + CodeRabbit'

Steps: Flip build.py's 'Full-system TEST deploy per change' to done citing stage4.yml + a green run; re-point the residual to the real gaps (engine PROD promote + staging-data freshness). Rewrite ARCHITECTURE.md L133 to as-built: single reusable ci/CI job (lint->typecheck->check-all->test->build), security-scan=Semgrep+Trivy enforcing, CodeRabbit advisory, dev Stage-4 + links + change-impact; drop '6 parallel jobs' and 'Aikido + CodeRabbit'; keep the accurate contract-gate paragraph L134.

🏗️ Build what was never built

Net-new. The two P0s live here.

P0effort LPillar 1 — Topology & isolationBuild the engine TEST->PROD promote path (close the P0 spine gap)

Closes: Pillar 1 locked spine: prod-promote.yml + prod-deploy.yml move ONLY the Vercel portal alias (verified zero engine/docker/ssh/hetzner steps); siem-deploy.yml's env-gated PROD mode has no caller, so the engine reaches PROD only via out-of-band manual deploy with no verified-staging-to-prod promotion. THIS made a 40-min change take 11 hours

Steps: Add a prod engine-promote step deploying the SAME verified TEST artifact (mdr-engine/edr-puller image at the promoted SHA, already in GHCR from stage4/docker-build) to PROD-MDR-01 via siem-deploy.yml's env-gated PROD mode (required-env-marker: PROD). Wire it into the human promote flow so ONE action moves BOTH the portal alias and the engine: extend prod-promote.yml to also call siem-deploy.yml@main with DEPLOY_HOST=PROD-MDR + a provenance assertion (image must come from a green stage4 on the promoted SHA), or add prod-engine-deploy.yml behind the same human dispatch. Mirror prod-deploy.yml's cannot-promote-an-unverified-build check.

P0effort MPillar 1 — Topology & isolationStand up the weekly anonymized staging-data refresh into the shared TEST stack

Closes: Pillar 1 D7 + sub-deliv I: no weekly anonymized PROD dump into TEST-MDR-PG/OpenSearch (verified zero weekly/refresh/anonymize hits outside rotate-secrets); MDR-PG gets only per-PR schema push + Neon qa-seed, so gates flake on empty/stale data — the staging-data-stale P0

Steps: Add a scheduled workflow that (1) takes a read-only PROD MDR-PG dump (pg_dump --exclude-table-data=audit_log) + relevant OpenSearch indices, (2) runs the deterministic scrubber (extend scripts/anonymize-seed.mjs into scrub(row,schema): emails->user{id}@test.local, phones->+972500000000, names->faker, card data dropped), (3) loads into TEST-MDR-01 PG + TEST-OpenSearch. Run weekly so gates execute against real-shape data. Minimum-viable anonymizer scoped to Phase 1 (full lib stays Pillar 12).

P1effort LPillar 2 — Identity & secretsExtend secret rotation to MDR_ENCRYPTION_KEY with synchronized cutover

Closes: Pillar 2 #4: rotation covers only OPENSEARCH_PROXY_KEY + mdr_admin PG password; MDR_ENCRYPTION_KEY (the trickier one, must stay in sync across Vercel + engine .env + puller) + bearer tokens/PATs/GrowthBook are unrotated; plan L1656 unmet

Steps: Add MDR_ENCRYPTION_KEY to rotation-targets schema + rotate-secrets.sh: generate new key, write to Vercel (production scope) AND to engine/.env + puller config via the existing SSH/role-tagged loop, then restart/verify all three consumers atomically so portal+engine+puller never split-brain. Reuse fingerprint-log + shred. Acceptance L1656: rotates on schedule, all 3 pick it up without manual restart, golden v2 envelope still decrypts (gate crypto-envelope contract test). Then add bearer tokens + GH PATs + GrowthBook as follow-ons.

P1effort MPillar 5 — Stage 4Build the cross-tenant isolation suite (catches the CR-03 IDOR class)

Closes: Pillar 5 G / D23: tests/e2e/security/ does not exist; grep cross-tenant = 0 hits (verified); plan requires top-12 sensitivity-endpoint cross-tenant specs

Steps: Create tests/e2e/security/cross-tenant-*.spec.ts covering the top-12 sensitivity endpoints (D23 list), using two distinct seeded tenants from the per-PR backend stack and asserting tenant A cannot read/write tenant B's resources (the CR-03 IDOR class). Wire into stage4 as a required dev check.

P2effort MPillar 2 — Identity & secretsBuild the getSecret('X') audit-logged proxy and adopt it for sensitive secrets

Closes: Pillar 2 D13/D12 #5: no src/lib/secrets/get-secret.ts, no SECRET_ACCESS AuditLog event (verified 0 hits); the two getSecret symbols are unrelated per-file helpers; no central audit point

Steps: Create src/lib/secrets/get-secret.ts exporting getSecret(name) wrapping process.env[name]; in PROD only (skip TEST per D12) emit an AuditLog row with a SECRET_ACCESS action capturing which-secret + by-which-user/request. Scope v1 to sensitive types per D12 (encryption/payment/AI keys); migrate those process.env reads. Acceptance L1657: AuditLog SECRET_ACCESS query returns who/what/when.

P2effort MPillar 5 — Stage 4Build the API contract gate (zod-to-openapi spec + oasdiff)

Closes: Pillar 5 C / D20: zod-to-openapi + oasdiff = 0 hits (verified); no generated OpenAPI spec, no breaking-change gate

Steps: After the Phase-5a Zod audit, generate an OpenAPI spec from the route zod schemas (zod-to-openapi), commit it as baseline, add an oasdiff step to stage4/check-all that fails on breaking changes vs baseline. Required dev check.

P2effort MPillar 5 — Stage 4Build the AI cost guard with AICostLedger persistence (three-tier caps)

Closes: Pillar 5 J / D24: AICostLedger model not in pbsiem prisma (verified); cost-guard code exists only in agents/framework, caps not enforced against a persisted ledger

Steps: Add an AICostLedger Prisma model; wire the existing cost-guard to write rows and enforce the three-tier caps ($0.50/run, $5/PR, $50/repo/mo), Claude-only. Gate the per-PR run on the per-PR cap so a runaway AI step blocks merge.

P2effort MPillar 5 — Stage 4Build the migration-safety gate (prisma migrate diff + lock-impact)

Closes: Pillar 5 I: no check-migration-safety.mjs; only crypto-envelope.mjs under scripts/check-contracts; stage4's honest db push (no --accept-data-loss) is only a partial substitute

Steps: Add scripts/check-contracts/migration-safety.mjs (auto-discovered by check-all) running prisma migrate diff + lock-impact, blocking risky ALTERs (drops, type narrows, non-concurrent index on large tables). Enforce in check-all so it gates every PR.

P2effort SPillar 1 — Topology & isolationWrite the prod-emergency-override runbook and reference it

Closes: Pillar 1 sub-deliv H: docs/runbooks/prod-emergency-override.md does not exist; no EMERGENCY-OVERRIDE ref in ARCHITECTURE.md; policy/audit-trail doc absent though claude-guard enforces the mechanism

Steps: Create docs/runbooks/prod-emergency-override.md per H: legitimate Mike-only scenarios, allowed actions, the [EMERGENCY-OVERRIDE] commit-subject convention, the required AuditLog entry within 24h + reverse-normal-flow cleanup commit + a make-this-unnecessary sprint task. Add a one-line ARCHITECTURE.md ref + cross-link claude-guard.

P2effort MPillar 2 — Identity & secretsNarrow GH Actions secret scoping to per-environment + write docs/secrets.md

Closes: Pillar 2 #6 + #10: only prod-deploy.yml uses a GH environment block; the other 7 VERCEL_TOKEN workflows pull repo-level secrets with no environment gate; docs/secrets.md does not exist

Steps: Audit every repo-level secret used by the 8 VERCEL_TOKEN workflows; move sensitive ones (deploy/promote/verify) behind GH environment:production / environment:preview mirroring prod-deploy.yml. Author docs/secrets.md: rotation model (2 shipped + extension set), no-plaintext fingerprint/shred posture, permission tiers (Mike admin / Nadav developer / Roy viewer), and the OIDC-deferred-to-v2 decision note.

התוכנית, משוחזרת

זהו עמוד השדרה בדיוק כפי שננעל ב-2026-05-25 (התוכנית האסטרטגית imperative-crafting-wand, 12 פילרים / D1–D55). הזרימה תמיד הייתה פשוטה: dev → staging (TEST) → main (PROD) — כל שינוי עובר ל-staging, עובר שוב את כל סוללת הבדיקות, ואז promote ידני יחיד של אדם מעביר אותו ל-prod. אין promotion אוטומטי. כל 12 הפילרים חייבים להישלח לפני שהלקוח המשלם הראשון עולה לאוויר.

עמוד השדרה

טופולוגיית 3 שכבותזרימת הליבה — DEV / TEST / PROD

עמוד השדרה של התוכנית הוא טופולוגיית 3 שכבות קשיחה — DEV / TEST / PROD — עם בידוד מוחלט של ה-PROD, וזרימת promotion חד-כיוונית של dev → preview-on-TEST → main → PROD. הכוונה המדויקת (verbatim) מתוך התוכנית:

שלוש השכבות (lines 304-327):
- DEV (לפטופים): pnpm/npm מקומי + Vercel localhost + ענף Neon `dev` + סביבת Vercel `development` + לוגים חיים של PlaySmart דרך log splitter במצב read-only.
- TEST (clone של prod): פרויקט Hetzner נפרד וחדש `plan-b-test`; שרתים TEST-siemsys-syslog + TEST-MDR-01 (3 containers); DNS test-siem.plan-b.co.il (alias של Vercel preview) + test-siem-api.plan-b.co.il (nginx של TEST, grey-cloud); סביבת Vercel `preview`; ענף Neon `dev`; Storage Box מתוחם ל-/test/ (BX11 חדש).
- PROD (בלתי-נגיע): פרויקט Hetzner של PROD; PROD-siemsys-syslog + PROD-MDR-01 (3 containers); siem-api.plan-b.co.il + siem.plan-b.co.il; סביבת Vercel `production`; ענף Neon `main`; Storage Box /prod/.

הזרימה (lines 317-337): 'PR נפתח → Preview נפרס ל-TEST → כל ה-gates של Pillar 4-5 רצים מול נתוני TEST. Mike מאשר merge ל-main → deploy ל-PROD → אימות שלאחר ה-deploy רץ → Auto-rollback במקרה של חריגת SLO.' שני גבולות: DEV↔TEST הוא ארעי/תכוף (כל PR נוחת כאן; TEST הוא המקום שבו ה-dev 'מראה את עצמו' לפני שזוכה לאמון); TEST→PROD הוא נדיר, מאושר על ידי אדם (Mike עושה merge ל-main), חד-כיווני בלבד, ללא promotion אוטומטי.

שלושה כללים מוחלטים (lines 333-337): (1) שום URL/DSN/key של prod לא נגיש משום נתיב קוד שאינו-prod (scripts/audit-prod-isolation.mjs אוכף, gate #13 ב-check-all). (2) ל-PROD אין שום תלות נכנסת ב-TEST או ב-DEV (log splitter ולא log-forward; clones ולא mirrors). (3) emergency override שמור ל-Mike בלבד ונרשם ב-audit לאחר התקרית בתוך 24 שעות.

כללי זרימת נתונים (lines 322-327): PROD→TEST רק דרך anonymizer דטרמיניסטי (רענון staging שבועי); TEST→PROD לעולם לא (אין נתיב קוד, אין script, אין פעולת admin); PlaySmart UDP 514 → log splitter → מפצל לשני היעדים: PROD-syslog וגם TEST-syslog (cx22 זול משלו; ל-PROD אין תלות יוצאת ב-TEST). סביבות ארעיות per-PR (Pillar 3 מקופל לתוך Phase 1): ענף Neon pr-{N} מתוך dev + Vercel preview + אינדקס OpenSearch מתוחם ל-PR pr-{N}-test + teardown אוטומטי בסגירת ה-PR.

אין promotion אוטומטימודל ה-Promote — 5 שכבות שערים + המיזוג של מייק

שינוי מגיע ל-prod אך ורק על ידי מעבר מלא של מחסנית ה-gates המבוימת ולאחר מכן Mike עושה merge ל-PR אל `main`; אין promotion אוטומטי. מודל ה-gate של הגנה-בעומק ב-5 שכבות (lines 619-640): שכבה 1 pre-push מקומי (Husky), שכבה 2 CI ב-push (GH Actions: install/lint/typecheck/check-all/test/build), שכבה 3 branch protection לפני merge (שכבה 2 + Aikido + CodeRabbit ללא בקשות שינוי פתוחות + ב-main: review מאשר אחד מ-Mike), שכבה 4 חזרה-כללית לפני deploy = Pillar 5 Stage 4 מול TEST (Playwright/visual/a11y/perf/DAST/mutation/contract), שכבה 5 אימות לאחר deploy = Pillar 7 (synthetic + burn-rate של SLO + auto-rollback). ה-merge ל-main מגודר: branch protection דורש שכל הבדיקות יהיו ירוקות וגם review מאשר של Mike; no-bypass-for-admins כבוי כך שאפילו Mike משתמש בזרימת ה-PR (חירומים מכוסים על ידי runbook של emergency-override). ב-merge ל-main, Vercel פורס ל-PROD ואימות שלאחר ה-deploy רץ. אספקה הדרגתית (Pillar 7): כל PR משמעותי נשלח מאחורי flag של GrowthBook; canary עולה 1%→10%→50%→100% לאורך 24h (gates של 4h/8h/12h); auto-rollback בכל אחד מ-{שיעור שגיאות פי 2 מה-baseline, p99 פי 2 מה-baseline, כשל synthetic פי 3}; rollback מיידי של קוד דרך החלפת alias ב-Vercel (<30s) + היפוך flag. D30: שלבי ה-ramp מאושרים ידנית על ידי Mike עד שה-Release Agent (Pillar 9 Tier 3) נשלח, אוטומטיים לאחר מכן. עקרון 'לסמוך על ה-gates' (D26, lines 1620-1623): לאחר ההשקה Renovate עושה auto-merge לכל PR ירוק — patch/minor/major כאחד — מפני שכל תפקיד ה-CI/CD הוא להבטיח שכל דבר שעבר merge הוא בטוח; majors שדורשים migration אמיתי נכשלים ב-gates באופן טבעי ונשארים פתוחים. המערכת עולה לאוויר ללקוח משלם ראשון רק כאשר כל 12 ה-pillars נשלחים — אין go-live מדורג (line 2161).

הפער המובלעכוונת דיפלוי המנוע — image חתום, לא alias של Vercel

התוכנית ממוקדת-portal (Next.js → Vercel) עבור עמוד השדרה dev→test→main→PROD; מנוע ה-MDR + ה-edr-puller (ה-Docker containers בצד ה-backend על PROD-MDR-01 / TEST-MDR-01) מנוהלים אחרת והתוכנית אינה מגדירה עבורם pipeline מלא של git-promotion. כוונות נעולות מרכזיות:

1) המנוע/ה-puller אינם משתנים מתוך PRs של pbsiem. ה-path_instructions של CodeRabbit (line 717): 'mdr-engine/**, edr-puller/** → contract-drift בלבד; לא מוצעים עריכות קוד (איננו משנים את אלה מתוך PRs של pbsiem).' ה-gate של crypto-contract (lines 731-736) מתייחס ל-src/lib/mdr-crypto.ts (portal), ל-decrypt ב-mdr-engine/src/action-executor.ts, ול-edr-puller/src/decrypt.ts כ-'imports במצב read-only — איננו משנים קבצים אלה'; ה-gate רק מוודא ששלושת ה-decryptors נשארים תואמי-חוזה מול envelope זהב v2 GCM.

2) האופן שבו המנוע מגיע ל-prod = signed-container-pull, ולא Vercel alias. Pillar 6 (supply chain) תת-תוצר 5 (lines 1597, 1609): 'תמונות חתומות-Sigstore עבור docker containers (mdr-engine, edr-puller) נדחפות ל-GHCR. שלב ה-GHCR push ב-CI חותם תמונות דרך cosign sign (keyless OIDC). סקריפטי ה-deployment של המנוע + ה-puller מאמתים חתימות לפני pull.' אימות (line 1660): cosign verify ghcr.io/plan-b-systems/mdr-engine:<sha> מחזיר חתימה keyless תקפה. כך שהמנוע מגיע ל-PROD-MDR-01 על ידי סקריפטי deployment שמושכים תמונות מאומתות-cosign, מאוששות-SLSA-L3, מתוך GHCR.

3) המנוע שזור בתוך ה-pillars החוצי-חתכיים אף שאינו נמצא על עמוד השדרה של ה-Vercel promote: instrumentation של OTel ב-MDR engine + edr-puller עם הפצת W3C TraceContext portal→engine→puller (Pillar 8, lines 1212-1219); GrowthBook SDK ב-portal + engine + puller (Pillar 7 B, lines 1425-1431); gates משותפים של חוזה crypto/rate-limit/audit-log חוצי-רכיבים (Pillar 4 H); אימות attestation חוצה-מוצרים ב-deploy (Pillar 6 #9).

נטו: התוכנית נועלת את נתיב ה-prod של המנוע כ-'תמונה חתומה ב-GHCR → pull מאומת אל ה-box של PROD Hetzner,' שנשמר כן על ידי gates של contract-drift ואימות signature/provenance — אך היא אינה מגדירה promotion בסגנון Vercel של canary/alias-switch או gate של git-merge מ-TEST→PROD עבור המנוע עצמו כפי שהיא עושה עבור ה-portal. זהו הפער המשתמע: עמוד השדרה של promote-to-prod מעוצב לפי ה-portal; המנוע נסמך על משמעת חתימת-containers ולא על זרימת dev→test→main בלחיצת כפתור.

12 הפילרים

הכוונה הנעולה של כל פילר וההחלטות שעיצבו אותו. המצב הנוכחי של כל אחד נמצא בלשונית מצב נוכחי; מה שחסר נמצא בלשונית פערים.

P1פילר 1 — טופולוגיית סביבות ובידוד

כוונה: DEV/TEST/PROD עם בידוד מוחלט של ה-prod; פרויקט Hetzner נפרד וחדש מארח את מחסנית ה-TEST כצל רזה של ה-prod, נאכף מכנית כך ששום URL/DSN/key של prod אינו נגיש משום נתיב קוד שאינו-prod; emergency override שמור ל-Mike בלבד עם trail של audit שלאחר התקרית.

החלטות מפתח

D5: סביבות ארעיות של Pillar 3 מקופלות לתוך Phase 1 (לא phase נפרד). D6: טופולוגיית TEST = צל רזה, cpx22/cx22 x2 + cx22 splitter, ~EUR20-25 לחודש. D7: נתוני TEST = PlaySmart חי דרך splitter + fixtures סינתטיים מינימליים (anonymizer נדחה ל-P12). D8: סביבות ארעיות per-PR כן ב-Phase 1 (ענף Neon לכל PR + אינדקס OS מתוחם + teardown אוטומטי). D9: מפתחות SaaS חיצוניים מפוצלים (Twilio/SendGrid/Tranzila/Finbot/Hetzner/GitHub), משותפים+מוגבלי-תקרה (Anthropic), מושבתים-ב-TEST (SentinelOne), משותפים בהיקף-צר (Cloudflare). D10: Storage Box חדש BX11 ייעודי ל-TEST (~EUR4 לחודש), מפתחות SSH+מכסה עצמאיים. סקריפט חדש scripts/audit-prod-isolation.mjs הופך ל-gate #13 של check-all.

P2פילר 2 — זהות, סודות ואמון

כוונה: Workload identity (OIDC במקום שנתמך), ללא credentials ארוכי-טווח של CI, cron rotation שבועי (כבר נשלח), תיחום סודות לכל סביבה, גישת סוד נרשמת ב-audit, כאשר 'שום אדם אינו רואה credentials של prod' היא כוכב הצפון.

החלטות מפתח

D11: אימוץ OIDC נדחה ל-v2 (החלטת Mike: היה מסבך דברים); לשמר את cron ה-rotation השבועי. D12: גישת סוד נרשמת ב-audit ב-PROD בלבד, סוגי סודות רגישים בלבד (encryption keys, מפתחות payment processor, מפתחות AI). D13: דפוס proxy לגישת סודות getSecret('X') העוטף את process.env.X (DX נקי, ניתן ל-grep, נקודת audit מרכזית). היקף v1 צומצם מ-10 ל-6 תת-תוצרים לאחר דחיית OIDC; ה-rotation מתרחב ל-MDR_ENCRYPTION_KEY + bearer tokens + PATs + מפתחות GrowthBook.

P3פילר 3 — סביבות ארעיות per-change

כוונה: ענף Neon לכל PR + Vercel preview מול TEST + אינדקס OpenSearch מתוחם ל-PR + fixtures דטרמיניסטיים של seed + teardown אוטומטי בסגירת ה-PR, עם תקרת עלות לכל סביבה ותקרה מצרפית.

החלטות מפתח

מקופל לתוך Phase 1 (Pillar 1, D5) — כבר לא phase משלו. מכניקה: GH workflow .github/workflows/ephemeral-env.yml על PR opened/synchronize/reopened/closed יוצר/מוחק ענף Neon pr-{N} מתוך dev + override של env per-branch ב-Vercel (DATABASE_URL, OPENSEARCH_PR_INDEX, PR_NUMBER) + אינדקס OS מתוחם ל-PR pr-{N}-test (1 shard, 0 replicas, ISM delete min_age 1d). Teardown אידמפוטנטי שלעולם לא נכשל-בשקט + cron GC יומי + cron עלות. תקציב ארעי מצרפי ~EUR100 לחודש; הערכה <EUR1 לכל PR. דרישות-קדם מסומנות: מכסת ענפי Neon (תוכנית Scale) + env per-branch ב-Vercel דורש Pro.

P4פילר 4 — Quality gates (הגנה בעומק)

כוונה: חמש שכבות — pre-push מקומי, CI ב-push, branch protection לפני merge, חזרה-כללית לפני deploy, אימות לאחר deploy — המרחיבות את 11 ה-gates הקיימים של check-all עם typecheck, lint-warnings-as-errors, Aikido, CodeRabbit, gate של crypto-contract, ומסגרת גנרית של contract-gate חוצה-רכיבים.

החלטות מפתח

D14: כלי pre-commit/push = Husky + lint-staged. D15: request_changes_workflow של CodeRabbit TRUE (חוסם merge בבקשות שינוי). D16: pinning של GH Actions היברידי — SHA-pinned לרלוונטיים-לאבטחה (Aikido/CodeQL/SLSA), @v4 לרלוונטיים-ל-build (checkout/setup-node/cache). D17: מסגרת contract gate חוצה-רכיבים נבנית עכשיו (scripts/check-contracts/) עם crypto-envelope כחוזה ה-v1; מועמדי v2 rate-limit/pii-scrub/audit-log. D18: פתח-מילוט --no-verify מותר עם runbook מתועד של emergency-override. Branch protection: dev = 7 בדיקות נדרשות ללא push ישיר; main = זהה ובנוסף אישור אחד של Mike, admin no-bypass כבוי. רצף קריטי: workflows חייבים להתקיים לפני branch protection כדי ששמות הבדיקות הנדרשות יתאימו.

P5פילר 5 — חזרה-כללית בצורת-production (Stage 4)

כוונה: מול הסביבה הארעית per-PR על TEST, כל PR מריץ Playwright E2E (86 ה-specs הקיימים) + visual regression EN+HE + בדיקות OpenAPI/contract + perf budget (Lighthouse + k6) + a11y (axe-core) + i18n key parity + בידוד cross-tenant + mutation testing (Stryker) + בטיחות migration + AI cost guard.

החלטות מפתח

D19: visual regression = critical-path 12 עמודים x EN+HE = 24 baselines (סף 2% פיקסלים). D20: OpenAPI דרך zod-to-openapi (דורש Phase 5a עצמאי של Zod audit ~2-3 ימים). D21: baseline של perf budget = rolling baseline מ-main. D22: סף mutation של Stryker = 30 יום ל-70%, להתחיל ברצפת first-run ~60%. D23: כיסוי בידוד cross-tenant = top-12 endpoints רגישים בתחילה (תופס את מחלקת ה-IDOR של CR-03). D24: AI cost guard = Claude API בלבד ל-v1; תקרות שלוש-שכבתיות $0.50/run, $5/PR-lifetime, $50/repo/חודש. ה-gates מופעלים ב-deployment_status=success מול הסביבה הארעית של Pillar 1, מתפרסמים כ-PR checks, נדרשים על ידי branch protection של Pillar 4.

P6פילר 6 — שלמות שרשרת-האספקה

כוונה: SBOM (CycloneDX) לכל release מצורף ל-GH Release + שימור ב-Storage Box; provenance של SLSA L3 דרך Sigstore + GitHub attest-build-provenance; SLSA Source Track; Aikido SAST/secret/dep-CVE/IaC; ZAP DAST; cron של dependency-audit עם auto-PR.

החלטות מפתח

D25: commits חתומים ב-main נדחים ל-v2 (שיבוש לצוות); יעד SLSA Source Track L1 (branch protection לבדה מספקת). D26: auto-merge של Renovate = לסמוך על ה-gates — ידני במהלך ה-build; לאחר ההשקה auto-merge לכל PR ירוק (ללא הבחנת patch/minor/major; majors שדורשים migration נכשלים ב-gates ונשארים פתוחים). D27: שימור SBOM = לנצח ב-Storage Box (~100KB JSON כל אחד). תמונות mdr-engine + edr-puller חתומות-Sigstore ל-GHCR דרך cosign keyless OIDC; סקריפטי deployment מאמתים חתימות לפני pull; אימות attestation חוצה-מוצרים נשלח ב-ci-templates.

P7פילר 7 — אספקה הדרגתית ו-rollback בטוח

כוונה: feature flags של GrowthBook באירוח-עצמי; canary 1%->10%->50%->100% לאורך 24h; auto-rollback בחריגת SLO; migrations של schema בדפוס expand-contract הנאכפות על ידי Migration Review Agent; rollback מיידי דרך החלפת alias ב-Vercel + היפוך flag.

החלטות מפתח

D28: ramp של canary 1%->10%->50%->100% לאורך 24h (stage gates של 4h/8h/12h). D29: טריגר rollback = כל אחד מ-{שיעור שגיאות פי 2 מה-baseline, p99 פי 2 מה-baseline, כשל synthetic פי 3} -> auto-rollback (rollbacks שגויים זולים, rollbacks שהוחמצו יקרים). D30: gates של אישור ידני עד שה-Release Agent נשלח ב-Pillar 9, אוטומטיים לאחר מכן. D31: flagging ברירת-מחדל = כל PR משמעותי כברירת-מחדל, opt-out דרך תווית PR של no-flag-needed. D32: היקף rollout טרום-השקה = Phase A PlaySmart-only (דה-פקטו היום), Phase B opt-in per-customer כשלקוחות עולים, Phase C cohorts מדורגים ב-10+ לקוחות. אינטגרציית SDK משתרעת על portal + engine + puller; rollback של החלפת alias ב-Vercel <30s דרך scripts/rollback-prod.mjs.

P8פילר 8 — Observability, SLOs וייחוס עלויות

כוונה: אותות זהב דרך OpenTelemetry + Prometheus + Grafana + Sentry + Better Uptime; 5 SLOs נקובים הנאכפים על ידי התראות burn-rate; בנוסף cost-per-customer + cost-per-feature דרך תגית client_id על כל log/metric/trace (קריטי לתמחור SIEM מבוסס-שימוש).

החלטות מפתח

D33: Prometheus + Grafana באירוח-עצמי על TEST בתחילה (~EUR4 לחודש); לבחון מחדש Grafana Cloud בהמשך. D34: SLO-as-code = כללי התראה של Prometheus כתובים-ביד ל-v1 (5 SLOs: login 99.9%/2s, license check 99.95%/500ms, MDR APIs 99.5%/1s, alert delivery 99%/5min, ingestion 99.9%/30s); Pyrra/Sloth נדחים. D35: exporter של OTel = OTLP/gRPC ל-collector באירוח-עצמי. D36: דגימת Sentry tracesSampleRate 0.1, replaysSessionSampleRate 0.01. D37: ייחוס עלויות = per-customer + per-feature (תוויות client_id + feature_flag בכל מקום). D38: PROD-obs-stack מוקצה 2-4 שבועות לאחר ש-TEST-obs-stack יציב. OTel מנטר portal + MDR engine + edr-puller עם הפצת W3C TraceContext.

P9פילר 9 — צי סוכני AI עם סמכות מתוחמת

כוונה: מסגרת סמכות Tier 0-4 הנאכפת-בזמן-ריצה (ולא נאכפת-בפרומפט) — לסרב לחשוף את הכלי במקום לבקש מהפרומפט לסרב — עם התקדמות אמון-נרכש, ערוץ <untrusted> ל-prompt-injection, תקרת graph של agent-to-agent, חבילות regression של golden-input לכל agent, ו-post-mortem חודשי + יום agent-off.

החלטות מפתח

D39: לבנות על Claude Agent SDK (אכיפת allowedTools + hooks של PreToolUse/PostToolOutput), לא על Anthropic SDK גולמי. D40: ליישר Tier 0-4 ל-CSA Agentic Trust Framework (פברואר 2026). D41: ספי אמון-נרכש T0 100 inv/<5% override, T1 200/<3%, T2 500/<2%, T3 1000/<1%; agents מתחילים ב-SHADOW, מתקדמים אוטומטית ל-LIVE. D42: telemetry מרחיבה את OTel/Prometheus/Grafana של Pillar 8 (ללא LangSmith/Helicone). D43: תקרת graph של agent-to-agent = מקסימום 3 hops, מקסימום $5 עלות cascade. D44: post-mortem חודשי על מדגם 20-החלטות לכל agent + יום agent-off חודשי. D45: תקרת עלות AI $0.50/run, $5/PR-lifetime, $50/repo/חודש; משמעת model-tier (Haiku שגרתי, Sonnet מורכב, Opus רק review חודשי). 10 agents חדשים נבנים lowest-risk-first (3 Tier-0, 5 Tier-1, 1 Tier-2 On-Call, 1 Tier-3 Release Agent) + 4 agents קיימים מותאמים מחדש; ל-v1 יש אפס agents Tier-4 (בני אדם תמיד לוחצים על פעולות prod-high-blast).

P10פילר 10 — ראיות compliance כתוצר-לוואי

כוונה: audit log בלתי-ניתן-לשינוי, SBOM-per-release עם שימור, תגיות change-impact per-PR (customer-facing/auth/schema/security/PII), auditd על prod, ראיות drill רבעוני, וטבלת מיפוי control-to-evidence (SOC 2 CC-* + PCI DSS 4.0) שבה artifact אחד מוכיח controls רבים — כך שה-audit הופך ל'ייצוא הדוח', ולא 'בנייה מחדש של ה-trail'.

החלטות מפתח

D46: מוכנות SOC 2 Type II = 6 חודשים לאחר ההשקה (מיושר לחלון ה-prospect הראשון מסוג enterprise; Type II דורש ראיות מתקופה נגררת). D47: היקף PCI DSS = SAQ-A לפי הנחיה כתובה של Tranzila (נתיב ה-API הוא מעבר של חלקיק-שנייה, אינו נשמר; Tranzila אישררה מחדש לאחר ש-Codex סימן סיכון SAQ-A-EP/SAQ-D); ההנחיה מאורכבת ב-compliance/pci-tranzila-scope-letter.pdf. D48: פורמט ייצוא audit-log = JSON + CSV + PDF (קריא-מכונה ל-GDPR + auditor enterprise + גיליון אלקטרוני). endpoint ייצוא מתוחם-tenant פונה-ללקוח /api/customer/audit-export (GDPR Article 15).

P11פילר 11 — התאוששות מאסון (מוכחת, לא מתוכננת)

כוונה: DR שבדקת ב-drill, ולא תכננת ב-doc — אימות backup-restore יומי לתוך TEST, RTO/RPO נמדדים בכל drill, Storage Box חוצה-אזורים (Falkenstein->Helsinki), game-day אוטומטי רבעוני, ועץ החלטות failover הממפה breach x משך -> פעולה.

החלטות מפתח

D49: היקף DR מפוצל לתוכנית נפרדת — רק cron ה-backup->TEST restore היומי נשאר בתוכנית זו (Phase 3, מפני שהוא מנצל מחדש את מכונת ה-CI/CD). כל השאר (שכפול חוצה-אזורים, drill רבעוני, עץ החלטות failover, מחקר migration של Phase B Neon->self-hosted PG, PITR ללקוח) עובר לתוכנית ה-DR הייעודית העתידית. שמורה רציונל לעיון-בלבד: DR קר בהשקה, replica חם ב-Helsinki ב-Y1, status page מתעדכן אוטומטית מ-SLO עם חלון אישור של 5 דקות, drills מונעי-מהנדס תחילה ואז נתמכי-agent.

P12פילר 12 — Meta-pillars של הפלטפורמה + חילוץ ci-templates

כוונה: ה-v2-polish שמונע ריקבון של 12-24 חודשים — קטלוג services.yaml, Terraform IaC + drift detection, runbooks-as-code, chaos game-days מתוזמנים, קישור status-page-to-SLO, ספריית anonymizer דטרמיניסטית משותפת, מדיניות versioning/deprecation של API, ממשק ייצוא פונה-ללקוח, סריקת PII של fixtures+screenshots — שמגיעה לשיאה בחילוץ דפוסים לשימוש-חוזר לתוך מוצר plan-b-systems/ci-templates (SemVer v1.x.y, צרכנים עושים SHA-pin).

החלטות מפתח

D50: כלי IaC = Terraform (סטנדרט תעשייתי; providers מתוחזקים ל-Hetzner/Cloudflare/Neon). D51: נראות repo של ci-templates = ציבורי (שום דבר סודי-תחרותי; מושך כישרון הנדסי). D52: ספריית anonymizer = מימוש עצמי (~500 LOC, 1-2 ימים; דטרמיניסטי + מודע-schema; ספקים הם overkill בגודל ה-schema שלנו). הפינאלה הגדולה (M10): לחלץ workflows של GH Actions + agent-runtime + agent prompts + מודולי IaC + schemas של SLO + תבניות doc לתוך plan-b-systems/ci-templates הציבורי; pbsiem עובר refactor לצרוך דרך uses:@v1.0.0; product-3 stub עולה ב-30 דקות כבדיקת ה-conformance; צרכנים עושים SHA-pin (לעולם לא @main, לעולם לא @v1); MAJOR = deprecation של 30 יום + auto-PR לכל צרכן + תמיכה של 90 יום.

מצב נוכחי — מאומת מול הקוד

כל שורה כאן נבדקה מול הריפו האמיתי בתאריך 2026-06-13 בניתוח יריב של 16 סוכנים — לא מהזיכרון, לא מטענות של סשן קודם. זוהי עמודת ה-מציאות עבור כל 14 התחומים.

55חסר

8נבנה, לא מחובר

45נבנה אחרת

21תואם

שני ה-P0 שהפכו שינוי של 40 דקות ל-11 שעות: (1) למנוע ה-MDR אין נתיב promote מ-TEST ל-PROD — prod-deploy.yml / prod-promote.yml מזיזים רק את ה-alias של ה-portal ב-Vercel (מאומת: אפס צעדי engine/docker/ssh/hetzner); המנוע מגיע ל-prod רק ידנית. (2) נתוני ה-TEST/staging המשותפים לעולם לא מתרעננים, ולכן השערים מתנדנדים על דאטה ריקה/ישנה. כל השאר למטה הוא חיזוק, לא חוסם.

3 חסר 3 נבנה אחרת 3 תואםפילר 1 — טופולוגיית סביבות ובידוד (DEV/TEST/PROD, בידוד prod, ephemeral לכל PR, log splitter, audit gate, emergency override)

ה-TEST stack הוא אמיתי וחי: test-siem-api.plan-b.co.il מתרגם ל-178.105.214.148 ומגיש תעודת Let's Encrypt תקפה (openssl Verify return code: 0). pbsiem/docs/01-infrastructure.md:80-97 מתעד את פרויקט ה-Hetzner הנפרד של ה-CI/CD: TEST-siemsys-syslog (132962657 / 178.105.214.148), TEST-MDR-01 (132962660 / 178.105.148.156, 3 קונטיינרים engine+postgres+puller), splitter-syslog (133160347 / 178.105.116.35, rsyslog UDP-514 fan-out ל-PROD 88.198.217.216 וגם ל-TEST), רשת Siem-internal-test (12257988, 10.1.0.0/16), BX11 (584099, u601754.your-storagebox.de). ephemeral לכל PR מחווט: .github/workflows/ephemeral-env.yml קורא ל-cicd-system ephemeral-env.yml@main (Neon branch מהורה "production", override של Vercel preview) וגם ל-pr-backend-stack.yml@main עם teardown בסגירת ה-PR (lines 19-56). פריסת TEST של המערכת המלאה מחווטת (סותר את ה-open gap של ה-runbook): הכותרת של .github/workflows/stage4.yml "THE RELEASE-CERTAINTY CORE (S5, D-4=Option C)" בונה מחדש את mdr-engine+edr-puller ב-PR SHA דרך docker-build, פורסת את שניהם ל-TEST stack המשותף דרך siem-deploy.yml@main עם required-env-marker:TEST + health/RestartCount gate (lines 85-125), עושה migrate ל-MDR-PG, מסנכרנת את ה-receiver, ואז מריצה e2e/dast/perf/visual/link-crawl מול ה-preview האמיתי. prod-isolation gate מחווט: scripts/check-all.mjs:52 מפעיל את check-prod-isolation.mjs. ENGINE PROD PROMOTE: אין. prod-deploy.yml + prod-promote.yml הם VERCEL-PORTAL-ONLY (prod-promote.yml:48-64 קורא ל-vercel-promote.yml כדי להזיז את ה-alias siemsys.plan-b.systems; ל-prod-deploy.yml אין אף צעד של engine/docker/ssh/hetzner). siem-deploy.yml נקרא רק על ידי stage4.yml + rollback-drill.yml (שניהם TEST) — לעולם לא עבור prod; הכותרת שלו עצמו מפרסמת מצב "deploy to PROD host (env-gated)" שאף קורא לא משתמש בו.

3 חסר 1 נבנה אחרת 2 תואםפילר 2 — זהות, סודות ואמון (Identity, secrets & trust)

אומת ב-pbsiem (plan-b-systems/pbsiem, clone מקומי). (1) Rotation cron נשלח (SHIPPED) ומחווט (WIRED): .github/workflows/rotate-secrets.yml:42 cron '19 20 * * 5' + workflow_dispatch; scripts/rotate-secrets.sh מבצע rotation לבדיוק שני סודות — OPENSEARCH_PROXY_KEY (line 122/149-150) וסיסמת ה-PG של mdr_admin (line 116-117, ALTER USER ב-206-210). גילוי hosts מתבצע דרך GET על src/app/api/cron/rotation-targets/route.ts (route.ts קיים, 1884 bytes) באמצעות CRON_SECRET bearer. אין rotation עבור MDR_ENCRYPTION_KEY, bearer-token, PAT או GrowthBook. (2) "No human sees prod credentials" מיושר (ALIGNED) עבור 2 הסודות שעוברים rotation: rotate-secrets.sh:355-359 מתעד sha256 fingerprints (16 התווים הראשונים) בלבד; קובץ הערכים בטקסט גלוי עובר shred בכל נתיב יציאה (trap EXIT INT TERM, line 135; אומת ב-361-363). (3) ה-proxy getSecret() חסר (MISSING): אין src/lib/secrets/get-secret.ts (התיקייה נעדרת); ההתאמות היחידות של getSecret הן helpers מקומיים לא קשורים — src/lib/trusted-browser.ts:21 (מפתח חתימת HMAC) ו-src/lib/twilio-callback-token.ts:32 — ולא ה-proxy המבוקר המרכזי. (4) audit של SECRET_ACCESS חסר (MISSING): grep "SECRET_ACCESS" על פני pbsiem מחזיר אפס תוצאות; מודל AuditLog קיים (prisma/schema.prisma:586) אך action הוא מחרוזת חופשית ללא שימוש ב-secret-access. (5) OIDC נדחה כראוי (DEFERRED): VERCEL_TOKEN ארוך-טווח עדיין מאוזכר על ידי 8 workflows (ephemeral-env, post-deploy-verify, prod-deploy, prod-promote, prod-verify, rotate-secrets, stage4, visual-baselines). (6) Per-env scoping חלקי (PARTIAL): מילת המפתח environment של GH קיימת רק ב-prod-deploy.yml; סקריפט ה-rotation שומר על הפרדה בין creds של preview/TEST לבין PROD (lines ~154). (7) docs/secrets.md חסר (MISSING).

2 חסר 2 נבנה אחרת 3 תואםפילר 3 — סביבות ephemeral לכל שינוי

מחווט ורץ ירוק (green) היום (אומת ב-2026-06-14). cicd-system מספק workflows מסוג `workflow_call` לשימוש חוזר — `.github/workflows/ephemeral-env.yml` (יצירה/שימוש חוזר ב-Neon branch עם endpoint מסוג read_write, hard-fail על parent שגוי, שער EPHEMERAL_PAUSED, retry על connection-uri, משתני Vercel preview, seed, תגובת PR, ו-teardown מלא בסגירת PR) ו-`pr-backend-stack.yml` (engine+puller+postgres per-PR ב-docker-compose על host של pool ב-TEST 178.105.148.156, סיסמת PG דטרמיניסטית ב-HMAC, pool eviction/MAX_STACKS, image GC). ה-`.github/workflows/ephemeral-env.yml:19` של pbsiem קורא ל-reusable של cicd ב-@main עם neon-project-id autumn-thunder-97996128, neon-parent-branch=production, seed-script scripts/qa-seed.ts, pr-stack-host. ה-`stage4.yml` של pbsiem פורס engine/puller/receiver ל-TEST + מקים את ה-backend stack של ה-PR עצמו + מריץ e2e/dast/perf/visual/links מול ה-Vercel preview האמיתי של ה-PR. Live: `gh run list` מראה הצלחת ephemeral-env (14-44s) והצלחת stage4 (10-13m) ב-2026-06-14; כל ה-jobs של stage4 ב-PR #78 ירוקים כולל deploy-engine/deploy-puller/pr-stack/pr-services/e2e/links. Crons קיימים: `src/app/api/cron/ephemeral-env-gc/route.ts` (Neon בלבד, 03:00) ו-`ephemeral-env-cost/route.ts` (ספירת Neon branch, 04:00, dual-auth) ב-pbsiem vercel.json:192/196. Admin UI ב-`src/app/admin/ephemeral-envs/page.tsx` (לא ב-path (admin)/admin/infra שבתוכנית).

8 נבנה אחרת 2 תואםפילר 4 — שערי איכות (defense in depth)

אומת על pbsiem origin/main + gh + cicd-system. Layer 1 קיים: .husky/pre-commit=`npx lint-staged`, .husky/pre-push=`npm run prepush`, .nvmrc=22.16.0 (התוכנית אמרה 22.11.0), husky^9.1.7 + lint-staged^15.4.3, ה-config של lint-staged תואם, engines>=22.11.0. אבל `prepush`=`npm run check && npm run typecheck && npm run lint && npm run test:unit` — משתמש ב-`npm run lint` (לא lint:strict) + `test:unit` (לא ה-test המלא); מתבדל מהתוכנית וגם מ-README L28 שטוען `--max-warnings 0` + `vitest run`. Layer 2 נבנה אחרת (BUILT-DIFFERENT): pbsiem/.github/workflows/ci.yml קורא ל-reusable cicd-system/.github/workflows/ci.yml@main כ-job משולב יחיד (required check בודד `ci / CI`, steps סדרתיים), ולא כ-6 checks נפרדים; pbsiem מעביר `lint-command: "npm run lint"` שדורס את ברירת המחדל `lint:strict` של ה-reusable — lint-warnings-as-errors לא נאכף ב-CI. check-all.mjs מריץ 13 lints + מגלה אוטומטית scripts/check-contracts/*.mjs (crypto-envelope.mjs קיים) — D17 framework נבנה+מחווט+נאכף (BUILT+WIRED+ENFORCED) (aligned). Aikido חסר/הוחלף: cicd-system/.github/workflows/security-scan.yml L1+L3 = "Security Scan (Semgrep + Trivy) … Replaces Aikido (D54 superseded)"; מריץ Semgrep+Trivy בלבד, ללא step של Aikido; pbsiem security-scan.yml enforce:true חוסם CRITICAL,HIGH עם allowlist. CodeRabbit: האפליקציה אכן מותקנת (coderabbitai כ-reviewer; PR#78 check `CodeRabbit pass / Review skipped`) אבל ADVISORY — .coderabbit.yaml לא קיים ב-origin/main (הוסר ב-commit fb8a2fb "remove CodeRabbit — can't downgrade to free tier"), request_changes_workflow מוגדר כברירת מחדל FALSE, לא required check. Branch protection (gh api): dev strict=False, 10 checks {ci/CI, impact, dast, e2e, perf, verify, visual, Vercel, links, security-scan/scan}, 0 reviews, enforce_admins=False; main strict=True, רק {ci/CI}, req_approvals=0, enforce_admins=TRUE, force=off. אף אחד מהענפים לא מציג את Lint/Typecheck/Test/Build/Aikido/CodeRabbit כ-checks נפרדים; main עם 0 approvals (התוכנית רצתה 1) + enforce_admins ON (התוכנית רצתה OFF). תיעוד: README Contributing 5-layer + --no-verify→docs/runbooks/prod-emergency-override.md; docs/runbooks/ci-gate-failures.md קיים; ARCHITECTURE.md L133 עדיין אומר "6 parallel jobs" + "Aikido + CodeRabbit" pre-merge — מיושן (STALE).

5 חסר 2 נבנה · לא מחובר 4 נבנה אחרת 1 תואםפילר 5 — חזרה כללית בצורת production (Stage 4)

VERIFIED. ב-cicd-system קיימים reusable workflows לכל מערך הבדיקות (.github/workflows/: playwright-e2e.yml, visual-regression.yml, perf-budget.yml [Lighthouse+k6], a11y-i18n.yml [axe-core + i18n parity], mutation-test.yml [Stryker], dast-zap.yml, link-crawl.yml). pbsiem stage4.yml (lines 208-269) מחווט רק 5 שערים אל ה-run הייעודי לכל PR: e2e, dast, perf, visual, links — בנוסף ל-deploy אמיתי של full-system TEST (build-engine/build-puller/migrate-test-db/deploy-engine/deploy-puller/deploy-receiver) וגם backend stack מבודד לכל PR (pr-stack/pr-db/pr-services עם synthetic-seed.ts), שחורגים מעבר להיקף ה-\"ephemeral env\" שבתוכנית. ה-required contexts של ה-branch protection ב-dev (gh api): ci/CI, impact, dast, e2e, perf, verify, visual, Vercel, links, security-scan — אומתו כחיים וירוקים על PRs אמיתיים (ריצות stage4.yml בתאריכים 2026-06-13/14 = success). אבל: (1) e2e מריץ רק את tests/e2e/dev-environment.spec.ts (stage4.yml:218) מתוך 90 קבצי spec — לא את מערך 86 ה-specs; הערה ב-workflow מודה \"broaden... once the ephemeral DB seeding is wired deterministically.\" (2) visual מכסה 3 עמודים login/pricing/status, EN בלבד, baselines tests/e2e/visual-regression.spec.ts-snapshots/*-chromium-linux.png — לא 24 EN+HE. (3) perf = lighthouse-urls \"/,/login\", configs/lighthouse-budget.json budget סטטי, k6-script-path ריק (אין k6, אין rolling-from-main baseline). (4) i18n-parity.yml קורא ל-a11y-i18n.yml אך רק את ה-job של i18n, מגודר ל-paths src/lib/i18n/** ולא ב-branch protection (הריצה האחרונה 2026-06-10 ירוקה). (5) job של axe-core accessibility: לעולם לא נקרא על ידי pbsiem. (6) mutation-test.yml: לעולם לא נקרא על ידי pbsiem. (7) cross-tenant: tests/e2e/security/ לא קיים; grep cross-tenant = 0 hits. (8) zod-to-openapi/oasdiff: 0 hits בשום מקום. (9) AICostLedger: לא קיים ב-prisma/. (10) migration safety: אין check-migration-safety.mjs; קיים רק scripts/check-contracts/crypto-envelope.mjs (stage4 מבצע prisma db push כן והוגן ללא --accept-data-loss, תחליף חלקי). GAP A מאומת: ה-required check של ה-branch protection ב-main הוא רק \"ci / CI\" — מערך Stage-4 אינו רץ מחדש בקידום (promotion) dev→main.

5 חסר 1 נבנה · לא מחובר 2 נבנה אחרת 2 תואםפילר 6 — שלמות שרשרת האספקה (SBOM, SLSA provenance, חתימת image עם Sigstore, Renovate, dependency-audit, ציון סיכון לשרשרת אספקה)

אומת בקוד וב-gh חי: (1) SBOM — ה-reusable sbom.yml קיים (cicd-system/.github/workflows/sbom.yml) והוא מחווט: pbsiem/.github/workflows/sbom.yml קורא לו @main ב-push ל-main; ריצות חיות מצליחות (הריצה האחרונה 2026-06-14T01:06 success). אך הוא סוטה מהתוכנית: מעלה artifact עם retention של 90-DAY (sbom.yml:68 retention-days: 90), לא מצורף לאף GH Release, לא נדחף ל-Storage Box לתמיד — והוא מבצע SBOM רק על תלויות npm של ה-portal, לעולם לא על ה-container images. (2) SLSA L3 — ה-reusable slsa-provenance.yml קיים (cicd-system) ומנוסח היטב (attest-build-provenance@v2, gh attestation verify), אך יש לו אפס callers: 404 ל-workflow ב-pbsiem, וחיפוש קוד ארגוני אחר slsa-provenance.yml@main לא מחזיר דבר. שלב צירוף-ה-release שלו (slsa-provenance.yml:54 `if: github.event_name == 'release'`) לעולם לא יכול לפעול — ל-pbsiem אין releases כלל (gh release list ריק). => SLSA = נבנה-אך-לא-מחווט. (3) חתימת image עם Sigstore — docker-build.yml (cicd-system) בונה+דוחף את mdr-engine/edr-puller ל-GHCR ללא שום שלב cosign sign (docker-build.yml:74-95); pbsiem/docker-build.yml קורא לו ללא שינוי. siem-deploy.yml מבצע pull/ssh-load של images ללא שום cosign verify (siem-deploy.yml:109-191). grep אחר "cosign" ברחבי cicd-system (ללא node_modules) פוגע רק ב-docs וב-findings route, לעולם לא ב-workflow. => חתימת image ו-verify = חסרים. (4) SLSA Source Track L1 — מיושר: branch protection נאכף (בהתאם ליעד L1 שנבחר בתוכנית, שמסופק על-ידי branch protection בלבד; signed commits נדחו כראוי). (5) Renovate / dep-audit cron — חסרים: אין .renovaterc.json ב-pbsiem או ב-cicd-system; אין workflow של osv-scanner/npm-audit-cron באף אחד מהם. (6) ציון סיכון לשרשרת אספקה — חסר: אין pane של Grafana/dashboard; מחרוזת ה"provenance" היחידה ב-dashboard היא בדיקת matchesMainTip של promotion-PR ב-Vercel (promotions/page.tsx:201,466), שאינה קשורה ל-SLSA/cosign. (7) docs/supply-chain.md — חסר (אין קובץ כזה). הערה: security-scan.yml:3 אומר "Replaces Aikido (D54 superseded)" — Aikido למעשה OUT בקוד (Semgrep+Trivy בלבד), בסתירה לטענת "Aikido Back IN" שב-build.py:169.

5 חסר 1 נבנה · לא מחובר 2 נבנה אחרת 1 תואםפילר 7 — אספקה הדרגתית (progressive delivery) ו-rollback בטוח

מנגנון ה-canary/flags הוא BUILT-BUT-NOT-WIRED (בנוי אך לא מחובר), וכמה רכיבים פשוט חסרים. (1) cicd-system/.github/workflows/canary-gate.yml קיים עם לוגיקה אמיתית: ramp של rollout-coverage דרך GrowthBook PUT /api/v1/features לפי לוח הזמנים הדיפולטי של D28 (line 39), בדיקות error-rate-vs-baseline בחלון gate של 5 דקות, ובעת breach ביצוע flag-disable ב-GrowthBook + כשל ה-workflow (lines 128-219). אבל מדובר ב-reusable מסוג workflow_call שלא מופעל על ידי שום דבר אמיתי — grep מראה שההפניה היחידה היא configs/example-callers/canary-caller.yml; אף workflow של pbsiem לא מפנה ל-canary/growthbook/feature-flag (grep של pbsiem/.github/workflows החזיר ריק). (2) GrowthBook לא deployed: configs/growthbook-setup.md הוא doc בלבד (קטע docker-compose בתוך markdown), וכל שלושת ה-hosts המתוכננים (gb.plan-b.co.il, growthbook.plan-b.co.il, gb-api...:3100) מחזירים connection failure (curl 000). אין SDK של @growthbook/growthbook ב-portal/engine/puller. (3) אין scripts/rollback-prod.mjs (find scripts/deploy החזיר רק scripts לא קשורים). פרימיטיב ה-Vercel-alias נמצא במקום זאת ב-pbsiem prod-promote.yml -> cicd vercel-promote.yml, ומשמש כ-HUMAN PROMOTE (repoint של alias, ללא ramp); ו-rollback מסוג Vercel promote-previous מוטמע inline ב-post-deploy-verify.yml (lines 130-183) אך הוא SEALED OFF ב-production (auto-rollback דיפולט false; lines 29-33, 191 אוכפים verify+page-Mike, ולעולם לא auto-rollback ב-prod לפי S6). (4) אין /api/cron/canary-ramp, אין /api/cron/burn-rate-rollback. (5) אין UI של /admin/canary או /admin/deploy/rollback ב-pbsiem (grep ריק). (6) אין Migration Review Agent ואין אכיפת expand-contract בשום מקום — cicd change-impact.yml רק מחיל LABEL מסוג "schema-changing" על נתיבי prisma/schema ו-/migrations/ (lines 71-76), ללא prisma migrate diff, ללא חסימת ALTER, ללא הצעת expand-contract; pbsiem change-impact.yml אפילו לא מפנה ל-prisma. (7) אין סוגי אירועי AuditLog של canary (grep ל-CANARY_RAMP/CANARY_ROLLBACK/PROD_DEPLOY_ROLLBACK ב-pbsiem החזיר ריק). ה-rollback-drill.yml שהוא TEST-only (שחזור container, assert של PLANB_ENV=TEST) כן הופעל (S6: restore בתוך כ-1s/21s) — זה הדבר היחיד הסמוך ל-progressive-delivery שהוכח כעובד, וזהו restore של engine container, לא canary או alias rollback.

3 חסר 1 נבנה · לא מחובר 4 נבנה אחרת 1 תואםפילר 8 — Observability, SLOs ושיוך עלויות (D33-D38)

מאומת. ה-stack של observability ב-TEST אכן קיים ורץ: observability/docker-compose.yml (Prometheus v2.55.1, Alertmanager v0.27.0, Grafana 11.3.1, node-exporter, blackbox v0.25.0, otel-collector-contrib 0.114.0, Caddy) על ה-Hetzner Build-Runner; grafana.plan-b.co.il/api/health מחזיר 200 {database:ok, version 11.3.1} — אומת חי באמצעות curl. slo/slo-definitions.json מחזיק את ששת ה-SLOs; observability/generate-slo-alerts.mjs מרנדר את observability/prometheus/rules/slo-alerts.yml (11 alerts). אבל: (1) כללי ה-alert האלה שואלים את http_requests_total{slo=...} ו-http_request_duration_seconds_bucket{slo=...} — metrics ששום דבר לא פולט. pbsiem/src/instrumentation.ts:8-15 רק מייבא את sentry.server/edge config; אין שום @vercel/otel, אין OTLP exporter, אין registerOTel, ואין W3C propagation ב-portal/engine/puller (grep של pbsiem src = 0 hits מחוץ ל-node_modules). ה-traces pipeline של ה-collector מייצא אל 'debug' (collector.yaml:36, נזרק). לכן ה-burn-rate alerts לעולם לא יכולים לירות על תעבורה אמיתית — הם יושבים על baseline של 0. (2) ה-SLOs של ה-dashboard הם סט שונה: dashboard/src/lib/slo-compute.ts:65-93 מחשב 'CI success rate', 'CI p90 duration', 'Deployment success rate' מתוך שורות ה-build/deployment של ה-dashboard עצמו — לא ששת ה-MDR SLOs. dashboard/src/app/api/slo/route.ts מגיש אותם. (3) שיוך עלויות: אפס client_id או feature_flag labels בכל מקום ב-observability/ (grep = ללא התאמות). dashboard /api/metrics (route.ts) מייצא רק cicd_* platform metrics; ה-cost metric היחיד הוא cicd_weekly_estimated_cost_eur מתוך EphemeralEnvCost (הוצאת ephemeral-env), לא per-customer/per-feature. אין שום קוד cost-exporter (Glob **/cost-exporter* = none). (4) Sentry: tracesSampleRate 0.05 ב-prod (לא 0.1) ו-Session Replay מושבת במכוון (sentry.client.config.ts:11-16) — אין client_id ב-beforeSend. (5) Better Uptime: לא משולב; 'synthetic monitoring' = blackbox probes נגד 2 URLs בלבד (prometheus.yml:51-53), לא 6 ב-1-min דרך Better Uptime. (6) Alertmanager מנתב אל ALERT_WEBHOOK_URL = ה-webhook של dashboard ה-CI/CD (כותב שורות audit מסוג ALERTMANAGER_WEBHOOK), לא אל משטח ה-SVC-HEALTH /admin/infra/service-health של pbsiem שהתוכנית מציינת בשמו. (7) D38 PROD-obs-stack: לא הוקצה (TEST-only).

6 חסר 1 נבנה · לא מחובר 4 נבנה אחרת 1 תואםפילר 9 — צי סוכני AI עם סמכות תחומה

קיים framework אמיתי וקוהרנטי לסוכנים תחת cicd-system/agents/, אבל שום דבר לא מפעיל אותו. הקבצים: framework/types.ts (AgentTier enum 0-4, AgentSpec, TrustThreshold), framework/runtime.ts (מפת TIER_PERMISSIONS, createAgentRuntime עם hooks ידניים של pre/post-tool-use + gating לפי tier+allowlist+cost, loadAgentSpec/listAgentSpecs), framework/telemetry.ts (recordInvocation→JSON מקומי, checkEarnedTrust, TRUST_THRESHOLDS), framework/cost-guard.ts (MODEL_PRICING, DEFAULT_CAPS לכל-run 0.5/לכל-PR 5/לכל-repo-mo 50, checkCostBudget/recordCost). 10 specs ב-agents/specs/*.json (migration-review T0, pii-detection T0, cost-anomaly T0, ci-triage T1, doc-drift T1, visual-regression-triage T1, test-generation T1, dependency-update T1, oncall-investigation T2, release T3). WIRING: grep אחר createAgentRuntime/loadAgentSpec/runAgent/agents/framework מחזיר אך ורק את docs/cicd-platform-build-plan.md, agents/README.md, ואת runtime.ts/index.ts של ה-framework עצמו — אפס callers ב-dashboard/, ב-.github/workflows/ (אף אחד מ-~22 ה-workflows לא מפנה אליו), ב-scripts/, או ב-pbsiem. אין package.json/tsconfig בשורש — קוד ה-TS של ה-framework אינו מהודר לתוך אף build target; אין *.test.* ל-framework. D39 מופר: אין dependency של Claude Agent SDK בשום מקום; allowedTools הוא string[] רגיל של JSON שנבדק ידנית (runtime.ts:201-228), ולא allowedTools של ה-SDK בשכבת ה-API; ה-hooks מקודדים ידנית, ולא PreToolUse/PostToolOutput של ה-SDK. מספרי ה-earned-trust של D41 תואמים (telemetry.ts:30-34) אבל אין שום שדה SHADOW/LIVE בשום מקום ו-checkEarnedTrust רק מדווח על זכאות — לעולם לא מקדם אוטומטית (README:119 \"Promotion is not automatic\"); כמו כן overrideRate מחושב כ\"אדם דרס DENIAL של tier\" (telemetry.ts:96), ולא כ\"אדם עשה משהו שונה ממה שהסוכן הציע\" של התוכנית — מטריקה שונה. D42 מופר: ה-telemetry הוא JSON מקומי ב-gitignore (agents/telemetry/*.json), ולא OTel/Prometheus/Grafana. חסר לחלוטין: ערוץ <untrusted> (grep: רק ב-docs), graph cap בין סוכנים / cascade cost (grep: רק ב-docs), golden-input suites (אין), post-mortem חודשי + agent-off-day (אין), מודלי AgentTrustState + AICostLedger מתמשכים (נעדרים מ-schema של pbsiem — מודלי ה\"trust\" היחידים הם TrustedDevice/Persona-4 trust-period, לא קשורים), התאמה ל-CSA framework (הטרמינולוגיה לא נוכחת). 4 הסוכנים הקיימים אינם עברו retrofit/הוכרזו בתור tier. הדבר היחיד הצמוד-לצי-סוכנים שבאמת WIRED הוא dashboard/src/lib/promotion-verdict.ts: שני \"verdict reviewers\" מבוססי-LLM בעלי אופי מייעץ (Anthropic claude-sonnet-4-6 + OpenAI gpt-4o) שנקראים מ-/api/promotions/verdict, /api/promotions, ו-/api/cron/promotion-notify, נשמרים ב-AuditLog (action=AI_VERDICT), ומוצגים בחלונית ה-Promotions / בלוח התוצאות — אבל זו מימוש עצמאי מבוסס raw-HTTP: הוא אינו מייבא את agents/framework, אין לו tier runtime, אין cost guard, אין allowedTools, והוא מזין את הודעת ה-commit אל ה-prompt ללא עטיפת <untrusted> (promotion-verdict.ts:35-68). הוא מייעץ/shadow בלבד (לעולם לא חוסם את הכפתור, Safety Rule 5).

8 חסר 1 נבנה · לא מחובר 1 נבנה אחרת 1 תואםפילר 10 — ראיות ציות כתוצר-לוואי (Compliance evidence as exhaust)

חלקי (PARTIAL). מה שקיים ומחווט: (1) C5 — תיוג אוטומטי של change-impact פעיל (LIVE): ה-reusable workflow ב-cicd-system/.github/workflows/change-impact.yml מחיל את כל 5 ה-labels מהתוכנית (auth-touching/schema-changing/customer-facing/security-relevant/pii-touching, שורות 62-103) כולל סריקת PII מבוססת-תוכן על ה-diff (146-162); ב-pbsiem/.github/workflows/change-impact.yml:17 קוראים לו עם @main על pull_request אל dev/main; gh מציג ריצות מוצלחות מ-2026-06-14 (run 27485718386, 14s). (2) קיים מאגר AuditLog בדאשבורד (cicd-system/dashboard/prisma/schema.prisma:265-280) שנכתב אליו באופן פעיל — אבל רק עבור פעולות control-plane של הדאשבורד: promotions/dispatch (controls/dispatch/route.ts:84, promotions/dispatch:72, merge-pr:86, promotion-notify:106), סטטוס ממצא אבטחה (security/findings/[id]/status:59, repair-queue:127), webhooks/autofix:58, webhooks/alertmanager:45. (3) ל-MDR יש מודל AuditLog משלו (pbsiem/prisma/schema.prisma:586-598: admin_email/action/target/details/ip/created_at) + gate בשם check-audit-coverage.mjs + gate בשם check-pci-body-logging.mjs, שניהם ב-check-all.mjs (שורות 35,41). שבור/לא-מחווט: (a) C1 — האודיט בשכבת ה-pipeline הוא WRITE-TO-VOID (כתיבה לתהום): cicd-system/.github/workflows/audit-log.yml שולח POST של payload מותאם (repo/workflow/run_id/changed_files/duration_seconds) עם header X-GitHub-Event:workflow_run אל /api/webhooks/github, אך dashboard/src/app/api/webhooks/github/route.ts:55-64 דורש payload.repository ומחזיר HTTP 400 "Invalid payload" לצורה הזו; ה-workflow בולע כל תגובה שאינה 2xx כ-::warning:: (audit-log.yml:91) ובכל זאת "מצליח". טבלת ה-Build מאוכלסת על-ידי אירועי workflow_run הנייטיביים של ה-GitHub App הארגוני, ולא על-ידי audit-log.yml — והנתונים שלה לעולם אינם נשמרים לשום מאגר אודיט. חסר לחלוטין (מאומת כנעדר בשני ה-repos באמצעות find): תיקיית compliance/; C2 soc2-control-map.yaml; C3 pci-control-map.yaml; D47 pci-tranzila-scope-letter.pdf; C4 docs/data-flows.md; קונפיגורציות auditd של C6; דוחות חתומים של C7 תחת compliance/drills/; C8 /api/customer/audit-export (אין endpoint של audit-export בשום מקום; קיימים רק admin/audit + admin clients export-request); אכיפת retention של C9 (אין crons ל-7yr/5yr, אין שמירת append-only או guard לאי-שינוי על אף אחד משני מודלי ה-AuditLog); C10 docs/compliance.md. אין כלל artifacts של מיפוי SOC 2 / PCI / GDPR.

2 חסר 1 נבנה · לא מחובר 3 נבנה אחרתפילר 11 — התאוששות מאסון (Disaster Recovery — מוכח, לא מתוכנן)

פריט ה-DR היחיד שבתחום הוא BUILT-BUT-NOT-WIRED, מזויף חלקית, וסוטה ארכיטקטונית מ-D1. (1) הקובץ .github/workflows/backup-restore-verify.yml קיים על main של plan-b-systems/cicd-system אך הוא `on: workflow_call` בלבד (lines 3-4). הקומיט 2dd43a8 ("fix: remove schedule trigger from backup-restore-verify workflow", 2026-05-28) מחק את `schedule: cron '0 3 * * *'` (שהיה בראש הקובץ) מכיוון שהוא המשיך להיכשל על cicd-system עצמו (אין NEON_API_KEY) — מתוך כוונה מוצהרת שריפו צרכן (consumer repo) יקרא לו. אף ריפו צרכן מעולם לא עשה זאת: grep לרוחב pbsiem ו-cicd-system לא מוצא אף קורא (callers) — רק אזכורי פרוזה ב-ARCHITECTURE-CI.md. 8 הריצות האחרונות ב-GH כולן `failure` עם 0s (כשלי self-repo שמלפני ההסרה). (2) ה-workflow משחזר ענף Neon לתוך ענף Neon EPHEMERAL (backup-restore-verify.yml:82-126), ולא "גיבוי PROD של אתמול INTO TEST" כפי ש-D1 דורש — הוא מעולם לא נוגע בארטיפקט גיבוי אמיתי או ב-stack של TEST. (3) שלב "Verify row counts within tolerance" (lines 169-193) הוא NO-OP: הוא כותב `deviation_ok=true` ללא תנאי בין אם קיים מידע ובין אם לא — בדיקת ה-tolerance המפורסמת לא עושה כלום. (4) הוא מכסה רק את ה-DB של פורטל Neon; מידע המוצר האמיתי של mdr (PostgreSQL על PROD-MDR-01, infra/prod-mdr-01/README.md) אינו ב-workflow הזה. (5) הקובץ docs/cicd-platform-build-plan.md:216 מציין את מימוש ה-cron-route שמעולם לא נבנה — src/app/api/cron/backup-restore-verify/route.ts אינו קיים. (6) גיבוי ה-PG היומי של MDR בצד ה-host (02:30 -> Storage Box) הנטען ב-build.py:872 אינו ניתן לאימות באף ארטיפקט ריפו שעבר commit — תיקיית infra/prod-mdr-01/backups/ מכילה רק pgdump ידני אחד מאפריל שלפני stage1; אף יחידת cron/סקריפט לא עבר commit. (7) ה"drill" החי הקרוב ביותר הוא pbsiem .github/workflows/rollback-drill.yml (פעיל, אך workflow_dispatch בלבד) — חזרה (rehearsal) של שחזור קונטיינר ב-Pillar 7 על TEST, ולא ה-DR D4 quarterly drill (אין מדידת RTO/RPO, אין דוח חתום, לא מתוזמן). (8) לא קיימים מסמכי DR (docs/dr.md, docs/runbooks/dr-drill.md, docs/runbooks/failover-decision.md כולם נעדרים) — אך אלה מחוץ לתחום על פי D49.

7 חסר 3 נבנה אחרת 1 תואםפילר 12 — מטא-פילרים של הפלטפורמה + חילוץ ci-templates (M1-M10: services.yaml, Terraform IaC+drift, runbooks-as-code, chaos game-days, status-page-to-SLO, ספריית anonymizer, API versioning/deprecation, ממשק export ללקוח, סריקת PII של fixtures/screenshots, ולסיום הגדול — חילוץ ה-ci-templates הציבורי plan-b-systems/ci-templates עם consumers שמבצעים SHA-pin)

M10 ממומש חלקית אך עם סטייה: מאגר ה-reusable-workflow קיים בתור plan-b-systems/cicd-system (27 reusable workflows תחת ‎.github/workflows/‎) ובאמת נצרך — ל-pbsiem יש 30 הפניות ‎uses: plan-b-systems/cicd-system/.github/workflows/*.yml@main‎ (ci.yml:18, ephemeral-env.yml:19/48, post-deploy-verify.yml:26/53, prod-promote.yml:48/72, ...) ו-vaughnblades צורך 3 (ci/sbom/security-scan, כולם ‎@main‎). אבל: (1) המאגר הוא PRIVATE (gh repo list: ‎plan-b-systems/cicd-system ... private‎), בניגוד ל-D51 PUBLIC; (2) לא קיים מאגר plan-b-systems/ci-templates (gh: \"Could not resolve to a Repository\"); (3) אין תגיות SemVer / ‎@v1.0.0‎ — ה-consumers מבצעים pin ל-‎@main‎ (30/30 הפניות ב-pbsiem, 0 SHA-pins, 0 ‎@v‎ pins), בניגוד ישיר לכלל \"never @main\" של התוכנית למרות ש-README.md:18/24 מתעד \"‎@<sha>... Pin by SHA, not branch. Renovate auto-updates‎\" (המשמעת כתובה, לא מיושמת). vaughnblades (39 PRs ממוזגים, מוצר אמיתי וחי) הוא דה-פקטו מבחן ה-conformance, שעבר onboarding מ-cicd-system. M6 anonymizer: קיים רק scripts/anonymize-seed.mjs — scrubber מבוסס REGEX על stream של SQL מעל טקסט pg_dump (emails/phones/IPs/cards), ולא ספריית ה-‎scrub(row, schema)‎ מודעת-הסכמה שבתוכנית; אין wiring ל-weekly-TEST-refresh; יתום אל מול הלוגיקה הנפרדת של S2b/S2c scripts/dev-scrub. M1 services.yaml: ABSENT בשני המאגרים (Glob: no files) למרות שמסגרת ה-agents + ה-specs שהוא אמור להזין קיימים (agents/specs/cost-anomaly.json, oncall-investigation.json, doc-drift.json) — לאותם agents אין קטלוג לקרוא ממנו (מחמיר את הסטייה 'מסגרת agent נבנתה אך לא חוברה לכלום'). M2 Terraform: ZERO קבצי ‎.tf‎ בשום מקום (Glob: no files); התשתית מנוהלת ידנית. M3 runbooks-as-code: ‎docs/runbooks/*.md‎ קיימים בשני המאגרים אך בפורמט חופשי (auto-fix.md, link-crawl.md, ...) ללא סכמת-סקציות נדרשת וללא ולידציית CI. M4 chaos: אין סקריפטים של chaos; ההתאמות היחידות ל-'chaos' הן ב-ARCHITECTURE-CI.md המיושן של pbsiem (Litmus/Stage-7 staging — עיצוב אחר שלא נבנה) ובמסמך ה-plan-mirror. M5/M7/M8/M9: לא נבנו — אין status-page-to-SLO binding, אין מדיניות X-Deprecated/API-versioning, אין משטח export קוהרנטי ללקוח, אין שער סריקת-PII מעל fixtures/visual baselines (security-scan.yml:85 מריץ Trivy ‎vuln,secret,misconfig‎ שתופס secrets אגב אך אינו שער ה-PII המתוכנן ל-fixture/screenshot).

4 חסר 3 נבנה אחרת 2 תואםCORE SPINE — טופולוגיית DEV/TEST/PROD + זרימת dev→test→main + מודל ה-promote + נתיב הדפלוי של מנוע MDR/ה-backend

טופולוגיה: מסמכי pbsiem וה-runbook מאשרים 3 שרתי TEST + פרויקט plan-b-test (לא אומת מחדש כפעיל בסשן הזה). הזרימה בנויה אחרת מהתוכנית: (1) סביבת ephemeral לכל PR = Vercel preview של ה-portal + ענף Neon בלבד (ephemeral-env.yml); (2) stage4.yml מדפלוי את engine+puller ל-TEST stack המשותף דרך siem-deploy.yml עם required-env-marker:TEST + TEST_MDR_DEPLOY_HOST (stage4.yml:89-127) — משותף, מסודר סדרתית בין PRs, לא מבודד לכל PR; (3) שער ה-merge-to-main האנושי של התוכנית מוחלף ב-promotion-autopilot.yml (משתנה PROMOTION_AUTOPILOT = on, אומת כפעיל) שפותח אוטומטית PRs של promotion מ-dev→main + PRs של "fold" וממזג אותם אוטומטית ברגע ש-certify ירוק — מסיר את "שלב ה-merge האנושי שבאמצע" (promotion-autopilot.yml:8-9); (4) PROD מושג דרך prod-deploy.yml (push ל-main → vercel-gated-deploy, staging artifact, ללא הזזת alias) ואז קליק Promote אנושי נפרד (prod-promote.yml מופעל ע"י ה-dashboard /api/promotions/dispatch/route.ts) שמזיז את ה-alias של Vercel בלבד. חי: ריצת prod-deploy 27484292775 (PR #77) + Promote 27484423128 שתיהן הצליחו 2026-06-14; fold PR #76 פעיל. נתיב ה-PROD של המנוע חסר: siem-deploy.yml נקרא רק ע"י stage4.yml (TEST) ו-rollback-drill.yml — אין שום קורא שמדפלוי את engine/puller ל-PROD-MDR-01. docker-build.yml דוחף ל-GHCR אבל ל-docker-build הניתן לשימוש חוזר של cicd-system אין שלב cosign sign, אין SLSA attestation; אפס קוראים של slsa-provenance/attest-build-provenance ב-pbsiem; siem-deploy.yml לא מבצע cosign-verify לפני pull. כך שספינת ה-PROD של המנוע מהתוכנית (signed-image→verified-pull) לא קיימת; המנוע מגיע אי-פעם רק ל-TEST. רענון נתוני ה-staging חסר: אין anonymize-seed.mjs / אין ספריית scrub(row,schema) / אין verify-dev-scrub / אין cron של רענון שבועי ב-pbsiem (ה-DEV DB סוטה → flake בשערים, ה-P0 שהבעלים נתקל בו). שער prod-isolation תואם-אך-עבר-שינוי-שם: scripts/check-prod-isolation.mjs (מונע-קונפיגורציה דרך audit-prod-isolation.config.yaml) מחווט אל scripts/check-all.mjs:52.

2 חסר 5 נבנה אחרת 1 תואםדיוק ה-RUNBOOK — האם cicd-runbook build.py משקף את התוכנית הנעולה וגם משקף בכנות את המציאות הנוכחית, או שהוא נסחף / הצהיר יתר על המידה

אומת ב-pbsiem/.github/workflows: prod-deploy.yml מממש verify-BEFORE-promote (header L1-28: "production alias is STRUCTURALLY unreachable by an unverified artifact"; קורא ל-vercel-gated-deploy.yml@main עם promote-after-verify:true, auto-promote:false, prod-alias siemsys.plan-b.systems) ומחנה ארטיפקט staging ("READY TO PROMOTE", הרצה מוכחת 27328491331 לפי cicd vercel-gated-deploy.yml:8). prod-promote.yml הוא workflow_dispatch נפרד שהוא שער אנושי המאמת מחדש provenance, מפנה מחדש את ה-alias, ואז סורק (crawls) את ה-alias החי (prod-promote.yml:6-72) — Gap B סגור בקוד. stage4.yml (L41-126) אכן בונה mdr-engine + edr-puller ב-PR SHA, מחיל את סכמת ה-TEST DB, ופורס את שניהם ל-TEST-MDR-01 דרך siem-deploy.yml@main עם PLANB_ENV/rollback מגודר-בריאות (health-gated) — full-system per-PR TEST deploy מחווט (סותר את טענת ה-"open" שבלשונית ה-gaps). certify.yml קיים על PRs ל-main, מגדר מכנית על protected_merge + introducing-PR gates, ומריץ AI CERTIFY/REFUSE רק כאשר ANTHROPIC_API_KEY קיים (certify.yml:145-161) — רדום בכוונת תכן. promotion-autopilot.yml פותח PR מ-dev→main ו-ARMS את ה-auto-merge של GitHub ברגע ש-certify=CERTIFY + הבדיקות ירוקות (L92-137) — מסלול promotion אוטומטי שהתוכנית אוסרת. חסר במסלול הקריטי: (a) אין engine→PROD deploy/promote workflow — docker-build.yml בונה image-ים של engine/puller ב-push ל-main אך יעדי ה-DEPLOY היחידים של ה-engine הם TEST (stage4, rollback-drill); PROD-MDR-01 (178.104.172.142) מופיע רק ב-known-hosts של rotate-secrets, לעולם לא כיעד פריסה. (b) אין ספריית scripts/dev-scrub/ ואין scheduled TEST-data-refresh / anti-drift cron ב-pbsiem (קיים רק cron של rotate-secrets) — נתוני ה-staging מתיישנים, וזהו שורש הבעיה המתועד ל-gate-flake.

פערים — תוכנית מול מציאות, כל סטייה

המציאות מקדימה את ה-runbook בכל הנוגע לתשתית ה-TEST/ephemeral, אך מפגרת אחרי ה-LOCKED PLAN בכל הנוגע ל-spine. סביבת ephemeral per-PR, full-system TEST deploy (mdr-engine + edr-puller + receiver + isolated per-PR backend stack), ומערך של 5 gates (e2e/dast/perf/visual/links) הם WIRED אמיתיים ו-green על PR-ים אמיתיים של pbsiem אל dev, מה שעולה על הטענה המיושנת של ה-runbook לפיה מדובר ב-'portal + DB branch only'. אבל ה-spine הנעול TEST->PROD שבור: prod-deploy.yml ו-prod-promote.yml מזיזים אך ורק את ה-Vercel portal alias (אומת: אפס שלבי engine/docker/ssh/hetzner; ל-PROD mode מבוסס-env של siem-deploy.yml אין caller), כך שה-engine מגיע ל-PROD רק דרך deploy ידני out-of-band ללא promotion מאומת מ-staging ל-prod. זהו ה-P0 שהפך שינוי של 40 דקות ל-11 שעות. מחמיר זאת: ה-data של MDR-PG/OpenSearch ב-TEST stack המשותף לעולם לא מתרענן (אומת: אין weekly anonymized dump), ולכן ה-gates flake על data ריק/מיושן; ה-staging אינו מתויג והשער האנושי היחיד הוא לחיצת portal-alias. פילר 4 רופף בשקט (lint:strict אינו נאכף ב-CI, Aikido נעדר למרות runbook שטוען 'back IN', CodeRabbit advisory, main מוגן רק על ידי ci/CI עם 0 approvals אך עם enforce_admins ON). מערך ה-battery של פילר 5 אמיתי אך דק (smoke spec אחד מתוך ~86, 3 EN visual baselines מתוך 24, ללא k6/a11y/mutation/cross-tenant/oasdiff/migration-safety). פילר 2 ומודלי ה-cost (EphemeralEnvCost, AICostLedger) ברובם לא נבנו. השורה התחתונה: ה-engine עובד עבור TEST; הבטחת ה-release-certainty של ה-plan ל-PROD עדיין אינה קיימת מכנית.

129 ממצאים על פני 14 תחומים, כל אחד מסווג ומגובה בראיות. חסר = מעולם לא נבנה. נבנה · לא מחובר = קיים ב-cicd-system אבל pbsiem לא קורא לו. נבנה אחרת = קיים אבל סוטה מהתוכנית. תואם = תואם לתוכנית (מקופל). לשונית משימות הופכת את אלה לרשימת משימות מסודרת.

פילר 1 — טופולוגיית סביבות ובידוד (DEV/TEST/PROD, בידוד prod, ephemeral לכל PR, log splitter, audit gate, emergency override)

🔴 חסרל-engine (mdr-engine/edr-puller) אין מסלול promote-to-PROD

ה-TEST->PROD spine של התוכנית (lines 317-331) הוא 'build מאומת מקודם ל-prod' עבור כל המערכת. במציאות: prod-promote.yml + prod-deploy.yml מזיזים רק את ה-Vercel portal alias; הם מכילים אפס צעדי engine/docker/ssh. siem-deploy.yml (שמפרסם מצב PROD מבוקר-env) נקרא רק על ידי stage4/rollback-drill מול TEST, לעולם לא עבור prod. כך שה-engine מגיע ל-PROD-MDR-01 רק דרך deploy ידני מחוץ-לתהליך ללא promotion מאומת מ-staging ל-prod, ללא בדיקת provenance חוזרת, וללא מקבילה-ל-alias מבוקרת-אדם. זוהי סטיית ה-P0 המרכזית שבעל המערכת נתקל בה.

🔴 חסרנתוני ה-staging של ה-TEST המשותף לעולם אינם מרועננים (אין רענון שבועי אנונימיזציה של PlaySmart לתוך MDR-PG/OpenSearch)

D7 של התוכנית + תת-תוצר I דורשים רענון שבועי בצורת-staging (dump של prod עם אנונימיזציה) כך שה-TEST stack המשותף נושא נתונים בצורה-אמיתית וה-gates לא מתנדנדים על DB ריק. ה-splitter מזין syslog חי לתוך TEST-OpenSearch, אך צד ה-MDR-PG מקבל רק schema push לכל PR (migrate-test-db) + משתמשי qa-seed של Neon לכל PR. אין cron של רענון שבועי ואין ספריית anonymizer (רק scripts/anonymize-seed.mjs ל-seed-scrubbing). ה-runbook עצמו דוחה זאת ל-Phase 7. זהו ה-P0 של 'נתוני staging מתיישנים -> gates מתנדנדים' שבעל המערכת נתקל בו.

🔴 חסרrunbook ל-emergency-override (Mike-only, audit פוסט-אירוע תוך 24h)

תת-תוצר H של התוכנית דורש docs/runbooks/prod-emergency-override.md שמגדיר פעולות לגיטימיות ישירות-ל-prod של Mike-only + מוסכמת ה-commit [EMERGENCY-OVERRIDE] + רשומת ה-AuditLog הנדרשת תוך 24h, בתוספת הפניה בת שורה אחת ב-ARCHITECTURE.md. תיקיית ה-runbooks מכילה 6 runbooks אחרים אך לא את זה; אין הפניה ל-EMERGENCY-OVERRIDE ב-ARCHITECTURE.md. (קיים תחליף חלקי: ה-claude-guard הגלובלי מספק override מתוחם-זמן של Mike-only, אך ה-runbook של המדיניות/audit-trail המתועד חסר.)

🟠 נבנה אחרתסקריפט ה-audit של prod-isolation (check-all gate #13) מבצע רק 1 מתוך 4 הבדיקות של התוכנית

ה-audit-prod-isolation.mjs של התוכנית הוגדר לבצע (a) diff של ערכי env בין vercel preview ל-production, (b) grep של אינדיקטורי prod מקודדים-קשיח, (c) לוודא שאין fallbacks של DB-URL ב-schema.prisma, (d) לוודא ששום fixture לא משתמש ב-client_id אמיתי של prod. הסקריפט שנבנה (ששמו שונה ל-check-prod-isolation.mjs) מממש רק את (b) — grep סטטי — ומגדיר כ-grandfathered 89 קבצים קיימים דרך legacy allowlist. ה-diff של env-var בזמן ריצה (החלק שבאמת אוכף 'אין DSN/key של prod נגיש מקוד שאינו prod') חסר. גרוע מכך, ה-config מצהיר על יכולות שהקוד מעולם לא קורא: known_prod_client_ids (line 28-29) ו-neon_hosts catch-all (line 12-13) נמצאים ב-audit-prod-isolation.config.yaml אך מעולם לא מופנים אליהם על ידי check-prod-isolation.mjs — over-claim של config מול code.

🟠 נבנה אחרתה-staging אינו stage עם שם, מתויג, עם human-approval מחווט כ-prod gate

התוכנית נותנת שם ל-spine dev->staging->prod עם gate אחד של staging במערך-בדיקות-מלא + human promote. במציאות: ה-TEST stack מתפקד כ-staging אך אינו מתויג ככזה, וה-human gate היחיד הוא לחיצת ה-Vercel prod-promote הידנית (portal alias). ה-runbook עצמו מסמן זאת ('TEST stack ~= staging, ללא תווית', 'wire the approval as the prod gate'). ה-promote מאמת מחדש את provenance של ה-Vercel artifact אך אין צעד אחד של staging-green-ואז-promote שמכסה גם portal וגם engine.

build.py:331-332,360-362 (warnA 'name + isolate' / 'wire the approval as the prod gate'); prod-promote.yml:1-16 (portal-only human promote)

🟠 נבנה אחרתה-'Full-system TEST deploy per change' של ה-runbook רשום OPEN אך הוא למעשה DONE

רשומת ה-gap של ה-runbook (build.py:639-645) טוענת שה-deploy לכל שינוי מכסה 'portal + DB branch only' וש-engine/puller/OpenSearch אינם בשום deploy לכל שינוי. stage4.yml מוכיח אחרת (engine+puller נפרסים ל-TEST בכל PR). זוהי התיישנות של runbook-מול-מציאות שמכריזה-בחֶסֶר על ההתקדמות וצריך לתקן כך שהמסמך יפסיק לייצג שגוי את המצב לקוראים/Mike.

build.py:639-645 (status 'open', 'portal + DB branch only') vs pbsiem/.github/workflows/stage4.yml:1-19,85-125 (engine+puller to TEST per PR)

✅ 3 תואמים לתוכנית — לחצו להרחבה

✅ תואםה-TEST stack הוקצה כפרויקט Hetzner CI/CD נפרד (lean shadow + splitter + BX11)

פרויקט Hetzner נפרד עם TEST-siemsys-syslog + TEST-MDR-01 + splitter-syslog (UDP-514 fan-out ל-PROD וגם ל-TEST), רשת 10.1.0.0/16, BX11 box ייעודי, test-siem-api DNS + LE cert — כולם קיימים וחיים. תואם ל-D6/D7-splitter/D10 ולתת-תוצרים A/F/G/J.

pbsiem/docs/01-infrastructure.md:80-97,173-177; openssl s_client test-siem-api.plan-b.co.il -> CN=test-siem-api, Verify return code: 0; DNS -> 178.105.214.148

✅ תואםסביבת ephemeral לכל PR (Neon branch + Vercel preview + PR backend stack + auto-teardown)

ephemeral-env.yml יוצר Neon branch מההורה 'production', מגדיר את ה-override של Vercel preview, מקים backend stack לכל PR, ומפרק אותו (+Neon branch) בסגירת ה-PR. ממש את D5+D8. הערת שמות קטנה: ה-Neon parent branch הוא 'production' ולא ה-'dev' של התוכנית, מטופל במפורש על ידי ה-reusable.

pbsiem/.github/workflows/ephemeral-env.yml:19-56 (uses ephemeral-env.yml@main + pr-backend-stack.yml@main, pr-stack-down on action==closed)

✅ תואםפריסת TEST של המערכת המלאה לכל שינוי (engine+puller ל-TEST בכל PR)

stage4.yml בונה מחדש את mdr-engine+edr-puller ב-PR SHA ופורס את שניהם ל-TEST stack המשותף (env-marker TEST, health+RestartCount gate, מסודר בין PRs), ואז מריץ את כל מערך הבדיקות מול ה-preview האמיתי. זה בנוי ומחווט — רשומת ה-gap ה'open' של ה-runbook בנושא זה היא STALE וצריך להפוך אותה ל-done.

פילר 2 — זהות, סודות ואמון (Identity, secrets & trust)

🔴 חסרהרחבת ה-rotation של סודות ל-MDR_ENCRYPTION_KEY + bearer tokens + GH PATs + GrowthBook keys

פריט #4 בתוכנית (L1565) דורש להרחיב את סט ה-rotation מעבר ל-2 הנוכחיים. הסקריפט מבצע rotation רק ל-OPENSEARCH_PROXY_KEY + סיסמת ה-PG של mdr_admin. MDR_ENCRYPTION_KEY הוא הקשה/החשוב (חייב להישאר מסונכרן בין Vercel + engine .env + puller); אף אחד מהיעדים הנוספים אינו עובר rotation. קריטריון האימות (L1656 — מפתח ההצפנה עובר rotation וכל 3 הרכיבים קולטים אותו ללא restart ידני) אינו מתקיים.

scripts/rotate-secrets.sh rotates only 2 (lines 116-122, 206-210); no MDR_ENCRYPTION_KEY/PAT/GrowthBook handling; build.py:733-736 status=dep confirms

🔴 חסרproxy מבוקר (audit-logged) לגישת סודות getSecret('X') (D13/D12, פריט #5 בתוכנית)

אין proxy מרכזי. src/lib/secrets/get-secret.ts נעדר ואין שום אירוע AuditLog מסוג SECRET_ACCESS בשום מקום ב-pbsiem. שני סמלי getSecret() שנמצאו הם helpers per-file לא קשורים, ולא ה-wrapper המתוכנן סביב process.env. בלעדיו אין נקודת audit מרכזית לקריאות של סודות רגישים (מפתחות הצפנה/תשלום/AI) ב-PROD.

🔴 חסרdocs/secrets.md (מודל ה-rotation + שכבות הרשאות אנושיות + הערת דחיית OIDC) — פריט #10 בתוכנית

התוכנית קוראת ל-docs/secrets.md יחיד שמסביר את מודל ה-rotation, שכבות ההרשאות (Mike admin / Nadav developer / Roy viewer), ואת השילוב/דחייה של OIDC. הקובץ אינו קיים.

ls docs/secrets.md -> NO docs/secrets.md

🟠 נבנה אחרתPer-environment של GH Actions ל-scoping של סודות (סביבות production/preview) — פריט #6 בתוכנית

התוכנית רוצה שכל סוד רגיש יצומצם ל-scope סביבתי. רק prod-deploy.yml משתמש בבלוק environment: של GH; 7 ה-workflows האחרים שצורכים VERCEL_TOKEN מושכים סודות ברמת ה-repo ללא שער סביבתי. ה-rotation אכן שומר על הפרדה בין creds של preview/TEST לבין PROD, כך שההפרדה הסביבתית קיימת ברמת הערך אך אינה נאכפת דרך scoping של GH environment על פני ה-workflows.

grep 'environment:' .github/workflows -> only prod-deploy.yml; VERCEL_TOKEN in 8 workflows; rotate-secrets.sh keeps preview creds separate (~line 154)

✅ 2 תואמים לתוכנית — לחצו להרחבה

✅ תואםRotation cron שבועי (החלק היחיד שלפי התוכנית כבר נשלח)

rotate-secrets.yml מופעל שבועית + workflow_dispatch; rotate-secrets.sh מבצע rotation ל-2 הסודות שבתחום, מגלה hosts דרך rotation-targets API, מתעד sha256 fingerprints בלבד, ומבצע shred לטקסט הגלוי בכל נתיב יציאה. זהו ה-baseline 'כבר נשלח' שהתוכנית משמרת (D11), ועמדת 'No human sees prod credentials' מתקיימת עבור 2 הסודות הללו.

pbsiem .github/workflows/rotate-secrets.yml:42; scripts/rotate-secrets.sh:122,206-210,355-359,135; src/app/api/cron/rotation-targets/route.ts

✅ תואםOIDC federation (GH Actions -> Vercel/Hetzner/Cloudflare)

התוכנית ב-D11 דוחה במפורש את OIDC ל-v2 ואומרת לשמור tokens ארוכי-טווח + rotation. המציאות תואמת: VERCEL_TOKEN נשאר ארוך-טווח על פני 8 workflows. אין כאן gap — הסטייה מאושרת על ידי התוכנית. אל תבנה OIDC ב-v1.

plan L112/L1575; pbsiem grep VERCEL_TOKEN -> 8 workflow files

פילר 3 — סביבות ephemeral לכל שינוי

🔴 חסרteardown ל-OpenSearch PR-index + מדיניות ISM delete (min_age 1d)

התוכנית רוצה ש-indices מסוג pr-{N}-test יפוגו אוטומטית באמצעות מדיניות ISM min_age:1d/delete כ-backstop של belt-and-braces. שום מדיניות ISM לא קיימת בשום מקום (grep ל-_ism/ism_policy/min_age מוצא רק סקריפטים לא קשורים של cold-archive/demo). ה-workflow לשימוש חוזר יכול למחוק OS indices ב-teardown, אבל רק כאשר מועברים opensearch-url + OPENSEARCH_API_KEY — ה-caller של pbsiem לא מעביר אף אחד מהם (רק backend-opensearch-url / BACKEND_OPENSEARCH_PROXY_KEY), כך שענף ה-teardown הזה מת עבור pbsiem. ה-GC cron נוגע ב-Neon בלבד, אף פעם לא ב-OpenSearch.

pbsiem ephemeral-env.yml has no opensearch-url/OPENSEARCH_API_KEY; cicd ephemeral-env.yml:327-344 (gated on those inputs); ephemeral-env-gc/route.ts is Neon-only

🔴 חסרמודל Prisma של EphemeralEnvCost + עלות מתגלגלת ל-30 יום + התראת SVC-HEALTH ב-EUR80/EUR100

התוכנית (lines 556-577) דורשת שה-cost cron ישמר ל-מודל EphemeralEnvCost עם total מתגלגל ל-30 יום ו-SVC-HEALTH/ServiceHealthAlert ב-EUR80 warn / EUR100 critical, בתוספת סריקת force-delete ל-PR ישן (>30d). ה-cost cron של MDR רק סופר Neon branches חיים, מחזיר JSON, מתריע דרך console.warn בסף קשיח-מקודד של 5 branches, ולא משמר דבר. אין מודל EphemeralEnvCost ב-prisma של pbsiem (grep = אין התאמות); שורה 724 ב-runbook מאשרת שהוא חי רק ב-DB של ה-dashboard. אין aggregate cap נקוב ב-EUR, אין trend מתגלגל, אין routing להתראות, אין סריקת stale >30d.

src/app/api/cron/ephemeral-env-cost/route.ts:45-68; grep EphemeralEnvCost in pbsiem/prisma = no matches; build.py:724-725

🟠 נבנה אחרתhelper לתיחום שכבת ה-query של OpenSearch getTestWriteIndex()/OPENSEARCH_PR_INDEX ב-src/lib/opensearch.ts

§2 בתוכנית (lines 503-517) רוצה שהאפליקציה תקרא את OPENSEARCH_PR_INDEX דרך helper מסוג getTestWriteIndex() שכותב ל-index מסוג pr-{N}-test. ה-helper הזה לא קיים — grep של src/lib/opensearch.ts ושל כל src מחזיר אפס התאמות ל-getTestWriteIndex/OPENSEARCH_PR_INDEX/OPENSEARCH_INDEX_PREFIX. ה-workflow מגדיר OPENSEARCH_INDEX_PREFIX על ה-preview אבל שום דבר לא קורא אותו. בידוד ה-OS מושג במקום זאת באמצעות namespacing של client-id per-PR (synthetic-seed יוצר לקוחות pr{N}-*; דפוס ה-index של ה-engine logs-{client_id}-* מצמצם קריאות ל-logs-pr{N}-*). מבודד פונקציונלית, אך מתבדל מהמנגנון הנעול.

grep getTestWriteIndex/OPENSEARCH_PR_INDEX across pbsiem/src = no matches; scripts/synthetic-seed.ts:8-72 (PR_NAMESPACE pr{N}); pr-backend-stack.yml:5-9 comment

🟠 נבנה אחרתpath של Admin UI + עיצוב dark-mode

עמוד הנראוּת ל-admin קיים ועובד (קורא את דוח העלויות, כפתור force-cleanup), אבל חי ב-src/app/admin/ephemeral-envs/page.tsx במקום ב-src/app/(admin)/admin/infra/ephemeral-envs/page.tsx של התוכנית, ומעוצב בהיר (bg-gray-50/text-gray-500) בניגוד לכלל הזיכרון של dark-mode-default. שולי; קיים פונקציונלית.

src/app/admin/ephemeral-envs/page.tsx:24-180

✅ 3 תואמים לתוכנית — לחצו להרחבה

✅ תואםיצירת Neon branch-per-PR + override של Vercel preview + teardown אידמפוטנטי

ה-ephemeral-env.yml של cicd-system יוצר pr-{N} מה-parent המוגדר עם endpoint מסוג read_write (lines 160-170), מגדיר DATABASE_URL/PR_NUMBER/OPENSEARCH_INDEX_PREFIX/EPHEMERAL_ENV per branch, ובסגירה מוחק את ה-branch + סורק את כל משתני ה-Vercel המתוחמים ל-gitBranch. מוקשח מעבר לתוכנית: שער no-fallback parent guard + שער EPHEMERAL_PAUSED (תיקון red-team). רץ ירוק על כל PR של pbsiem ל-dev.

cicd .github/workflows/ephemeral-env.yml:109-347; pbsiem .github/workflows/ephemeral-env.yml:19-43; gh run 27485718373 success

✅ תואםפריסת full-system לכל שינוי (engine+puller+receiver+backend stack per-PR) מחווטת ל-preview ה-ephemeral

חורג מעבר לתוכנית. stage4.yml בונה engine/puller ב-PR SHA, פורס ל-TEST (health-gated, assert של PLANB_ENV=TEST), עושה schema+seeds ל-postgres ייעודי per-PR, ומריץ כל שער מול ה-preview האמיתי של ה-PR. זהו ליבת ה-release-certainty שהתוכנית רק רמזה עליה. שים לב: ה-gap מסוג 'open' של ה-runbook עצמו (build.py:639-645) שמתאר זאת כחסר הוא STALE.

pbsiem .github/workflows/stage4.yml:84-339; cicd pr-backend-stack.yml; gh run 27485718389 jobs all green (deploy-engine/deploy-puller/pr-stack/pr-services)

✅ תואםseed fixtures דטרמיניסטיים לכל סביבת ephemeral

התוכנית רוצה seed fixtures דטרמיניסטיים. ephemeral-env.yml עושה seed ל-Neon branch דרך scripts/qa-seed.ts (synchronize עושה re-seed כך ש-PR שעורך את ה-seed מריץ אותו מחדש); ה-backend stack per-PR מריץ את synthetic-seed.ts עם ids דטרמיניסטיים מסוג synth-det-*. db push כן (honest) של prisma (ללא --accept-data-loss) הוא התיקון שאחרי #101.

pbsiem ephemeral-env.yml:25; cicd ephemeral-env.yml:252-277; scripts/synthetic-seed.ts:44-73

פילר 4 — שערי איכות (defense in depth)

🟠 נבנה אחרתlint:strict (lint-warnings-as-errors) ב-CI וב-pre-push (sub-deliv C/D של התוכנית)

התוכנית דורשת ש-lint ירוץ כ-`next lint --max-warnings 0` גם ב-pre-push וגם ב-CI. במציאות: package.json `prepush` קורא ל-`npm run lint` (רגיל), ו-pbsiem ci.yml מעביר `lint-command: "npm run lint"` שדורס את ברירת המחדל `lint:strict` של ה-reusable workflow. כך שאזהרות eslint לעולם לא חוסמות — האימות של התוכנית 'warning → lint:strict fails' (L820) לא היה נכשל היום. README L28 וגם ה-runbook build.py:697 שניהם טוענים ש-lint:strict בתוקף — שניהם over-claim.

pbsiem origin/main package.json scripts.prepush + ci.yml `lint-command: "npm run lint"`; cicd-system .github/workflows/ci.yml default `lint-command: npm run lint:strict`; README L28; build.py:697

🟠 נבנה אחרתCI כ-6 checks נדרשים נפרדים (sub-deliv D של התוכנית + branch protection I)

התוכנית מציינת install/lint/typecheck/check-all/test/build כשישה jobs נבדלים, כל אחד status check נדרש בעל שם. במציאות: job בודד ושמיש מחדש 'ci / CI' מריץ אותם כ-steps סדרתיים, חשוף כ-check נדרש אחד. פונקציונלית כל ה-steps רצים וחוסמים (טוב), אבל שמות ה-checks הנדרשים הגרנולריים של התוכנית (Lint, Typecheck, check-all, Test, Build) אינם קיימים, ולכן contexts של branch-protection לא יכולים להתאים לתוכנית והנראות/המקביליות לכל שלב נאבדת. זוהי הסטייה המכוונת של reusable-workflow ב-cicd-system (build.py:179).

pbsiem origin/main ci.yml (uses plan-b-systems/cicd-system/.github/workflows/ci.yml@main); cicd-system ci.yml single job name CI; gh api branches show only `ci / CI`

🟠 נבנה אחרתAikido SAST gate (sub-deliv E של התוכנית, טבלת ה-D)

התוכנית + טבלת ה-D מחייבות workflow של Aikido (security-review.yml). במציאות: Aikido נעדר לחלוטין מה-pipeline המחווט — הכותרת של cicd-system security-scan.yml אומרת 'Replaces Aikido (D54 superseded)' ומריצה Semgrep + Trivy בלבד. שער ה-Semgrep+Trivy אכן מחווט ואוכף (enforce:true, חוסם CRITICAL/HIGH, allowlist), כך שכיסוי SAST/dep/secret קיים — אבל לא דרך Aikido. קריטי: ה-runbook (build.py:169) טוען 'Aikido Back IN as a FREE blocking gate' שזה שקרי בקוד, ו-build.py:586 בו-זמנית טוען 'after Aikido removal' — ה-runbook סותר את עצמו ואת המציאות. החלטת מייק מ-2026-06-11 להחזיר את Aikido מעולם לא יושמה.

cicd-system .github/workflows/security-scan.yml L1,L3 'Replaces Aikido (D54 superseded)'; pbsiem security-scan.yml enforce:true; build.py:169 vs build.py:586 (contradiction)

🟠 נבנה אחרתCodeRabbit request_changes_workflow חוסם merge (D15, sub-deliv F)

D15 בתוכנית מחייב .coderabbit.yaml עם request_changes_workflow:TRUE כך ש-'request changes' של CodeRabbit מנטרל את כפתור ה-merge, ו-CodeRabbit כ-check נדרש. במציאות: האפליקציה מותקנת ומפרסמת reviews (advisory), אבל .coderabbit.yaml הוסר (commit fb8a2fb) כך ש-request_changes_workflow לא מוגדר/false, ו-CodeRabbit אינו check נדרש ב-dev או ב-main. זוהי החלטה-מחדש מכוונת (build.py:171 'FREE tier, advisory only, never a required gate') שגוברת על D15 — אבל התוכנית עדיין מקור האמת הנעול, ולכן זו סטייה שצריך לאשרר או להחזיר לאחור.

pbsiem .coderabbit.yaml absent on origin/main (git cat-file fail; removed in fb8a2fb); PR#78 check 'CodeRabbit pass / Review skipped'; gh api protection (no CodeRabbit context); build.py:171

🟠 נבנה אחרתענף main: 1 approving review נדרש (sub-deliv I של התוכנית)

התוכנית דורשת ש-main יזדקק ל-approving review אחד ממייק. במציאות: main req_approvals=0. זוהי ההחלטה-מחדש המתועדת (build.py:173 'No second approver; Mike's chat instruction after green gates authorizes protected merges') — אבל פירוש הדבר ש-GitHub עצמו לא אוכף שום human review על main; ה-'human gate' הוא פרוצדורלי (מילתו של מייק), לא מכני. סטייה לאשרר או להחזיר לאחור.

gh api repos/plan-b-systems/pbsiem/branches/main/protection req_approvals=0; build.py:173; plan L774-776

🟠 נבנה אחרתענף main: admin no-bypass OFF (sub-deliv I של התוכנית, קשור ל-emergency-override runbook D18)

התוכנית מגדירה במפורש enforce_admins/no-bypass OFF על main כדי שמייק יוכל לבצע emergency overrides דרך ה-runbook המתועד. במציאות: main enforce_admins=TRUE (admins לא יכולים לעקוף). זה מחמיר יותר מהתוכנית (לכאורה בטוח יותר, ועקבי עם שכבת claude-guard הגלובלית) אבל פירושו שנתיב ה-emergency-override של התוכנית חסום בשכבת GitHub — ה-override כעת תלוי בהחלפה זמנית של ה-protection במקום admin fast-forward.

gh api branches/main/protection enforce_admins=True; plan L776; build.py:701 claims this as 'verified DONE'

🟠 נבנה אחרתchecks נדרשים ב-main חלשים יותר מאשר ב-dev (פער sequencing/coverage)

התוכנית אומרת main = ה-checks של dev בתוספת review. במציאות: main דורש רק `ci / CI`, בעוד ש-dev דורש 10 checks כולל security-scan, Stage-4 (dast/e2e/perf/visual), ו-links. כך ש-PR של hotfix ישירות ל-main עוקף לחלוטין את security-scan, Stage-4, ו-link-crawl בשכבת ה-protection. בפועל main מושג דרך promotion של commits מ-dev שכבר עברו שערים (autopilot folds), אבל ה-config של ה-protection לבדו לא מבטיח שקוד המיועד ל-main עבר את השערים הללו.

gh api branches/main/protection contexts=['ci / CI'] vs branches/dev/protection 10 contexts

🟠 נבנה אחרתתיעוד שערים ב-ARCHITECTURE.md (sub-deliv J)

ARCHITECTURE.md L133 עדיין מתאר את העיצוב הישן: '6 parallel jobs' ו-'pre-merge (branch protection: all checks + Aikido + CodeRabbit)'. המציאות היא job CI משולב אחד, Semgrep+Trivy (ללא Aikido), CodeRabbit advisory. התיעוד מיושן/מטעה; ci-gate-failures.md ו-README Contributing קיימים (החלקים האלה aligned).

pbsiem origin/main ARCHITECTURE.md L133; docs/runbooks/ci-gate-failures.md present

✅ 2 תואמים לתוכנית — לחצו להרחבה

✅ תואםHusky local hooks (Layer 1)

pre-commit (lint-staged), pre-push (prepush), .nvmrc, engines, husky+lint-staged devDeps כולם נוכחים ומחווטים ב-origin/main; --no-verify מתועד ב-README ומפנה ל-prod-emergency-override.md. סטייה קטנה: .nvmrc=22.16.0 מול 22.11.0 של התוכנית (לא מזיק, שניהם תואמי-Vercel).

pbsiem origin/main .husky/pre-commit=`npx lint-staged`, .husky/pre-push=`npm run prepush`, package.json husky^9.1.7/lint-staged^15.4.3; README L19-48

✅ תואםCross-component contract gate framework + crypto-envelope (D17, sub-deliv G+H)

ה-runner הגנרי scripts/check-contracts/ בנוי, מתגלה אוטומטית על ידי check-all.mjs, ונאכף ב-CI דרך ה-step של check-all. crypto-envelope.mjs הוא חוזה ה-v1. תואם בדיוק לתוכנית כולל כוונת ה-reusability.

pbsiem origin/main scripts/check-all.mjs L (readdirSync contractsDir + run each) ; scripts/check-contracts/crypto-envelope.mjs present; ARCHITECTURE.md L134

פילר 5 — חזרה כללית בצורת production (Stage 4)

🔴 חסרמערך cross-tenant isolation (D23: top-12 sensitivity endpoints, תופס את מחלקת CR-03 IDOR)

אין תיקיית tests/e2e/security/, אין cross-tenant-*.spec.ts, אפס grep hits. זהו שער האבטחה בעל הערך הגבוה ביותר בפילר (מונע מכנית את מחלקת ה-IDOR/cross-tenant-leak) והוא אינו קיים בשום מקום. ה-runbook מסמן אותו נכון כפתוח.

pbsiem: tests/e2e/security/ absent; grep cross-tenant = 0; build.py:714-719; plan lines 991-1007

🔴 חסרשער API contract (D20: spec של zod-to-openapi + זיהוי breaking-change של oasdiff)

אין zod-to-openapi, אין oasdiff, אין openapi.yaml בשום מקום ב-pbsiem או ב-cicd-system. דרישת הקדם של ביקורת Zod ב-Phase-5a מעולם לא החלה. שינויי API שוברים אינם מוגנים כלל. ה-runbook מסמן אותו כפתוח.

grep zod-to-openapi|oasdiff|openapi in pbsiem = 0; cicd-system reusable workflows have no contract gate; build.py:720-722; plan lines 919-933

🔴 חסרשער migration safety (prisma migrate diff + lock-impact)

אין scripts/check-migration-safety.mjs; פקודות ALTER/DROP מסוכנות אינן מזוהות. קיים תחליף חלקי: migrate-test-db ב-stage4 מבצע prisma db push הוגן ללא --accept-data-loss (כך ששינוי הרסני נכשל בקול רם על TEST), אך זה אינו השער המבוסס-diff, מייעץ-expand-contract שתוכנן ואינו רץ כ-check על PRs של schema בלבד.

pbsiem scripts/ has only check-contracts/crypto-envelope.mjs; stage4.yml:74-82; plan lines 1025-1037

🔴 חסרAI cost guard + AICostLedger (D24: $0.50/run, $5/PR, $50/repo/mo)

אין מודל Prisma של AICostLedger ב-pbsiem; אין אכיפת caps per-run/per-PR/per-repo מחווטת אל Stage-4. ה-runbook מציין שקוד cost-guard קיים ב-agents framework אך מודל ה-ledger נעדר ב-MDR (קיים רק EphemeralEnvCost ב-DB של ה-dashboard) — כלומר נתיב הכתיבה וה-caps אינם מחווטים לשלב החזרה הכללית.

grep AICostLedger in pbsiem/prisma = 0; build.py:723-726; plan lines 1039-1061

🔴 חסרGap A — מערך Stage-4 אינו רץ מחדש בקידום dev->main (שער ה-prod)

branch protection של main דורש רק 'ci / CI'. מערך ה-prod-mirror המלא (e2e/dast/perf/visual/links) רץ על PRs אל dev אך לעולם אינו מורץ מחדש לפני שהקוד מגיע אל production main. זהו פער ה-P0 ב-critical-path שבו נתקל הבעלים: שינוי שלא אומת מול המערך יכול להגיע ל-prod דרך ה-PR של הקידום. כלל הבטיחות 1 של התוכנית ('production structurally unreachable by an unverified change') וכלל 2 ('staging is a faithful mirror, re-tested before prod') אינם נאכפים בשלב ה-promote.

gh api repos/plan-b-systems/pbsiem/branches/main/protection required_status_checks.contexts = ["ci / CI"]; build.py:302-303 (Gap A explicit); stage4.yml:22-23 triggers only on PRs to dev

🔌 נבנה · לא מחוברAccessibility (axe-core, WCAG 2.1 AA, אפס הפרות)

reusable workflow מלא של axe-core קיים ב-cicd-system (job של accessibility ב-a11y-i18n.yml) אך pbsiem לעולם לא קורא לו — לא ב-stage4.yml, לא כ-required check. התוכנית הופכת אותו לשער חוסם-merge על 12 עמודי critical-path. אפס כיסוי כיום.

cicd-system a11y-i18n.yml:111-164 (axe-core job exists); grep a11y/axe in pbsiem/.github/workflows = only i18n-parity.yml (i18n job only); plan lines 960-977

🔌 נבנה · לא מחוברMutation testing (D22: Stryker, מַחְגֵר 60%->70%)

reusable workflow מלא של Stryker קיים (mutation-test.yml, runner של vitest, mutate-paths + threshold ניתנים להגדרה) אך pbsiem לעולם לא קורא לו והוא אינו מתוזמן nightly על dev כפי שהתוכנית מפרטת. אין stryker.config ב-pbsiem. אפס כיסוי mutation למרות שהכלי נבנה. ה-scorecard של ה-runbook משמיט אותו כליל.

cicd-system mutation-test.yml (complete); grep mutation|stryker in pbsiem/.github/workflows = 0; plan lines 1009-1023

🟠 נבנה אחרתשער Playwright E2E (Plan A: 86 ה-specs הקיימים)

שער ה-e2e מחווט ונדרש, אך מריץ spec smoke בודד (dev-environment.spec.ts) מתוך 90 קבצי spec. התוכנית דורשת את מלוא מערך ~86 ה-specs. הערה ב-workflow עצמו מסמנת זאת כמכוון-זמני בהמתנה ל-seeding דטרמיניסטי של ephemeral-DB (שכבר נבנה דרך pr-db/synthetic-seed.ts, כך שהחסם הוסר במידה רבה).

pbsiem/.github/workflows/stage4.yml:208-220 (test-pattern: tests/e2e/dev-environment.spec.ts); find tests/e2e -name *.spec.ts = 90; cicd-system playwright-e2e.yml

🟠 נבנה אחרתVisual regression (D19: 12 עמודים x EN+HE = 24 baselines)

מחווט ונדרש, השוואה הוגנת (CI=1 נכשל על snapshot חסר, ללא חוסר-תוחלת של --update-snapshots). אך מכסה רק 3 עמודים ציבוריים (login, pricing, status) ב-EN/Linux בלבד = 3 baselines, מול 24 הנעולים. אין עמודי authed/admin/MDR, אין locale של HE.

pbsiem/.github/workflows/stage4.yml:256-269; tests/e2e/visual-regression.spec.ts-snapshots/{login,pricing,status}-chromium-linux.png; plan lines 900-913

🟠 נבנה אחרתPerformance budget (D21: Lighthouse + k6, rolling baseline מ-main)

Lighthouse רץ על 2 כתובות URL ציבוריות (/ ו-/login) מול קובץ budget סטטי. בדיקת עומס k6 של hot-endpoint אינה מחווטת (k6-script-path ריק ב-caller; אין tests/perf/*.k6.js). אין rolling baseline מ-main — נעשה שימוש ב-configs/lighthouse-budget.json קבוע. ה-reusable workflow תומך ב-k6, כך שזה חצי לא-מחווט.

pbsiem/.github/workflows/stage4.yml:231-238; cicd-system perf-budget.yml:133-156 (k6 job gated on k6-script-path != ''); plan lines 949-958

🟠 נבנה אחרתi18n key parity מחווט כשער נדרש

i18n-parity.yml קורא ל-reusable a11y-i18n.yml (job של i18n בלבד), עובד וירוק. אך הוא מגודר ב-paths ל-src/lib/i18n/** ולכן אינו נורה ברוב ה-PRs, והוא אינו ב-branch protection של dev/main — כך שאינו שער Stage-4 חוסם כפי שהתוכנית דורשת (התוכנית: מחווט אל check-all, נדרש).

pbsiem/.github/workflows/i18n-parity.yml (paths: src/lib/i18n/**); dev required checks list has no i18n context; plan lines 979-989

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםDeploy של full-system TEST + backend stack מבודד לכל PR כתשתית למערך הבדיקות

stage4.yml בונה את mdr-engine + edr-puller ב-SHA של ה-PR, מחיל את ה-schema של MDR-PG, פורס את שניהם אל ה-TEST stack החי (health-gated עם RestartCount=0, אישור זהות PLANB_ENV=TEST, auto-rollback), מסנכרן את ה-syslog receiver, וגם מקים stack מבודד של postgres+engine+puller לכל PR שמוזרע בנתונים סינתטיים (synthetic-seed.ts). זה תואם וחורג מעבר ל-handshake בין Pillar-1 ל-Pillar-5 שבתוכנית — מערך הבדיקות רץ באמת מול מערכת בצורת production, ולא רק מול preview של ה-portal. חי וירוק על PRs אמיתיים.

pbsiem/.github/workflows/stage4.yml:41-163,271-339; gh run list stage4.yml = success 2026-06-14T02:17Z; cicd-system siem-deploy.yml + pr-backend-stack.yml

פילר 6 — שלמות שרשרת האספקה (SBOM, SLSA provenance, חתימת image עם Sigstore, Renovate, dependency-audit, ציון סיכון לשרשרת אספקה)

🔴 חסרחתימת mdr-engine + edr-puller images עם Sigstore/cosign — חסרה מנתיב ה-build

התוכנית דורשת חתימת keyless-OIDC עם cosign של engine + puller images שנדחפים ל-GHCR. docker-build.yml בונה+דוחף ללא שלב חתימה; ה-images ב-GHCR אינם חתומים. אין cosign באף workflow.

docker-build.yml:74-95 (build-push, no sign); pbsiem/docker-build.yml:52-67 calls it unchanged; grep cosign over cicd-system => docs + findings route only

🔴 חסראימות חתימה בזמן deploy — חסר (deploy של ה-engine מבצע pull של images לא מאומתים)

התוכנית: סקריפטי ה-deployment חייבים לאמת חתימות לפני ה-pull. siem-deploy.yml מבצע stream/pull של ה-image ומריץ אותו ללא gate של cosign verify — נתיב ה-deploy של ה-engine ל-PROD (תת-מערכת קריטית ב-P0) בוטח בכל tag שמועבר אליו.

siem-deploy.yml:109-191 (ssh-load + docker compose up, no verify); plan line 1609 'verify signatures before pulling'

🔴 חסרקונפיגורציית Renovate + dependency-audit cron שבועי — חסרים

תוכנית D26 רוצה .renovaterc.json (manual-merge בזמן ה-build) + cron שבועי של npm audit/osv-scanner עם auto-PR. אף אחד מהריפו אינו מכיל קונפיגורציית renovate או workflow של dep-audit cron.

no .renovaterc* in either repo (Glob); ls => none; no osv/npm-audit cron workflow; build.py:771-772 status:open

🔴 חסרpane של ציון סיכון לשרשרת אספקה ב-dashboard — חסר

התוכנית רוצה ציון סיכון 0-100 לכל repo בשבוע שמאגד ממצאי scanner+audit+Dependabot. אין pane כזה; תווית ה-'provenance' היחידה ב-dashboard היא בדיקת ה-tip של promotion-PR ב-Vercel, שאינה קשורה.

dashboard promotions/page.tsx:201,466 (matchesMainTip 'provenance' is Vercel-PR, not SLSA); no risk-score component found; plan lines 1599,1612

🔴 חסרdocs/supply-chain.md (מדריך אימות חיצוני) — חסר

התוכנית קוראת ל-docs/supply-chain.md שמסביר SBOM/SLSA/Sigstore + כיצד לאמת artifact חיצונית. הקובץ אינו קיים.

plan line 1614; no docs/supply-chain.md in cicd-system

🔌 נבנה · לא מחוברworkflow של SLSA L3 build provenance קיים אך יש לו אפס callers (נבנה, לא מחווט)

slsa-provenance.yml הוא reusable תקין (attest-build-provenance@v2 + gh attestation verify) אך שום דבר לא קורא לו: 404 ב-pbsiem, אין caller ארגוני. שלב צירוף-ה-release שלו לעולם לא יכול לפעול כי ל-pbsiem אין releases. ה-runbook מסמן זאת כ-DONE — over-claim.

cicd-system/.github/workflows/slsa-provenance.yml:46-77; gh: slsa-provenance.yml 404 on pbsiem default branch; gh search code (org) => 0 hits; build.py:784-787 (status:done)

🟠 נבנה אחרתretention של SBOM = artifact ל-90 ימים במקום FOREVER ב-Storage Box + לא מצורף לאף GH Release

תוכנית D27 מחייבת retention לתמיד ב-Storage Box ו-SBOM מצורף לכל GH Release. במציאות: artifact של GH ל-90 ימים שפג אוטומטית ולעולם לא מאורכב; אין צירוף-release (אין releases כלל).

sbom.yml:68 (retention-days: 90); plan lines 143,1606,1625; gh release list (pbsiem) empty

🟠 נבנה אחרתסטטוס Aikido: הקוד אומר OUT (Semgrep+Trivy), ה-runbook אומר 'Back IN' — סותר את עצמו פנימית

תוכנית פילר 6 מונה Aikido SAST/secret/dep-CVE/IaC. במציאות: security-scan.yml אומר במפורש 'Replaces Aikido (D54 superseded)' ומריץ Semgrep+Trivy בלבד. build.py טוען בו-זמנית ש-Aikido חזר IN (שורה 169) והוסר (שורות 586-587). הכיסוי בפועל שונה מהתוכנית וה-runbook סותר את עצמו.

security-scan.yml:1-3,70-85; build.py:169 vs build.py:586-587; plan line 37

✅ 2 תואמים לתוכנית — לחצו להרחבה

✅ תואםיצירת SBOM (CycloneDX) מחווטת על pbsiem main

ה-reusable sbom.yml אמיתי ונקרא על-ידי pbsiem בכל push ל-main; ריצות חיות מצליחות (2026-06-14). זהו החלק האחד שבאמת עובד ומחווט בפילר 6 — אם כי הוא מכסה רק תלויות npm של ה-portal, לא את ה-engine images.

cicd-system/.github/workflows/sbom.yml:55-68; pbsiem/.github/workflows/sbom.yml:9; gh run list sbom.yml => success 2026-06-14T01:06

✅ תואםSLSA Source Track L1 באמצעות branch protection — תואם לתוכנית הנדחית

D25 דחתה signed-commits-on-main ל-v2 ומכוונת ל-Source Track L1 שמסופק על-ידי branch protection בלבד. branch protection נאכף (נתיב promotion-PR-only ל-main). זה תואם להיקף שנבחר בתוכנית.

plan lines 141,1608,1618; pbsiem prod-deploy.yml:4-8 (main reachable only via certified branch-protected promotion PR)

פילר 7 — אספקה הדרגתית (progressive delivery) ו-rollback בטוח

🔴 חסרפלטפורמת feature-flag self-hosted של GrowthBook

התוכנית דורשת GrowthBook self-hosted (D-E, lines 1399, 1420-1426) כבסיס של כל הפילר. קיים רק doc של setup; אין instance שעבר deploy ואין SDK משולב ב-portal/engine/puller.

🔴 חסרcontroller cron של canary /api/cron/canary-ramp + UI של /admin/canary

ה-controller ה-stateful של התוכנית (קורא את ה-flags במצב ramping, מקדם לפי זמן + היעדר SLO-breach, pause/resume/skip ידני) לא קיים. לולאת ה-sleep בת 13 שעות בתוך ה-job של workflow ה-canary-gate היא תחליף מנוון, לא ה-controller המתוכנן מבוסס cron, וממילא היא לא מחוברת.

plan lines 1433-1439; no cron route nor admin page found in pbsiem (grep /admin/canary empty).

🔴 חסרמיגרציות expand-contract הנאכפות על ידי Migration Review Agent

התוכנית דורשת חסימת ALTERs מסוכנים (ADD COLUMN NOT NULL ללא default על טבלאות גדולות) והצעת expand-contract; טבלאות עם >100k שורות חייבות להצהיר על phases. במציאות: cicd change-impact.yml רק מוסיף label 'schema-changing' לנתיבי prisma — ללא migrate-diff, ללא lock-impact, ללא חסימת ALTER, ללא expand-contract. pbsiem change-impact אינו מפנה ל-prisma כלל. אין agent.

cicd change-impact.yml lines 71-76 (label only); grep for expand-contract/migration-review/ALTER in both repos returned empty; plan lines 1037, 1449-1450, 1032, 1518.

🔴 חסרמדיניות default-flag-every-PR (D31) + opt-out מסוג no-flag-needed

אין אוטומציה שיוצרת flag של GrowthBook לכל PR ואין workflow שמכבד label של no-flag-needed, כי GrowthBook וקונבנציית ה-flag לא נבנו.

configs/growthbook-setup.md lines 69-75 describe the convention as intent only; no PR workflow creates flags (grep empty).

🔴 חסרסוגי אירועי AuditLog של canary (CANARY_RAMP_ADVANCE/HALT, CANARY_ROLLBACK, PROD_DEPLOY_ROLLBACK)

התוכנית דורשת שכל החלטת ramp/rollback תיכתב ל-AuditLog. אף אחד מסוגי האירועים הנקובים לא קיים ב-pbsiem.

grep for CANARY_RAMP/CANARY_ROLLBACK/PROD_DEPLOY_ROLLBACK across pbsiem .ts/.tsx/.prisma returned empty; plan line 1473.

🔌 נבנה · לא מחוברה-reusable canary-gate.yml (ramp של GrowthBook + rollback מסוג flag-flip)

לוגיקת ramp אמיתית ובצורת התוכנית קיימת (לוח הזמנים הדיפולטי של D28 מוטמע ב-default של ה-input; gate של error-rate-vs-baseline לפי D29; flag-disable בעת breach) אך מדובר ב-workflow_call שלא מופעל על ידי שום דבר אמיתי. שום נתיב merge-to-main / prod-deploy לא קורא לו.

cicd .github/workflows/canary-gate.yml lines 35-39, 128-219; only caller is configs/example-callers/canary-caller.yml; grep of pbsiem workflows for canary/growthbook returned empty.

🟠 נבנה אחרתAuto-rollback בעת SLO breach (triple-trigger של D29 + צרכן burn-rate)

התוכנית רוצה ANY של {err 2x/15m, p99 2x/5m, synthetic 3x} שמזין את /api/cron/burn-rate-rollback כדי להפוך flag, או alias-switch ל-breaches מבניים. במציאות: post-deploy-verify מבצע בדיקת health נקודתית יחידה + בדיקת error-rate-ratio ו-Vercel promote-previous, אך SEALED ל-auto-rollback=false ב-prod (כלל S6). אין trigger של p99, אין trigger של synthetic-3x, אין צרכן burn-rate, אין נתיב flag-flip.

post-deploy-verify.yml lines 29-33 (default false), 104-128 (single ratio check), 130-183 (promote-previous), 191; plan D29 line 150 + lines 1442-1444, 1468-1470.

🟠 נבנה אחרתscripts/rollback-prod.mjs — Vercel alias switch בתוך <30s

התוכנית נוקבת בשם script ייעודי + כפתור /admin/deploy/rollback ל-rollback מבני בתוך <30s. שום script כזה לא קיים; פרימיטיב ה-alias-repoint ממומש ב-cicd vercel-promote.yml ומחובר כ-PROMOTE אנושי (prod-promote.yml), לא כמנוף rollback. אין כפתור rollback ידני.

find scripts/deploy shows no rollback-prod.mjs; pbsiem prod-promote.yml uses cicd vercel-promote.yml for alias repoint as promote/discard; plan lines 1452-1456.

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםrollback drill של container ב-TEST (שחזור engine)

לא מנגנון ה-canary, אלא יכולת ה-safe-rollback היחידה שכן בנויה, מחוברת ומוכחת: rollback-drill.yml משחזר את ה-engine ל-tag known-good דרך siem-deploy עם guard של PLANB_ENV=TEST; S6 הוכיח restore בתוך כ-1s (round-trip 21s) לעומת היעד של 5 דקות. זה סמוך ל- (ולא תחליף ל-) ה-alias/flag rollback המתוכנן.

pbsiem rollback-drill.yml lines 1-46; build.py S6 line 844/923-924 records the drill result and the prod-never-auto-rollback seal.

פילר 8 — Observability, SLOs ושיוך עלויות (D33-D38)

🔴 חסרOTel instrumentation ב-portal + MDR engine + edr-puller עם OTLP/gRPC ל-collector + W3C propagation (D35)

החיווט המרכזי של התוכנית לא קיים. pbsiem/src/instrumentation.ts רק טוען את Sentry; אין @vercel/otel, אין @opentelemetry/sdk-node, אין OTLP exporter, אין TraceContext propagation בשום app. ה-collector לא מקבל שום telemetry מהאפליקציות וזורק traces אל debug. בלי זה, ה-golden signals של המוצר עצמו לא נאספים.

pbsiem/src/instrumentation.ts:8-15; collector.yaml:33-36 (traces->debug); grep @opentelemetry in pbsiem src outside node_modules = 0; build.py:741-745 flags this OPEN

🔴 חסרשיוך עלויות per-customer + per-feature באמצעות client_id + feature_flag labels (D37)

ה-deliverable החדשני והמרכזי של התוכנית (קריטי לתמחור SIEM מבוסס-שימוש) חסר. אין שום client_id או feature_flag label בכל מקום ב-observability; /api/metrics פולט רק cicd_* platform series; מספר העלות היחיד הוא ephemeral-env EUR. אין dashboards של 'Cost per client' / 'Cost per feature' ב-Grafana. ה-cost-exporter ש-build.py טוען שהוא 'deployed on Hetzner' אין לו source ב-repo — over-claim.

🔴 חסרהקצאת PROD-obs-stack (D38)

קיים רק ה-stack של TEST/Hetzner. אין PROD obs stack. לפי D38 הדבר נדחה בצדק עד ש-TEST יציב, כך שזה on-plan-not-yet-due ולא פגם — אבל הוא נותר לא בנוי.

single docker-compose.yml on Build-Runner; no prod obs infra; plan line 1350 (2-4 weeks after TEST stable)

🔌 נבנה · לא מחוברששת ה-MDR SLO burn-rate alerts יורים בפועל על תעבורה אמיתית (D34)

slo-definitions.json + generate-slo-alerts.mjs מרנדרים 11 Prometheus rules תקפים, אבל הם שואלים את http_requests_total{slo=...}/http_request_duration_seconds_bucket{slo=...} ששום דבר לא פולט (אין OTel). הכללים טעונים אבל מוערכים לצמיתות מול series ריקות — הם לא יכולים לירות. הגדרות ה-SLO בנויות; שכבת המדידה לא.

observability/prometheus/rules/slo-alerts.yml:7,15 (queries http_requests_total{slo}); no emitter — depends on the missing OTel above

🟠 נבנה אחרתמשטח ה-SLO של ה-dashboard מציג את ה-MDR SLOs

ה-dashboard מחשב ומציג 3 SLOs ברמת ה-CI/CD-PLATFORM (CI success rate, CI p90 duration, deployment success rate) מתוך שורות ה-build/deploy שלו עצמו — תצוגת meta-SLO שימושית ברמת הפלטפורמה, אבל לא 5/6 ה-MDR product SLOs שהתוכנית מציינת בשמם (login/license/mdr-api/alert-delivery/ingestion). קיימים שני עולמות SLO נפרדים: ה-JSON (alert rules, ללא data) וה-dashboard (data אמיתי, SLOs שונים).

dashboard/src/lib/slo-compute.ts:65-93; dashboard/src/app/api/slo/route.ts:24-60

🟠 נבנה אחרתSentry sampling + client_id (D36)

tracesSampleRate הוא 0.05 ב-prod (תוכנית: 0.1) ו-Session Replay מושבת במכוון (תוכנית: replaysSessionSampleRate 0.01) — החלטת PII מוצדקת של Mike עבור מוצר אבטחה שמרנדר תוכן log/alert של לקוחות. beforeSend מבצע scrub אבל לא מצרף client_id. סוטה מההחלטה המילולית אבל קריאת ה-replay מכוונת וכנראה נכונה; רק תיוג ה-client_id ו-trace rate הם פערים אמיתיים.

pbsiem/sentry.client.config.ts:11-16; sentry.server.config.ts:11; plan lines 1254-1256

🟠 נבנה אחרתBetter Uptime synthetic monitoring — 6 probes, מרווחי 1-min

אין שילוב של Better Uptime. 'Synthetic monitoring' הוא Prometheus blackbox probes נגד 2 URLs בלבד (siemsys health + cicd login). מחמיץ את הכיסוי של 6 endpoints קריטיים, במרווח 1-min, מנקודת תצפית חיצונית שהתוכנית מציינת.

observability/prometheus/prometheus.yml:46-53 (2 blackbox targets); Better Uptime hits only in CI workflows/docs, not a monitor integration

🟠 נבנה אחרתburn-rate alerts מנותבים ל-SVC-HEALTH של pbsiem (/admin/infra/service-health + SMS/email)

Alertmanager שולח אל ALERT_WEBHOOK_URL = ה-webhook של dashboard ה-CI/CD (כותב שורות audit מסוג ALERTMANAGER_WEBHOOK + alert pane ב-dashboard). התוכנית מנתבת את ה-burn-rate alerts אל משטח ה-SVC-HEALTH הקיים של pbsiem עם dispatch של SMS+email ב-mutual-fallback. sink שונה; אין מסלול page של SMS/email על burn-rate כיום.

observability/alertmanager/alertmanager.yml:18-26; dashboard /api/metrics:104-108 (ALERTMANAGER_WEBHOOK rows); plan line 1327 (/api/cron/burn-rate-webhook -> ServiceHealthAlert -> SVC-HEALTH)

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםPrometheus+Grafana+Alertmanager+OTel collector+blackbox עצמאי (self-hosted) על TEST/Hetzner (D33)

ה-stack בנוי וחי בדיוק כפי ש-D33 מציין — self-hosted על ה-Hetzner CI/CD obs host, footprint של ~EUR4/mo, retention של 30d. grafana.plan-b.co.il/api/health = 200 (אומת באמצעות curl). זהו הליבה שבאמת הושלמה של פילר 8.

observability/docker-compose.yml; observability/README.md:6-21; live curl grafana.plan-b.co.il/api/health => {database:ok,version:11.3.1}

פילר 9 — צי סוכני AI עם סמכות תחומה

🔴 חסרF4 — ערוץ <untrusted> נגד prompt-injection

אין שום עטיפת <untrusted> בקוד בשום מקום. התוכנית דורשת שכל קלט לא-מהימן (טקסט PR, הודעת commit, שורת log) ייעטף כך שהסוכן יתייחס אליו כאל data. משטח ה-AI החי היחיד (promotion-verdict) מזין את הודעת ה-commit הגולמית ישירות לתוך ה-prompt ללא markers — נתיב injection פתוח.

grep untrusted|prompt.?inject → only docs/cicd-platform-build-plan.md; dashboard/src/lib/promotion-verdict.ts:35-68 (commitMessage injected raw); plan lines 1699, 1761

🔴 חסרD43 — graph cap בין סוכנים (מקסימום 3 hops, מקסימום $5 cascade cost)

אין שום אורקסטרציה רב-סוכנית, hop counter, או cap על cascade-cost. ה-cost guard הוא לכל-run/לכל-PR/לכל-repo בלבד; אין מושג של chain root או cascade budget.

grep graph.?cap|cascade|max.?hops → only docs; cost-guard.ts has no cascade concept; plan lines 174, 1700

🔴 חסרGolden-input regression suites לכל סוכן

התוכנית דורשת golden-input regression suite לכל סוכן; אין כאלה. אין golden fixtures, אין test harness ל-specs.

grep golden → only docs + specs' description text; no test files under agents/; plan line 43, runbook S11

🔴 חסרD44 — תדירות post-mortem חודשי + agent-off day

אין post-mortem מתוזמן (מדגם של 20 החלטות) ואין מנגנון agent-off-day (השבתת כל Tier1+ למשך 24h). README מתעד 'monthly review process' ידני אבל שום דבר לא מתזמן או אוכף אותו; אין docs/runbooks/agent-post-mortem.md.

agents/README.md:144-153 (manual prose only); no cron/workflow for it; plan lines 175, 1780-1782

🔴 חסרמודלים מתמשכים AgentTrustState + AICostLedger

התוכנית/runbook S10 קוראים למודלי trust + cost-ledger עמידים כך שה-state ישרוד בין הרצות ויזין את ה-dashboard. ה-state כיום הוא JSON מקומי בן-חלוף; ל-schema של Prisma ב-pbsiem אין AgentTrustState/AICostLedger (רק TrustedDevice / Persona-4 trust period שאינם קשורים).

pbsiem prisma/schema.prisma grep agent|trust → TrustedDevice:197, trust_period_ends_at:1057 (unrelated); runbook S10 line ~1021 'AgentTrustState + AICostLedger models'

🔴 חסרRetrofit של 4 הסוכנים הקיימים (qa-runner, code-reviewer, monday-process, mike-bot) לתוך ה-framework עם tiers מוכרזים

אף אחד מ-4 הסוכנים הקיימים ב-.claude אינו מוכרז-tier או מנותב דרך ה-runtime; הם רצים כמקודם, מחוץ לשכבת הסמכות התחומה.

no specs for them in agents/specs/ (only the 10 new); runbook gap line ~795; plan lines 1697, 1836-1840

🔌 נבנה · לא מחוברRuntime בעל סמכות תחומה (אכיפת tier+allowlist+cost, 10 specs)

ה-framework הוא קוד אמיתי וקוהרנטי עם מספרי tier/threshold/cost-cap נכונים, אבל ל-createAgentRuntime/loadAgentSpec אפס callers מחוץ ל-framework+docs. אף workflow, dashboard route, script, או קוד pbsiem לא מפעיל אותו; אין package.json/tsconfig בשורש שמהדר אותו; אין tests. הוא לא מריץ שום דבר.

🟠 נבנה אחרתD39 — בנייה על Claude Agent SDK (allowedTools בשכבת ה-API + PreToolUse/PostToolOutput hooks)

התובנה המרכזית של התוכנית — לתת ל-SDK לסרב לחשוף כלים שאינם מורשים בשכבת ה-API — אינה בשימוש. ה-runtime מקודד ידנית ללא שום SDK: allowedTools הוא string[] של JSON שנבדק ב-JS, וה-hooks הם פונקציות inline. זו בדיוק הסטייה שהתוכנית הזהירה מפניה (raw מול SDK), אלא שזה אפילו לא raw Anthropic SDK — זה validator תפור-בהזמנה שהמודל לעולם לא באמת רואה.

🟠 נבנה אחרתקידום אוטומטי של earned-trust מ-SHADOW ל-LIVE (D41) + מטריקת override נכונה

ה-thresholds תואמים את התוכנית, אבל אין שום שדה מצב SHADOW/LIVE; הסוכנים לא יכולים לרוץ ב-shadow ולא להתקדם אוטומטית (checkEarnedTrust רק מדווח על זכאות, README אומר שהקידום ידני). וכן overrideRate מודד "אדם דרס DENIAL של tier" במקום "אדם עשה משהו שונה ממה שהסוכן הציע" של התוכנית — אות חלש יותר, ברובו אפס.

agents/framework/telemetry.ts:122-214 (checkEarnedTrust reports only), README.md:119; AgentInvocation.overridden = tier-denial override (types.ts:86-87, runtime.ts:289); plan lines 1698, 1753-1757

🟠 נבנה אחרתD42 — telemetry מרחיב את פילר 8 OTel + Prometheus + Grafana

ה-telemetry של הסוכנים נכתב לקבצי JSON מקומיים ב-gitignore בלבד; הוא אינו פולט את המטריקות הנקובות של התוכנית (agent_invocation_total{...}) לתוך Prometheus/Grafana. ה-stack של פילר 8 קיים על Hetzner אבל צי הסוכנים אינו מחובר אליו.

agents/framework/telemetry.ts:23-24 (INVOCATIONS_FILE local json), .gitignore 'agents/telemetry/*.json'; plan lines 173, 1768

🟠 נבנה אחרתסוכן Release בדרגת Tier-3 כ-gate של AI-certification-לפני-prod

התוכנית הופכת את סוכן ה-Release בדרגת Tier-3 למאשר החיובי 'לא ישבור את הפרודקשן' שצורך evidence bundle מלא. מה שבאמת רץ הוא dashboard promotion-verdict.ts: verdicts מייעצים מבוססי שני LLM (Anthropic+OpenAI, מנוקדים-shadow, לעולם לא חוסמים) על אובייקט evidence מינימלי — מחובר לחלונית ה-Promotions/לוח התוצאות אבל עוקף את ה-framework לחלוטין (אין tier runtime, אין cost guard, אין allowedTools, אין evidence bundle מלא). הוא מממש את הרוח (האדם מחזיק את ה-gate, ה-AI מנוקד-shadow קודם) אבל סוטה מהסוכן שתוכנן ארכיטקטונית.

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםטקסונומיית Tier 0-4 + cost caps לכל-tier + אפס סוכני Tier-4

ה-tier enum, מפת tier→tool permission, thresholds הקידום של D41, ו-cost caps של D45 ($0.50/run, $5/PR, $50/repo-mo; release מוגבה ל-$2/$10) כולם תואמים את התוכנית בדיוק, ו-v1 מכריז על אפס סוכני Tier-4. החלק הזה ממודל בנאמנות — הוא פשוט אינו מופעל.

פילר 10 — ראיות ציות כתוצר-לוואי (Compliance evidence as exhaust)

🔴 חסרC2 — מפת control-to-evidence ל-SOC 2 (compliance/soc2-control-map.yaml)

אין תיקיית compliance/ או control map באף אחד מה-repos. הליבה של 'one-artifact-many-controls' (artifact אחד מוכיח controls רבים) אינה קיימת.

find over cicd-system + pbsiem: no soc2-control-map / control-map / compliance.md (empty result)

🔴 חסרC3 — control map ל-PCI DSS בהיקף SAQ-A + D47 מכתב היקף של Tranzila

אין compliance/pci-control-map.yaml ואין compliance/pci-tranzila-scope-letter.pdf (D47 נוקב בנתיב המדויק הזה). משטח הראיות ל-PCI נעדר.

find: no pci-control / tranzila-scope artifact; plan D47 §183, §1966

🔴 חסרC6 — auditd על PROD-siemsys-syslog + PROD-MDR-01 (roll-up יומי ל-Storage Box)

אין תבניות קונפיגורציה ל-auditd ואין cron ל-roll-up בשום מקום. אודיט ברמת ה-host של ה-prod אינו ממומש.

find: no auditd* files in either repo; plan §1946

🔴 חסרC8 — /api/customer/audit-export פונה-ללקוח (GDPR Art.15, D48 JSON+CSV+PDF)

אין endpoint של audit-export. קיימים רק views של אודיט מתוחמי-admin ו-tracker של export-request ללקוח מתוחם-admin; אף אחד מהם אינו ה-audit trail מתוחם-tenant ובשירות-עצמי שהתוכנית דורשת.

pbsiem find: only src/app/api/admin/audit + admin/clients/[id]/export-request; no customer/audit-export; plan §1948, D48 §184

🔴 חסרC7 — תשתית לדוח drill חתום רבעוני (compliance/drills/)

אין תיקיית compliance/drills/ ואין generator לדוח חתום. (הערה: rollback-drill.yml קיים ב-pbsiem אך אינו מפיק דוח ציות חתום-SLSA.)

find: no compliance/drills; plan §1947

🔴 חסרC9 — אכיפת מדיניות retention (AuditLog 7yr, drill 5yr, auditd 1yr+7yr)

אין crons ל-retention ואין retention נאכף על אף אחד משני מודלי ה-AuditLog. retention קיים רק עבור דומיין ה-subscription/data-purge הבלתי-קשור ב-MDR.

grep: no auditLog.deleteMany / 7-year retention in dashboard src; plan §1949

🔴 חסרC4 — חבילת ראיות DPA ל-GDPR (docs/data-flows.md, רשימת processors)

נספח DPA נשלח במוצר ה-MDR, אך תרשימי זרימת-הנתונים / ארכיון ה-processor / sub-processor שהתוכנית מוסיפה עבור ה-pipeline אינם קיימים.

find: no data-flows.md; plan §1944

🔴 חסרC10 — docs/compliance.md (אינדקס-מאסטר של control→evidence→retention)

אינדקס הקריאה-הראשונה של המבקר אינו קיים באף אחד מה-repos.

find: no compliance.md; plan §1950

🔌 נבנה · לא מחוברC1 — workflow audit-log בשכבת ה-pipeline (deploys/runs → מאגר אי-שינוי)

audit-log.yml יורה ו'מצליח', אך ה-payload המותאם שלו נדחה על-ידי ה-webhook של הדאשבורד (שדורש payload.repository / payload.workflow_run, שהצורה המותאמת חסרה) ומושמט; תגובה שאינה 2xx נבלעת כ-warning. רשומות אודיט של ה-pipeline אינן נשמרות בשום מקום. ה'only piece running' של ה-runbook מטעה — הוא רץ אך כותב לתהום.

🟠 נבנה אחרתאי-שינוי (immutability) של AuditLog + הרחבתו לאירועי pipeline (deploy/canary/agent/secret-access)

מודל AuditLog של הדאשבורד קיים ונכתב אליו, אבל רק עבור פעולות UI ב-control-plane (promotions, dispatch, finding-status, autofix). הוא אינו immutable/append-only (רשומות Prisma רגילות, ניתנות למחיקה/עדכון) ואינו לוכד את אירועי ה-pipeline של התוכנית. AuditLog של MDR מתוחם לפעולות admin, לא ל-pipeline.

dashboard/prisma/schema.prisma:265-280 (no append-only guard); writers at controls/dispatch:84, promotions/merge-pr:86, security/findings/[id]/status:59; pbsiem/prisma/schema.prisma:586-598

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםC5 — תיוג אוטומטי של change-impact פר-PR (5 labels, סריקת תוכן PII)

תואם לתוכנית במדויק: change-impact.yml ב-cicd-system מחיל את auth-touching/schema-changing/customer-facing/security-relevant/pii-touching; pbsiem קורא לו עם @main על PRs אל dev/main; עובר ב-CI היום.

cicd-system/.github/workflows/change-impact.yml:62-162; pbsiem/.github/workflows/change-impact.yml:1-19; gh run 27485718386 (success, 2026-06-14)

פילר 11 — התאוששות מאסון (Disaster Recovery — מוכח, לא מתוכנן)

🔴 חסרמסד הנתונים של מוצר MDR (PG על PROD-MDR-01) מוחרג מלולאת האימות

האימות מכסה רק את ה-DB של פורטל Neon. מידע הזיהוי/אירועים (detection/incident) האמיתי של MDR נמצא ב-siemsys-mdr-postgres על PROD-MDR-01. אין עבורו אף מסלול restore-verification, על אף ש-banner הראנבוק טוען שגיבוי ה-Postgres של MDR ב-prod 'tested end-to-end'.

infra/prod-mdr-01/README.md (siemsys-mdr-postgres container); backup-restore-verify.yml uses NEON_API_KEY only (lines 24-26); build.py:872 over-claims

🔴 חסראין quarterly DR drill, מדידת RTO/RPO, רפליקציה cross-region, עץ failover, PITR ללקוח, או מסמכי DR

D2-D10 נעדרים לחלוטין. זה ALIGNED עם D49 (הם נדחו במכוון לתוכנית DR נפרדת), ולכן אינם ליקויים מול התוכנית הזו — מסומנים רק כדי שהבעלים יידע שהפער קיים ומחוץ-לתחום במכוון. rollback-drill.yml הוא rehearsal של Pillar 7, לא ה-DR drill.

plan lines 1993-1997 + 2020-2022 (D49 split-out); docs/dr.md, docs/runbooks/dr-drill.md, failover-decision.md all absent; rollback-drill.yml is workflow_dispatch container restore only

🔌 נבנה · לא מחובראימות backup-restore היומי אינו מתוזמן ואין לו קורא (ה-DR deliverable היחיד שבתחום, D1)

backup-restore-verify.yml הוא workflow_call-only מאז שהקומיט 2dd43a8 הסיר את ה-schedule שלו; שום דבר ב-pbsiem או ב-cicd-system אינו מפעיל אותו. הראנבוק עצמו מודה בכך (build.py:752 'no schedule invokes it for MDR daily') ו-S8 דוחה זאת. זהו פריט ה-DR היחיד ש-D49 משאיר בתוכנית, ולכן אי-החיווט (non-wiring) שלו הוא פער הכותרת של Pillar 11.

backup-restore-verify.yml:3-4 (on: workflow_call only); git commit 2dd43a8 diff (removed cron '0 3 * * *'); zero grep callers in pbsiem; build.py:752-753; gh run list = 8x 0s failure

🟠 נבנה אחרתיעד השחזור סוטה מ-D1: Neon-branch->Neon-ephemeral במקום prod-backup->TEST

D1 מציין שחזור של גיבוי PROD של אתמול INTO ה-stack של TEST והרצת smoke tests שם. ה-workflow שנבנה משכפל ענף Neon לתוך ענף Neon חד-פעמי (throwaway) ומריץ SELECTs מולו — הוא מעולם לא מתרגל ארטיפקט גיבוי אמיתי, את סביבת TEST, או את מסד הנתונים של מוצר MDR. הוא מוכיח ש-Neon branching עובד, לא שגיבוי משוחזר.

backup-restore-verify.yml:82-126 (create ephemeral branch from primary), :128-167 (smoke queries against ephemeral conn_uri); plan line 2009 (D1 'Restores yesterday's prod backup INTO TEST')

🟠 נבנה אחרתאימות ה-tolerance / row-count הוא no-op (gate מזויף)

שלב 'Verify row counts within tolerance' תמיד מגדיר deviation_ok=true ללא קשר לקלט, כך שבדיקת שלמות-המידע (data-integrity) שהראנבוק מכנה REAL ('Neon ephemeral restore + smoke + tolerance', build.py:752) לא עושה כלום. שחזור שאיבד שורות עדיין היה עובר.

backup-restore-verify.yml:169-193 (both branches write deviation_ok=true; no comparison to primary)

🟠 נבנה אחרתה-banner של S-PRE בראנבוק מגזים בטענה על cron גיבוי MDR PG חי ומבדוק שאינו ניתן לאימות בקוד

build.py:872 קובע שה-cron של גיבוי ה-MDR PG המוצפן היומי הוא 'live on PROD-MDR-01 02:30 -> Storage Box (tested end-to-end)' ו-'restore-verified on TEST = 36 tables'. אף יחידת cron, סקריפט, או workflow מתוזמן שעבר commit אינם תומכים בזה; ה-dump היחיד שעבר commit הוא קובץ אפריל ידני בודד. ככל הנראה ארטיפקט host-only, אך הוא מוצהר כ-DONE ללא מקור בר-ביקורת (auditable) — בדיוק תבנית ה-'hallucinated done' שצרבה את הבעלים.

infra/prod-mdr-01/backups/ holds only 2026-04-29-pre-stage1-pgdump.sql; no committed cron; build.py:872-873 vs no repo evidence

פילר 12 — מטא-פילרים של הפלטפורמה + חילוץ ci-templates (M1-M10: services.yaml, Terraform IaC+drift, runbooks-as-code, chaos game-days, status-page-to-SLO, ספריית anonymizer, API versioning/deprecation, ממשק export ללקוח, סריקת PII של fixtures/screenshots, ולסיום הגדול — חילוץ ה-ci-templates הציבורי plan-b-systems/ci-templates עם consumers שמבצעים SHA-pin)

🔴 חסרקטלוג שירותים services.yaml (M1) — ה-agents שזקוקים לו כבר קיימים

אין services.yaml בשום מאגר, ובכל זאת שלושת ה-agent specs שלפי התוכנית הוא אמור להזין כבר כתובים. ל-agents אין קטלוג לקרוא ממנו, מה שמחמיר את הסטייה 'מסגרת agent נבנתה אך לא חוברה לכלום'. ‎~1-2 ימים; מינוף גבוה כי זה מפעיל agents שכבר נבנו.

Glob **/services.yaml -> No files found; agents/specs/cost-anomaly.json, oncall-investigation.json, doc-drift.json present; plan line 2070

🔴 חסרTerraform IaC + daily drift cron (M2, D50)

אפס קבצי ‎.tf‎; Hetzner/Cloudflare/Neon מנוהלים ידנית. הפיסה הבודדת הגדולה ביותר בפילר 12 (‎~5-6 ימים). ה-runbook מסמן זאת ביושר OPEN. עדיפות נמוכה יותר מ-spine ה-staging אך זהו מונע-הרקב הקנוני של 12-24 חודשים.

Glob **/*.tf -> No files found; build.py:810-812 'No IaC anywhere; ... hand-managed'; plan lines 196/2071/2086/2106

🔴 חסרRunbooks-as-code עם סקציות נדרשות + ולידציית CI (M3)

‎docs/runbooks/*.md‎ בפורמט חופשי קיימים בשני המאגרים אך ללא סכמת Symptoms/Diagnostics/Common-fixes/Escalation אכופה וללא שער CI שמוודא מבנה, כך של-On-Call Investigation agent אין חוזה קריא-מכונה לצרוך. ‎~2 ימים.

cicd-system docs/runbooks/ (auto-fix.md, link-crawl.md, ...) and pbsiem docs/runbooks/ are prose; no structure-validation workflow found; plan line 2072

🔴 חסרChaos game-days רבעוניים (M4)

אין תרחישי chaos מתוסרטים. ההפניות היחידות ל-'chaos' הן ב-ARCHITECTURE-CI.md המיושן של pbsiem (Litmus/Stage-7 staging — עיצוב אחר שמעולם לא נבנה) ובמסמך ה-plan-mirror — ולא ה-game-days הרבעוניים המתוסרטים של Neon/OpenSearch/Twilio עם post-mortems שבתוכנית.

grep chaos pbsiem -> only ARCHITECTURE-CI.md:53/306/311/327/644 (Litmus/staging); grep chaos cicd-system -> only docs/cicd-platform-build-plan.md; plan line 2073

🔴 חסרמדיניות API versioning + deprecation (M7)

אין דפוס header של X-Deprecated, אין מדיניות deprecate-in-MINOR/remove-in-MAJOR, אין קודיפיקציה. גם לא ברשימת ה-gaps של ה-runbook.

grep 'X-Deprecated|deprecation policy|API versioning' both repos -> no hits; plan lines 56/2076

🔴 חסרממשק export פונה-ללקוח (M8)

התוכנית רוצה משטח API קוהרנטי ללקוח שמאחד exports של audit-log (פילר 10), SBOM (פילר 6), אימות cosign attestation, ו-status embeds (פילר 11). החלקים מפוזרים/לא בנויים; אין משטח מאוחד. ה-runbook מציין שספציפית ה-endpoint של audit-export חסר. תלוי בפילרים 6/10/11.

build.py:806-808 'No ... customer audit-export endpoint'; no export-interface module found; plan line 2077

🔴 חסרסריקת PII של test fixtures + visual-regression baselines (M9)

אין שער CI שסורק test fixtures או תמונות IMAGES של visual-regression baseline לזיהוי PII אמיתי בטעות (וקטור הדליפה הקלאסי של visual-baseline). security-scan.yml מריץ סריקת 'secret' של Trivy שתופסת credentials אגב, אך זה אינו שער ה-PII המתוכנן ל-fixture/screenshot.

security-scan.yml:85 scanners 'vuln,secret,misconfig'; no fixture/baseline PII gate; visual-regression.yml has no PII step; plan line 2078

🟠 נבנה אחרתחילוץ ci-templates כמוצר SemVer ציבורי (M10 + D51)

הדפוסים הניתנים לשימוש חוזר רוכזו לתוך cicd-system ה-PRIVATE מהיום הראשון (סטיית D2 מכוונת) ונצרכים דרך ‎uses:@main‎, במקום להיות מחולצים לתוך מוצר plan-b-systems/ci-templates ציבורי עם SemVer v1.x.y. גישת המאגר-המרכזי משיגה את מטרת השימוש החוזר, אך מפילה שלושה פרטים נעולים: נראות D51 PUBLIC, משטח ה-release/versioning של SemVer, ומסגור המוצר העצמאי. פילר 12 הוא הפאזה האחרונה והפלטפורמה נמצאת ב-M1 (טרם נשלחה), ולכן הדחייה נכונה — אך ה-runbook צריך להפסיק לכנות זאת 'deliberate drift' סגור ולפתוח מחדש את הדלתא של PUBLIC+SemVer כעבודה שעדיין חבים.

🟠 נבנה אחרתמשמעת pinning של consumer: ‎@main‎ במקום SHA/‎@v1.0.0‎

כל הפניית consumer מבצעת pin ל-‎@main‎, מה שהתוכנית אוסרת במפורש ('never @main, never @v1') ומה שה-README של cicd-system עצמו ממליץ נגדו. ‎@main‎ פירושו שכל push ל-cicd-system משנה בשקט את ה-CI של כל consumer ללא שער גרסה — בדיוק הרקב שפילר 12 קיים כדי למנוע. זול לתיקון עכשיו ואינו זקוק לקיומו של ci-templates; ניתן לבצע SHA-pin ל-cicd-system כבר היום.

🟠 נבנה אחרתספריית anonymizer = ‎scrub(row, schema)‎ מודעת-סכמה (M6, D52)

scripts/anonymize-seed.mjs הוא scrubber-שורות מבוסס regex מעל TEXT של pg_dump, ולא ספריית ה-‎scrub(row, schema)->scrubbedRow‎ הדטרמיניסטית ומודעת-הסכמה שבתוכנית עם ה-interface מהיום-הראשון (plan line 2331). הוא דטרמיניסטי (seeded ב-sha256) וזה טוב, אך אינו יכול להבטיח שלמות יחסית לפי סכמה ואינו מחובר ל-weekly TEST refresh. ה-staging refresh האמיתי של S2b/S2c משתמש בלוגיקה נפרדת של scripts/dev-scrub, ולכן ה'ספרייה' הזו למעשה יתומה.

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםCENTRALIZATION של reusable-workflow + צריכה אמיתית (כוונת השימוש החוזר של M10)

הכוונה המרכזית — מוצרים יורשים דפוסי CI של Plan-B ללא copy-paste — מושגת באמת ו-WIRED: 27 workflows ניתנים-לקריאה ב-cicd-system, נצרכים על ידי pbsiem (30 הפניות) ו-vaughnblades (מוצר חי בן 39-PR). vaughnblades הוא מבחן conformance עובד של '2nd/3rd product onboards fast'. הסטיות הן נראות (private), אריזה (אין SemVer), ו-pinning (‎@main‎) — לא מנגנון השימוש החוזר עצמו.

CORE SPINE — טופולוגיית DEV/TEST/PROD + זרימת dev→test→main + מודל ה-promote + נתיב הדפלוי של מנוע MDR/ה-backend

🔴 חסרנתיב דפלוי PROD ל-engine/puller (image חתום ב-GHCR → pull מאומת אל PROD-MDR-01)

התוכנית נועלת את נתיב ה-prod של המנוע כ-images ב-GHCR החתומים ב-cosign + מאומתי-SLSA-L3 שסקריפטי הדפלוי מבצעים עליהם cosign-verify לפני ה-pull (שורות 1597,1609,1660). המציאות: docker-build דוחף ל-GHCR אבל ללא cosign sign וללא SLSA attestation; siem-deploy לא מבצע אימות חתימה כלל; ו-siem-deploy מופעל רק עבור TEST (stage4) + rollback-drill — אין שום workflow שמדפלוי את המנוע ל-PROD-MDR-01 בכלל. מנוע ה-MDR מגיע ל-prod כיום רק דרך SSH ידני. זהו הפער הנקוב של הבעלים: ספינת ה-promote מעוצבת לפי ה-portal, ולמנוע אין נתיב promote.

cicd-system docker-build.yml (no cosign/attest/sign tokens); zero slsa-provenance/attest callers in pbsiem; siem-deploy.yml callers = stage4.yml(TEST)+rollback-drill.yml only

🔴 חסרחתימת cosign + SLSA-L3 attestation + verify-before-pull עבור engine/puller

Pillar 6 #5/#9 (שורות 1609,1613) ושורת האימות 1660 דורשות cosign sign מסוג keyless ב-push ל-GHCR ו-cosign-verify בדפלוי. לא ה-docker-build הניתן לשימוש חוזר ולא siem-deploy מממשים שום חלק מזה. slsa-provenance.yml קיים ב-cicd-system אך אין לו קורא ב-pbsiem.

cicd-system .github/workflows/docker-build.yml:74-93 (build-push only, no sign); slsa-provenance.yml present but uncalled by pbsiem

🔴 חסררענון נתוני TEST/staging שבועי דרך anonymizer דטרמיניסטי

התוכנית נועלת PROD→TEST דרך anonymizer דטרמיניסטי עם רענון staging שבועי (שורות 322-323). המציאות: אין ספריית scrub(row,schema), אין anonymize-seed.mjs (ה-runbook טוען שהוא קיים — הוא אינו קיים ב-pbsiem/scripts), אין workflow של verify-dev-scrub, אין cron רענון. התוצאה היא ה-DEV DB הריק/הסוטה שגורם ל-flake בשערים — ה-P0 שהבעלים נתקל בו.

pbsiem scripts/ has no anon/scrub file (find returned none); no schedule cron targets TEST refresh; build.py:818-819 defers it to 'Phase 7'

🔴 חסרהמנוע שזור בפילרים החוצים (הפצת OTel trace, GrowthBook SDK ב-engine/puller)

התוכנית קושרת את המנוע ל-Pillar 8 (W3C TraceContext portal→engine→puller, שורות 1212-1219) ול-Pillar 7 (GrowthBook SDK ב-engine+puller, שורות 1425-1431). מחוץ לנתיב הקריטי של אזור זה אך חלק מכוונת ספינת-המנוע; לא אומת כבנוי ונדחה לפי הפיזור של ה-runbook.

plan lines 1212-1219,1425-1431; not in scope of any current pbsiem workflow examined

🟠 נבנה אחרתשער merge-to-main אנושי (התוכנית: Mike מאשר merge ל-main; אין promotion אוטומטי)

התוכנית נועלת את ה-merge-to-main כגבול הנדיר המאושר-אנושית ללא promotion אוטומטי (שורות 317, 331, 627-631). המציאות: promotion-autopilot.yml פותח אוטומטית PRs של promotion מ-dev→main וגם PRs של fold וחומש auto-merge ברגע ש-certify ירוק, ובמפורש 'מסיר את שלב ה-merge האנושי שבאמצע'. שער ה-human היחיד עבר לקליק Promote מאוחר יותר של Vercel-alias. חי ו-ON.

pbsiem .github/workflows/promotion-autopilot.yml:8-9,107-164; gh var PROMOTION_AUTOPILOT=on; fold PR #76 + promotion PR #77 live 2026-06-14

🟠 נבנה אחרתמודל ה-Promote = קליק Vercel-alias (verify-before-promote)

ה-promote של התוכנית = merge ל-main מפעיל דפלוי ל-PROD (אירוע אחד). המציאות מפצלת זאת: prod-deploy.yml בונה staging artifact ללא הזזת ה-alias, ואז Promote אנושי נפרד (prod-promote.yml דרך ה-dashboard) מצביע מחדש את ה-alias לאחר אימות מחדש של provenance. מוצדק וניתן לטעון שאף חזק יותר (prod שאינו ניתן להשגה מבנית), אבל זהו מתג alias של ה-portal בלבד, לא מודל ה-merge-deploys-to-prod של התוכנית, ורמפת ה-canary 1%→10%→50%→100% (Pillar 7, התוכנית) נעדרת.

pbsiem prod-deploy.yml:1-18,44-46 (auto-promote:false); prod-promote.yml:46-64; dashboard src/app/api/promotions/dispatch/route.ts:8-16,51-62

🟠 נבנה אחרתדפלוי TEST של כל המערכת מבודד לכל PR

התוכנית + ה-runbook מכריזים 'מערכת מלאה לכל PR'. המציאות: ה-ephemeral לכל PR הוא portal+ענף-Neon בלבד; engine/puller/receiver מדופלים ל-TEST stack משותף (Option C, סדרתי דרך concurrency של siem-deploy), לא מערכת מלאה מבודדת לכל PR. שורת runbook 292 מגזימה בטענה ומציגה זאת כשלב per-PR שסופק.

pbsiem stage4.yml:1-9,89-138 (shared TEST stack, TEST_MDR_DEPLOY_HOST); build.py:560,641 self-flags engine/puller NOT in per-change deploy

✅ 2 תואמים לתוכנית — לחצו להרחבה

✅ תואםטופולוגיה תלת-שכבתית (שרתי DEV/TEST/PROD, רשתות, פיצול סביבות Vercel)

TEST stack (TEST-siemsys-syslog, TEST-MDR-01, splitter) + פרויקט plan-b-test + Siem-internal-test 10.1.0.0/16 קיימים; תואם לתוכנית שורות 304-327, 342-346. כלל בידוד prod #1 נאכף.

build.py:682-685; plan lines 304-327; siem-deploy env-marker assert siem-deploy.yml:94-107

✅ תואםaudit של prod-isolation כשער check-all

שער #13 של התוכנית (שורות 297, 363-369) קיים כ-check-prod-isolation.mjs מונע ע"י audit-prod-isolation.config.yaml ונמצא בשרשרת ה-check-all. שמו שונה מ-audit-prod-isolation.mjs של התוכנית אך פונקציונלית זהו השער הנעול.

pbsiem scripts/check-all.mjs:52 ('check-prod-isolation.mjs'); scripts/check-prod-isolation.mjs + scripts/audit-prod-isolation.config.yaml present

דיוק ה-RUNBOOK — האם cicd-runbook build.py משקף את התוכנית הנעולה וגם משקף בכנות את המציאות הנוכחית, או שהוא נסחף / הצהיר יתר על המידה

🔴 חסרלא קיים מסלול engine→PROD promote, וה-runbook מעולם לא מנקב ב-gap הזה

docker-build.yml בונה image-ים של engine ב-push ל-main אך אין workflow שפורס/מקדם את mdr-engine או edr-puller ל-PROD-MDR-01 — פריסות ה-engine ל-prod נשארות ידניות/SSH, מחוץ למסלול ה-promote המגודר שהתוכנית דורשת עבור כל המערכת (plan L292). מסלול ה-Vercel promote מכסה רק את ה-portal.

🔴 חסרstaging-data refresh אינו בנוי (אין סקריפטי dev-scrub, אין scheduled PROD→TEST refresh cron) ובכל זאת באנר ה-runbook אומר 'ONLY M1 REMAINS'

Plan L323/L406-409 מחייבים refresh אנונימי שבועי מ-PROD→TEST; ה-runbook עצמו קושר את ה-gate-flake ל-dev DB ריק (הערות S2b/S3). S2b/S2c מודים שהם 'pending' אך באנר ה-EXECUTION-STATUS מגזים בתיאור השלמוּת.

build.py:844 ('ONLY M1 REMAINS'), L905-912 (S2b/S2c pending) vs filesystem: no pbsiem scripts/dev-scrub dir, no schedule: cron in any pbsiem workflow except rotate-secrets.yml:41-42

🟠 נבנה אחרתטבלת 'today's pipeline' ב-runbook עדיין מציגה את Gap B (deploy-to-live-then-verify) כמצב נוכחי + את verify-before-promote כ-P1 לבנייה, אך הקוד כבר מבצע verify-before-promote עם שער Promote אנושי נפרד

build.py L306/L342-343 מגזימים בתיאור ה-gap; ה-roadmap באותו קובץ (L435,L844) אומר M1 SHIPPED + P1 done — סתירה פנימית. המציאות: prod-deploy.yml verify-before-promote + השער האנושי prod-promote.yml בנויים ומוכחים (הרצה 27328491331).

build.py:306,342-343,418 vs pbsiem/.github/workflows/prod-deploy.yml:1-49 + prod-promote.yml:6-72 + cicd vercel-gated-deploy.yml:5-19

🟠 נבנה אחרתלשונית Gap Analysis מסמנת את 'Full-system TEST deploy per change' כ-OPEN, בטענה ש-engine/puller/OpenSearch/syslog אינם חלק מאף פריסה per-change — אך stage4.yml בונה ופורס אותם בכל PR

ה-runbook מצהיר בחֶסֶר (אומר חסר כאשר זה בנוי). לשונית ה-gaps (status:open) מתנגשת עם ה-S5 DONE שב-roadmap. קורא שעובד על לשונית ה-gaps מלמעלה למטה היה בונה מחדש משהו שכבר קיים.

build.py:639-642 (status open) vs pbsiem/.github/workflows/stage4.yml:41-126 (build-engine/build-puller at PR SHA → siem-deploy to TEST-MDR-01)

🟠 נבנה אחרתpromotion-autopilot מבצע auto-merge של dev→main, בסטייה מ-'no automated promotion / Mike merges to main' של התוכנית

Plan L317/L324/L331-336 מפורש שאין מסלול אוטומטי ל-TEST→PROD. המציאות arming של auto-merge של GitHub על PR של dev→main (מגודר-certify). ה-runbook מציג זאת כפיצ'ר Phase-4 בלי לסמן זאת כסטייה מהתוכנית הדורשת תיקון מכוון.

🟠 נבנה אחרתטקסט security-loop בלשונית Workflows עדיין אומר 'Stack after Aikido removal', בסתירה ל-override של ה-runbook עצמו לפיו Aikido חזר IN

פרוזה מיושנת מהסרת 2026-06-01 שורדת לצד ה-override של ההחזרה מ-2026-06-11 — ה-runbook סותר את עצמו בשאלה האם Aikido הוא gate.

build.py:586-587 ('after Aikido removal: Semgrep + Trivy + ZAP + CodeRabbit') vs build.py:169,393,483 (Aikido back IN as free blocking gate)

🟠 נבנה אחרתסחיפת prod alias: ה-runbook/הקוד משתמשים ב-siemsys.plan-b.systems; התוכנית נוקבת ב-siem.plan-b.co.il

מינורי אך לא מיושב — ה-runbook צריך להתכופף ל-DNS של התוכנית או לתקן את הערת התוכנית. כפי שזה כעת, ה-hostnames הנקובים של התוכנית מוחלפים בשקט.

plan L293,L318 (siem.plan-b.co.il / siem-api.plan-b.co.il) vs prod-deploy.yml:43 + build.py prod-alias siemsys.plan-b.systems

✅ 1 תואמים לתוכנית — לחצו להרחבה

✅ תואםשער ה-AI certify הוא אמיתי אך רדום (מדלג ללא ANTHROPIC_API_KEY) — ה-runbook מתאר זאת במדויק

ה-runbook מדווח בכנות ש-certify מופעל רק כאשר המפתח מתווסף (build.py:844, S6b); הקוד תואם (certify.yml:145 מדלג, exit 1 רק כאשר ה-AI מסרב כשהמפתח קיים). זו דוגמה למשקפוּת מדויקת.

pbsiem/.github/workflows/certify.yml:140-161 vs build.py:844,447

משימות — הפערים כרשימת משימות מסודרת

סגרו את שני ה-P0 — נתיב ה-promote של המנוע ורענון נתוני ה-staging — ושדרוג Java/מנוע יעבור דרך staging ויצא ל-prod כ-promote יחיד, לא דיפלוי ידני של 11 שעות. 21 הנותרים אמיתיים, אבל הם חיזוק ולא חוסמים. חברו את מה שקיים, יישרו את מה שסטה, בנו את מה שחסר.

🎯 התחילו מאלה

P0 — בנו את נתיב ה-promote של ה-engine מ-TEST->PROD (פילר 1): ה-mdr-engine/edr-puller מגיעים ל-PROD רק דרך deploy ידני out-of-band ללא promotion מאומת מ-staging ל-prod (prod-deploy/prod-promote מזיזים רק את ה-Vercel portal alias; ל-PROD mode של siem-deploy אין caller). ה-spine השבור הזה הוא מה שהפך שינוי של 40 דקות ל-11 שעות.
P0 — הקימו את ה-weekly anonymized staging-data refresh (פילר 1): ללא refresh, TEST-MDR-PG/OpenSearch רצים ריקים/מיושנים וה-gates ה-per-PR flake — החצי השני של הכאב בן 11 השעות. בנו את ה-scrub המינימלי-בר-קיימא + טעינה שבועית.
P1 — תנו שם ותייגו את ה-staging וחווטו את ה-promote האנושי היחיד post-green-stage4 כשער היחיד שמזיז גם portal וגם engine (פילר 1; מתמזג למשימת ה-engine-promote), כך ש-promote-to-prod יהיה פעולה אחת ניתנת לביקורת ומאומתת-provenance.
P1 — יישבו את הגנת branch של main (פילר 4): main מוגן רק על ידי ci/CI עם 0 approvals בעוד ל-dev יש 10 checks, כך ש-hotfix ישיר ל-main עוקף security-scan + Stage-4 + links. סגרו את פער הכיסוי לפני הסתמכות על נתיב ה-promote החדש.
P1 — השלימו את בדיקת ה-runtime env-diff של prod-isolation (פילר 1) כך ש-'אין prod DSN/key נגיש מ-non-prod' ייאכף מכנית, ולא רק יהיה ניתן ל-grep סטטי — מעקה הבטיחות ששומר על נתיב ה-promote ישר.
P1 — אכפו lint:strict ב-CI/pre-push ויישבו את הסתירה סביב Aikido (פילר 4): ה-plan/runbook/README טוענים אכיפה שה-code אינו מבצע; תקנו את ההתנהגות, ואז גרמו ל-plan + runbook + code להסכים.

🔌 חברו את מה שנבנה אבל לא מחובר

הניצחונות המהירים — ה-workflow כבר קיים ב-cicd-system; pbsiem פשוט לא קורא לו.

P1מאמץ MPillar 5 — Stage 4חווטו את ה-gates הניתנים לשימוש חוזר שכבר נבנו (a11y, mutation, k6) אל stage4 של pbsiem

סוגר: פילר 5 שכבות E/H + חצי ה-k6 של D: a11y-i18n.yml axe-core, mutation-test.yml Stryker, perf-budget.yml k6 קיימים ב-cicd-system אך pbsiem לעולם לא קורא להם

צעדים: ב-stage4.yml הוסיפו caller jobs עבור a11y-i18n.yml (axe-core, zero WCAG 2.1 AA) ו-mutation-test.yml (Stryker על src/lib + src/app/api, ratchet שמתחיל ב-~60%). הגדירו k6-script-path אל tests/perf/hot-endpoints.k6.js חדש כך שה-k6 job של perf-budget.yml (מוגן על path לא-ריק) ירוץ עם תקציבי p95/p99. הוסיפו כל אחד כ-required dev check. חיווט טהור של workflows מוכחים.

P1מאמץ MPillar 5 — Stage 4הרחיבו את ה-e2e gate מ-smoke spec אחד אל החבילה המלאה

סוגר: פילר 5 שכבה A: stage4 e2e מריץ רק את tests/e2e/dev-environment.spec.ts מתוך ~90 specs; ה-deterministic seeding (pr-db + synthetic-seed.ts) שחסם את ההרחבה כבר נבנה

צעדים: שנו את test-pattern ב-stage4.yml מ-dev-environment.spec.ts היחיד אל חבילת tests/e2e המלאה (למעט specs הזקוקים ל-data אמיתי של prod), מכוון אל ה-isolated backend stack ה-per-PR ש-synthetic-seed.ts זורע. בצעו triage ל-specs ש-flaky מול ה-synthetic seed במקום להשבית את ה-gate.

P1מאמץ MPillar 3 — Ephemeral envsהפעילו או פרשו את ענף ה-teardown של OpenSearch ב-reusable

סוגר: פילר 3: ל-cicd ephemeral-env.yml יש ענף teardown של OS-index המוגן על opensearch-url + OPENSEARCH_API_KEY, אך ה-caller של pbsiem אינו מעביר אף אחד מהם, כך שזהו dead code; ה-GC cron הוא Neon-only

צעדים: החליטו עם Mike. אפשרות A: העבירו opensearch-url + OPENSEARCH_API_KEY מה-caller של pbsiem כך שה-teardown יתבצע, הוסיפו מדיניות ISM min_age:1d/delete (1 shard/0 replicas) כ-backstop. אפשרות B: תעדו שבידוד OS הוא client-id-namespaced דרך ה-per-PR stack, מחקו את משתנה ה-preview המת OPENSEARCH_INDEX_PREFIX ואת ענף ה-OS-teardown המת. כך או כך הוסיפו backstop אחד שקוצר pr-* indices מיושנים.

🔧 יישרו את מה שנבנה שונה מהתוכנית

הקוד, ה-runbook והתוכנית לא מסכימים. תקנו את ההתנהגות, ואז יישרו את שלושתם.

P1מאמץ SPillar 4 — Quality gatesאכפו lint:strict (warnings-as-errors) ב-CI וב-pre-push

סוגר: פילר 4 C/D: ה-plan דורש next lint --max-warnings 0 ב-CI+pre-push; במציאות prepush קורא ל-npm run lint רגיל ו-ci.yml דורס את ברירת המחדל lint:strict של ה-reusable. README L28 + build.py:697 מגזימים בטענה

צעדים: ב-ci.yml הגדירו lint-command אל npm run lint:strict (או הסירו את הדריסה). ב-package.json הגדירו prepush אל npm run lint:strict + npm test. פתחו PR עם eslint warning שתול כדי להוכיח שהוא נכשל (plan L820). תקנו את README L28 + build.py:697 רק אחרי שההתנהגות תואמת.

P1מאמץ MPillar 4 — Quality gatesפתרו את הסתירה סביב Aikido; גרמו ל-plan + runbook + code להסכים

סוגר: פילר 4 E: Aikido נעדר (header של security-scan.yml 'Replaces Aikido D54 superseded', Semgrep+Trivy בלבד) אך build.py:169 אומר 'Aikido back IN as a blocking gate' בעוד build.py:586 אומר 'after Aikido removal' — סותר את עצמו ושקרי

צעדים: קבלו החלטה מחייבת מ-Mike. OUT: תקנו את ה-D-table של ה-LOCKED PLAN + sub-deliv E כדי לתעד את ה-supersession של D54, מחקו את שורת build.py:169 השקרית. IN: הוסיפו שלב Aikido scan עם SHA-pinned ל-cicd security-scan.yml המוגן על AIKIDO_SECRET_KEY (דלגו אם לא מוגדר), ספקו את ה-key, הוסיפו 'Aikido scan' כ-required check. כל שלושת המקורות חייבים להתאים.

P1מאמץ MPillar 4 — Quality gatesיישבו את הגנת branch של main עם ה-plan (approvals, admin-bypass, coverage)

סוגר: פילר 4 I + GAP A (אומת live): main דורש רק ci/CI, approvals=0, enforce_admins=TRUE; ל-dev יש 10 checks. main חלש יותר בשער (hotfix מדלג על security-scan + Stage-4 + links) אך מחמיר יותר על admin-bypass מה-plan (ה-plan רצה 1 approval + enforce_admins OFF)

צעדים: עם Mike: (a) approvals: הגדירו main approvals=1 לפי ה-plan או תקנו את ה-plan כדי לאשרר את ה-chat-word/0 approvals של Mike (build.py:173); (b) admin-bypass: בחרו בין ה-OFF של ה-plan (נתיב override runbook) לבין ה-ON הנוכחי, תעדו את הנתיב המנצח (cross-link אל claude-guard); (c) גרמו ל-main לדרוש את ה-checks של dev — security-scan + Stage-4 + links — או תעדו ש-main מגיע רק דרך gated-dev promotion + הוסיפו re-gate בזמן ה-promotion. עדכנו את plan I + build.py:700-703.

P1מאמץ MPillar 1 — Topology & isolationהשלימו את audit ה-prod-isolation ל-4 הבדיקות של ה-plan; עצרו את ההגזמה config-מול-code

סוגר: פילר 1: check-prod-isolation.mjs מבצע רק את ה-static grep (b) ומחיל grandfather על 89 קבצים; ה-runtime env-var diff שאוכף 'אין prod DSN/key מ-non-prod' נעדר; ה-config מצהיר על known_prod_client_ids + neon_hosts שה-code לעולם לא קורא

צעדים: הרחיבו את check-prod-isolation.mjs: (a) משכו את ה-env של vercel ל-preview וגם ל-production, היכשלו אם ערך backend כלשהו (DATABASE_URL/MDR DSN/OpenSearch URL/keys) זהה — ללא grandfathering בבדיקה זו; (b) קראו config.known_prod_client_ids, היכשלו על כל WRITE/fixture ref מחוץ ל-*.test.ts; (c) קראו config.neon_hosts, סמנו hostnames ישירים של prod Neon; (d) ודאו ש-prisma/*.prisma אין בו DB-URL fallback. שמרו את ה-allowlist רק עבור ה-static-grep.

P1מאמץ SPillar 1 — Topology & isolationתנו שם ותייגו את ה-staging; חווטו את האישור האנושי כשער ה-prod המאוחד

סוגר: פילר 1: ה-TEST stack מתפקד כ-staging אך אינו מתויג; השער האנושי היחיד הוא לחיצת ה-portal-alias. ה-runbook עצמו מסמן 'name + isolate' / 'wire the approval as the prod gate' (build.py:331-332,360-362)

צעדים: תייגו את ה-TEST stack כ-'staging' בתיעוד + נומנקלטורת ה-deploy; גרמו ל-promote האנושי היחיד post-green-stage4 להיות השער שמזיז גם portal וגם engine (מתמזג למשימת ה-engine-promote). עדכנו את ARCHITECTURE/key-workflow-rules אל dev->staging->prod עם staging gate אחד מלא-battery.

P2מאמץ SPillar 4 — Quality gatesאשררו-או-שחזרו את CodeRabbit request_changes_workflow (D15)

סוגר: פילר 4 F: ה-plan D15 מחייב .coderabbit.yaml request_changes_workflow:true שחוסם merge + CodeRabbit required; במציאות .coderabbit.yaml הוסר (fb8a2fb), CodeRabbit advisory-only, לא required (build.py:171)

צעדים: החליטו עם Mike. Advisory: תקנו את ה-LOCKED PLAN D15 אל 'advisory, not a required gate'. D15 עומד בעינו: הוסיפו מחדש .coderabbit.yaml בשורש ה-repo עם request_changes_workflow:true + path_instructions של plan F, הוסיפו CodeRabbit כ-required check. אל תשאירו את ה-plan וה-code חולקים בשקט.

P2מאמץ MPillar 3 — Ephemeral envsשחזרו את EphemeralEnvCost כ-cost-governor (תקרת EUR + SVC-HEALTH alert + sweep למיושנים)

סוגר: פילר 3: ה-plan רוצה ש-EphemeralEnvCost יישמר עם סך EUR מתגלגל של 30-day + ServiceHealthAlert ב-EUR80/100 + force-delete ל->30d מיושן; במציאות ה-cron רק סופר Neon branches, console.warn ב-5 קבוע, לא שומר דבר; אין model ב-prisma

צעדים: הוסיפו EphemeralEnvCost אל pbsiem/prisma/schema.prisma; ב-ephemeral-env-cost/route.ts תשאלו Neon compute-units (לא ספירת branch), חשבו EUR יומי + מתגלגל של 30-day, שמרו, הרימו ServiceHealthAlert ב-EUR80 warn / EUR100 critical. הוסיפו את ה-stale sweep: כל pr-* branch שה-PR שלו נסגר לפני >30d ימחק ב-force עם רשומת AuditLog.

P2מאמץ SPillar 3 — Ephemeral envsהעבירו את דף האדמין של ephemeral-envs אל ה-path המתוכנן + החילו dark mode

סוגר: פילר 3: הדף נמצא ב-src/app/admin/ephemeral-envs/page.tsx ולא ב-(admin)/admin/infra/ephemeral-envs של ה-plan, ומעוצב בהיר (bg-gray-50/text-gray-500) בניגוד לכלל dark-mode-default

צעדים: העבירו אל קיבוץ ה-(admin)/admin/infra שה-plan מציין; עצבו מחדש ל-dark-mode-default עם פונטים בהירים + hovers קריאים. קוסמטי/מבני, סיכון נמוך.

P2מאמץ MPillar 5 — Stage 4הרחיבו את ה-visual regression ל-24 ה-baselines הנעולים (12 pages x EN+HE)

סוגר: פילר 5 B / D19: visual מכסה רק 3 דפי EN ציבוריים (login/pricing/status), ללא דפי authed/admin/MDR, ללא HE; ה-plan דורש 12 critical-path pages x EN+HE = 24 baselines ב-2%

צעדים: הרחיבו את tests/e2e/visual-regression.spec.ts ל-12 ה-critical-path pages ב-EN+HE (auth דרך ה-per-PR seeded stack עבור admin/MDR), צרו את 24 ה-baselines בסף 2%, שמרו על עמדת ה-CI=1 no-update-snapshots הכנה. ה-gate מחווט; זהו כיסוי.

P2מאמץ SPillar 4 / cross-cuttingתקנו את רשומות ה-runbook + ARCHITECTURE המיושנות/הסותרות

סוגר: התיישנות חוצת-חתכים: build.py:639-645 מציין 'Full-system TEST deploy per change' כ-OPEN (portal+DB בלבד) אך stage4 פורס engine+puller אל TEST בכל PR; ARCHITECTURE.md L133 עדיין אומר '6 parallel jobs' + 'Aikido + CodeRabbit'

צעדים: הפכו את 'Full-system TEST deploy per change' ב-build.py ל-done תוך ציטוט stage4.yml + ריצה green; כוונו מחדש את החשש הנותר אל הפערים האמיתיים (engine PROD promote + staging-data freshness). כתבו מחדש את ARCHITECTURE.md L133 אל ה-as-built: reusable ci/CI job יחיד (lint->typecheck->check-all->test->build), security-scan=Semgrep+Trivy אוכף, CodeRabbit advisory, dev Stage-4 + links + change-impact; הסירו '6 parallel jobs' ו-'Aikido + CodeRabbit'; שמרו על פסקת ה-contract-gate המדויקת L134.

🏗️ בנו את מה שמעולם לא נבנה

חדש לגמרי. שני ה-P0 נמצאים כאן.

P0מאמץ LPillar 1 — Topology & isolationבנו את נתיב ה-promote של ה-engine מ-TEST->PROD (סגרו את פער ה-spine ה-P0)

סוגר: ה-spine הנעול של פילר 1: prod-promote.yml + prod-deploy.yml מזיזים אך ורק את ה-Vercel portal alias (אומת אפס שלבי engine/docker/ssh/hetzner); ל-PROD mode מבוסס-env של siem-deploy.yml אין caller, כך שה-engine מגיע ל-PROD רק דרך deploy ידני out-of-band ללא promotion מאומת מ-staging ל-prod. זה מה שהפך שינוי של 40 דקות ל-11 שעות

צעדים: הוסיפו שלב engine-promote ל-prod שפורס את אותו artifact מאומת של TEST (mdr-engine/edr-puller image ב-SHA שקודם, שכבר נמצא ב-GHCR מ-stage4/docker-build) אל PROD-MDR-01 דרך ה-PROD mode מבוסס-env של siem-deploy.yml (required-env-marker: PROD). חווטו אותו אל ה-human promote flow כך שפעולה אחת תזיז גם את ה-portal alias וגם את ה-engine: הרחיבו את prod-promote.yml כך שגם יקרא ל-siem-deploy.yml@main עם DEPLOY_HOST=PROD-MDR + provenance assertion (ה-image חייב להגיע מ-stage4 green ב-SHA שקודם), או הוסיפו prod-engine-deploy.yml מאחורי אותו human dispatch. שקפו את בדיקת cannot-promote-an-unverified-build של prod-deploy.yml.

P0מאמץ MPillar 1 — Topology & isolationהקימו את ה-weekly anonymized staging-data refresh אל ה-TEST stack המשותף

סוגר: פילר 1 D7 + sub-deliv I: אין weekly anonymized PROD dump אל TEST-MDR-PG/OpenSearch (אומת אפס hits של weekly/refresh/anonymize מחוץ ל-rotate-secrets); MDR-PG מקבל רק per-PR schema push + Neon qa-seed, כך שה-gates flake על data ריק/מיושן — ה-P0 של staging-data-stale

צעדים: הוסיפו scheduled workflow ש-(1) לוקח read-only PROD MDR-PG dump (pg_dump --exclude-table-data=audit_log) + OpenSearch indices רלוונטיים, (2) מריץ את ה-deterministic scrubber (הרחיבו את scripts/anonymize-seed.mjs אל scrub(row,schema): emails->user{id}@test.local, phones->+972500000000, names->faker, נתוני card נזרקים), (3) טוען אל TEST-MDR-01 PG + TEST-OpenSearch. הריצו שבועית כך שה-gates ירוצו מול data בעל-צורה-אמיתית. ה-anonymizer המינימלי-בר-קיימא מתוחם ל-Phase 1 (ה-lib המלא נשאר פילר 12).

P1מאמץ LPillar 2 — Identity & secretsהרחיבו את rotation של ה-secrets אל MDR_ENCRYPTION_KEY עם cutover מסונכרן

סוגר: פילר 2 #4: rotation מכסה רק OPENSEARCH_PROXY_KEY + סיסמת mdr_admin PG; MDR_ENCRYPTION_KEY (המסובך יותר, חייב להישאר מסונכרן בין Vercel + engine .env + puller) + bearer tokens/PATs/GrowthBook לא מתחלפים; plan L1656 לא מתקיים

צעדים: הוסיפו MDR_ENCRYPTION_KEY ל-schema של rotation-targets + rotate-secrets.sh: צרו key חדש, כתבו אל Vercel (production scope) וגם אל engine/.env + puller config דרך לולאת ה-SSH/role-tagged הקיימת, ואז הפעילו-מחדש/אמתו את כל שלושת ה-consumers אטומית כך ש-portal+engine+puller לעולם לא יהיו ב-split-brain. השתמשו מחדש ב-fingerprint-log + shred. קבלה L1656: מתחלף לפי לוח זמנים, כל השלושה קולטים אותו ללא restart ידני, ה-golden v2 envelope עדיין מפענח (gate ל-crypto-envelope contract test). אחר כך הוסיפו bearer tokens + GH PATs + GrowthBook כ-follow-ons.

P1מאמץ MPillar 5 — Stage 4בנו את חבילת ה-cross-tenant isolation (תופסת את מחלקת ה-IDOR של CR-03)

סוגר: פילר 5 G / D23: tests/e2e/security/ אינו קיים; grep cross-tenant = 0 hits (אומת); ה-plan דורש cross-tenant specs ל-top-12 sensitivity-endpoint

צעדים: צרו tests/e2e/security/cross-tenant-*.spec.ts המכסים את top-12 ה-sensitivity endpoints (רשימת D23), בשימוש שני tenants נפרדים זרועים מה-per-PR backend stack וכן ניסוח טענה ש-tenant A אינו יכול לקרוא/לכתוב את המשאבים של tenant B (מחלקת ה-IDOR של CR-03). חווטו אל stage4 כ-required dev check.

P2מאמץ MPillar 2 — Identity & secretsבנו את ה-getSecret('X') audit-logged proxy ואמצו אותו עבור secrets רגישים

סוגר: פילר 2 D13/D12 #5: אין src/lib/secrets/get-secret.ts, אין SECRET_ACCESS AuditLog event (אומת 0 hits); שני סמלי ה-getSecret הם helpers לא-קשורים per-file; אין נקודת audit מרכזית

צעדים: צרו src/lib/secrets/get-secret.ts שמייצא getSecret(name) שעוטף את process.env[name]; ב-PROD בלבד (דלגו על TEST לפי D12) פלטו רשומת AuditLog עם action של SECRET_ACCESS שלוכדת איזה-secret + על-ידי-איזה-user/request. תחמו את v1 ל-sensitive types לפי D12 (encryption/payment/AI keys); המירו את אותם reads של process.env. קבלה L1657: query של AuditLog SECRET_ACCESS מחזיר who/what/when.

P2מאמץ MPillar 5 — Stage 4בנו את ה-API contract gate (zod-to-openapi spec + oasdiff)

סוגר: פילר 5 C / D20: zod-to-openapi + oasdiff = 0 hits (אומת); אין OpenAPI spec שנוצר, אין breaking-change gate

צעדים: אחרי ה-Zod audit של Phase-5a, צרו OpenAPI spec מ-zod schemas של ה-routes (zod-to-openapi), commit אותו כ-baseline, הוסיפו שלב oasdiff אל stage4/check-all שנכשל על breaking changes מול ה-baseline. Required dev check.

P2מאמץ MPillar 5 — Stage 4בנו את ה-AI cost guard עם persistence של AICostLedger (תקרות שלוש-שכבתיות)

סוגר: פילר 5 J / D24: model של AICostLedger אינו ב-prisma של pbsiem (אומת); cost-guard code קיים רק ב-agents/framework, התקרות אינן נאכפות מול ledger נשמר

צעדים: הוסיפו AICostLedger Prisma model; חווטו את ה-cost-guard הקיים לכתוב rows ולאכוף את התקרות השלוש-שכבתיות ($0.50/run, $5/PR, $50/repo/mo), Claude-only. Gate את ה-per-PR run על ה-per-PR cap כך ש-AI step בורח יחסום merge.

P2מאמץ MPillar 5 — Stage 4בנו את ה-migration-safety gate (prisma migrate diff + lock-impact)

סוגר: פילר 5 I: אין check-migration-safety.mjs; רק crypto-envelope.mjs תחת scripts/check-contracts; ה-db push הכן של stage4 (ללא --accept-data-loss) הוא רק תחליף חלקי

צעדים: הוסיפו scripts/check-contracts/migration-safety.mjs (מתגלה אוטומטית על ידי check-all) שמריץ prisma migrate diff + lock-impact, חוסם ALTERs מסוכנים (drops, type narrows, non-concurrent index על טבלאות גדולות). אכפו ב-check-all כך שיעבור gate על כל PR.

P2מאמץ SPillar 1 — Topology & isolationכתבו את ה-prod-emergency-override runbook והפנו אליו

סוגר: פילר 1 sub-deliv H: docs/runbooks/prod-emergency-override.md אינו קיים; אין EMERGENCY-OVERRIDE ref ב-ARCHITECTURE.md; מסמך policy/audit-trail נעדר אף ש-claude-guard אוכף את המנגנון

צעדים: צרו docs/runbooks/prod-emergency-override.md לפי H: תרחישי Mike-only לגיטימיים, פעולות מותרות, מוסכמת ה-[EMERGENCY-OVERRIDE] commit-subject, רשומת ה-AuditLog הנדרשת תוך 24h + reverse-normal-flow cleanup commit + sprint task של make-this-unnecessary. הוסיפו ARCHITECTURE.md ref בשורה אחת + cross-link אל claude-guard.

P2מאמץ MPillar 2 — Identity & secretsצמצמו את scoping של GH Actions secrets ל-per-environment + כתבו docs/secrets.md

סוגר: פילר 2 #6 + #10: רק prod-deploy.yml משתמש ב-GH environment block; 7 ה-VERCEL_TOKEN workflows האחרים מושכים secrets ברמת ה-repo ללא environment gate; docs/secrets.md אינו קיים

צעדים: בצעו audit לכל secret ברמת ה-repo בשימוש 8 ה-VERCEL_TOKEN workflows; העבירו את הרגישים (deploy/promote/verify) מאחורי GH environment:production / environment:preview בשיקוף prod-deploy.yml. כתבו את docs/secrets.md: rotation model (2 ששוגרו + ה-extension set), עמדת ה-no-plaintext fingerprint/shred, שכבות ההרשאה (Mike admin / Nadav developer / Roy viewer), והערת ההחלטה OIDC-deferred-to-v2.