Commit Graph

510 Commits

Author SHA1 Message Date
will.anderson 1f499a1b68 Merge pull request 'Rollback MCP to v0.15.3; add IngressRoute, Terraform DNS import' (#22) from feat/marketplace-postgres-infra into main 2026-04-26 09:21:19 +00:00
Will Anderson 5fa8cf45d5 Rollback neuron-mcp to v0.15.3 to restore MCP connectivity 2026-04-26 04:20:57 -05:00
will.anderson 71f7026e5e Merge pull request 'Add MCP IngressRoute + marketplace Postgres infra' (#21) from feat/marketplace-postgres-infra into main 2026-04-26 08:53:01 +00:00
Will Anderson 83b8d0768e Add Traefik IngressRoute for neuron.neurontechnologies.ai MCP endpoint 2026-04-26 03:52:24 -05:00
will.anderson 2cbf824bba Merge pull request 'Deploy neuron-mcp:97bd19ba (Postgres marketplace)' (#20) from feat/marketplace-postgres-infra into main 2026-04-26 08:27:59 +00:00
Will Anderson 6ac4faabc5 Deploy neuron-mcp:97bd19ba — Postgres marketplace datasource 2026-04-26 03:27:46 -05:00
will.anderson 700b94e2b1 Merge pull request 'Revert CI to host Docker socket, remove buildkitd' (#19) from feat/marketplace-postgres-infra into main 2026-04-26 07:51:10 +00:00
Will Anderson 2550786965 Revert CI to host Docker socket, remove buildkitd
BuildKit rootless failed on k3s (mount propagation), privileged mode
fixed that but buildkitd gRPC is incompatible with the Docker REST API
that forgejo-runner needs to manage job containers. Net security change
vs. the original socket approach was zero (privileged ≈ socket access).

Remove buildkitd.yaml entirely. Restore docker-sock hostPath mount on
both runners. Builds work again.
2026-04-26 02:50:27 -05:00
will.anderson 5ab8a2eefc Merge pull request 'Wire marketplace Postgres datasource for neuron-mcp' (#18) from feat/marketplace-postgres-infra into main 2026-04-26 07:34:05 +00:00
Will Anderson e18d953ff9 Wire marketplace Postgres datasource for neuron-mcp
Add MARKETPLACE_DB_URL/USER to ConfigMap and MARKETPLACE_DB_PASSWORD
ExternalSecret (sourced from secret/legion-db in Vault). Remove the
SQLite subPath volume mount and fix-data-ownership initContainer from
the blue deployment — marketplace storage is now in Postgres.
2026-04-26 02:33:26 -05:00
will.anderson af00930a9f Merge pull request 'Pre-create neuron-marketplace.db for subPath mount' (#17) from fix/neuron-mcp-marketplace-touch into main 2026-04-26 07:06:06 +00:00
Will Anderson 952e781550 Pre-create neuron-marketplace.db in initContainer for subPath mount
subPath mounts require the source file to exist on the PVC before the
main container starts. Add 'touch /data/neuron-marketplace.db' to the
initContainer so the subPath mount can bind the file at /app/neuron-marketplace.db.
2026-04-26 02:06:02 -05:00
will.anderson 6b4553458e Merge pull request 'Mount marketplace DB via PVC subPath at /app/neuron-marketplace.db' (#16) from fix/neuron-mcp-marketplace-subpath into main 2026-04-26 07:03:16 +00:00
Will Anderson 702faca691 Mount neuron-marketplace.db from PVC via subPath at /app/neuron-marketplace.db
App hard-codes relative path 'neuron-marketplace.db' from working dir /app.
Mount the PVC file at /app/neuron-marketplace.db via subPath so the app can
write to it as UID 1000. PVC persists the file across pod restarts.
Also keep NEURON_MARKETPLACE_DB_PATH in ConfigMap for future app versions.
2026-04-26 02:03:09 -05:00
will.anderson d7235b4549 Merge pull request 'Force neuron-mcp restart for configmap update' (#15) from fix/neuron-mcp-restart into main 2026-04-26 06:52:27 +00:00
Will Anderson a6b0dc9211 Force pod restart to pick up NEURON_MARKETPLACE_DB_PATH configmap change
Add config-hash annotation to template to cycle the ReplicaSet and
ensure pods restart with the updated ConfigMap env var.
2026-04-26 01:52:19 -05:00
will.anderson 0e5f2f0f68 Merge pull request 'Fix neuron-mcp marketplace DB path' (#14) from fix/neuron-mcp-marketplace-db into main 2026-04-26 06:50:01 +00:00
Will Anderson ddaab28638 Fix marketplace DB: revert workingDir, add NEURON_MARKETPLACE_DB_PATH
workingDir: /data broke the JVM (app.jar is a relative path in /app).
Add NEURON_MARKETPLACE_DB_PATH=/data/neuron-marketplace.db env var
following the same pattern as NEURON_DB_PATH for the core DB.
2026-04-26 01:49:48 -05:00
will.anderson bb89b8e5f5 Merge pull request 'Set workingDir: /data for neuron-mcp (marketplace DB fix)' (#13) from fix/neuron-mcp-workdir into main 2026-04-26 06:47:14 +00:00
Will Anderson 26ed92ca38 Set workingDir: /data so relative DB paths resolve to PVC
neuron-marketplace.db was opening relative to /app (image root,
not writable by UID 1000). Set workingDir to /data so all relative
file opens land on the writable PVC volume.
2026-04-26 01:47:02 -05:00
will.anderson 10b2ea6045 Merge pull request 'Fix Neuron MCP SQLite READONLY: chown /data via initContainer' (#12) from fix/neuron-mcp-chown into main 2026-04-26 06:43:52 +00:00
Will Anderson 754cd72df5 Fix SQLite READONLY: chown /data to UID 1000 via initContainer
PVC was written as root before runAsUser: 1000 was added; SQLite
refused writes with SQLITE_READONLY. initContainer runs as root to
chown the volume, then the app runs as UID 1000 as required.
2026-04-26 01:43:37 -05:00
will.anderson 4137fb556b Merge pull request 'Fix Neuron MCP routing and security' (#11) from fix/neuron-mcp into main 2026-04-26 06:35:41 +00:00
Will Anderson 01245a6278 Fix Neuron MCP: runAsUser + move to neuron.neurontechnologies.ai
Pod was crash-looping: image has no USER directive so kubelet rejected it
with runAsNonRoot. Add runAsUser: 1000 at both pod and container level.

MCP paths moved from neurontechnologies.ai (now a GCP Cloud Run A record,
unreachable on Legion) to private subdomain neuron.neurontechnologies.ai.
Cloudflare tunnel and DNS CNAME updated out-of-band.
2026-04-26 01:35:20 -05:00
will.anderson cd57a7b789 Merge pull request 'Fix BuildKit readiness probe (TCP socket)' (#10) from fix/buildkit-probe into main 2026-04-26 06:24:07 +00:00
Will Anderson ffac2161bc Use TCP readiness probe for BuildKit — exec fails on k3s privileged containers 2026-04-26 01:23:52 -05:00
will.anderson 6be8c645b3 Merge pull request 'Switch BuildKit to privileged mode' (#9) from fix/buildkit-privileged into main 2026-04-26 06:21:15 +00:00
Will Anderson 6e67f62ffe Switch BuildKit to privileged mode — rootless fails on k3s mount propagation 2026-04-26 01:20:45 -05:00
will.anderson c6d42948ff Merge pull request 'Fix BuildKit rootless startup (allowPrivilegeEscalation)' (#8) from fix/buildkit-rootless into main 2026-04-26 06:18:08 +00:00
Will Anderson e138c45d51 Fix BuildKit rootless: allow privilege escalation for newuidmap/newgidmap 2026-04-26 01:17:41 -05:00
will.anderson d8e2a9601b Merge pull request 'Remove GCP stage environment' (#7) from chore/remove-gcp-stage into main 2026-04-26 06:10:13 +00:00
Will Anderson 5b16aebdcb Remove stage environment from GCP — staging is local only 2026-04-26 01:09:44 -05:00
Will Anderson 74dd054a36 feat(swarm): add swarm self-improvement loop infrastructure
Adds complete k8s manifests, ArgoCD app, and Terraform namespace
resources for the Neuron swarm self-improvement loop system.

Each variant (alpha, beta, gamma) gets its own isolated namespace,
PVC, MCP/REST deployments, ExternalSecrets from Vault, RBAC for CI,
and a SQLite clone Job template for session startup.
2026-04-26 00:58:43 -05:00
will.anderson 89a0209637 Merge pull request 'Fix Cloud Run probe ports and LB timeout_sec' (#5) from fix-terraform-probes into main 2026-04-26 05:28:49 +00:00
Will Anderson 53f2ad3c16 Fix Cloud Run probe ports and LB timeout_sec
- accounts: add ACCOUNTS_PORT=8080 env var (service defaults to 7753)
- api: add SERVER_PORT=8080 env var, change probe path to /actuator/health
- LB backends: remove timeout_sec (not supported for serverless NEG backends)
2026-04-26 00:28:33 -05:00
will.anderson 378680af01 Merge pull request 'Fix GCP deploy: postgres chart 18.x + Cloud Run PORT/probe fixes' (#4) from fix-gcp-deploy into main
Merge: fix GCP deploy
2026-04-26 04:56:08 +00:00
Will Anderson 36cfd3738d Fix Cloud Run: remove reserved PORT env, increase startup probe tolerance 2026-04-25 23:55:33 -05:00
Will Anderson 7adab317d4 fix: restore postgres chart to 18.x (was wrongly pinned to 16.x)
Chart 16.x deploys PG17 images that do not exist on Docker Hub.
Existing data directory is PG18. Restoring >=18.0.0,<19.0.0 range.
2026-04-26 04:51:07 +00:00
will.anderson 1b40d32ab0 Merge pull request 'Harden prod + expand GCP to multi-region' (#3) from gcs-backup-wiring into main 2026-04-26 03:54:55 +00:00
Will Anderson bb583e3ccb Fix HCL syntax errors in accounts and api Cloud Run definitions 2026-04-25 22:54:18 -05:00
Will Anderson d4c65d5857 Expand GCP infra: accounts + API services, Cloud SQL, Artifact Registry
Architecture: intelligence stays on Legion; only compiled artifacts cross
to GCP. Source code and Neuron's knowledge base never leave the system.

Artifact Registry:
- neuron-marketing, neuron-accounts, neuron-api repos in us-central1
- Keep-last-10 cleanup policy; ci-pusher SA with writer access
- Legion CI runners authenticate via GCP_SA_KEY Gitea secret

Cloud SQL (cloud-sql.tf):
- postgres-15 on db-g1-small, us-central1 (scale up to REGIONAL HA at 1k users)
- Point-in-time recovery, 14-day backup retention
- Accounts DB + user; password generated and stored in Secret Manager
- JWT signing key in Secret Manager (shared by accounts + api)
- Cloud Run connects via built-in Auth Proxy (Unix socket volume mount)

Accounts Cloud Run (cloud-run-accounts.tf):
- 3 regions (us-central1, europe-west1, asia-northeast1), min:1 max:50
- Cloud SQL proxy volume mount; secrets via Secret Manager
- Stripe + JWT env vars; health probe on /health

API Cloud Run (cloud-run-api.tf):
- 3 regions, min:1 max:100, cpu_idle=false (always-hot)
- Validates JWTs from accounts service; no direct DB connection
- License admin token from Secret Manager

Load balancer (host-based routing):
- Same global anycast IP for all three services
- URL map routes by Host: neurontechnologies.ai→marketing,
  api.neurontechnologies.ai→api, accounts.neurontechnologies.ai→accounts
- New managed SSL certs for api.* and accounts.* added to HTTPS proxy
- Cloud Armor (WAF + rate limit) applied to all backends

Service accounts + IAM:
- neuron-accounts-sa: secretmanager.secretAccessor + cloudsql.client
- neuron-api-sa: secretmanager.secretAccessor
- allUsers invoker on all prod Cloud Run services (LB health checks)

bootstrap.sh:
- One-shot setup: pulls Stripe secrets from Vault → Secret Manager,
  creates CI SA JSON key, prints DNS + next-step instructions
2026-04-25 22:54:18 -05:00
Will Anderson 93358505fc Harden prod: security, autoscaling, observability, BuildKit CI
Security:
- Drop ALL capabilities, enforce non-root, RuntimeDefault seccomp on
  neuron-mcp, neuron-rest, neuron-marketing pods
- Add startup probes (150s window for JVM) so liveness doesn't fire early
- Replace docker-sock hostPath with BuildKit rootless TCP endpoint
  (moby/buildkit:v0.19.0-rootless) — removes node root access from CI
- Document full ESO AppRole migration path in cluster-secret-store.yaml

Autoscaling & availability:
- HPAs on mcp (1–6), rest (1–4), marketing (2–8) at 65–70% CPU
- PodDisruptionBudgets (minAvailable: 1) on all three services
- NetworkPolicy: default-deny-all in neuron-prod, explicit allow rules
  for Traefik ingress, intra-namespace, and egress to DNS/platform/vault

Observability:
- ServiceMonitors for mcp, rest, marketing (cross-namespace enabled in
  kube-prometheus-stack with serviceMonitorSelectorNilUsesHelmValues:false)
- PrometheusRules: high error rate, high latency, crash loops, replica
  shortage, Postgres down/connections, backup failure, backup staleness

Chart version pinning:
- kube-prometheus-stack, loki, tempo, redis, alloy, postgres — all pinned
  to major-version ranges to block silent breaking upgrades

Backup hardening:
- restic:latest → restic:0.17.3 (deterministic image)
- Weekly backup-verify CronJob: restores latest snapshot and validates
  SQL dump structure (≥5 CREATE TABLE, pg_dump header check)

ArgoCD:
- neuron-prod AppProject: scopes deploys to neuron-prod + platform ns,
  blacklists ClusterRole/ClusterRoleBinding/Namespace creation,
  automated sync window 2–6am UTC, manual always allowed
2026-04-25 22:54:18 -05:00
Will Anderson 8fd3d12907 simplify neuron self-improve loop to blue/green + stage
Replace the aspirational alpha/beta/gamma model with the actual
deployment topology: prod runs blue/green in neuron-prod namespace,
stage is the single experiment slot in neuron-stage namespace.

The old script referenced neuron-alpha/beta/gamma deployments that
never existed. The new script uses blue-green-deploy.sh for prod
promotion and kubectl set image for stage experiments.

Loop: snapshot → deploy stage → evaluate → promote via blue/green.
2026-04-25 22:54:18 -05:00
will.anderson f3ed83cdd0 Wire GCS backup to neuron-db-backup-prod (neuron-494301) 2026-04-25 22:52:17 +00:00
Will Anderson 7eeff54a11 Wire GCS backup to neuron-db-backup-prod bucket (neuron-494301)
Bucket created, SA key stored in Vault at secret/gcs.
CronJob ExternalSecret updated to pull from secret/gcs.
Hourly restic backup now runs to both R2 and GCS.
2026-04-25 17:51:57 -05:00
Will Anderson 8d97bbd802 Merge branch 'main' of https://git.neuralplatform.ai/will/infrastructure 2026-04-25 17:50:32 -05:00
will.anderson 67aed61cfb Scale Docuseal up to 1 replica 2026-04-25 20:59:34 +00:00
Will Anderson 2d0ce77518 Scale Docuseal up to 1 replica 2026-04-25 15:59:03 -05:00
Will Anderson a37deca724 Add GCS backup bucket + dual-destination hourly backup (R2 + GCS)
Provision Google Cloud Storage bucket for neuron prod DB backups via Terraform.
Create dedicated backup service account with objectAdmin on the bucket.
Update neuron-prod backup CronJob to run restic against both R2 and GCS hourly —
R2 as primary, GCS as secondary, independent credentials and repositories.
2026-04-25 15:23:51 -05:00
Neuron CI 8de866a8b9 ci(neuron-prod): update rest+license to v0.15.3 2026-04-25 20:06:27 +00:00