Get in Touch

Course Outline

EXO Infrastructure as Code

  • Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
  • Automating dependency installation (Xcode, uv, Node.js, Rust) using configuration management
  • Leveraging Nix flakes for reproducible EXO builds and developer environments
  • Writing Ansible playbooks or shell scripts for unattended cluster provisioning

Reproducible Builds and CI Integration

  • Pinning dependencies and building the dashboard within CI pipelines
  • Executing EXO smoke tests in GitHub Actions or GitLab CI runners
  • Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
  • Versioning custom model cards alongside application code

Cluster Discovery and Networking Automation

  • Configuring mDNS and static DNS for reliable libp2p node discovery
  • Automating network profile creation and Thunderbolt bridge management on macOS
  • Utilizing custom namespaces (EXO_LIBP2P_NAMESPACE) to segregate dev, staging, and prod clusters
  • Implementing firewall rules and network segmentation for multi-tenant environments

Storage and Model Lifecycle Management

  • Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
  • Mounting NFS or SAN shares as read-only model repositories for rapid provisioning
  • Garbage collection of stale caches and defining versioned weight retention policies
  • Automating model pre-downloads and health checks prior to rolling updates

Monitoring and Alerting

  • Shipping EXO logs to centralized logging platforms (ELK, Loki, or Splunk)
  • Building Grafana dashboards from EXO_TRACING_ENABLED output
  • Setting up alerts for cluster membership changes, OOM events, and inference latency spikes
  • Correlating macmon hardware telemetry with model performance regressions

Update, Rollback, and Disaster Recovery

  • Staging EXO binary updates on a canary node before fleet-wide rollout
  • Executing model-level rollbacks by switching between quantized versions without re-downloading
  • Backing up and restoring cluster state, custom namespaces, and cached weights
  • Documenting recovery runbooks for total cluster rebuild scenarios

Security Hardening and Compliance

  • Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
  • Implementing API rate limiting and IP whitelisting for EXO endpoints
  • Isolating clusters using VLANs and zero-trust network policies
  • Auditing access and maintaining an inventory of deployed models and versions

Requirements

  • Proficiency in DevOps practices (CI/CD, IaC, container orchestration)
  • Familiarity with macOS or Linux system administration and package management
  • Understanding of networking, DNS, and storage concepts

Target Audience

  • DevOps engineers
  • Infrastructure architects
  • SREs managing on-premise AI workloads
 21 Hours

Number of participants


Price per participant

Testimonials (2)

Upcoming Courses

Related Categories