Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
EXO Infrastructure as Code
- Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
- Automating dependency installation (Xcode, uv, Node.js, Rust) using configuration management
- Leveraging Nix flakes for reproducible EXO builds and developer environments
- Writing Ansible playbooks or shell scripts for unattended cluster provisioning
Reproducible Builds and CI Integration
- Pinning dependencies and building the dashboard within CI pipelines
- Executing EXO smoke tests in GitHub Actions or GitLab CI runners
- Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
- Versioning custom model cards alongside application code
Cluster Discovery and Networking Automation
- Configuring mDNS and static DNS for reliable libp2p node discovery
- Automating network profile creation and Thunderbolt bridge management on macOS
- Utilizing custom namespaces (EXO_LIBP2P_NAMESPACE) to segregate dev, staging, and prod clusters
- Implementing firewall rules and network segmentation for multi-tenant environments
Storage and Model Lifecycle Management
- Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
- Mounting NFS or SAN shares as read-only model repositories for rapid provisioning
- Garbage collection of stale caches and defining versioned weight retention policies
- Automating model pre-downloads and health checks prior to rolling updates
Monitoring and Alerting
- Shipping EXO logs to centralized logging platforms (ELK, Loki, or Splunk)
- Building Grafana dashboards from EXO_TRACING_ENABLED output
- Setting up alerts for cluster membership changes, OOM events, and inference latency spikes
- Correlating macmon hardware telemetry with model performance regressions
Update, Rollback, and Disaster Recovery
- Staging EXO binary updates on a canary node before fleet-wide rollout
- Executing model-level rollbacks by switching between quantized versions without re-downloading
- Backing up and restoring cluster state, custom namespaces, and cached weights
- Documenting recovery runbooks for total cluster rebuild scenarios
Security Hardening and Compliance
- Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
- Implementing API rate limiting and IP whitelisting for EXO endpoints
- Isolating clusters using VLANs and zero-trust network policies
- Auditing access and maintaining an inventory of deployed models and versions
Requirements
- Proficiency in DevOps practices (CI/CD, IaC, container orchestration)
- Familiarity with macOS or Linux system administration and package management
- Understanding of networking, DNS, and storage concepts
Target Audience
- DevOps engineers
- Infrastructure architects
- SREs managing on-premise AI workloads
21 Hours
Testimonials (2)
Craig was extremely involved in the training, always making sure we are paying attention, adapted the examples to our day-to-day activities and always provided an answer when asked, even if the information was not added in the presentation.
Ecaterina Ioana Nicoale - BOOKING HOLDINGS ROMANIA SRL
Course - DevOps Foundation®
High level of commitment and knowledge of the trainer