Job Description
Job Description
Are you passionate about AI and eager to make a significant impact in the cybersecurity space?
Join us at our cutting-edge AI startup in the San Francisco Bay Area , where we are assembling a world-class team to tackle some of the most pressing challenges in cybersecurity.
As a Senior AI Infrastructure Engineer , you will own the design, deployment, and scaling of our AI infrastructure and production pipelines . You’ll bridge the gap between our AI research team and engineering organization , enabling the deployment of advanced LLM and ML models into secure, high-performance production systems.
You will build APIs, automate workflows, optimize GPU clusters, and ensure our models perform reliably in real-world cybersecurity applications. This role is ideal for someone who thrives in a startup environment — hands-on, cross-functional, and driven to build world-class AI systems from the ground up.
Why Join Us:
- $25M Seed Funding: We are well-funded, with $25 million raised in our seed round, giving us the resources to innovate and scale rapidly.
- Proven Early Success: We’ve already partnered with Fortune 500 companies , demonstrating market traction and trust in our AI-driven cybersecurity solutions.
- Experienced Leadership: Our founders are second- and third-time entrepreneurs with 25+ years in cybersecurity — having led companies to valuations exceeding $3B .
- World-Class Leadership Team: Heads of AI, Engineering, and Product come from top global tech companies, ensuring best-in-class mentorship and technical direction.
- Cutting-Edge AI Solutions: We leverage the most advanced AI technologies, including Large Language Models (LLMs) , Generative AI , and intelligent inference systems .
- Generous Compensation: Competitive salary, meaningful equity, and a high-growth environment where your impact is recognized and rewarded.
- Cybersecurity Knowledge Preferred but Not Required: We value strong AI/ML and infrastructure engineering talent above all — cybersecurity expertise can be learned on the job.
Key Responsibilities:
Core (Mission-Critical)
- Own and manage the AI infrastructure stack — GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar).
- Productionize LLMs and ML models developed by the AI team, deploying them into secure, monitored, and scalable environments.
- Design and maintain REST/gRPC APIs for inference and automation, integrating tightly with the core cybersecurity platform.
- Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability.
Infrastructure & Reliability
- Build and maintain infrastructure-as-code (IaC) setups using Terraform or Pulumi for reproducible environments.
- Implement observability and monitoring — latency, throughput, model drift, and uptime dashboards with Prometheus / Grafana / OpenTelemetry.
- Automate CI/CD pipelines for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools.
- Architect scalable, hybrid AI systems across on-prem and cloud, enabling cost-effective compute scaling and fault tolerance.
Security, Data, and Performance
- Enforce data privacy and compliance across AI pipelines (SOC2, encryption, access control, VPC isolation).
- Manage data and model artifacts , including versioning, lineage tracking, and storage for models, checkpoints, and embeddings.
- Optimize inference latency, GPU utilization, and throughput , using batching, caching, or quantization techniques.
- Build fallback and failover mechanisms to maintain service reliability in case of model or API failure.
Innovation & Leadership
- Research and integrate emerging LLMOps and MLOps tools (e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI).
- Create sandbox environments for AI researchers to experiment safely.
- Lead cost optimization and capacity planning , forecasting GPU and cloud needs.
- Document and maintain runbooks, architecture diagrams, and standard operating procedures .
- Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement.
Qualifications:
Required
- 5+ years of experience in ML Infrastructure, MLOps, or AI Platform Engineering .
- Proven expertise with LLM serving, distributed systems , and GPU orchestration (e.g., Kubernetes, Ray, or vLLM).
- Strong programming skills in Python and experience building APIs (FastAPI, Flask, gRPC).
- Proficiency with cloud platforms (Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi).
- Solid understanding of CI/CD , Docker, containerization, and model registry practices.
- Experience implementing observability, monitoring, and fault-tolerant deployments .
Preferred
- Familiarity with vector databases (FAISS, Pinecone, Weaviate, Qdrant).
- Exposure to security or compliance-focused environments .
- Experience with PyTorch / TensorFlow and MLflow / Weights & Biases .
- Knowledge of distributed training or large-scale inference optimization (DeepSpeed, TensorRT, Quantization).
- Prior work at startups or fast-paced R&D-to-production environments.
Our Culture & Team
- Collaborative Environment: Join a fast-moving, innovation-driven startup where every engineer has a direct impact.
- World-Class Leadership: Mentorship from leaders with deep expertise in AI, ML, and cybersecurity.
- Growth Opportunities: Access to professional development, top-tier conferences, and bleeding-edge AI projects.
- Diversity and Inclusion: We believe that diverse perspectives drive stronger innovation.
Perks & Benefits
- Comprehensive health, dental, and vision insurance .
- Wellness and professional development stipends.
- Equity options — share in the company’s success.
- Access to the latest tools and GPUs for AI/ML development.
Job Tags