A collection of in-depth paper reviews on Systems for Machine Learning, covering distributed computing, ML infrastructure, and emerging AI frameworks.
Comprehensive reviews of 22 research papers exploring cutting-edge systems, frameworks, and infrastructure for AI and machine learning applications.
Analysis of Ray's unified interface for task-parallel and actor-based computation with dynamic execution engine.
Zero-shot video search system using VLMs with prompt expansion and diversity-aware selection for efficient video analytics.
Flexible dataflow graph architecture enabling scalable ML computation across CPUs, GPUs, and TPUs with unified programming model.
Google's TPU delivering 15-30× speedups for neural network inference with domain-specific design and 8-bit systolic arrays.
Framework unifying ML input pipelines with systematic optimization (reordering, caching, fusion) across PyTorch, TensorFlow, and JAX.
Policy-driven retraining system for continuously growing datasets with drift-based triggers and sample-level selection strategies.
Compiler automating parallelism strategies for large model training using ILP and dynamic programming for optimal execution plans.
System combining intra-batch and inter-batch parallelism with 1F1B scheduling and weight stashing for efficient distributed training.
Extends ZeRO-3 with NVMe offloading and memory-centric tiling to train trillion-parameter models on limited GPUs with bandwidth-centric partitioning.
Fault-tolerant training system using adaptive pipelining, decoupled backprop, and staggered optimizer to sustain throughput during GPU failures.
Meta's system for scheduling ML workloads across global datacenters with temporal and scope decoupling, achieving 98% GPU allocation efficiency.
Declarative API for ML inference with automatic model variant selection, cost optimization using ILP, and dynamic vertical autoscaling.
Study showing model size and training tokens should scale equally for compute-optimal training, challenging prior scaling laws.
Iteration-level scheduling and selective batching for efficient autoregressive model serving, achieving 36.9x higher throughput than FasterTransformer.
PagedAttention system using OS-like paging for KV cache management, achieving 2-4x throughput gains and 96%+ memory utilization.
Separates prefill and decoding phases onto different GPUs with bandwidth-aware placement, achieving up to 7.4x higher request rates.
Hierarchical search with clustered indices for trillion-token RAG systems, achieving 9.33x speedup with K-means clustering and DVFS.
Architecture combining LLMs with memory streams, reflection, and planning for believable AI agents exhibiting emergent social behaviors.
ByteDance's system achieving 55.2% MFU on 12,288 GPUs with algorithm-system co-design, parallel transformers, and hybrid networking.
Unified paging and heterogeneous batching for serving thousands of LoRA adapters with custom CUDA kernels, achieving 4x throughput over vLLM.
Analysis of privacy vulnerabilities in ML system components like data filters, tokenizers, and memorization detectors beyond model privacy.
Holistic analysis of AI's carbon footprint including embodied carbon from hardware manufacturing and operational efficiency across ML lifecycle.