Research - Charith Purushotham

Paper Review #1

Ray: A Distributed Framework for Emerging AI Applications

Analysis of Ray's unified interface for task-parallel and actor-based computation with dynamic execution engine.

Paper Review #2

Zelda: Video Analytics using Vision-Language Models

Zero-shot video search system using VLMs with prompt expansion and diversity-aware selection for efficient video analytics.

Paper Review #3

TensorFlow: A System for Large-Scale Machine Learning

Flexible dataflow graph architecture enabling scalable ML computation across CPUs, GPUs, and TPUs with unified programming model.

Paper Review #4

In-Datacenter Performance Analysis of a Tensor Processing Unit

Google's TPU delivering 15-30× speedups for neural network inference with domain-specific design and 8-bit systolic arrays.

Paper Review #5

cedar: Optimized and Unified Machine Learning Input Data

Framework unifying ML input pipelines with systematic optimization (reordering, caching, fusion) across PyTorch, TensorFlow, and JAX.

Paper Review #6

Modyn: Data-Centric Machine Learning Pipeline Orchestration

Policy-driven retraining system for continuously growing datasets with drift-based triggers and sample-level selection strategies.

Paper Review #7

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Compiler automating parallelism strategies for large model training using ILP and dynamic programming for optimal execution plans.

Paper Review #8

PipeDream: Generalized Pipeline Parallelism for DNN Training

System combining intra-batch and inter-batch parallelism with 1F1B scheduling and weight stashing for efficient distributed training.

Paper Review #9

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

Extends ZeRO-3 with NVMe offloading and memory-centric tiling to train trillion-parameter models on limited GPUs with bandwidth-centric partitioning.

Paper Review #10

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

Fault-tolerant training system using adaptive pipelining, decoupled backprop, and staggered optimizer to sustain throughput during GPU failures.

Paper Review #11

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

Meta's system for scheduling ML workloads across global datacenters with temporal and scope decoupling, achieving 98% GPU allocation efficiency.

Paper Review #12

INFaaS: Automated Model-less Inference Serving

Declarative API for ML inference with automatic model variant selection, cost optimization using ILP, and dynamic vertical autoscaling.

Paper Review #13

Training Compute-Optimal Large Language Models (Chinchilla)

Study showing model size and training tokens should scale equally for compute-optimal training, challenging prior scaling laws.

Paper Review #14

ORCA: A Distributed Serving System for Transformer-Based Generative Models

Iteration-level scheduling and selective batching for efficient autoregressive model serving, achieving 36.9x higher throughput than FasterTransformer.

Paper Review #15

Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM)

PagedAttention system using OS-like paging for KV cache management, achieving 2-4x throughput gains and 96%+ memory utilization.

Paper Review #16

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving

Separates prefill and decoding phases onto different GPUs with bandwidth-aware placement, achieving up to 7.4x higher request rates.

Paper Review #17

Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale

Hierarchical search with clustered indices for trillion-token RAG systems, achieving 9.33x speedup with K-means clustering and DVFS.

Paper Review #18

Generative Agents: Interactive Simulacra of Human Behavior

Architecture combining LLMs with memory streams, reflection, and planning for believable AI agents exhibiting emergent social behaviors.

Paper Review #19

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

ByteDance's system achieving 55.2% MFU on 12,288 GPUs with algorithm-system co-design, parallel transformers, and hybrid networking.

Paper Review #20

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Unified paging and heterogeneous batching for serving thousands of LoRA adapters with custom CUDA kernels, achieving 4x throughput over vLLM.

Paper Review #21

Privacy Side Channels in Machine Learning Systems

Analysis of privacy vulnerabilities in ML system components like data filters, tokenizers, and memorization detectors beyond model privacy.

Paper Review #22

Sustainable AI: Environmental Implications, Challenges and Opportunities

Holistic analysis of AI's carbon footprint including embodied carbon from hardware manufacturing and operational efficiency across ML lifecycle.

Charith Purushotham

Research & Paper Reviews

Systems for Machine Learning