OMSCS Seminar: System and Architecture Concepts for ML and Large Language Models

  • Overview
  • Course Content
  • Requirements & Materials
Overview

OMSCS Seminar: System and Architecture Concepts for ML and Large Language Models

Course Description

This seminar introduces system-level concepts underpinning modern ML and LLM workloads. Topics include OS-level scheduling and kernel resource management, accelerators and GPU architectures optimizations, memory hierarchies, storage solutions, virtualization, and network designs optimized for large-scale data processing. The course emphasizes conceptual understanding through readings and discussions, with no prior programming or systems background required.

Course Content

Week 1:
Introduction to LLMs and Seminar Overview
– Seminar logistics and expectations
– High-level concepts behind LLMs and system-level impacts

Week 2:
Reading Week – “Attention Is All You Need” (NeurIPS 2017)
– Origins of the Transformer compute pattern
– Foundational architecture for modern LLMs

Week 3:
Discussion: BERT (NAACL 2019) + Week 2 Reading
– Pre-training and workload evolution
– How model scale changes system and resource needs

Week 4:
Reading: "Mind the Memory Gap" (pre-publication/working paper)
– OS-level and hardware implications of memory hierarchies in ML workloads

Week 5:
Discussion: ZeRO-Infinity (SC 2021) + Week 4 Reading
– Mitigating GPU memory bottlenecks in large model training
– Partitioning strategies and offloading techniques

Week 6:
Reading: PagedAttention / vLLM (SOSP 2023)
– Memory-efficient inference for LLMs
– Virtual memory and runtime paging for transformers

Week 7:
Discussion: CacheGen (SIGCOMM 2024) + Week 6 Reading
– How memory management innovations impact LLM performance
– Compiler/runtime co-design for caching

Week 8:
Reading: MLPerf Benchmark Suite
– Standardized benchmarks for ML training and inference
– Overview of workloads and performance metrics

Week 9:
Discussion: MLPerf + Week 8 Reading
– Trade-offs in system design using benchmarked data
– Latency, throughput, power efficiency

Week 10:
Reading: FlashAttention
– Algorithmic improvements for attention layers
– Efficient memory usage during training and inference

Week 11:
Discussion: Tentative + Week 10 Reading
– FlashAttention vs. standard attention mechanisms
– The role of custom CUDA kernels and hardware optimizations

Week 12:
Reading: QoS-Efficient Serving of Multiple MoE LLMs Using Partial Runtime Reconfiguration
– Serving multiple mixture-of-expert models efficiently
– QoS-aware scheduling and reconfiguration

Week 13:
Discussion: Synergy (OSDI 2022) + Week 12 Reading
– Resource management in multi-tenant clusters
– OS/hypervisor techniques for optimizing ML workloads

Week 14:
Reading: ALISA (ISCA 2024)
– Accelerator-aware load balancing and scheduling
– Co-design of datacenter infrastructure for AI workloads

Week 15:
Discussion: Tentative + Week 14 Reading
– Datacenter architectures, accelerators, and power-aware scheduling
– Wrap-up of system-level trends in ML and LLMs

Week 16:
Buffer/Make-Up Week
– Reserved for catch-up in case of holidays or conference conflicts

 

REVISION by RJM ...

Weeks 1 and 2
Introduction to LLMs and Seminar Overview
– Seminar logistics and expectations
– High-level concepts behind LLMs and system-level impacts

Reading Week – “Attention Is All You Need” (NeurIPS 2017)
– Origins of the Transformer compute pattern
– Foundational architecture for modern LLMs

Weeks 3 and 4:
Discussion: BERT (NAACL 2019) and Week 2 Reading
– Pre-training and workload evolution
– How model scale changes system and resource needs

Reading: "Mind the Memory Gap" (Pre-Publication/Working Paper)
– OS-level and hardware implications of memory hierarchies in ML workloads

Weeks 5 and 6:
Discussion: ZeRO-Infinity (SC 2021) and Week 4 Reading
– Mitigating GPU memory bottlenecks in large model training
– Partitioning strategies and offloading techniques

Reading: PagedAttention/vLLM (SOSP 2023)
– Memory-efficient inference for LLMs
– Virtual memory and runtime paging for transformers

Weeks 7 and 8
Discussion: CacheGen (SIGCOMM 2024) and Week 6 Reading
– How memory management innovations impact LLM performance
– Compiler/runtime co-design for caching

Reading: MLPerf Benchmark Suite
– Standardized benchmarks for ML training and inference
– Overview of workloads and performance metrics

Weeks 9 and 10
Discussion: MLPerf and Week 8 Reading
– Trade-offs in system design using benchmarked data
– Latency, throughput, power efficiency

Reading: FlashAttention
– Algorithmic improvements for attention layers
– Efficient memory usage during training and inference

Weeks 11 and 12
Discussion: Tentative and Week 10 Reading
– FlashAttention vs. standard attention mechanisms
– The role of custom CUDA kernels and hardware optimizations

Reading: QoS-Efficient Serving of Multiple MoE LLMs Using Partial Runtime Reconfiguration
– Serving multiple mixture-of-expert models efficiently
– QoS-aware scheduling and reconfiguration

Weeks 13 and 14
Discussion: Synergy (OSDI 2022) and Week 12 Reading
– Resource management in multi-tenant clusters
– OS/hypervisor techniques for optimizing ML workloads

Reading: ALISA (ISCA 2024)
– Accelerator-aware load balancing and scheduling
– Co-design of data center infrastructure for AI workloads

Weeks 15 and 16
Discussion: Tentative and Week 14 Reading
– Data center architectures, accelerators, and power-aware scheduling
– Wrap-up of system-level trends in ML and LLMs

Buffer/Make-Up Week
– Reserved for catch-up in case of holidays or conference conflicts

Requirements & Materials

Materials

PROVIDED (Student will receive):

All content is available in Canvas.

Who Should Attend

This seminar is designed for OMSCS students and alumni interested in the system-level foundations that support machine learning and large language models.

Computer science students coding on computers

What You Will Learn

  • The role of operating systems, kernels, and schedulers in ML/LLM workloads
  • GPU and accelerator architectures and how they affect performance
  • How to read and interpret systems research papers and white papers
Female professional in computer science lab looking at tablet

How You Will Benefit

  • Gain a foundational understanding of the hardware and software systems behind ML and LLMs.
  • Learn to evaluate architectural trade-offs when scaling or deploying models.
  • Learn to read and critically assess systems-focused academic and industry publications.
  • Evaluate trade-offs and have a general idea of efficient, secure, scalable, and energy-efficient architectures for ML, DL, and LLM applications.
  • Gain insights on performance bottlenecks, OS and hardware interactions, container orchestration, and resource scheduling via white papers and conference publications.
  • Grow Your Professional Network icon
    Grow Your Professional Network
  • Taught by Experts in the Field icon
    Taught by Experts in the Field

TRAIN AT YOUR LOCATION

We enable employers to provide specialized, on-location training on their own timetables. Our world-renowned experts can create unique content that meets your employees' specific needs. We also have the ability to deliver courses via web conferencing or on-demand online videos. For 15 or more students, it is more cost-effective for us to come to you.

  • Save Money
  • Flexible Schedule
  • Group Training
  • Customize Content
  • On-Site Training
  • Earn a Certificate
Learn More

Want to learn more about this course?