Reinforcement Learning Graphical Representations

This repository contains a full set of 72 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.

Category Component Illustration Details Context
MDP & Environment Agent-Environment Interaction Loop Illustration Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state All RL algorithms
MDP & Environment Markov Decision Process (MDP) Tuple Illustration (S, A, P, R, γ) with transition dynamics and reward function s,a) and R(s,a,s′))
MDP & Environment State Transition Graph Illustration Full probabilistic transitions between discrete states Gridworld, Taxi, Cliff Walking
MDP & Environment Trajectory / Episode Sequence Illustration Sequence of (s₀, a₀, r₁, s₁, …, s_T) Monte Carlo, episodic tasks
MDP & Environment Continuous State/Action Space Visualization Illustration High-dimensional spaces (e.g., robot joints, pixel inputs) Continuous-control tasks (MuJoCo, PyBullet)
MDP & Environment Reward Function / Landscape Illustration Scalar reward as function of state/action All algorithms; especially reward shaping
MDP & Environment Discount Factor (γ) Effect Illustration How future rewards are weighted All discounted MDPs
Value & Policy State-Value Function V(s) Illustration Expected return from state s under policy π Value-based methods
Value & Policy Action-Value Function Q(s,a) Illustration Expected return from state-action pair Q-learning family
Value & Policy *Policy π(s) or π(a* Illustration s) Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps
Value & Policy Advantage Function A(s,a) Illustration Q(s,a) – V(s) A2C, PPO, SAC, TD3
Value & Policy Optimal Value Function V / Q** Illustration Solution to Bellman optimality Value iteration, Q-learning
Dynamic Programming Policy Evaluation Backup Illustration Iterative update of V using Bellman expectation Policy iteration
Dynamic Programming Policy Improvement Illustration Greedy policy update over Q Policy iteration
Dynamic Programming Value Iteration Backup Illustration Update using Bellman optimality Value iteration
Dynamic Programming Policy Iteration Full Cycle Illustration Evaluation → Improvement loop Classic DP methods
Monte Carlo Monte Carlo Backup Illustration Update using full episode return G_t First-visit / every-visit MC
Monte Carlo Monte Carlo Tree (MCTS) Illustration Search tree with selection, expansion, simulation, backprop AlphaGo, AlphaZero
Monte Carlo Importance Sampling Ratio Illustration Off-policy correction ρ = π(a\ s)
Temporal Difference TD(0) Backup Illustration Bootstrapped update using R + γV(s′) TD learning
Temporal Difference Bootstrapping (general) Illustration Using estimated future value instead of full return All TD methods
Temporal Difference n-step TD Backup Illustration Multi-step return G_t^{(n)} n-step TD, TD(λ)
Temporal Difference TD(λ) & Eligibility Traces Illustration Decaying trace z_t for credit assignment TD(λ), SARSA(λ), Q(λ)
Temporal Difference SARSA Update Illustration On-policy TD control SARSA
Temporal Difference Q-Learning Update Illustration Off-policy TD control Q-learning, Deep Q-Network
Temporal Difference Expected SARSA Illustration Expectation over next action under policy Expected SARSA
Temporal Difference Double Q-Learning / Double DQN Illustration Two separate Q estimators to reduce overestimation Double DQN, TD3
Temporal Difference Dueling DQN Architecture Illustration Separate streams for state value V(s) and advantage A(s,a) Dueling DQN
Temporal Difference Prioritized Experience Replay Illustration Importance sampling of transitions by TD error Prioritized DQN, Rainbow
Temporal Difference Rainbow DQN Components Illustration All extensions combined (Double, Dueling, PER, etc.) Rainbow DQN
Function Approximation Linear Function Approximation Illustration Feature vector φ(s) → wᵀφ(s) Tabular → linear FA
Function Approximation Neural Network Layers (MLP, CNN, RNN, Transformer) Illustration Full deep network for value/policy DQN, A3C, PPO, Decision Transformer
Function Approximation Computation Graph / Backpropagation Flow Illustration Gradient flow through network All deep RL
Function Approximation Target Network Illustration Frozen copy of Q-network for stability DQN, DDQN, SAC, TD3
Policy Gradients Policy Gradient Theorem Illustration ∇_θ J(θ) = E[∇_θ log π(a\ Flow diagram from reward → log-prob → gradient
Policy Gradients REINFORCE Update Illustration Monte-Carlo policy gradient REINFORCE
Policy Gradients Baseline / Advantage Subtraction Illustration Subtract b(s) to reduce variance All modern PG
Policy Gradients Trust Region (TRPO) Illustration KL-divergence constraint on policy update TRPO
Policy Gradients Proximal Policy Optimization (PPO) Illustration Clipped surrogate objective PPO, PPO-Clip
Actor-Critic Actor-Critic Architecture Illustration Separate or shared actor (policy) + critic (value) networks A2C, A3C, SAC, TD3
Actor-Critic Advantage Actor-Critic (A2C/A3C) Illustration Synchronous/asynchronous multi-worker A2C/A3C
Actor-Critic Soft Actor-Critic (SAC) Illustration Entropy-regularized policy + twin critics SAC
Actor-Critic Twin Delayed DDPG (TD3) Illustration Twin critics + delayed policy + target smoothing TD3
Exploration ε-Greedy Strategy Illustration Probability ε of random action DQN family
Exploration Softmax / Boltzmann Exploration Illustration Temperature τ in softmax Softmax policies
Exploration Upper Confidence Bound (UCB) Illustration Optimism in face of uncertainty UCB1, bandits
Exploration Intrinsic Motivation / Curiosity Illustration Prediction error as intrinsic reward ICM, RND, Curiosity-driven RL
Exploration Entropy Regularization Illustration Bonus term αH(π) SAC, maximum-entropy RL
Hierarchical RL Options Framework Illustration High-level policy over options (temporally extended actions) Option-Critic
Hierarchical RL Feudal Networks / Hierarchical Actor-Critic Illustration Manager-worker hierarchy Feudal RL
Hierarchical RL Skill Discovery Illustration Unsupervised emergence of reusable skills DIAYN, VALOR
Model-Based RL Learned Dynamics Model Illustration ˆP(s′\ Separate model network diagram (often RNN or transformer)
Model-Based RL Model-Based Planning Illustration Rollouts inside learned model MuZero, DreamerV3
Model-Based RL Imagination-Augmented Agents (I2A) Illustration Imagination module + policy I2A
Offline RL Offline Dataset Illustration Fixed batch of trajectories BC, CQL, IQL
Offline RL Conservative Q-Learning (CQL) Illustration Penalty on out-of-distribution actions CQL
Multi-Agent RL Multi-Agent Interaction Graph Illustration Agents communicating or competing MARL, MADDPG
Multi-Agent RL Centralized Training Decentralized Execution (CTDE) Illustration Shared critic during training QMIX, VDN, MADDPG
Multi-Agent RL Cooperative / Competitive Payoff Matrix Illustration Joint reward for multiple agents Prisoner's Dilemma, multi-agent gridworlds
Inverse RL / IRL Reward Inference Illustration Infer reward from expert demonstrations IRL, GAIL
Inverse RL / IRL Generative Adversarial Imitation Learning (GAIL) Illustration Discriminator vs. policy generator GAIL, AIRL
Meta-RL Meta-RL Architecture Illustration Outer loop (meta-policy) + inner loop (task adaptation) MAML for RL, RL²
Meta-RL Task Distribution Visualization Illustration Multiple MDPs sampled from meta-distribution Meta-RL benchmarks
Advanced / Misc Experience Replay Buffer Illustration Stored (s,a,r,s′,done) tuples DQN and all off-policy deep RL
Advanced / Misc State Visitation / Occupancy Measure Illustration Frequency of visiting each state All algorithms (analysis)
Advanced / Misc Learning Curve Illustration Average episodic return vs. episodes / steps Standard performance reporting
Advanced / Misc Regret / Cumulative Regret Illustration Sub-optimality accumulated Bandits and online RL
Advanced / Misc Attention Mechanisms (Transformers in RL) Illustration Attention weights Decision Transformer, Trajectory Transformer
Advanced / Misc Diffusion Policy Illustration Denoising diffusion process for action generation Diffusion-RL policies
Advanced / Misc Graph Neural Networks for RL Illustration Node/edge message passing Graph RL, relational RL
Advanced / Misc World Model / Latent Space Illustration Encoder-decoder dynamics in latent space Dreamer, PlaNet
Advanced / Misc Convergence Analysis Plots Illustration Error / value change over iterations DP, TD, value iteration
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support