Wednesday, February 26, 2025

Multi-Armed Bandit Problem from Scratch in Python

 

The Multi-Armed Bandit (MAB) problem is a fundamental problem in reinforcement learning where an agent must choose between multiple actions ("arms"), each yielding stochastic rewards. The goal is to maximize the total reward while balancing exploration (trying new actions) and exploitation (leveraging known best actions).

In this blog, we will implement a basic Multi-Armed Bandit problem in Python using the Epsilon-Greedy Algorithm.

Step 1: Define the Environment

Each arm in the MAB has an associated probability distribution for generating rewards. We can represent this using a list of probabilities.

import numpy as np
import matplotlib.pyplot as plt

class MultiArmedBandit:
    def __init__(self, probabilities):
        self.probabilities = probabilities  # True probabilities of each arm
        self.k = len(probabilities)
    
    def pull(self, arm):
        return 1 if np.random.rand() < self.probabilities[arm] else 0

Step 2: Implement the Epsilon-Greedy Algorithm

The epsilon-greedy strategy selects the best-known arm most of the time but explores a random arm with a small probability ε.

class EpsilonGreedyAgent:
    def __init__(self, k, epsilon=0.1):
        self.k = k
        self.epsilon = epsilon
        self.counts = np.zeros(k)  # Number of times each arm is selected
        self.values = np.zeros(k)  # Estimated values of each arm
    
    def select_arm(self):
        if np.random.rand() < self.epsilon:
            return np.random.choice(self.k)  # Explore
        return np.argmax(self.values)  # Exploit
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]

Step 3: Simulating the Bandit Problem

We can now create a bandit environment and run our agent through multiple iterations to observe its learning behavior.

np.random.seed(42)
probabilities = [0.1, 0.5, 0.8]  # Probabilities for each arm
bandit = MultiArmedBandit(probabilities)
agent = EpsilonGreedyAgent(k=len(probabilities), epsilon=0.1)

num_trials = 1000
rewards = []

for _ in range(num_trials):
    arm = agent.select_arm()
    reward = bandit.pull(arm)
    agent.update(arm, reward)
    rewards.append(reward)

Step 4: Visualizing the Results

We can analyze the agent’s performance by plotting cumulative rewards.

plt.plot(np.cumsum(rewards))
plt.xlabel('Trials')
plt.ylabel('Cumulative Reward')
plt.title('Performance of Epsilon-Greedy Strategy')
plt.show()

Conclusion

In this blog, we implemented a basic Multi-Armed Bandit problem using the Epsilon-Greedy Algorithm in Python. This method provides a simple yet effective approach to balancing exploration and exploitation in decision-making problems.

Future extensions could include UCB, Thompson Sampling, and Contextual Bandits to enhance learning efficiency in complex environments.

No comments:

Post a Comment

Multi-Armed Bandit Problem from Scratch in Python

  The Multi-Armed Bandit (MAB) problem is a fundamental problem in reinforcement learning where an agent must choose between multiple actio...