The Multi-Armed Bandit (MAB) problem is a fundamental problem in reinforcement learning where an agent must choose between multiple actions ("arms"), each yielding stochastic rewards. The goal is to maximize the total reward while balancing exploration (trying new actions) and exploitation (leveraging known best actions).
In this blog, we will implement a basic Multi-Armed Bandit problem in Python using the Epsilon-Greedy Algorithm.
Step 1: Define the Environment
Each arm in the MAB has an associated probability distribution for generating rewards. We can represent this using a list of probabilities.
import numpy as np
import matplotlib.pyplot as plt
class MultiArmedBandit:
def __init__(self, probabilities):
self.probabilities = probabilities # True probabilities of each arm
self.k = len(probabilities)
def pull(self, arm):
return 1 if np.random.rand() < self.probabilities[arm] else 0
Step 2: Implement the Epsilon-Greedy Algorithm
The epsilon-greedy strategy selects the best-known arm most of the time but explores a random arm with a small probability ε.
class EpsilonGreedyAgent:
def __init__(self, k, epsilon=0.1):
self.k = k
self.epsilon = epsilon
self.counts = np.zeros(k) # Number of times each arm is selected
self.values = np.zeros(k) # Estimated values of each arm
def select_arm(self):
if np.random.rand() < self.epsilon:
return np.random.choice(self.k) # Explore
return np.argmax(self.values) # Exploit
def update(self, arm, reward):
self.counts[arm] += 1
self.values[arm] += (reward - self.values[arm]) / self.counts[arm]
Step 3: Simulating the Bandit Problem
We can now create a bandit environment and run our agent through multiple iterations to observe its learning behavior.
np.random.seed(42)
probabilities = [0.1, 0.5, 0.8] # Probabilities for each arm
bandit = MultiArmedBandit(probabilities)
agent = EpsilonGreedyAgent(k=len(probabilities), epsilon=0.1)
num_trials = 1000
rewards = []
for _ in range(num_trials):
arm = agent.select_arm()
reward = bandit.pull(arm)
agent.update(arm, reward)
rewards.append(reward)
Step 4: Visualizing the Results
We can analyze the agent’s performance by plotting cumulative rewards.
plt.plot(np.cumsum(rewards))
plt.xlabel('Trials')
plt.ylabel('Cumulative Reward')
plt.title('Performance of Epsilon-Greedy Strategy')
plt.show()
Conclusion
In this blog, we implemented a basic Multi-Armed Bandit problem using the Epsilon-Greedy Algorithm in Python. This method provides a simple yet effective approach to balancing exploration and exploitation in decision-making problems.
Future extensions could include UCB, Thompson Sampling, and Contextual Bandits to enhance learning efficiency in complex environments.
No comments:
Post a Comment