# Reinforcement Learning in DataRobot

Horizontal

In this notebook, we implement a very simple model based on the Q-learning algorithm. This notebook is intended to show a basic form of RL that doesn't require a deep understanding of neural networks or advanced mathematics and how one might deploy such a model in DataRobot.

This example shows the Grid World problem, where an agent learns to navigate a grid to reach a goal.

The notebook will go through the following steps:

1. Define State and Action Space
2. Create a Q-table to store expected rewards for each state/action combination
3. Implement learning algorithm and train model
4. Evaluate model
5. Deploy to a DataRobot Rest API end-point

## 1. Define State and Action Space

Let’s first install datarobotx for some convenient DataRobot deployment procedures.

``````In [ ]:

%%bash
pip install -U datarobotx``````
``````In [ ]:

import random

import numpy as np``````
``````In [ ]:

# Grid settings
grid_size = 4

# funtion to build list of all state tuples

def build_state_list(grid_size):
state_list = []
for i in range(grid_size):
for j in range(grid_size):
state_list.append((i, j))
return state_list

all_states = build_state_list(grid_size)

# Here we just try to reach the top right corner (could be center or any other state)
goal_state = (3, 3)
n_states = grid_size * grid_size
n_actions = 4  # Up, Down, Left, Right``````

### 2. Create a Q-table to store expected rewards for each state/action combination

``````In [ ]:

# Initialize Q-table
Q = np.zeros((n_states, n_actions))

# Helper functions

def state_to_index(state):
return state[0] * grid_size + state[1]

def index_to_state(index):
return (index // grid_size, index % grid_size)

def get_possible_actions(state):
actions = []
if state[0] > 0:
actions.append(0)  # Up
if state[0] < grid_size - 1:
actions.append(1)  # Down
if state[1] > 0:
actions.append(2)  # Left
if state[1] < grid_size - 1:
actions.append(3)  # Right
return actions

# Correct the state transition function to prevent invalid states

def take_action(state, action):
new_state = list(state)
if action == 0 and state[0] > 0:
new_state[0] -= 1  # Up
if action == 1 and state[0] < grid_size - 1:
new_state[0] += 1  # Down
if action == 2 and state[1] > 0:
new_state[1] -= 1  # Left
if action == 3 and state[1] < grid_size - 1:
new_state[1] += 1  # Right
return tuple(new_state)``````

### 3. Implement learning algorithm and train model

``````In [ ]:

# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1  # Exploration rate
n_episodes = 100000

# Training the model with corrected state transitions
for episode in range(n_episodes):
# start at a random state
state = random.choice(all_states)
done = state == goal_state

while not done:
state_index = state_to_index(state)
if random.uniform(0, 1) < epsilon:
# Explore: choose a random action
action = random.choice(get_possible_actions(state))
else:
# Exploit: choose the best action from Q-table
action = np.argmax(Q[state_index])

# Take action and observe reward
next_state = take_action(state, action)
reward = 1 if next_state == goal_state else 0
next_state_index = state_to_index(next_state)

# Q-learning update
Q[state_index, action] = Q[state_index, action] + learning_rate * (
reward
+ discount_factor * np.max(Q[next_state_index])
- Q[state_index, action]
)

# Transition to the next state
state = next_state
done = state == goal_state
``````

### 4. Evaluate model

First, we will just show one path then see on average how many actions it takes to get to the goal state.

``````In [ ]:

# Evaluating the model
state = random.choice(all_states)
print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
state_index = state_to_index(state)
action = np.argmax(Q[state_index])  # Choose the best action
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state

print(trajectory)``````
``````Out [ ]:

Initial state: (3, 3)
[(3, 3)]``````
``````In [ ]:

total_actions = 0  # Total number of actions taken to reach the goal
for state in all_states:
# Evaluating the model
trajectory = [state]
done = state == goal_state
while not done:
state_index = state_to_index(state)
action = np.argmax(Q[state_index])  # Choose the best action
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state
total_actions += 1
print(
"Average number of actions taken to reach the goal:",
total_actions / len(all_states),
)``````
``````Out [ ]:

Average number of actions taken to reach the goal: 3.0
Is this optimal? We know the optimal policy is to move up or to the right until we reach the goal. From the bottom left, this is 6 actions, for the next 2 states it is 5 actions, for the next 3 it is 4 actions, then 4->3, 3->2, 2->1, 1->0 as we already start at the goal state. By simple arithmetic we have

6+2*5+3*4+4*3+3*2+2*1 = 48

Total state = 16

Therefore, the optimal is 48/16 = 3 which is exactly our average number of actions.``````

### 5. Deploy to DataRobot Rest API end-point

``````In [ ]:

import pickle

import datarobot as dr
import numpy as np
import pandas as pd``````
``````In [ ]:

import os

os.makedirs("./storage/deploy/", exist_ok=True)
# save the Q table to a pickle file
with open("./storage/deploy/q_table.pkl", "wb") as f:
pickle.dump(Q, f)``````

### Connect to DataRobot

Read more about different options for connecting to DataRobot from the client.

``````In [ ]:

dr_client = dr.Client()
``````

Define Hooks for Deploying an Unstructured Custom Model. One could use a standard custom deployment, but using this to illustrate flexibity for more complex RL problems.

``````In [ ]:

Make sure to execute the cell earlier in the notebook that create Q-table before deploying
"""

with open(input_dir + "/storage/deploy/" + "q_table.pkl", "rb") as f:

return Q

def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for return action.

model: The output of load_model is passed to this object
data: str
Expects json string passed in request body.
Required keys:
state: tuple(int, int) .. Current state of the agent
query: None
Unused
**kwargs: dict
Unused

Returns:
JSON string with output action

"""
import json

import numpy as np

Q = model
grid_size = int(np.sqrt(len(Q)))  # Grid size is inferred from the Q-table

# Helper functions
def state_to_index(state):
return state[0] * grid_size + state[1]

state = data_dict["state"]

state_index = state_to_index(state)
action = np.argmax(Q[state_index])

return json.dumps({"action": action}, default=int)``````

Test out the prediction structure proior to deployment.

``````In [ ]:

import json

score_unstructured(
json.dumps({"state": (0, 1)}),
None,
)``````
``````Out [ ]:

'{"action": 1}'
``````

Deploy the RL policy model. We will use this convenience method in drx.

• Builds a new Custom Model Environment
• You can also use a DataRobot Python Drop-in Enviroment (e.g. “6386dc1159c606b0d8beddc7”)
• Assembles a new Custom Model with the provided hooks
• Deploys an Unstructured Custom Model to your Deployments
• Returns an object which can be used to make predictions

Use `environment_id` to re-use an existing Custom Model Environment that you’re happy with for shorter iteration cycles on the custom model hooks.

Note: See https://app.datarobot.com/docs/api/api-quickstart/index.html for instructions to setup a drconfig.yaml or call drx.Context() to initialize your credentials.

``````In [ ]:

import datarobotx as drx

drx.Context().endpoint = dr_client.endpoint
drx.Context().token = dr_client.token``````
``````In [ ]:

deployment = drx.deploy(
"storage/deploy/",
extra_requirements=[],
# environment_id="6386dc1159c606b0d8beddc7",
)``````
``````Out [ ]:

# Deploying custom model
- Unable to auto-detect model type; any provided paths and files will be
exported - dependencies should be explicitly specified using
`extra_requirements` or `environment_id`
- Preparing model and environment...
- Configured environment [[Custom]
priceless-ganguly](https://app.datarobot.com/model-registry/custom-environments/65ac4115be769b7f85d5aaf9)
with requirements:
python 3.9.16
datarobot-drum==1.10.14
datarobot-mlops==9.2.8
cloudpickle==2.2.1
- Awaiting custom environment build...``````
``````Out [ ]:

100%|███████████████████████████| 11.0k/11.0k [00:00<00:00, 5.14MB/s]
- Registered custom model
with target type: Unstructured
- Creating and deploying model package...``````
``````Out [ ]:

- Created deployment
# Custom model deployment complete``````

Let’s try out our deployment and track the trajectory from the deployed policy (returns action)

``````In [ ]:

# If your deployment already occured or your notebook restarted due to inactivity, get ID from URL in the UI
# deployment = drx.Deployment("YOUR DEPLOYEMENT ID HERE")
deployment.predict_unstructured({"state": (0, 1)})``````
``````Out [ ]:

# Making predictions
- Making predictions with deployment
# Predictions complete
{'action': 1}``````

Test and print trajectory.

``````In [ ]:

state = (0, 1)
goal_state = (3, 3)

print("Initial state:", state)
trajectory = [state]
done = state == goal_state
while not done:
action = deployment.predict_unstructured({"state": state})["action"]
state = take_action(state, action)
trajectory.append(state)
done = state == goal_state

print(trajectory)``````
``````Out [ ]:

Initial state: (0, 1)
# Making predictions
- Making predictions with deployment
# Predictions complete
# Making predictions
- Making predictions with deployment
# Predictions complete
# Making predictions
- Making predictions with deployment
# Predictions complete
# Making predictions
- Making predictions with deployment
# Predictions complete
# Making predictions
- Making predictions with deployment