Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added SACD implementation. #190

Merged
merged 18 commits into from
Aug 13, 2024
Merged

Added SACD implementation. #190

merged 18 commits into from
Aug 13, 2024

Conversation

PKWadsy
Copy link
Contributor

@PKWadsy PKWadsy commented Aug 7, 2024

Tested on Cart-Pole. Attains maximum reward

based on this paper

alpha_lr: float,
device: torch.device,
):
self.type = "discrete_policy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this get handled by gym_environment? is there a respective update? This should just stay as policy no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization in the gym environment was breaking things so wanted to make a loop without normalization. Sorry about the horrible naming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it breaking things? it shouldn't break anything

Copy link
Contributor

@SamBoasman SamBoasman Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the discrete setting, select_action_from_policy() and sample_action() return a tensor containing the integer index of the action to enact. However, the action returned from sample_action() is normalised, while the action returned from select_action_from_policy() is denormalised. To make this work, we would need to configure some sort of redundant reversal of normalisation/denormalisation for one of these functions so that run_action_on_emulator() receives only the normalised or original action format instead of both.

):
super().__init__()
if hidden_size is None:
hidden_size = [256, 256]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check default hidden layer size

if hidden_size is None:
hidden_size = [256, 256]
if log_std_bounds is None:
log_std_bounds = [-20, 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check log_std_bounds default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove log_std_bounds not even used here

dist = torch.distributions.Categorical(action_probs)
action = dist.sample()
# Offset any values which are zero by a small amount so no nan nonsense
zero_offset = action_probs == 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid confusion just through () around the (action_probs == 0.0)

):
super().__init__()
if hidden_size is None:
hidden_size = [256, 256]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check hidden layer size defaults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can we get this info? The paper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper or the code they provided

Comment on lines 214 to 220
actor_lr: Optional[float] = 3e-4
critic_lr: Optional[float] = 3e-4
alpha_lr: Optional[float] = 3e-4

gamma: Optional[float] = 0.99
tau: Optional[float] = 0.005
reward_scale: Optional[float] = 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check defaults from paper/code

alpha_lr: float,
device: torch.device,
):
self.type = "discrete_policy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it breaking things? it shouldn't break anything

):
super().__init__()
if hidden_size is None:
hidden_size = [256, 256]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper or the code they provided

if hidden_size is None:
hidden_size = [256, 256]
if log_std_bounds is None:
log_std_bounds = [-20, 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove log_std_bounds not even used here

import torch
from torch import nn

from cares_reinforcement_learning.util.common import SquashedNormal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also remove the squashednormal import here

@beardyFace beardyFace merged commit 15de136 into main Aug 13, 2024
4 checks passed
@beardyFace beardyFace deleted the alg/SAC-D branch August 13, 2024 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants