-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SACD implementation. #190
Conversation
alpha_lr: float, | ||
device: torch.device, | ||
): | ||
self.type = "discrete_policy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this get handled by gym_environment? is there a respective update? This should just stay as policy no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalization in the gym environment was breaking things so wanted to make a loop without normalization. Sorry about the horrible naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it breaking things? it shouldn't break anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the discrete setting, select_action_from_policy()
and sample_action()
return a tensor containing the integer index of the action to enact. However, the action returned from sample_action()
is normalised, while the action returned from select_action_from_policy()
is denormalised. To make this work, we would need to configure some sort of redundant reversal of normalisation/denormalisation for one of these functions so that run_action_on_emulator()
receives only the normalised or original action format instead of both.
): | ||
super().__init__() | ||
if hidden_size is None: | ||
hidden_size = [256, 256] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check default hidden layer size
if hidden_size is None: | ||
hidden_size = [256, 256] | ||
if log_std_bounds is None: | ||
log_std_bounds = [-20, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check log_std_bounds default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove log_std_bounds not even used here
dist = torch.distributions.Categorical(action_probs) | ||
action = dist.sample() | ||
# Offset any values which are zero by a small amount so no nan nonsense | ||
zero_offset = action_probs == 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid confusion just through () around the (action_probs == 0.0)
): | ||
super().__init__() | ||
if hidden_size is None: | ||
hidden_size = [256, 256] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check hidden layer size defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can we get this info? The paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper or the code they provided
actor_lr: Optional[float] = 3e-4 | ||
critic_lr: Optional[float] = 3e-4 | ||
alpha_lr: Optional[float] = 3e-4 | ||
|
||
gamma: Optional[float] = 0.99 | ||
tau: Optional[float] = 0.005 | ||
reward_scale: Optional[float] = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double check defaults from paper/code
…learning into alg/SAC-D
alpha_lr: float, | ||
device: torch.device, | ||
): | ||
self.type = "discrete_policy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it breaking things? it shouldn't break anything
): | ||
super().__init__() | ||
if hidden_size is None: | ||
hidden_size = [256, 256] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper or the code they provided
if hidden_size is None: | ||
hidden_size = [256, 256] | ||
if log_std_bounds is None: | ||
log_std_bounds = [-20, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove log_std_bounds not even used here
import torch | ||
from torch import nn | ||
|
||
from cares_reinforcement_learning.util.common import SquashedNormal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also remove the squashednormal import here
Tested on Cart-Pole. Attains maximum reward
based on this paper