-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds carflag-v0 env #34
base: master
Are you sure you want to change the base?
Conversation
Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just a few more requests:
- Consider adding easy/medium/hard versions if you would like this to be future proof! You can probably just increase the heaven/hell positions or decrease velocities to make it harder, but you are the expert.
- Please add this to the list of environments. That way, the users can find it. I also have some unit tests that will use this list of envs to ensure that nobody in the future introduces code that breaks your env.
- Consider writing a unit test, just to ensure the environment behaves exactly as expected. RL algorithms are very good at exploiting even the smallest bugs. You can validate everything works as expected by running
pytest tests/
. I think this will already bring up one bug (hint: compare the type signature for thisreset
function to theMineSweeper
one.) - The
carflag
reward is determined to be 1.0 per timestep in the paper. But inpopgym
, the other envs have a max cumulative reward of 1.0 and a min of -1.0. This is so you can train on 50 episodes ofEnv1
and 50 episodes ofEnv2
using the same learning rate. Not sure what the correct answer is here, but I figured I would let you know just in case.
popgym/envs/carflag.py
Outdated
|
||
return self.state, env_reward, done, {"is_success": env_reward > 0} | ||
|
||
def reset(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reset
function should reset using a seed. This ensures we can reproduce trajectories exactly, which is useful when comparing policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, thank you for the pointer. I have updated it now.
The easy level has the range [-1, 1]. The medium level has the range [-3, 3]. The hard level has the range [-5, 5].
Thank you for the detailed comments and support. I have added the easy, medium, and hard levels and updated the list of environments accordingly. While writing the tests, I noticed a bug in the DTQN codebase. The To address this, I observed that the |
Yeah this is tricky. The minimum time required would be the time to reach maximum velocity, drive to the oracle, slow down to 0 velocity, accelerate to max velocity in the direction of the goal, then drive to the goal. Also, we would probably want to be in "heaven" or "hell" for more than a single timestep to help with credit assignment. Do you know how many timesteps they used in the DTQN paper? If so, maybe we can just scale that. For example, maybe the standard/easy mode would take a minimum of 20 timesteps to reach the oracle and then heaven, and DTQN gives the agent 40 timesteps, so 20 timesteps to remain in "heaven". If the harder mode takes 60 timesteps to reach oracle and heaven, then we can set the episodes to be 120 timesteps long. These are just made up numbers. |
The DTQN paper has set the |
Would it be better to set an arbitrary value or one based on the environment conditions? |
Ideally, I'd say it should be set based on environment conditions. How many timesteps would it take an optimal policy to reach the oracle and then heaven? |
Hey, I trained the DTQN policy and it took |
Oh if you have a working DTQN implementation it's even easier! How many timesteps does it take on medium and hard? |
85-137 for easy, 125-175 for medium and 165-210 for hard. |
How does |
Sounds good 👍 |
Hey @smorad, I am getting this error in the testcases but unable to figure out what's wrong: =================================================================== short test summary info ===================================================================
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.4428458, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.3615731, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([1.020816, 0. , 0. ], dtype=float32), 0...
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [ 0.32249273 -0.0015 -1. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [ 0.73226799 -0.0015 0. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data [-0.7669738 0.0015 0. ]
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.10640713, 0. , 0. ]), 0.5, 1.0, ...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-0.65530812, -0.0015 , 0. ]), 0.5, -1...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.25203142, 0. , 0. ]), 0.5, -1.0,...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagEasy] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.50059146, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagMedium] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([-1.0373735, 0. , 0. ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagHard] - ValueError: space Box([-1.1 -0.07 -1. ], [1.1 0.07 1. ], (3,), float32) does not contain data (array([0.86538035, 0. , 0. ], dtype=float...
=================================================== 12 failed, 441 passed, 6 skipped, 417 warnings in 5.21s =================================================== I tried to |
Ah I think this might be related to our Markovian state space requirement. So with POMDPs, it is useful to have the state available for debugging. I think you might need to add the I think the issue is that |
Thank you for the advice. I’ll look into the dimensions of |
Changing the get_state to return just the observation (of size 3) also gives the same error. def get_state(self):
# Return the position of the car, oracle, and goal
return self.state |
yay it passes the tests now! |
Great work! Your code is good. I think I have just one more recommendation: To be in-line with the other popgym environments, I think it would be best if the episodic reward (non discounted return) was always between -1.0, 1.0. Maybe you could rescale the reward to be max_reward = self.max_steps - self.heaven_position / self.max_velocity
reward = reward / max_reward As for tests, I would suggest testing that the episode ends after you reach the boundary, and that the episode return is always bounded in [-1, 1]. But it's up to you. |
Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.