-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreading.txt
351 lines (234 loc) · 13.1 KB
/
reading.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
2023-03-14
SEAGuL: Sample Efficient Adversarially Guided Learning of Value Functions
http://proceedings.mlr.press/v144/landry21a/landry21a.pdf
trajectory optimization -> value approximation -> short-horizon MPC.
so purely supervised, no HJB solution attempt.
But they generate adversarial samples to be more sample efficient. the
inner optimisation consists of finding a large approximation error between
the value functions given by neural network and trajectory optimisation. It
is approximated by a small amount of projected gradient steps,
differentiating through trajectory optimisation. they say in practice 1 or
2 are enough.
show some examples where they decrease the number of needed samples to
about half of the conventional sampling approach, with better approximation
performance too. pretty nice.
Synthesis of Lyapunov Functions using Formal Verification
https://arxiv.org/pdf/2112.01835.pdf
synthesise lyapunov functions for autonomous* system. algorithm is a loop
alternating between synthesis of a candidate lyapunov function, and
verification by finding the worst counterexample with a fancy SMT solver.
looks like they only investigate relatively simple linearly parameterised
lyapunov functions.
*in the systems theory sense, no input.
Is L2 Physics-Informed Loss Always Suitable for Training Physics-Informed
Neural Network?
https://arxiv.org/pdf/2206.02016.pdf
challenge the 'de facto standard' of training PINNs with L2 physics loss.
small L2 loss can only provide weak approximation guarantees, authors
advocate for L-infty loss.
write lots of theory I dont really understand. conduct experiments with an
inner training loop that generates adversarial samples - points with high
PDE and boundary condition residuals. quite through and seems practically
relevant. let's do this :)
Safe Control with Learned Certificates: A Survey of Neural Lyapunov,
Barrier, and Contraction Methods for Robotics and Control
https://arxiv.org/pdf/2202.11762.pdf
cool paper by chuchu fan et al. overview of neural verification methods.
mainly talk about verifying an existing control law. there is one example
where they find a control barrier function, (i think) from scratch, so that
would amount to design of the control law too.
HJB Optimal Feedback Control with Deep Differential Value Functions and
Action Constraints
https://arxiv.org/pdf/1909.06153.pdf
one of the few papers where they appear to actually solve the HJB in pinn
style. main innovation is the fact that they sweep the discount from short
to far sighted which apparently helps converge to the right HJB solution.
introduce 'differential neural network', a NN that also has the jacobian of
the (main) output w.r.t. the inputs as a separate output -- something that
seems trivial in jax..?
but then they say pretty much nothing about the actual training, no loss
function, no architecture, no convergence plots...
write a lot about different constraint -> penalty function reformulation.
maybe that part is useful? G
Neural Lyapunov Model Predictive Control: Learning Safe Global Controllers
from Sub-optimal Examples
http://export.arxiv.org/pdf/2002.10451
do a one-step MPC planning thing.
introduce a nice parameterisation of lyapunov fct by NN:
V(x) = x.T (lI + NN(x).T NN(x)) x
where NN(x) is a matrix of appropriate size.
lyapunov NN is trained with a dataset of transition tuples (x, u, x+) (so,
discrete time basically) and a rather complicated loss function that I
don't really understand.
Data-driven initialization of deep learning solvers for Hamilton-Jacobi-Bellman PDEs
https://arxiv.org/pdf/2207.09299.pdf
combines supervised learning and PINN style residual minimization. the data
for the supervised part is generated by a (IMO) sketchy SDRE style
suboptimal controller, obtaining an upper bound of the true value function
for a collection of states.
generally a nice, ground-up paper with toy applications. one possibility
for me could be following that line and extending it to slightly more
interesting systems.
~~~ stuff recommended by lenart ~~~
{{{
Neural network approach to continuous-time direct adaptive optimal control
for partially unknown nonlinear systems
https://www.sciencedirect.com/sdfe/reader/pii/S0893608009000446/pdf
describes a policy iteration scheme, with relatively exact controls-type
lingo. PI goes like this, we have a policy μ and associated value function
V, and update μ by alternating these two steps:
- policy evaluation: let V^μ(x) = \int_t^t+T r(x(τ), μ(x(τ))) dτ + V^μ(x(t+T))
- policy improvement: μ+(x) = argmin_u [the hamiltonian]
i.e. the optimal control w.r.t. the surrogate value function V^μ.
they prove that this is equivalent to solving the HJB equation. the upside
is that we do not need knowledge of f(x) (although we do need g(x) for
minimising the hamiltonian). so indeed the systems can only be partially
unknown as said in the title.
how is this initialized? we need a value function for μ0. maybe in proof,
references right after eq. 13.
apparently the polcies all stay admissible if the initial one is
admissible, where admissible implies
Is this basically an (approximate, MC, time-discretized) way of solving the
time dependent HJB eq. backwards in time? (with the dynamics replaced by
simulation results...)
then they prove some results about how good the convergence is and stuff.
in the end they demonstrate with a couple of simple examples: 2d systems
and linearly parameterised value function, whose optimal parameters are
also known in closed form.
Value Iteration in Continuous Actions, States and Time
https://arxiv.org/pdf/2105.04682.pdf
the paper with the nice grid world illustration in the top right.
central approach is a value iteration loop adapted to continuous time.
basically this, more details in algo 1, but it just another instance of ADP
as far as I can tell:
- compute value function targets:
- simulate the system for some ensemble of states x
(with optimal policy arising from last value estimate)
- evaluate the returns as: integral of return seen over simulated time
+ value function estimate at end of sim time, with appropriate discount
- fit value function:
- just a supervised learning problem: fit NN to (x, V) pairs with Lp loss
seems very simple and approachable. maybe this is the way??? based on
simulation, also with basically f (dynamics) provided by sim and g known.
Continuous-Time Model-Based Reinforcement Learning
https://arxiv.org/pdf/2102.04764.pdf
go over some fundamental challenges in converting existing RL methods to
continuous time instead of time discretising the dynamics as usual.
estimate a model in form of a distribution of ODE right hand sides...
actor-critic style algorithm. critic = estimated state->value function,
actor = state -> action function. some more theory reading definitely
required. also, they have a thing called imaginary states and actions,
which I completely failed to understand....
}}}
SYMPOCNET: SOLVING OPTIMAL CONTROL PROBLEMS WITH
APPLICATIONS TO HIGH-DIMENSIONAL MULTI-AGENT PATH
PLANNING PROBLEMS
https://arxiv.org/pdf/2201.05475.pdf
dont get what they do. but they have impressive results - 500d path
planning/collision avoidance problem solved in 1.5h on one gpu??? what?
section 3:
- introduce the 'hamiltonian ODE' system which looks exactly like
Pontryagin's minimum principle.
- use the SympNet (previously introduced NN architecture that represents
symplectic maps) to transform state+adjoint space, then solve hamiltonian
ODE in that new space which somehow is simpler? explicitly, transform:
(y, q) = φ(x, p)
where φ is that symplectic map. so we have some new 'hamiltonian ODE' in
the transformed space.
dy/ds = ∇_q \tilde H (y(s), q(s))
dq/ds = -∇_y \tilde H (y(s), q(s))
then they assume that the new hamiltonian, \tilde H, is independent of y,
so it simplifies to:
dy/ds = ∇_q \tilde H (q(s))
dq/ds = 0
which obviously has solutions with constant q and affine y. whaaaaat?
- a bit later they say, to learn the symplectic net φ they use the PINN
loss of the hamiltonian ODEs. that I don't really understand. basically
it looks like
section 4:
- the version with state constraints is basically the normal, unconstrained
version + some different penalty functions which they compare.
the idea starts to creep up on me that they use all of this to solve JUST
ONE optimal control problem, instead of every possible one, so they only
have local optimality of the trajectory, only one initial condition, etc!
that also explains the egregious 500dim claim, otherwise they would have to
somehow cover a continuous space with 2^500 =
32733906078961418700131896968275991522166420460430647894832913680961\
33796404674554883270092325904157150886684127560071009217256545885393\
053328527589376
and also they talk about being maybe possibly in the future able to solve
(simpler kinds of) OCPs in real time which would be irreleevant if they
had a HJB solution for all x0. conclusion:
ER!! HAT!! IHN!! ÜBER!! DEN!! TISCH!! GEZOGEN!!!!!!
Adaptive Deep Learning for high-dimensional
Hamilton–Jacobi–Bellman Equations
https://epubs.siam.org/doi/pdf/10.1137/19M1288802
impressive looking paper at first glance. neat and clean. train a rigid
body stabilization OCP in only three minutes.
extensive review section, be sure to check out some of their sources.
also talk a lot about optimal control basics, HJB, pontryagin stuff, maybe
too much?
in '2.2 causality-free data generation' they have a practical trick for
finding optimal trajectories with the pontryagin principle, a kind of
homotopy method where they solve the problem for a short time horizon
first, which is then successively stepped up until the desired OCP is
reached.
really interesting:
computing many such solutions becomes expensive, which means that generating
the large data sets necessary to train an NN can be difficult. With this in mind, we use
the time-marching trick only to generate a small initial data set, and adaptively add
more points during training. The key to doing this efficiently is simulating the system
dynamics using the partially trained NN to close the loop. The closed-loop trajectory
and predicted costate provide good guesses for the optimal state and costate, so that
we can immediately solve (2.11) for all of [0, tf ]. Besides being more computationally
efficient than time-marching, this approach also requires no parameter tuning. Details
are presented in section 4, and numerical comparisons between this method and time-
marching are given in subsections 5.2 and 6.2.
seems like a really nice and efficient way to alleviate the pain of
nonlinear trajectory optimization.
in general they have a lot of thoughts about data generation, learning in
the small data regime, fitting value functions, and an especially long
section about validation.
show in results that it is critical to include value gradient in the loss
formulation, instead of just value values (heh). with a possible PINN
approach we will have to do this anyway -- but interesting to know that in
a more supervised-style setting it is still quite useful, and the gradients
should be quite easy to get with the KKT matrix, autodiff, implicit
function theorem, whatever.
criticism:
a cheap man's form of uncertainty sampling, basically inserting the
surrogate uncertainty = || grad_x V^NN(x, t) ||
reliance on nonlinear/nonconvex optimization problems for initialisation
probably removes all hopes of global optimality in nonconvex control
problems
A neural network approach for stochastic optimal control
https://arxiv.org/pdf/2209.13104.pdf
introduce a quite general approach to the problem. formulate a tunable
loss, where different tunigns result in quite different methods.
somehow the stochastic version has smooth optimal value functions. would
that make something easier for me?
(i think) their central important ingredient is also finding new optimal
trajectories by PMP, with inintal costates estimated by approximate value
function.
Deep Learning Approximation for Stochastic Control Problems
https://arxiv.org/pdf/1611.07422v1.pdf
They solve discrete time, finite horizon, stochastic optimal control
problems. basically they write one big fat equation which applies some NN
controller to each state, calculates with that and a sample of the
stochastic component the next state, and applies the next NN controller
(time varying!). The cost is calculated, and they do direct gradient,
trying to minimise the expected cost w.r.t all controller NN weights, with
a MC-approximated expectation over stochastic components.
they have quite a limited theory discussion and basically no correctness
results.
continuous-time versions would basically be neural ODEs right? ofc we can
do that, but also who knows about convergence guarantees. probably it works
alright though..
also watched this talk:
Wei Kang: "Data Development and Deep Learning for HJB Equations"
https://www.youtube.com/watch?v=ABY-Wo6w77I&list=PLHyI3Fbmv0ScvcPAT9IK6YPEZGnj-6Sqk&index=24
mentions a lot of the interesting papers summarised above and collected on
semantic scholar in the "VERY interesting" section. especially in the
discussion afterwards there are some interesting thoughts on global vs
local optimality. no concrete research mentioned though. could this be my
niche???