reading.txt

2023-03-14

SEAGuL: Sample Efficient Adversarially Guided Learning of Value Functions
http://proceedings.mlr.press/v144/landry21a/landry21a.pdf

trajectory optimization -> value approximation -> short-horizon MPC.
so purely supervised, no HJB solution attempt.

But they generate adversarial samples to be more sample efficient. the
inner optimisation consists of finding a large approximation error between
the value functions given by neural network and trajectory optimisation. It
is approximated by a small amount of projected gradient steps,
differentiating through trajectory optimisation. they say in practice 1 or
2 are enough.

show some examples where they decrease the number of needed samples to
about half of the conventional sampling approach, with better approximation
performance too. pretty nice.


Synthesis of Lyapunov Functions using Formal Verification
https://arxiv.org/pdf/2112.01835.pdf

synthesise lyapunov functions for autonomous* system. algorithm is a loop
alternating between synthesis of a candidate lyapunov function, and
verification by finding the worst counterexample with a fancy SMT solver.

looks like they only investigate relatively simple linearly parameterised
lyapunov functions.

*in the systems theory sense, no input.


Is L2 Physics-Informed Loss Always Suitable for Training Physics-Informed
Neural Network?
https://arxiv.org/pdf/2206.02016.pdf

challenge the 'de facto standard' of training PINNs with L2 physics loss.
small L2 loss can only provide weak approximation guarantees, authors
advocate for L-infty loss.

write lots of theory I dont really understand. conduct experiments with an
inner training loop that generates adversarial samples - points with high
PDE and boundary condition residuals. quite through and seems practically
relevant. let's do this :)


Safe Control with Learned Certificates: A Survey of Neural Lyapunov,
Barrier, and Contraction Methods for Robotics and Control
https://arxiv.org/pdf/2202.11762.pdf

cool paper by chuchu fan et al. overview of neural verification methods.
mainly talk about verifying an existing control law. there is one example
where they find a control barrier function, (i think) from scratch, so that
would amount to design of the control law too.


HJB Optimal Feedback Control with Deep Differential Value Functions and
Action Constraints
https://arxiv.org/pdf/1909.06153.pdf

one of the few papers where they appear to actually solve the HJB in pinn
style. main innovation is the fact that they sweep the discount from short
to far sighted which apparently helps converge to the right HJB solution.
introduce 'differential neural network', a NN that also has the jacobian of
the (main) output w.r.t. the inputs as a separate output -- something that
seems trivial in jax..?

but then they say pretty much nothing about the actual training, no loss
function, no architecture, no convergence plots...

write a lot about different constraint -> penalty function reformulation.
maybe that part is useful? G


Neural Lyapunov Model Predictive Control: Learning Safe Global Controllers
from Sub-optimal Examples
http://export.arxiv.org/pdf/2002.10451

do a one-step MPC planning thing.

introduce a nice parameterisation of lyapunov fct by NN:
V(x) = x.T (lI + NN(x).T NN(x)) x
where NN(x) is a matrix of appropriate size.

lyapunov NN is trained with a dataset of transition tuples (x, u, x+) (so,
discrete time basically) and a rather complicated loss function that I
don't really understand.


Data-driven initialization of deep learning solvers for Hamilton-Jacobi-Bellman PDEs
https://arxiv.org/pdf/2207.09299.pdf

combines supervised learning and PINN style residual minimization. the data
for the supervised part is generated by a (IMO) sketchy SDRE style
suboptimal controller, obtaining an upper bound of the true value function
for a collection of states.

generally a nice, ground-up paper with toy applications. one possibility
for me could be following that line and extending it to slightly more
interesting systems.


~~~ stuff recommended by lenart ~~~
{{{

Neural network approach to continuous-time direct adaptive optimal control
for partially unknown nonlinear systems
https://www.sciencedirect.com/sdfe/reader/pii/S0893608009000446/pdf


describes a policy iteration scheme, with relatively exact controls-type
lingo. PI goes like this, we have a policy μ and associated value function
V, and update μ by alternating these two steps:

- policy evaluation: let V^μ(x) = \int_t^t+T r(x(τ), μ(x(τ))) dτ + V^μ(x(t+T))

- policy improvement: μ+(x) = argmin_u [the hamiltonian]
  i.e. the optimal control w.r.t. the surrogate value function V^μ.

they prove that this is equivalent to solving the HJB equation. the upside
is that we do not need knowledge of f(x) (although we do need g(x) for
minimising the hamiltonian). so indeed the systems can only be partially
unknown as said in the title.

how is this initialized? we need a value function for μ0. maybe in proof,
references right after eq. 13.

apparently the polcies all stay admissible if the initial one is
admissible, where admissible implies

Is this basically an (approximate, MC, time-discretized) way of solving the
time dependent HJB eq. backwards in time? (with the dynamics replaced by
simulation results...)

then they prove some results about how good the convergence is and stuff.
in the end they demonstrate with a couple of simple examples: 2d systems
and linearly parameterised value function, whose optimal parameters are
also known in closed form.


Value Iteration in Continuous Actions, States and Time
https://arxiv.org/pdf/2105.04682.pdf


the paper with the nice grid world illustration in the top right.

central approach is a value iteration loop adapted to continuous time.
basically this, more details in algo 1, but it just another instance of ADP
as far as I can tell:

- compute value function targets:
    - simulate the system for some ensemble of states x
      (with optimal policy arising from last value estimate)
    - evaluate the returns as: integral of return seen over simulated time
      + value function estimate at end of sim time, with appropriate discount

- fit value function:
    - just a supervised learning problem: fit NN to (x, V) pairs with Lp loss

seems very simple and approachable. maybe this is the way??? based on
simulation, also with basically f (dynamics) provided by sim and g known.


Continuous-Time Model-Based Reinforcement Learning
https://arxiv.org/pdf/2102.04764.pdf

go over some fundamental challenges in converting existing RL methods to
continuous time instead of time discretising the dynamics as usual.

estimate a model in form of a distribution of ODE right hand sides...

actor-critic style algorithm. critic = estimated state->value function,
actor = state -> action function. some more theory reading definitely
required. also, they have a thing called imaginary states and actions,
which I completely failed to understand....

}}}


SYMPOCNET: SOLVING OPTIMAL CONTROL PROBLEMS WITH
APPLICATIONS TO HIGH-DIMENSIONAL MULTI-AGENT PATH
PLANNING PROBLEMS
https://arxiv.org/pdf/2201.05475.pdf

dont get what they do. but they have impressive results - 500d path
planning/collision avoidance problem solved in 1.5h on one gpu??? what?


section 3:
- introduce the 'hamiltonian ODE' system which looks exactly like
  Pontryagin's minimum principle.
- use the SympNet (previously introduced NN architecture that represents
  symplectic maps) to transform state+adjoint space, then solve hamiltonian
  ODE in that new space which somehow is simpler? explicitly, transform:
    (y, q) = φ(x, p)
  where φ is that symplectic map. so we have some new 'hamiltonian ODE' in
  the transformed space.
    dy/ds =  ∇_q \tilde H (y(s), q(s))
    dq/ds = -∇_y \tilde H (y(s), q(s))
  then they assume that the new hamiltonian, \tilde H, is independent of y,
  so it simplifies to:
    dy/ds = ∇_q \tilde H (q(s))
    dq/ds = 0
  which obviously has solutions with constant q and affine y. whaaaaat?
- a bit later they say, to learn the symplectic net φ they use the PINN
  loss of the hamiltonian ODEs. that I don't really understand. basically
  it looks like

section 4:
- the version with state constraints is basically the normal, unconstrained
  version + some different penalty functions which they compare.


the idea starts to creep up on me that they use all of this to solve JUST
ONE optimal control problem, instead of every possible one, so they only
have local optimality of the trajectory, only one initial condition, etc!
that also explains the egregious 500dim claim, otherwise they would have to
somehow cover a continuous space with 2^500 =
32733906078961418700131896968275991522166420460430647894832913680961\
33796404674554883270092325904157150886684127560071009217256545885393\
053328527589376

and also they talk about being maybe possibly in the future able to solve
(simpler kinds of) OCPs in real time which would be irreleevant if they
had a HJB solution for all x0. conclusion:

ER!! HAT!! IHN!! ÜBER!! DEN!! TISCH!! GEZOGEN!!!!!!


Adaptive Deep Learning for high-dimensional
Hamilton–Jacobi–Bellman Equations
https://epubs.siam.org/doi/pdf/10.1137/19M1288802

impressive looking paper at first glance. neat and clean. train a rigid
body stabilization OCP in only three minutes.

extensive review section, be sure to check out some of their sources.

also talk a lot about optimal control basics, HJB, pontryagin stuff, maybe
too much?

in '2.2 causality-free data generation' they have a practical trick for
finding optimal trajectories with the pontryagin principle, a kind of
homotopy method where they solve the problem for a short time horizon
first, which is then successively stepped up until the desired OCP is
reached.

really interesting:
computing many such solutions becomes expensive, which means that generating
the large data sets necessary to train an NN can be difficult. With this in mind, we use
the time-marching trick only to generate a small initial data set, and adaptively add
more points during training. The key to doing this efficiently is simulating the system
dynamics using the partially trained NN to close the loop. The closed-loop trajectory
and predicted costate provide good guesses for the optimal state and costate, so that
we can immediately solve (2.11) for all of [0, tf ]. Besides being more computationally
efficient than time-marching, this approach also requires no parameter tuning. Details
are presented in section 4, and numerical comparisons between this method and time-
marching are given in subsections 5.2 and 6.2.
seems like a really nice and efficient way to alleviate the pain of
nonlinear trajectory optimization.

in general they have a lot of thoughts about data generation, learning in
the small data regime, fitting value functions, and an especially long
section about validation.

show in results that it is critical to include value gradient in the loss
formulation, instead of just value values (heh). with a possible PINN
approach we will have to do this anyway -- but interesting to know that in
a more supervised-style setting it is still quite useful, and the gradients
should be quite easy to get with the KKT matrix, autodiff, implicit
function theorem, whatever.

criticism:

a cheap man's form of uncertainty sampling, basically inserting the
surrogate uncertainty = || grad_x V^NN(x, t) ||

reliance on nonlinear/nonconvex optimization problems for initialisation
probably removes all hopes of global optimality in nonconvex control
problems


A neural network approach for stochastic optimal control
https://arxiv.org/pdf/2209.13104.pdf

introduce a quite general approach to the problem. formulate a tunable
loss, where different tunigns result in quite different methods.

somehow the stochastic version has smooth optimal value functions. would
that make something easier for me?

(i think) their central important ingredient is also finding new optimal
trajectories by PMP, with inintal costates estimated by approximate value
function.


Deep Learning Approximation for Stochastic Control Problems
https://arxiv.org/pdf/1611.07422v1.pdf

They solve discrete time, finite horizon, stochastic optimal control
problems. basically they write one big fat equation which applies some NN
controller to each state, calculates with that and a sample of the
stochastic component the next state, and applies the next NN controller
(time varying!). The cost is calculated, and they do direct gradient,
trying to minimise the expected cost w.r.t all controller NN weights, with
a MC-approximated expectation over stochastic components.

they have quite a limited theory discussion and basically no correctness
results.

continuous-time versions would basically be neural ODEs right? ofc we can
do that, but also who knows about convergence guarantees. probably it works
alright though..


also watched this talk:
Wei Kang: "Data Development and Deep Learning for HJB Equations"
https://www.youtube.com/watch?v=ABY-Wo6w77I&list=PLHyI3Fbmv0ScvcPAT9IK6YPEZGnj-6Sqk&index=24

mentions a lot of the interesting papers summarised above and collected on
semantic scholar in the "VERY interesting" section. especially in the
discussion afterwards there are some interesting thoughts on global vs
local optimality. no concrete research mentioned though. could this be my
niche???