render

RuizSerra · Mar 4, 2024 · 036ab4b · 036ab4b
1 parent 38db782
commit 036ab4b
Showing 1 changed file with 141 additions and 0 deletions.
diff --git a/index.html b/index.html
@@ -188,8 +188,149 @@ <h1>Learning vision processing for assistive displays through self-attention age
   </div>
 </div>
 </dt-byline>
+<h2>Problem statement</h2>
+<div style="text-align: center;">
+<img src="assets/png/TVCG-pipeline-1.png" style="margin: 0; width: 100%;" ></img>
+<figcaption style="text-align: left; padding-top: 0;">
+<span style="color: #00F">Assistive vision</span> consists of a camera that captures the real world, with
+images processed by a video processing unit (VPU), converting them into scene
+representations that can be rendered in assistive displays of different kinds.
+We <span style="color: #FF9002">train a self-attention network in a RL context</span> to select important parts of
+images for 3D navigation. Once trained, the SA network can be <span style="color: #C159B2">deployed</span>
+to the visual prostheses’ VPU to perform the vision processing.
+</figcaption>
+</div>
+<p>With the goal of simplifying visual representations of scenes
+for navigation by selecting relevant features, we build upon
+the work of Tang et al. <dt-cite key="Tang2020"></dt-cite>,
+adapting the DRL agent they introduced to enable training in a 3D navigation simulation environment. We
+propose several methods to enhance the selected features,
+and adapt the vision processing pipeline to present the obtained representations through different display modalities,
+highlighting the method’s versatility. The resultant visualisations’ task-relevant features are enhanced, and those
+irrelevant removed, effectively increasing the signal-to-noise ratio.</p>
+<hr>
+<h2>Training in simulation</h2>
+<p>The agents are trained in Deepmind Lab <dt-cite key="Beattie2016"></dt-cite>
+&quot;NavMaze&quot; simulation environments with RGB-D observations (or variations thereof),
+and an action space size of 3.</p>
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/d2_10_0_overlay.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+The self-attention models are trained in a reinforcement learning context by means of neuroevolution.
+During training, the LSTM controller part of the network makes all decisions based solely on the
+location of the top <i>K</i> most important image patches. This figure shows agent <i>d2</i> navigating environment
+<i>NavMazeStatic01</i>.
+</figcaption>
+</div>
+<p></p>
+<div style="text-align: center;">
+<img src="assets/png/d2_reward_vs_iteration.png" style="margin: 0; width: 100%;" ></img>
+<figcaption style="text-align: left; padding-top: 0;">
+The agents can learn to navigate the environment effectively with less than
+100 million  training  observations (~200 iterations × 64 population/iter. ×
+8 episodes/pop. × 900 observations/episode ≈ 92E6 observations), taking ~3h of wall time in our infrastructure.
+This figure shows agent <i>d2</i> learning in environment <i>NavMazeStatic01</i>.
+</figcaption>
+</div>
+<p></p>
+<div style="text-align: center;">
+<img src="assets/png/training-components.png" style="margin: 0; width: 100%;" ></img>
+<figcaption style="text-align: left; padding-top: 0;">
+To make the training process more scalable and marginally faster, we completely decoupled the CMA-ES
+population from the training task queue. Task requests, including population member
+identifier and agent parameters for the given population member are placed in a
+queue and undertaken by compute workers on a FIFO basis. This makes the training
+more flexible and suitable for distributed computing.
+</figcaption>
+</div>
+<hr>
+<h2>Vision processing in real-world scenes</h2>
+<div style="text-align: center;">
+<img src="assets/png/TVCG-main-figure.png" style="margin: 0; width: 100%;" ></img>
+<figcaption style="text-align: left; padding-top: 0;">
+The representations learnt in simulation translate to the real-world.
+Hyperparameters can be adjusted in real time in the final application. For example,
+in this figure, <i>K=10</i> patches are selected in training, whereas <i>K=80</i>
+patches are selected in the real-world image.
+</figcaption>
+</div>
+<p>Below we show different feature retrieval methods applied to real-world RGB-D video.</p>
+<h3>Importance ranking</h3>
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/C4star_50_ranking.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+Patch brightness is based on its importance ranking. Agent C4*, showing K=50 patches.
+</figcaption>
+</div>
+<h3>Masked luminance</h3>
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/C4star_50_masked_intensity.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+Luminance (greyscale) masked with selected patches. Agent C4*, showing K=50 patches.
+</figcaption>
+</div>
+<h3>Masked depth</h3>
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/C4star_50_masked_depth.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+Depth channel (disparity values) masked with selected patches. Agent C4*, showing K=50 patches.
+</figcaption>
+</div>
+<h3>Weighted depth</h3>
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/C4star_weighted_depth.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+Depth at the patch location is scaled by the patch importance value. Agent C4*, showing all patches.
+</figcaption>
+</div>
+<hr>
+<h2>Display modalities</h2>
+<h3>Simulated Phosphene Visualisation</h3>
+<div style="text-align: left;">
+<img src="assets/png/TVCG-SA-output-SPV.png" style="margin: 0; width: 100%;" ></img>
+<figcaption style="text-align: left; padding-top: 0;">
+SPV of different output modes (refer to Figure 5 in the paper).
+</figcaption>
+</div>
+<!--
+<div style="text-align: center;">
+<video class="b-lazy" src="assets/mp4/TODO.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
+<figcaption style="text-align: left; padding-top: 0;">
+Simulated Phosphene Visualisation (SPV)
+</figcaption>
+</div> -->
+<!-- <p></p>
+
+
+### vOICe
+
+<dt-cite key="Meijer1993"></dt-cite> -->
 </dt-article>
 <dt-appendix>
+<h3>Acknowledgements</h3>
+<p>The template for this supporting materials site is from <a href="https://github.com/attentionagent/attentionagent.github.io">Tang et al</a>.</p>
+<p>The experiments in this work were performed on Swinburne University's <a href="https://supercomputing.swin.edu.au/ozstar/">OzStar high-performance computing system</a>.</p>
+<h3 id="citation">Citation</h3>
+<p>For attribution in academic contexts, please cite this work as:</p>
+<pre class="citation short">Jaime Ruiz-Serra and Jack White and Stephen Petrie and Tatiana Kameneva and Chris McCarthy,
+Learning vision processing for assistive displays through self-attention agents, 2022.</pre>
+<p>BibTeX citation</p>
+<pre class="citation long">@article{Ruiz-Serra2021,
+  author = {Ruiz-Serra, Jaime and
+            White, Jack and
+            Petrie, Stephen and
+            Kameneva, Tatiana and
+            McCarthy, Chris},
+  title  = {Learning vision processing for assistive displays through self-attention agents},
+  eprint = {},
+  url    = {},
+  note   = "\url{http://ruizserra.github.io/self-attention-assistive-displays}",
+  year   = {2022}
+}</pre>
+<h3>Open Source Code</h3>
+<p>Code to reproduce the results in this work TBD.</p>
+<h3>Reuse</h3>
+<p>Diagrams and text are licensed under Creative Commons Attribution <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY 4.0</a> with the <a href="http://github.com/ruizserra/self-attention-assistive-displays/assets">source available on GitHub</a>, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.</p>
 </dt-appendix>
 </dt-appendix>
 </body>