Skip to content

Commit

Permalink
venue
Browse files Browse the repository at this point in the history
  • Loading branch information
RuizSerra committed Mar 4, 2024
1 parent da194d8 commit 38db782
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 149 deletions.
8 changes: 4 additions & 4 deletions draft_header.html
Original file line number Diff line number Diff line change
Expand Up @@ -179,12 +179,12 @@ <h1>Learning vision processing for assistive displays through self-attention age
</div>
</div>
<div class="date">
<div class="month">January</div>
<div class="year">2022</div>
<div class="month">February</div>
<div class="year">2024</div>
</div>
<div class="date">
<div class="month">Cyb-IEEE<dt-fn>IEEE Transactions on Cybernetics (under review)</dt-fn></div>
<div class="year" style="color: #FF6C00;"><a href="https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=6221036" target="_blank">paper</a></div>
<div class="month">ACM-TOMM<dt-fn>ACM Transactions on Multimedia Computing, Communications, and Applications</dt-fn></div>
<div class="year" style="color: #FF6C00;"><a href="https://doi.org/10.1145/3650111" target="_blank">paper</a></div>
</div>
</div>
</dt-byline>
149 changes: 4 additions & 145 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -179,158 +179,17 @@ <h1>Learning vision processing for assistive displays through self-attention age
</div>
</div>
<div class="date">
<div class="month">January</div>
<div class="year">2022</div>
<div class="month">February</div>
<div class="year">2024</div>
</div>
<div class="date">
<div class="month">Cyb-IEEE<dt-fn>IEEE Transactions on Cybernetics (under review)</dt-fn></div>
<div class="year" style="color: #FF6C00;"><a href="https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=6221036" target="_blank">paper</a></div>
<div class="month">ACM-TOMM<dt-fn>ACM Transactions on Multimedia Computing, Communications, and Applications</dt-fn></div>
<div class="year" style="color: #FF6C00;"><a href="https://doi.org/10.1145/3650111" target="_blank">paper</a></div>
</div>
</div>
</dt-byline>
<h2>Problem statement</h2>
<div style="text-align: center;">
<img src="assets/png/TVCG-pipeline-1.png" style="margin: 0; width: 100%;" ></img>
<figcaption style="text-align: left; padding-top: 0;">
<span style="color: #00F">Assistive vision</span> consists of a camera that captures the real world, with
images processed by a video processing unit (VPU), converting them into scene
representations that can be rendered in assistive displays of different kinds.
We <span style="color: #FF9002">train a self-attention network in a RL context</span> to select important parts of
images for 3D navigation. Once trained, the SA network can be <span style="color: #C159B2">deployed</span>
to the visual prostheses’ VPU to perform the vision processing.
</figcaption>
</div>
<p>With the goal of simplifying visual representations of scenes
for navigation by selecting relevant features, we build upon
the work of Tang et al. <dt-cite key="Tang2020"></dt-cite>,
adapting the DRL agent they introduced to enable training in a 3D navigation simulation environment. We
propose several methods to enhance the selected features,
and adapt the vision processing pipeline to present the obtained representations through different display modalities,
highlighting the method’s versatility. The resultant visualisations’ task-relevant features are enhanced, and those
irrelevant removed, effectively increasing the signal-to-noise ratio.</p>
<hr>
<h2>Training in simulation</h2>
<p>The agents are trained in Deepmind Lab <dt-cite key="Beattie2016"></dt-cite>
&quot;NavMaze&quot; simulation environments with RGB-D observations (or variations thereof),
and an action space size of 3.</p>
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/d2_10_0_overlay.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
The self-attention models are trained in a reinforcement learning context by means of neuroevolution.
During training, the LSTM controller part of the network makes all decisions based solely on the
location of the top <i>K</i> most important image patches. This figure shows agent <i>d2</i> navigating environment
<i>NavMazeStatic01</i>.
</figcaption>
</div>
<p></p>
<div style="text-align: center;">
<img src="assets/png/d2_reward_vs_iteration.png" style="margin: 0; width: 100%;" ></img>
<figcaption style="text-align: left; padding-top: 0;">
The agents can learn to navigate the environment effectively with less than
100 million training observations (~200 iterations × 64 population/iter. ×
8 episodes/pop. × 900 observations/episode ≈ 92E6 observations), taking ~3h of wall time in our infrastructure.
This figure shows agent <i>d2</i> learning in environment <i>NavMazeStatic01</i>.
</figcaption>
</div>
<p></p>
<div style="text-align: center;">
<img src="assets/png/training-components.png" style="margin: 0; width: 100%;" ></img>
<figcaption style="text-align: left; padding-top: 0;">
To make the training process more scalable and marginally faster, we completely decoupled the CMA-ES
population from the training task queue. Task requests, including population member
identifier and agent parameters for the given population member are placed in a
queue and undertaken by compute workers on a FIFO basis. This makes the training
more flexible and suitable for distributed computing.
</figcaption>
</div>
<hr>
<h2>Vision processing in real-world scenes</h2>
<div style="text-align: center;">
<img src="assets/png/TVCG-main-figure.png" style="margin: 0; width: 100%;" ></img>
<figcaption style="text-align: left; padding-top: 0;">
The representations learnt in simulation translate to the real-world.
Hyperparameters can be adjusted in real time in the final application. For example,
in this figure, <i>K=10</i> patches are selected in training, whereas <i>K=80</i>
patches are selected in the real-world image.
</figcaption>
</div>
<p>Below we show different feature retrieval methods applied to real-world RGB-D video.</p>
<h3>Importance ranking</h3>
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/C4star_50_ranking.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
Patch brightness is based on its importance ranking. Agent C4*, showing K=50 patches.
</figcaption>
</div>
<h3>Masked luminance</h3>
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/C4star_50_masked_intensity.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
Luminance (greyscale) masked with selected patches. Agent C4*, showing K=50 patches.
</figcaption>
</div>
<h3>Masked depth</h3>
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/C4star_50_masked_depth.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
Depth channel (disparity values) masked with selected patches. Agent C4*, showing K=50 patches.
</figcaption>
</div>
<h3>Weighted depth</h3>
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/C4star_weighted_depth.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
Depth at the patch location is scaled by the patch importance value. Agent C4*, showing all patches.
</figcaption>
</div>
<hr>
<h2>Display modalities</h2>
<h3>Simulated Phosphene Visualisation</h3>
<div style="text-align: left;">
<img src="assets/png/TVCG-SA-output-SPV.png" style="margin: 0; width: 100%;" ></img>
<figcaption style="text-align: left; padding-top: 0;">
SPV of different output modes (refer to Figure 5 in the paper).
</figcaption>
</div>
<!--
<div style="text-align: center;">
<video class="b-lazy" src="assets/mp4/TODO.mp4" type="video/mp4" autoplay muted playsinline loop style="margin: 0; width: 100%;" ></video>
<figcaption style="text-align: left; padding-top: 0;">
Simulated Phosphene Visualisation (SPV)
</figcaption>
</div> -->
<!-- <p></p>
### vOICe
<dt-cite key="Meijer1993"></dt-cite> -->
</dt-article>
<dt-appendix>
<h3>Acknowledgements</h3>
<p>The template for this supporting materials site is from <a href="https://github.com/attentionagent/attentionagent.github.io">Tang et al</a>.</p>
<p>The experiments in this work were performed on Swinburne University's <a href="https://supercomputing.swin.edu.au/ozstar/">OzStar high-performance computing system</a>.</p>
<h3 id="citation">Citation</h3>
<p>For attribution in academic contexts, please cite this work as:</p>
<pre class="citation short">Jaime Ruiz-Serra and Jack White and Stephen Petrie and Tatiana Kameneva and Chris McCarthy,
Learning vision processing for assistive displays through self-attention agents, 2022.</pre>
<p>BibTeX citation</p>
<pre class="citation long">@article{Ruiz-Serra2021,
author = {Ruiz-Serra, Jaime and
White, Jack and
Petrie, Stephen and
Kameneva, Tatiana and
McCarthy, Chris},
title = {Learning vision processing for assistive displays through self-attention agents},
eprint = {},
url = {},
note = "\url{http://ruizserra.github.io/self-attention-assistive-displays}",
year = {2022}
}</pre>
<h3>Open Source Code</h3>
<p>Code to reproduce the results in this work TBD.</p>
<h3>Reuse</h3>
<p>Diagrams and text are licensed under Creative Commons Attribution <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY 4.0</a> with the <a href="http://github.com/ruizserra/self-attention-assistive-displays/assets">source available on GitHub</a>, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.</p>
</dt-appendix>
</dt-appendix>
</body>
Expand Down

0 comments on commit 38db782

Please sign in to comment.