-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththesis.tex
1074 lines (840 loc) · 116 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Copyright (C) 2014-2023 by Thomas Auzinger <[email protected]>
\documentclass[draft,final]{vutinfth} % Remove option 'final' to obtain debug information.
% Load packages to allow in- and output of non-ASCII characters.
\usepackage{lmodern} % Use an extension of the original Computer Modern font to minimize the use of bitmapped letters.
\usepackage[T1]{fontenc} % Determines font encoding of the output. Font packages have to be included before this line.
\usepackage[utf8]{inputenc} % Determines encoding of the input. All input files have to use UTF8 encoding.
% Extended LaTeX functionality is enables by including packages with \usepackage{...}.
\usepackage{amsmath} % Extended typesetting of mathematical expression.
\usepackage{amssymb} % Provides a multitude of mathematical symbols.
\usepackage{mathtools} % Further extensions of mathematical typesetting.
\usepackage{microtype} % Small-scale typographic enhancements.
\usepackage[inline]{enumitem} % User control over the layout of lists (itemize, enumerate, description).
\usepackage{multirow} % Allows table elements to span several rows.
\usepackage{booktabs} % Improves the typesetting of tables.
\usepackage{subcaption} % Allows the use of subfigures and enables their referencing.
\usepackage[ruled,linesnumbered,algochapter]{algorithm2e} % Enables the writing of pseudo code.
\usepackage[usenames,dvipsnames,table]{xcolor} % Allows the definition and use of colors. This package has to be included before tikz.
\usepackage[style=alphabetic,natbib=true]{biblatex}
\usepackage{makecell}
\usepackage{nag} % Issues warnings when best practices in writing LaTeX documents are violated.
\usepackage{todonotes} % Provides tooltip-like todo notes.
\usepackage{hyperref} % Enables hyperlinking in the electronic document version. This package has to be included second to last.
\usepackage[english,german]{babel}
\usepackage[acronym,toc]{glossaries} % Enables the generation of glossaries and lists of acronyms. This package has to be included last.
\addbibresource{refs.bib}
% Define convenience functions to use the author name and the thesis title in the PDF document properties.
\newcommand{\authorname}{Stefan Steinheber} % The author name without titles.
\newcommand{\thesistitle}{Observation and Tracking of Plant Health using
Computer Vision} % The title of the thesis. The English version should be used, if it exists.
% Set PDF document properties
\hypersetup{
pdfpagelayout = TwoPageRight, % How the document is shown in PDF viewers (optional).
linkbordercolor = {Melon}, % The color of the borders of boxes around hyperlinks (optional).
pdfauthor = {\authorname}, % The author's name in the document properties (optional).
pdftitle = {\thesistitle}, % The document's title in the document properties (optional).
pdfsubject = {Subject}, % The document's subject in the document properties (optional).
pdfkeywords = {a, list, of, keywords} % The document's keywords in the document properties (optional).
}
\setpnumwidth{2.5em} % Avoid overfull hboxes in the table of contents (see memoir manual).
\setsecnumdepth{subsection} % Enumerate subsections.
\nonzeroparskip % Create space between paragraphs (optional).
\setlength{\parindent}{0pt} % Remove paragraph indentation (optional).
\makeindex % Use an optional index.
\makeglossaries % Use an optional glossary.
%\glstocfalse % Remove the glossaries from the table of contents.
% Set persons with 4 arguments:
% {title before name}{name}{title after name}{gender}
% where both titles are optional (i.e. can be given as empty brackets {}).
\setauthor{}{\authorname}{}{male}
\setadvisor{Univ.Prof. Dipl.-Ing. Dipl.-Ing. Dr.techn.}{Michael Wimmer}{PhD}{male}
% For bachelor and master theses:
\setfirstassistant{Projektass. Mag.rer.soc.oec.}{Stefan Ohrhallinger}{PhD}{male}
\setsecondassistant{Pretitle}{Forename Surname}{Posttitle}{male}
\setthirdassistant{Pretitle}{Forename Surname}{Posttitle}{male}
% For dissertations:
\setfirstreviewer{Pretitle}{Forename Surname}{Posttitle}{male}
\setsecondreviewer{Pretitle}{Forename Surname}{Posttitle}{male}
% For dissertations at the PhD School and optionally for dissertations:
\setsecondadvisor{Pretitle}{Forename Surname}{Posttitle}{male} % Comment to remove.
% Required data.
\setregnumber{12022506}
\setdate{01}{01}{2024} % Set date with 3 arguments: {day}{month}{year}.
\settitle{\thesistitle}{Erfassung von Pflanzengesundheit mittels Computer Visison} % Sets English and German version of the title (both can be English or German). If your title contains commas, enclose it with additional curvy brackets (i.e., {{your title}}) or define it as a macro as done with \thesistitle.
%\setsubtitle{Optional Subtitle of the Thesis}{Optionaler Untertitel der Arbeit} % Sets English and German version of the subtitle (both can be English or German).
% Select the thesis type: bachelor / master / doctor.
% Bachelor:
\setthesis{bachelor}
%
% Master:
%\setthesis{master}
%\setmasterdegree{dipl.} % dipl. / rer.nat. / rer.soc.oec. / master
%
% Doctor:
%\setthesis{doctor}
%\setdoctordegree{rer.soc.oec.}% rer.nat. / techn. / rer.soc.oec.
% For bachelor and master:
\setcurriculum{Media Informatics and Visual Computing}{Medieninformatik und Visual Computing} % Sets the English and German name of the curriculum.
% Optional reviewer data:
\setfirstreviewerdata{Affiliation, Country}
\setsecondreviewerdata{Affiliation, Country}
% render without figures and tables
%\usepackage{environ}
%\RenewEnviron{figure}{}% Gobble figure environment
%\RenewEnviron{table}{}% Gobble table environment
\begin{document}
\newcommand{\aitool}[3]{
\textbf{\underline{#1:}}\vspace{.8em}\newline\textbf{Input:}\vspace{.3em}\newline#2\vspace{.6em}\newline \textbf{Output:}\vspace{.3em}\newline#3}
\frontmatter % Switches to roman numbering.
% The structure of the thesis has to conform to the guidelines at
% https://informatics.tuwien.ac.at/study-services
\addtitlepage{naustrian} % German title page.
\addtitlepage{english} % English title page.
\addstatementpage
\begin{danksagung*}
Vielen Dank an das DataLab der Technischen Universität Wien und die Bereitstellung der Cluster-Ressourcen. Alle Berechnungen wurden in einer Jupyter-Umgebung (Version 3.6.7) auf einer Compute-Node mit einem AMD EPYC 7742 64-Core Processor, 64 GB RAM und einer NVIDIA A100 40 GB ausgeführt.
Als Schreibhilfe wurde Writefull verwendet, das mithilfe von KI Verbesserungsvorschläge zum Text erstellt. Hier wurde sowohl das Modell von Writefull als auch die GPT Integration, die ebenfalls von Writefull angeboten wird, verwendet.
\end{danksagung*}
\begin{acknowledgements*}
A big thank you goes to the Datalab at the Technical University of Vienna and the provision of their Cluster Resources. All the computations were executed in a Jupyter environment (Version 3.6.7) on a Compute Node equipped with an AMD EPYC 7742 64-Core Processor, 64 GB of RAM, and an NVIDIA A100 40 GB GPU.
The Writefull AI tool was used as a writing aid utilizing Writefull's model as well as the GPT based integration provided by Writefull.
In some cases, AI tools were used to improve the writing style:
\aitool{DeepL Writing (Academic Style)}{The novelty of this architecture is the introduction of Inception blocks, which process their input with different-sized convolutions for different number of times}{The distinctive feature of this architectural design is the incorporation of Inception blocks, which facilitate the processing of input through the application of varying convolutional operations, depending on the input size.}
\end{acknowledgements*}
\begin{kurzfassung}
\todo{Ihr Text hier.}
\end{kurzfassung}
\begin{abstract}
\todo{Enter your text here.}
\end{abstract}
% Select the language of the thesis, e.g., english or naustrian.
\selectlanguage{english}
% Add a table of contents (toc).
\tableofcontents % Starred version, i.e., \tableofcontents*, removes the self-entry.
% Switch to arabic numbering and start the enumeration of chapters in the table of content.
\mainmatter
\chapter{Introduction}
The health of plants is a critical factor in agriculture, especially in terms of food security. Effective monitoring and assessment of plant health can prevent the loss of crops, optimize the yield of crops, and reduce the need for the use of chemical fertilizers. Traditionally, monitoring plant health relied on manual human inspection of crops or sensor-based methods with soil moisture or nutrient-detecting sensors. However, these methods can be labor intensive and difficult to scale across a large farm field.
Approaches to the use of computer vision in a crop-farming context have been around for a long time, but are more rudimentary than the approaches presented in this work \cite{cunha_application_2003}. These approaches only had a limited range of use cases and were more manually involved in the process.
With the advancement of Machine Learning, particularly in the domain of Computer Vision, the capability to monitor plant health is increasingly accessible to small farms and individual cultivators. Computer Vision encompasses the automated extraction, analysis, and interpretation of information derived from images or videos. Within the scope of plant health monitoring, Computer Vision focuses on analyzing captured images of plant leaves to identify specific characteristics or patterns that signify the plant's health status or potential issues therein.
The majority of approaches presented in existing literature utilize given datasets, like PlantVillage or PlantDoc (\cite{hughes_open_2016, singh_plantdoc_2020}). These datasets mostly provide images of single leaves in front of a neutral background. However, this is rarely the case in a practical environment.
In a practical environment, you often have to deal with many leaves, which can be overlapping or completely obstructed by other objects in the scene, and the background can vary greatly in complexity and contrast to the leaves.
In this work, the ability to transmit the results of this method to training and evaluation models will be evaluated. This involves the introduction of an image processing pipeline. To combat the problems of overlapping leaves and varying backgrounds, the pipeline will include an image segmentation step, which aims to separate the leaf from the background as well as possible, which is then followed up by the disease detection step.
Subsequently, the image processing pipeline will be applied to a set of images captured in a real-world scenario, containing multiple overlapping instances of leaves with both healthy and diseased plants. Based on that image set, the efficacy of detecting diseases in the presented plants with the proposed method will be evaluated.
\chapter{Related Work}
There has been a substantial body of research dedicated to monitoring plant health. Especially in regions with high agricultural importance like India and China, a lot of research has been conducted. I will focus on the two areas that are important for the thesis, i.e. Leaf Segmentation (Section \ref{sec:leaf_seg}) and Disease Detection (Section \ref{sec:disease_detection}), but will also highlight some research that uses different indicators that can be linked to plant health, like the amount a plant is growing in a given time frame (Section \ref{sec:plant_growth_analysis}) or predicting the nutrients present in a plant from hyperspectral imaging (Section \ref{sec:nutrient_analysis}).
\section{Plant/Leaf Segmentation} \label{sec:leaf_seg}
There exist many different segmentation algorithms in the domain of image processing (IP) and Machine Learning (ML), each pertaining to different use cases. Ranging from the most basic approach of selecting similar colors like global threshold \cite{lu_chapter_2024}, to more sophisticated methods like k-means clustering involving shallow learning methods for segmentation but is also based on color \cite{dhanachandra_image_2015}.
In the beginning of this domain, researchers utilized more traditional classifiers and shallow learning to determine an object from the background in an image. \citeauthor{lowe_distinctive_2004} presented an approach utilizing the histogram of the image in a method called Histogram Oriented Gradients (HOG) \cite{lowe_distinctive_2004}.
\citeauthor{gao_method_2018} proposed a semi-automated image processing method for segmenting leaves. Their methodology requires the manual input of points lying on the area of the leaf in a presented image. Through further processing, they determine the perimeter points of the leaf and use these in the marker-based watershed segmentation \cite{kornilov_review_2022}. The proposed methodology managed to outperform state-of-the-art segmentation algorithms like Otsu, GrabCut, and regular Watershed segmentation with a Jaccard index of 96.97\% \cite{gao_method_2018}.
The problem with using color-based segmentation algorithms is the high uniformity in color when processing images of plant leaves. Especially when examining overlapping plant leaves, there is often little color contrast when differentiating individual leaf instances, which leads to low efficacy of color-based segmentation algorithms.
With recent developments in hardware, namely the progress in Graphics and Tensor processing units, and the developments in software development in Machine Learning, the approach of utilizing Deep Learning Neural Networks (DLNN) has become more and more viable for use in image segmentation. For this reason, multiple Deep Learning (DL) based segmentation algorithms have been proposed. The benefit of using Artificial Neural Networks (ANN) in Deep Learning is that they can be trained to learn different characteristics and patterns of plant leaves, especially when utilizing Convolutional Layers \cite{patil_convolutional_2021}, so they can recognize leaves not solely on color information but also based on their traits, like shape and color, and additionally can be more resilient towards capturing imperfections like lighting conditions, reflections, and overlapping objects.
\citeauthor{yang_leaf_2020} performed a study examining the performance of the Mask R-CNN model in a leaf segmentation task with a complicated background. They argued that other datasets like the Aberystwyth Leaf Evaluation Dataset \cite{bell_aberystwyth_2016, scharr_leaf_2016, minervini_finely-grained_2016} only provide images with a quite uniform and contrasting background compared to the leaves. Thus they prepared 4000 images of 15 different plant species with a non-uniform and more complicated background and manually created masks for individual leaves. The model was compared to non-deep-learning segmentation approaches Grabcut and Otsu segmentation. The DL approach managed to perform significantly better than the others with a Misclassification Error (ME) of 1.15\% compared to 28.74\% (Grabcut) and 29.80\% (Otsu) \cite{yang_leaf_2020}.
\citeauthor{guo_leafmask_2021} proposed a model called LeafMask. This model consists of two main components: the mask assembly module and the mask refining module. The mask assembly module merges position-sensitive bases from each predicted bounding box after non-maximum suppression (NMS) with corresponding coefficients to generate initial masks. This is followed by the mask refining module, which improves leaf boundaries through a point selection strategy and predictor, ensuring precise delineation of leaf edges. Additionally, LeafMask incorporates a multi-scale attention module within its dual attention-guided mask (DAG-Mask) branch, enhancing information representation and producing more accurate bases. By integrating these modules under an anchor-free instance segmentation paradigm, LeafMask effectively addresses the challenges of leaf occlusion, overlap, and varying shapes and sizes. Validated through extensive experiments on the Leaf Segmentation Challenge (LSC) dataset, LeafMask achieved a 90.09\% BestDice score, outperforming existing state-of-the-art methods \cite{guo_leafmask_2021}.
A hierarchical approach to segmenting semantics, plants, and individual leaves was proposed by
\citeauthor{roggiolani_hierarchical_2023}. This model uses an encoder-decoder architecture, where a single ERFNet (a semantic segmentation network designed for real-time usage \cite{romera_erfnet_2018}) encoder and 3 different ERFNet decoders, one for semantic segmentation (differentiating between plant and soil), plant segmentation (differentiating individual plants) and leaf segmentation (differentiating individual leaves). The decoders are also connected by skip connections, that provide unencoded data from the previous level. They use two different datasets captured from a top-down perspective on an agricultural field: GrowliFlower \cite{kierdorf_growliflower_2023} \& Sugar Beets \cite{chebrolu_agricultural_2017}.
They scored a Panoptic Quality (PQ) score of 76.2\% \& 89.2\% whereas other architectures performed significantly worse \cite{roggiolani_hierarchical_2023}.
\section{Disease Detection} \label{sec:disease_detection}
Several studies exist that use shallow learning to classify and recognize plant diseases, utilizing Support Vector Machines \cite{kirti_black_2020}, while others use k Nearest Neighbor classifiers to great success \cite{bharate_classification_2020}.
\citeauthor{kaur_semi-automatic_2018} proposed a method that is based on the k-means clustering algorithm, which in image processing creates clusters of similarly colored regions in the image. In their image processing pipeline, they use the number of these clusters to first identify if the plant in the given image is infected with some kind of disease or not, based on the number of clusters the k-means algorithm generated and the color properties of the individual clusters. In their paper, they presented this algorithm to be used with manual segmentation by masking the region of interest (ROI) by hand \cite{kaur_semi-automatic_2018}.
Another way of linearly classifying is by utilizing a support vector machine (SVM). \citeauthor{prakash_detection_2017} proposed a method that first extracts various features from a given citrus plant leaf image using statistical Gray-Level Co-Occurrence Matrix (GLCM), which subsequently are input into an SVM to classify the leaf as healthy or infested by a disease. They managed to achieve an accuracy of 90\% across a test set of 60 images \cite{prakash_detection_2017}.
\citeauthor{suresha_recognition_2017} proposed a method for detecting diseases in paddy (rice plant) leaves using the geometric features of visual indicators. First, they used the Otsu (\citeauthor{otsu_threshold_1979} segmentation with a global threshold after converting the input image to the HSV color space to segment the image by color. The resulting regions were then analyzed for their geometric features, which were then used to classify the disease using the k-nearest-neighbour shallow learning method. They achieved an accuracy of 76.6\% in classifying three different types of diseases on paddy leaves \cite{suresha_recognition_2017}.
In comparison to Shallow Learning, Deep Learning approaches, which utilize Deep Learning Neural Networks (DLNN) to classify plant diseases, show a more accurate performance \cite{yao_machine_2023, sujatha_performance_2021}.
However, as \citeauthor{yao_machine_2023} as well as \citeauthor{sujatha_performance_2021} pointed out, Deep Learning is much more popular and also more effective due to its higher accuracy and flexibility \cite{yao_machine_2023, sujatha_performance_2021}.
Using deep learning \citeauthor{khalid_real-time_2023} utilized a fine-tuned version of the general-purpose object classification model \textit{YOLO} to classify unhealthy regions in pictures of individual leaves. They managed to achieve an accuracy of 95.12\%. \cite{khalid_real-time_2023}.
Many other approaches are using general object detection networks like AlexNet, GoogLeNet, and VGGNet, which are fine-tuned to fit the task of detecting illness spots in plant leaves. \cite{applalanaidu_review_2021}
A method for increasing the variability of the training data and thus increasing the model's adaptability to different inputs is data augmentation. Generally, this task aims at changing an existing image in a way that discourages the model from learning unimportant properties. An example: only trained with a dataset solely consisting of images of leaves with the stem of the leaf facing downwards, the model could not classify an image of a rotated leaf as effectively. That is what data augmentation tries to tackle \cite{perez_effectiveness_2017}. \citeauthor{wongbongkotpaisan_plant_2021} proposed a two-fold data augmentation method. They utilize local augmentation in which they identify diseased regions in the image using Otsu global thresholding in the L*a*b* color space, to increase the ratio of diseased regions compared to the leaf area. In global data augmentation, they performed probabilistic changes in rotation, brightness, and blurriness. By utilizing this data augmentation, they managed to accomplish greater accuracy in different model architectures as well as faster loss- \& accuracy-convergence \cite{wongbongkotpaisan_plant_2021}.
A study by \citeauthor{xu_research_2022} proposes an alternative methodology that first involves segmenting images of individual leaves. Subsequently, it assesses the shape and any deviations from a standard leaf morphology to infer the health status of the corresponding plant. \cite{xu_research_2022}
\section{Nutrient analysis} \label{sec:nutrient_analysis}
To analyze the nutrient prevalence in plants, most of the time hyperspectral analysis is used. This means that electromagnetic frequencies that are captured by a camera are analyzed and used to predict what nutrients are prevalent in a plant leaf. For example, \citeauthor{pandey_high_2017} conducted this type of analysis on maize and soy plants and achieved quite usable results with $R^2=0.68$ across both plant types and all recorded nutrients (Mg, Ca, Cu). \cite{pandey_high_2017}
\citeauthor{as_image_2022} utilize hyperspectral analysis, similar to the nutrient analysis approach, to predict moisture levels in two different soil types, Andosol \& Alluvial, with a precision of 94.50\% \& 87.62\%. \cite{as_image_2022}
\section{Plant Growth analysis} \label{sec:plant_growth_analysis}
Another indicator for evaluating the healthiness of a plant is inspecting and monitoring the growth of the plant. This follows the principle of: If a plant is healthy, it can grow faster.
The research for plant growth monitoring using Image Processing/Computer Vision is quite limited, and only some viable papers are available. For example, \citeauthor{matsui_computer_1976} released a paper in 1967 that uses a rotary table for the plant and a 90° circle segment rail for camera movement, enabling capturing every angle of a plant to analyze the plant growth. \cite{matsui_computer_1976}. Another study by \citeauthor{li_measuring_2020} was released in 2020, in which they used a top-down view of plants to analyze their growth. \cite{li_measuring_2020}
Based on this study, another study was conducted by \citeauthor{gupta_image_2022} where they used only side profile pictures to identify plant heights, with a minuscule 0.003–0.006m Mean-Squared-Error between measured and actual plant height. \cite{gupta_image_2022}
Depending on the algorithms used, it may be necessary to execute image segmentation before continuing in the pipeline, as shown in \ref{fig:pipeline}. This segmentation may serve purposes such as extracting regions of the image that solely depict soil for soil analysis or isolating individual leaves from the image for nutrient and health analysis.
For the segmentation of leaves \citeauthor{lin_self-supervised_2023} proposed a Self-Supervised learning approach to separate the leaves from an image. \cite{lin_self-supervised_2023}
\chapter{Methodology}
The general aim of this work is to propose a pipeline that is designed to detect diseases in an image captured by a camera that contains multiple leaves of different plants.
This pipeline will consist of multiple stages that will handle different subtasks to achieve the general goal of detecting diseases in plant leaves. The multiple stages are as follows:
\textbf{Leaf segmentation} (Section \ref{sec:method_segmentation}) is used to detect individual leaves in the image. This is necessary as the Disease Detection stage is not able to distinguish between multiple leaves in different states. Furthermore, the image background, which this stage is seeking to eliminate, may bring unwanted noise into the disease detection step, potentially skewing its results. This stage will consist of two approaches, which will be compared in terms of performance. One approach involves first segmenting the image panoptically and subsequently classifying each resulting region (see Section \ref{sec:meth_panoptic}). The other approach is instance segmentation, wherein a model outputs only the regions of a specific class within an image (see Section \ref{sec:meth_instance}).
The \textbf{Disease detection} (Section \ref{sec:method_disease}) stage utilizes the leaf region outputs generated from the preceding Leaf segmentation stage and subsequently classifies each leaf into healthy and diseased categories based on these inputs.
This two-stage approach has been chosen for several reasons: The model developed through this work should be usable in a variety of scenarios, including ones where multiple plants are present in the captured image. To enable the detection of diseases in each individual plant, a segmentation step is necessary. Additionally, the datasets designed for differentiating between diseased and healthy plants in most cases present individual leaves making it unsuitable to directly classify multiple leaves. To optimize the performance of models trained on such datasets, similar inputs need to be produced. Thus, the segmentation stage is employed, which outputs single leaf areas. This should also increase the accuracy of the model and its resilience toward noise in the background, which may contribute to false classifications.
\section{Leaf Segmentation} \label{sec:method_segmentation}
In this work, the subject of Leaf Segmentation will be examined using two different approaches. Each of the two variants utilizes different approaches to achieving the same task. The panoptic segmentation will utilize a model that generates panoptic segmentation data for an input image. These segments are void of any class attribution and just represent individual regions in an image. An example output can be seen in Figure \ref{fig:panoptic_example}, where each region generated by the panoptic segmentation is colored individually. Subsequent to the segmentation, a classification will be employed that aims to filter relevant (i.e. regions containing a leaf) from irrelevant regions, which can then be passed to the next stage.
In instance segmentation, the segmentation and classification steps are combined. The model is trained to only output relevant regions, while simultaneously differentiating between instances of regions with the same semantic label. This means the model is able to distinguish between the background and leaf regions, multiple leaf instances, and generate a region mask for each of them.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/panoptic_example.png}
\caption{Example output of panoptic segmentation. Each region generated by the PS is colored individually}
\label{fig:panoptic_example}
\end{figure}
\subsection{Panoptic Segmentation} \label{sec:meth_panoptic}
The technique of panoptic segmentation (PS) combines semantic segmentation and instance segmentation. This approach allows for the division of an image into regions of varying semantic significance while concurrently distinguishing between different instances within the same semantic category.
This technique will be employed for leaf segmentation as a two-stage process, in which the panoptic segmentation model will segment the image into significant regions. This will yield a comprehensive set of all the semantic regions within the image. To focus exclusively on the regions that represent a leaf, each identified region will be processed through a binary classification model designed to ascertain whether the respective region is a leaf or not.
In this case, the Segment Anything Model (SAM) \cite{kirillov_segment_2023} published by Meta will be used as the panoptic segmentation backbone. As this model is trained to generate segmentations from input prompts, such as bounding boxes or points lying around or inside the object area, the provided \textit{AutomaticMaskGenerator} will be used to automatically generate the necessary input prompts and allow the image to be panoptically segmented without manual input. An example output of SAM can be seen in Figure \ref{fig:panoptic_example}.
Using PS for segmentation avoids the need to directly identify leaves with a custom-trained model. It leverages all semantic regions and determines their importance, allowing the use of a pretrained panoptic segmentation model. This pretrained model handles the complex segmentation task, which usually needs extensive data for training.
The second stage, i.e. the classification of regions, will consist of a custom classification model and require actual training and/or finetuning. For this stage, multiple approaches will be discussed.
\subsubsection{Autoencoder} \label{sec:method_autoencoder}
An autoencoder is a type of convolutional neural network that consists of an encoder, a bottleneck, and a decoder. The encoder converts the input of the model into a latent representation, which is then passed through the bottleneck to the decoder, which tries to reconstruct the original model input from the latent space representation (as depicted in Figure \ref{fig:autoencoders}).
The autoencoder is trained in an unsupervised manner, enabling it to effectively handle a single class during the training process, and alleviates the necessity of defining a negative (non-leaf) class for training. Throughout training, the model's weights are fine-tuned to minimize the deviation between the output and the input, leveraging the latent representation within the network.
The concept of the \textit{anomaly detection} process is that the model learns to recreate images of the presented class more accurately than images that do not belong to that class.
The conformity to the presented class is determined by the divergence (that is, the pixel-wise mean square error) of the output of the model from the input. By thresholding this divergence, a single-class classification can be achieved \cite{bank_autoencoders_2021}.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/autoencoder.png}
\caption{Visual representation of the autoencoder architecture \cite{kumawat_everything_2023}}
\label{fig:autoencoders}
\end{figure}
The autoencoder utilized in this work comprises a symmetric encoder-decoder architecture. The encoder is constructed with five convolutional layers, which reduce the spatial dimensions of the input to successive sizes of 112x112, 56x56, 28x28, 14x14, and ultimately 7x7. This output is subsequently flattened and fed into a bottleneck layer, which consists of a feedforward neural network (FFNN) with 200 dimensions. Following this, the decoder performs deconvolution on the bottleneck's output, mirroring the encoder's process in reverse order. The overall architecture of the autoencoder is also depicted in Figure \ref{fig:autoencoder_arch}.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{graphics/autoencoder_arch.png}
\caption{The architecture of the presented Autoencoder}
\label{fig:autoencoder_arch}
\end{figure}
The training process implemented on the autoencoder involved 80 epochs utilizing the UrbanStreet leaves dataset \cite{yang_urban_2023}. Training was conducted specifically with the masked versions of the leaves, using the provided segmentation mask. The leaves were then cropped to that mask and then resized to 224x224 pixels, resembling the data that result from the segmentation stage. An example can be seen in Figure \ref{fig:leaf_example}
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/leaf_dataset_example.png}
\caption{An example picture from the dataset used for training the autoencoder}
\label{fig:leaf_example}
\end{figure}
\subsubsection{Convolutional Neural Network}
A convolutional neural network (CNN), such as ResNet \cite{he_deep_2015}, can also be used as a classification model. In a manner similar to an Autoencoder, the network initially generates a latent representation of an image, referred to as a feature map, which is subsequently processed by a Feed-Forward network (FFN). This FFN is responsible for deriving the likelihood that the input belongs to a specific class. This approach, in contrast to the autoencoder, requires supervised learning, i.e. labeled data, as well as multiclass data. For the specific use case at hand, this means that a leaf class and a non-leaf class need to be defined for the model to be able to discern features for the various classes \cite{he_deep_2015}.
For the leaf classification task, ResNet-50 was chosen as the specific architecture (illustrated in more detail in Section \ref{sec:meth_resnet}. It differs from pure CNNs in that instead of purely convolutional blocks (multiple layers, including convolutional layers), residual blocks are used. Additionally to convolutional blocks, residual blocks contain skip connections. These skip connections add the input of the block directly to the output of the block. This approach allows the network to learn if the convolutional layers in the block derive meaningful information (layer output is preferred) or not (the skip connection prevails) \cite{he_deep_2015, choudhary_comprehensive_2023}.
The specific model used was ResNet-50 with weights pre-trained on the ImageNet dataset \cite{deng_imagenet_2009} (\verb |ResNet50_Weights.IMAGENET1K_V2|) sourced from PyTorch's model library \cite{paszke_pytorch_2019}.
Another CNN selected to tackle the leaf classification task is InceptionV3. Its architecture is highlighted in Section \ref{sec:meth_disease_inception}. This network will be trained the same way as ResNet and is also initialized with pre-trained weights from PyTorch hub \cite{paszke_pytorch_2019} that were obtained by training on the ImageNet \cite{deng_imagenet_2009} dataset.
The fine-tuning of ResNet and Inception comprises 10 epochs, for which the Urban Street Leaves dataset \cite{yang_urban_2023} with the provided segmentation mask applied to closely match the output of SAM will serve as the positive leaf class, and 6,221 objects from the Open Images V7 dataset \cite{kuznetsova_open_2020} with their mask applied as well, will represent the negative non-leaf class. The learning rate throughout the training was scheduled using the one-cycle policy proposed by \citeauthor{smith_super-convergence_2018}. For this, a combination of cosines for increasing the learning rate from 0.005 to 0.01 in the first quarter of training and again decrease it to 0.00001 throughout the rest of the training loop. This type of scheduling for the learning rate aims to reduce overfitting of the model and faster convergence \cite{smith_super-convergence_2018}.
The third CNN employed for leaf classification is YOLOv8 \cite{yao_hp-yolov8_2024}. This network, akin to ResNet, is also formulated for object detection but employs a distinct internal architecture to produce its predictions, as elaborated in Section \ref{sec:yolo_arch}. The model delineated in Section \ref{sec:yolo_arch} is distinct from the object detection variant, in that the latter lacks a segmentation head, which is tasked with generating a mask output. This variant comprises exclusively a classification head and a detection head, which are responsible for generating class and bounding box predictions, respectively.
YOLOv8 for leaf classification will be trained on the Urban Street Leaves \cite{yang_urban_2023} dataset, which is comprised of 9,763 single leaf images, spanning 39 leaf classes along with a segmentation annotation for each leaf. Training will be carried out for a total of 100 epochs using the training loop in the \textit{ultralytics} Python library \cite{jocher_ultralytics_2023}.
\subsection{Instance Segmentation} \label{sec:meth_instance}
Instance segmentation differs from panoptic segmentation by focusing exclusively on segmenting specific regions within an image. In this context, the targeted segmentation will be limited to the regions that contain leaves. This methodology offers a distinct advantage over panoptic segmentation, as the presence of a classification stage is rendered unnecessary; the instance segmentation inherently encompasses the classification aspect.
The three architectures, for instance, segmentation utilized in this work are the aforementioned YOLOv8 \cite{yao_hp-yolov8_2024}, the Mask R-CNN model \cite{he_mask_2018}, and RetinaNet \cite{lin_focal_2018}.
It should be noted that, by their design, the outputs of the presented models differ in that the Mask R-CNN and the YOLOv8seg architecture, by the addition of the Mask Head, produce not only bounding boxes, but also a binary mask for each class. In contrast, RetinaNet only generates bounding boxes for each detected object.
\subsubsection{YOLOv8} \label{sec:yolo_arch}
The YOLOv8 architecture comprises various variants tailored for specific applications. In this study, the YOLOv8-Segmentation segmentation architecture (YOLOv8seg) will be used, as the underlying task is a segmentation task \cite{jocher_ultralytics_2023}.
The architecture of the YOLOv8seg model is composed of 3 main stages: backbone, neck, and head. In the backbone stage, the convolutional layers are organized to detect various features across different spatial resolutions, thereby facilitating a robust understanding of the input images. The neck stage integrates feature maps from the backbone to create a comprehensive representation of the detected objects across different scales. This is achieved through the utilization of feature pyramids, which aid in maintaining an accurate detection rate irrespective of the object's size. In the head stage, the architecture employs diverse heads that specialize in distinct prediction tasks: one for bounding-box localization, another for generating segmentation masks, and a third for class prediction. This multi-head configuration allows for a more effective interpretation of the features processed by the previous stages. In general, the structured approach of the YOLOv8seg model contributes to its efficacy in handling object detection and segmentation challenges in real-time applications \cite{pedro_detailed_2023, timilsina_yolov8_2024}. An illustration of the architecture of YOLOv8 can be seen in Figure \ref{fig:yolov8_architecture}.
YOLOv8seg will be trained utilizing the predefined training loop supplied by \textit{ultralytics} which is integrated within their Python library. The training process will take place over 100 epochs, during which the entire synthetic dataset (as discussed in Section \ref{sec:data_aug}) is processed once, followed by a comprehensive evaluation of the validation set.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/yolov8_architecture.png}
\caption{Illustration of the building blocks that make up the YOLOv8 architecture \cite{yao_hp-yolov8_2024}}
\label{fig:yolov8_architecture}
\end{figure}
\subsubsection{Mask R-CNN} \label{sec:method_maskrcnn}
Mask R-CNN is an instance segmentation architecture based on the Faster R-CNN object detection framework \cite{ren_faster_2016}. It consists of the following components \cite{he_mask_2018, potrimba_what_2023}:
\begin{itemize}
\item \textbf{Backbone Network:} The backbone network is usually a pre-trained object detection network in the likes of \textit{ResNet} \cite{he_deep_2015} and \textit{MobileNet} \cite{howard_mobilenets_2017}. It is responsible for detecting regions in the image which can be representations of objects.
\item \textbf{Feature Pyramid Network (FPN):} An FPN generates a Feature Pyramid from the proposed features of the Backbone Network. This Feature Pyramid will be utilized by the subsequent components to enable the detection of features in different variations of size and scale.
\item \textbf{Region Proposal Network (RPN):} From the multiscale feature representations of the FPN, the RPN generates proposals for regions of interest (ROI), that is, regions which are likely to contain an object of interest.
\item \textbf{ROI-Align:} The ROI-Align Stage combines the generated features of the backbone network and the regional proposals of the FPN by dividing the region proposals into a grid and attributing the features of the Backbone Network to the ROI.
\item \textbf{Mask Head:} This is the main feature that differentiates Mask R-CNN from Faster R-CNN and the one that allows the architecture to produce the segmentation of objects. The Mask Head is a Convolutional Neural Network that produces binary masks for each class from the features obtained by the ROIAlign stage.
\end{itemize}
Mask R-CNN training will be performed using the TensorFlow ModelGarden \cite{yu_tensorflow_2020}. This project provides several model configurations as well as a framework for conducting training and evaluation on these models. For Mask R-CNN, a ResNet object detection backbone was used. The training process was configured for 750,000 steps with stochastic gradient descent (SGD) with a momentum of 0.9, utilizing a stepwise decaying learning rate of:
\[
lr(step)=
\begin{cases}
0.12,& \text{if } step \leq 15000\\
0.012,& \text{if } step > 15000 \land step \leq 20000\\
0.0054,& \text{if } step > 20000
\end{cases}
\]
Additionally, a linear learning rate warmup phase of 500 steps with a warmup rate of 0.0067 was employed.
\subsubsection{RetinaNet}
In contrast to the two-stage approach of Mask-RCNN, which involves the separate region proposal- and classification \& bounding box regression stage, RetinaNet proposes a one-stage approach. The differences in the architecture of the one-stage and two-stage architectures can be seen in Figure \ref{fig:maskrcnn_retinanet_arch}.
The RetinaNet architecture is composed of the following components \cite{lin_focal_2018}:
\begin{itemize}
\item \textbf{Backbone Network:} Similar to Mask R-CNN, RetinaNet also utilizes an object detection network as the backbone for instance segmentation. This Backbone network is then combined with a
\item \textbf{Feature Pyramid Network (FPN)} which creates multiscale representations of the generated features of the Backbone Network. Instead of feeding its output to an RPN, RetinaNet generates $A$ anchors consisting of a class vector indicating the probability that a class is present at this anchor and a 4-element vector describing the bounding box of a predicted object.
\item \textbf{Classification Subnet:} This Subnet calculates an object class probability vector for each of the $A$ anchors it is given.
\item \textbf{Box Regression Subnet:} Parallel to the Classification subnet the Box Regression Subnet calculates the regression from the predicted Anchor's bounding box vector to the ground truth bounding boxes. This prediction of the bounding box, in contrast to Mask R-CNN, is class agnostic.
\end{itemize}
Training on RetinaNet was conducted equally to the process of training Mask R-CNN described in Section \ref{sec:method_maskrcnn}.
\begin{figure}[h]
\centering
\includegraphics[width=0.8\linewidth]{graphics/maskrccn_retinanet.png}
\caption{Difference in architecture between two-stage and one-stage instance segmentation models, shown exemplarily with Faster R-CNN (the version of Mask R-CNN without segmentation) and RetinaNet \cite{carranza-garcia_performance_2021}}
\label{fig:maskrcnn_retinanet_arch}
\end{figure}
\subsection{Data Augmentation} \label{sec:data_aug}
The presented datasets provide multiple limiting factors for training a segmentation model:
\begin{itemize}
\item \textbf{Single Instances:} The UrbanStreet dataset \cite{yang_urban_2023} only provides images of single leaves and by that only contains semantic segmentation information, and is thus not suited for training a model on instance segmentation.
\item \textbf{Low Variability:} The instance segmentation datasets (GrowliFlower \cite{kierdorf_growliflower_2023}, Phenobench \cite{weyler_phenobench_2023} and the Leaf Segmentation Challenge (LSC) dataset \cite{minervini_finely-grained_2016}) have a high specialization, with only a narrow range of different plant species. Of the data sets presented, each provides segmentation data only for a single species or at most two species. This will impact the ability of the model to work with generic input data and various plant species.
\item \textbf{Perspective:} The images are solely captured from a top-down perspective, providing clear visibility and a uniform flat view of plant leaves. This also provides a clear visibility of the arrangement of plants around the stem, which is not always given in real-world usage.
\item \textbf{Background:} As the images are not densely populated with plants and are captured in a field with soil as a substrate, the background, i.e., the part of the picture that is not a plant, is visually clearly separable from the plant leaves themselves. This does not provide viable data for distinguishing leaf regions from background regions.
\item \textbf{Data quality:} The Phenobench dataset \cite{weyler_phenobench_2023} does not present the quality of segmentation data necessary to train a segmentation model reliably. The annotations contained in this dataset are supposed to represent single-leaf instances. However, upon inspection, it is evident that a majority of the masks that should only contain single leaf regions wrongly contain multiple leaf regions. This will severely limit a models ability to learn features identifying single leaf regions.
\end{itemize}
Due to these limitations, existing datasets present a challenge when it comes to their applicability in real-world scenarios and their suitability for training. Consequently, a data augmentation technique will be employed to produce synthetic datasets for training, validating, and assessing the models.
In this context, a synthetic dataset will be created using single-leaf segmentations. To achieve this, an approach akin to the Na\"ive Collage technique proposed by \citeauthor{kuznichov_data_2019} will be utilized \cite{kuznichov_data_2019}.
This approach is fundamentally based on the UrbanStreet \cite{yang_urban_2023} dataset. The dataset consists of 9,763 images across 39 classes captured in city scenes, which are provided in conjunction with a simple segmentation mask. Each of the images provides a single leaf and the respective leaf segmentation mask, providing optimal data for generating a synthetic dataset.
From this dataset, a random amount of images $n \in_R [N_{min};N_{max}]$ is selected and then individual leaves are masked from the original RGB image using the mask image provided, applying a pixel-wise Boolean operation. Subsequently, a set of random transformations is applied:
\begin{itemize}
\item Rotation with an angle $\theta \in_R [-\theta_r; \theta_r]$
\item Translation with offset values $x \in_R [-\mathcal{O}_X; \mathcal{O}_X]$ and $y \in_R [-\mathcal{O}_Y; \mathcal{O}_Y]$
\item Scaling with a factor of $s \in_R [s_l; s_u]$ after resizing to the resulting image size
\end{itemize}
This generated leaf overlay is then combined with a randomly sampled image from the City Street View Dataset \cite{stealth_username_city_2022}, a dataset that comprises 50,000 images obtained from Google's Street View API across five cities: San Francisco, Detroit, Chicago, Washington, and New York City.
Using this method, a dataset consisting of 8,000 training images, 2,000 validation images, and 2,000 test/evaluation images will be created.
In this specific case the following parameters for the dataset generation are used: $N_{min} = 4$, $N_{max} = 10$, $\theta_r = 45\deg$, $\mathcal{O}_x = \mathcal{O}_y = 100px$, $s_l = 0.2$ and $s_u = 0.7$.
A sample image created with the data augmentation can be seen in Figure \ref{fig:data_aug_example}.
\begin{figure}[h]
\centering
\includegraphics[width=0.5\linewidth]{graphics/data_aug_example.png}
\caption{An example picture generated by the presented Na\"ive Data Augmentation algorithm}
\label{fig:data_aug_example}
\end{figure}
\section{Disease Detection} \label{sec:method_disease}
For classification tasks similar to this, a considerable amount of research has already been conducted. Traditionally, the models employed in classification are designed to differentiate between multiple distinct classes, such as distinguishing between an image of a cat and one of a dog. In this particular case, the two classes do not possess such a high distinctiveness. The healthy class shares numerous features with the diseased class, including shape, color, structure, and texture; the primary distinctions lie on a few minor features, which are the regions indicative of the disease on the leaf. The minuscule differences between healthy and diseased plants can be seen in \ref{fig:healthy_diseased_arjun}.
\begin{figure}
\centering
\includegraphics[width=0.75\linewidth]{graphics/healthy_diseased_arjuna.jpg}
\caption{Picture of a healthy (left) and diseased (right) leaf of the Terminalia arjuna plant from the PlantLeaves dataset \cite{chouhan_database_2019}}
\label{fig:healthy_diseased_arjun}
\end{figure}
For the disease detection stage, 8 different models will be discussed. The models under discussion will be trained utilizing a combination of the PlantaeK \cite{kour_plantaek_2019}, PlantVillage \cite{hughes_open_2016} and PlantLeaves \cite{chouhan_database_2019} datasets. The datasets contain 2,153, 54,303 and 4,502 images of plant leaves which are either healthy or diseased with a different range of diseases for each dataset. As the goal of this work is to classify leaves as healthy or diseased, all of the images in any of the not-healthy categories are merged into a single \textit{diseased} category. From this resultant aggregated dataset, the train, validation and test sets are split by an 80\%, 10\% \& 10\% ratio. The resulting training, validation, and test sets consist of 19,736, 2,228, \& 2,934 and 41,591, 4,587 \& 6,265 healthy and diseased leaf images respectively.
To closely match the output of the segmentation stage (see Figure \ref{fig:leaf_example}) for every image in the dataset, the previously discussed panoptic segmentation approach will be used to create a segmented version of the image. For this, SAM will be used again in combination with a leaf classifier, namely ResNet, to determine the leaf region in the image. SAM will be prompted with the center point of the image, as upon visual inspection a majority of the images in the dataset is centered over the leaf. From the resulting region proposals $p$ by SAM, the one with the highest score $S_{mask}$ (see Equation \eqref{eqn:mask_score}, where $C_{SAM}$ denotes the mask confidence generated by SAM and $C_{leaf}^{ResNet}$ is the probability that the region $m$ is a leaf generated by the ResNet leaf classifier discussed in Section \ref{sec:meth_panoptic}).
\begin{equation} \label{eqn:mask_score}
S_{mask}(m) = C_{SAM}(m) \cdot C_{leaf}^{ResNet}(m)
\end{equation}
The training loop will be executed for a duration of 36 epochs, in which the whole dataset is processed once. The training will be halted if no decrease in the loss of the validation set is detected. Specifically, the training will stop after the trainer does not detect any change in validation loss in the range of $\delta = 0.1$ over a period of 5 epochs. The training loop incorporates a cosine annealing/decay learning rate scheduler, with a warmup phase of a quarter epoch, using an initial learning rate of 0.001, increasing to 0.01 during warmup and decaying to $1 \cdot 10^{-9}$ as the final learning rate over the rest of the training. The course of the learning rate is shown in Figure \ref{fig:disease_detection_lr}.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/disease_detection_training_lr.png}
\caption{Learning rate over epochs utilizing cosine annealing learning rate scheduler}
\label{fig:disease_detection_lr}
\end{figure}
\subsection{AlexNet}
The AlexNet architecture is a fundamental model proposed by \citeauthor{krizhevsky_imagenet_2012}. It is composed of three convolutional blocks, with the first two featuring a single convolutional layer each, while the third block contains three convolutional layers, all utilizing the Rectified Linear Unit (ReLU) activation function. Each convolutional block is concluded with a MaxPooling layer. Subsequently, the convolutional blocks are followed by three fully connected feedforward layers (FF) that ultimately produce the classification probabilities \cite{krizhevsky_imagenet_2012}.
As the architecture and computational complexity of this network are relatively straightforward, it is not intended to serve as a direct competitor to more complex models. Instead, it will function as a baseline to facilitate an understanding of the performance of simplistic models on comparison to their more intricate counterparts.
The AlexNet model was sourced from the deep-cv library by \citeauthor{krizhevsky_imagenet_2012} implemented in Keras \cite{chollet_keras_2015}.
\subsection{VGG-19}
The Visual Geometry Group (VGG) networks are an expansion of the AlexNet architecture. The VGG-19 model uses similar convolutional blocks, with 19 indicating the presence of 19 convolutional layers (compared to the five present in AlexNet) \cite{simonyan_very_2015}. The structure of a smaller VGG-16 network can be seen in Figure \ref{fig:vgg16}.
The specific VGG-19 implementation model was sourced from the deep-cv library by \citeauthor{krizhevsky_imagenet_2012} implemented in Keras \cite{chollet_keras_2015}.
\begin{figure}
\centering
\includegraphics[width=0.87\linewidth]{graphics/vgg_16.png}
\caption{Architecture of the VGG-16 image recognition network \cite{simonyan_very_2015, bangar_vgg-net_2022}}
\label{fig:vgg16}
\end{figure}
\subsection{ResNet} \label{sec:meth_resnet}
The Residual Networks group (ResNets) represents an advancement over the purely convolutional architecture exemplified by VGG networks. Instead of a purely convolutional block, they utilize a block called the residual block. It is a block of (usually two) convolutional layers, but it differs from it in that it contains a concatenate function at the end, that joins the output of the last convolutional layer with the original input of the block \citeauthor{he_deep_2015}. The described behavior is also visualized in Figure \ref{fig:resnet_block}.
For training the \verb|ResNet152V2| implementation provided by the Keras framework \cite{chollet_keras_2015} without any pretrained weights is used.
\begin{figure}[h]
\centering
\includegraphics[width=0.85\linewidth]{graphics/resnet.png}
\caption{The mode of operation for a residual block in ResNet \cite{ruiz_understanding_2018, he_deep_2015}}
\label{fig:resnet_block}
\end{figure}
\subsection{ConvNeXt} \label{sec:arch_convnext}
ConvNeXt presents an advancement of the ResNet architecture with many changes inspired by the Vision Transformer (described in Section \ref{sec:method_vit}). One of these changes is to utilize larger kernel sizes in the convolutional layer to gain a wider receptive field. Instead of using ResNet's 3x3 kernel size, ConvNeXt has a kernel size of 7x7, approaching the global attention of the Vision Transformer. ConvNeXt uses the Gaussian error linear unit (GELU) activation function in place of ResNet's Rectifying Linear Unit (ReLU). In addition, instead of using regular convolutional layers, ConvNeXt employs the usage of depthwise convolutional layers, which apply a convolution to each of the channels instead of applying one to all of them simultaneously, which allows retaining channel-specific information throughout the model \cite{pandey_depth-wise_2018, singh_convnext_2022}.
The architecture of the ConvNeXt network can be seen in Figure \ref{fig:arch_convnext}.
The implementation used for this work is the large variant (ConvNeXtLarge) sourced from the Keras framework \cite{chollet_keras_2015} and was initialized without any pre-training weights applied to it.
\begin{figure}
\centering
\includegraphics[width=0.75\linewidth]{graphics/convnext_arch.png}
\caption{Architecture of a ConvNeXt network \cite{chen_large-scale_2023}}
\label{fig:arch_convnext}
\end{figure}
\subsection{MobileNet}
MobileNet, as its name suggests, was developed to fit the needs of more efficient Deep Convolutional Neural Networks for usage in mobile devices. For this, depthwise separable convolution is employed, which reduces computational cost while achieving similar results compared to standard convolution. Depthwise separable convolutions combine depthwise convolution (discussed in Section \ref{sec:arch_convnext}) and pointwise convolutions, reducing their computational cost significantly \cite{howard_mobilenets_2017}. The model used in this work is the 3rd iteration of MobileNet, MobileNetV3. It utilizes a specific activation function called \textit{h-swish}, which replaces the computationally expensive swish activation function through the omission of the sigmoid in favor of ReLU. Another adaptation made with version 3 is the introduction of the Squeeze-and-Excite (SE) block. This takes the output of a convolutional layer and aims to model the interdependencies between color channels by passing the squeezed output of the convolutional layer through a series of linear layers and multiplying the output with the filter outputs of the convolutional layer, illustrated in Figure \ref{fig:se_block} \cite{howard_searching_2019, erdogan_squeeze-and-excitation_2022}.
The complete architecture of MobileNetV3 is shown in Figure \ref{fig:arch_mobilenetv3}. The specific implementation for MobileNetV3 was sourced from the Keras framework \cite{chollet_keras_2015} and was initialized without applying pre-trained weights to it.
\begin{figure}
\centering
\includegraphics[width=0.3\linewidth]{graphics/se_block.png}
\caption{Illustration of the makeup of a SE block \cite{erdogan_squeeze-and-excitation_2022}}
\label{fig:se_block}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/arch_mobilenetv3.png}
\caption{Architecture of the MobileNetV3 architecture \cite{elsayed_abd_elaziz_evolution_2023}}
\label{fig:arch_mobilenetv3}
\end{figure}
\subsection{InceptionV3} \label{sec:meth_disease_inception}
The InceptionV3 architecture is the second iteration of the original Inception/GoogLeNet architecture proposed by \citeauthor{szegedy_going_2014} \cite{szegedy_going_2014}. The novelty of this architecture is the introduction of Inception blocks, which process input through the application of varying convolutional operations, depending on the input size, seen in Figure \ref{fig:inception_arch}.
\begin{figure}
\centering
\includegraphics[width=0.65\linewidth]{graphics/inception_arch.png}
\caption{The architecture of Inception V3 network \cite{iparraguirre-villanueva_convolutional_2022}}
\label{fig:inception_arch}
\end{figure}
\subsection{Vision Transformer} \label{sec:method_vit}
The Vision Transformer (ViT) represents a relatively recent advancement in neural network architecture, introduced by \citeauthor{dosovitskiy_image_2021} in 2021. Transformers have garnered considerable attention on the domain of Natural Language Processing (NLP) tasks \cite{vaswani_attention_2023}.
The ViT approach is designed to apply the principles that a transformer uses for text processing to images. The architecture of the vision transformer is also quite similar. But instead of using Text Embeddings for transforming words and sentences into the fitting vector space, the ViT utilizes image embeddings on patches created from a regular grid over the image. These patches serve to reduce the amount of data ingested by the network at once. The aforementioned image embeddings are implemented as convolutional layers whose output is then flattened to translate the images into a vector space. Otherwise, the architecture of the vision transformer is similar to the original transformer architecture described in \cite{vaswani_attention_2023} \cite{dosovitskiy_image_2021}. The architecture is visualized in Figure \ref{fig:vit_arch}.
The specific implementation of VisionTransformer was sourced from the \verb|vit_keras| library by \citeauthor{fausto_morales_vit_keras_2020} \cite{fausto_morales_vit_keras_2020}, which port the Google \textit{FLAX} implementation \cite{dosovitskiy_image_2021, tolstikhin_mlp-mixer_2021, steiner_how_2021, chen_when_2021, zhuang_surrogate_2022, zhai_lit_2022} to Keras. The model is loaded without pre-trained weights.
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\linewidth]{graphics/vit_arch.png}
\caption{Architecture of the Vision Transformer \cite{wolfe_using_2022}}
\label{fig:vit_arch}
\end{figure}
\chapter{Results} \label{sec:results}
In this research, the goal was to determine the efficacy of various machine learning architectures and models in detecting diseases that are visually manifested in plants and, more specifically, plant leaves. For this, a pipeline was constructed that is made up of two parts: leaf segmentation and disease detection at the top level. For each of these parts and their respective subdivisions, the results will be presented in (1) model training, (2) model evaluation, and (3) real-world application.
\section{Leaf Segmentation} \label{sec:results_segmentation}
The leaf segmentation stage of the pipeline is responsible for providing the subsequent stage with a set of segmented regions of the image that contain leaves. The proposed approaches to leaf segmentation were evaluated on both a synthetically generated dataset (described in Section \ref{sec:data_aug}) and a real-world dataset, namely the leaf segmentation dataset \cite{giovi_leaf_2024}.
This dataset was used as it contains leaf segmentation data with images taken in real-world settings with complicated backgrounds similar to the segmentation target. The dataset consists of 351 images with an average of $1.42$ annotations per image. The evaluation metrics generated from this dataset are not perfectly representative of the model's segmentation performance, as the background often also contains leaf regions that are not included in the ground-truth segmentation mask. For this reason, the false positive rate will not be taken into account as much as the other metrics. However, it serves as a good discriminator of a model's performance to segment leaves in front of a complicated background. For every model the \textit{true positives (TP)}, \textit{false positives (FP)}, \textit{true negatives (TN)} and \textit{false negatives (FN)} were tracked and from that the metrics \textit{precision}, \textit{recall}, \textit{F1 score}, \textit{mean IoU}, \textit{mean Dice score} and \textit{specificity} were calculated with the formulas in Equation \eqref{eqn:metrics_formulas}:
\begin{equation} \label{eqn:metrics_formulas}
\begin{aligned}
accuracy_{x} & = \frac{TP_{x} + TN_{x}}{TP_{x} + TN_{x} + FP_{x} + FN_{x}}
\\
precision_{x} & = \frac{TP_{x}}{TP_{x} + FP_{x}}
\\
recall_{x} & = \frac{TP_{x}}{TP_{x} + FN_{x}}
\\
specificity_{x} & = \frac{TN_{x}}{TN_{x} + FP_{x}}
\\
F1_{x} & = \frac{2 * (precision_{x} * recall_{x})}{(precision_{x} + recall_{x})}
\\
IoU_{x} & = \frac{TP_{x}}{TP_{x} + FN_{x} + FP_{x}}
\\
Dice_{x} & = \frac{2 * TP_{x}}{2 * TP_{x} + FP_{x} + FN_{x}}
\end{aligned}
\end{equation}
In Table \ref{tab:metrics_segmentation_synthetic}, which shows the evaluation performance of the models presented against the synthetically generated test dataset.
YOLOv8 underperforms across all metrics with a precision of 0.0075, recall of 0.0092 and an F1-score of 0.0079, showing that its segmentation capabilities are far inferior to the other, SAM-based models. The combination of ResNet with SAM achieved the highest overall performance with the highest scores across all metrics with a precision of 0.2543, a recall of 0.5921 and an F1-score of 0.3391, along with the highest mean IoU of 0.8257 and mean Dice coefficient (0.8628). This shows that ResNet provides a strong classification model on top of SAM for the task of leaf segmentation.
SAM + Inception also exhibits good performance, achieving the second best mean IoU of 0.7475 and a mean Dice coefficient of 0.7890 behind SAM + ResNet.
YOLOv8 combined with SAM achieved moderate performance, outperforming the standalone YOLOv8, but lagging behind SAM + ResNet and SAM + Inception. Its precision (0.0892), recall (0.3204), and F1 score (0.1095) indicate some improvement, while its mean IoU (0.4128) and mean Dice coefficient (0.4341) suggest limited segmentation accuracy.
The autoencoder (AE) presents itself as the weakest classifier model on top of SAM, achieving a precision of 0.0469, recall of 0.2588, and F1-score of 0.0758, indicating that it is not suited for reliably classifying leaf regions produced by SAM. In general, these results highlight the superiority of SAM-based models, particularly those that leverage ResNet, while emphasizing the challenges of achieving high precision and recall in segmentation tasks.
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& YOLOv8 & ResNet (S) & Inception (S) & AE (S) & YOLOv8 (S) \\
\midrule
Precision & 0.0075 & 0.2543 & 0.1803 & 0.0469 & 0.0892 \\
Recall & 0.0092 & 0.5921 & 0.4424 & 0.2588 & 0.3204 \\
F1-Score & 0.0079 & 0.3391 & 0.2410 & 0.0758 & 0.1095 \\
Mean IoU & 0.0131 & 0.8257 & 0.7475 & 0.6064 & 0.4128 \\
Mean Dice & 0.0167 & 0.8628 & 0.7890 & 0.6314 & 0.4341 \\
\bottomrule
\end{tabular}
\caption{Table of evaluation Metrics of the leaf segmentation stage on synthetic test data, (S) indicates SAM-based models}
\label{tab:metrics_segmentation_synthetic}
\end{table}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/metrics_segmentation_synthetic.png}
\caption{Bar graph of evaluation metrics of the leaf segmentation stage on synthetic data}
\label{fig:metrics_segmentation_synthetic}
\end{figure}
In Table \ref{tab:metrics_segmentation}, which compares various model configurations for leaf segmentation evaluated on a real-world dataset, notable trends emerge across the evaluation metrics. SAM + YOLOv8 achieves the highest scores, outperforming other models across all metrics, including a recall of 0.6719 and a mean Dice score of 0.6697, suggesting strong segmentation performance with high overlap accuracy, contradicting the findings of the evaluation on the synthetic dataset where YOLOv8 in combination with SAM only achieved average performance. This combination presents a low precision of 0.3507, highlighting its tendency to incorrectly classify regions as leaves. SAM + ResNet is the second best performing model in all metrics, exhibiting a precision of 0.1245 and a mean IoU score of 0.4846, indicative of precise segmentation boundaries. SAM + Inception is a close follow-up of SAM-based ResNet, achieving a similarly low precision of 0.1224 and a competitive mean Dice coefficient of 0.4259. In contrast, YOLOv8 performed poorly, showing values less than 0.01 in every metric, indicating that it is not able to generalize synthetic training data to the real-world dataset. SAM + AE and Mask R-CNN underperform across most metrics, with SAM + AE consistently producing negative classifications, reflecting limited or no detection capacity in this task. The table highlights that SAM + YOLOv8 and SAM + ResNet have an optimal balance in segmentation quality, and SAM + YOLOv8 offers slightly better overall performance in this context. These findings are also illustrated in the bar graph seen in Figure \ref{fig:metrics_segmentation}.
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& YOLOv8 & ResNet (S) & Inception (S) & AE (S) & YOLOv8 (S) \\
\midrule
Precision & 0.0050 & 0.1245 & 0.1224 & 0.0757 & 0.3507 \\
Recall & 0.0050 & 0.4403 & 0.3655 & 0.3465 & 0.6719 \\
F1-score & 0.0044 & 0.1752 & 0.1599 & 0.1078 & 0.4264 \\
Mean IoU & 0.0040 & 0.4846 & 0.4136 & 0.3384 & 0.6407 \\
Mean Dice & 0.0050 & 0.4967 & 0.4259 & 0.3557 & 0.6697 \\
\bottomrule
\end{tabular}
\caption{Table of evaluation Metrics of the leaf segmentation stage on real world data, (S) indicates SAM-based models}
\label{tab:metrics_segmentation}
\end{table}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/metrics_segmentation.png}
\caption{Bar graph of evaluation metrics of the leaf segmentation stage on real-world data}
\label{fig:metrics_segmentation}
\end{figure}
\subsection{Panoptic Segmentation} \label{sec:results_segmentation_panoptic}
The panoptic segmentation approach to the first stage of the pipeline includes the Segment Anything model (SAM) \cite{kirillov_segment_2023} generating panoptically segmented image regions, which are subsequently classified by a classification model.
Since SAM is a pre-trained model that is not specifically tailored for this study but a state-of-the-art panoptic segmentation model, no assessment was performed on it.
Each of the classifiers used in conjunction with SAM was additionally evaluated in its leaf-classification capabilities. For this, the already discussed leaf segmentation dataset \cite{giovi_leaf_2024} was used, as it provides various single leaf images and their corresponding segmentation masks. For each image in the set, the corresponding segmentation mask was applied to closely mirror the output of the SAM. As a negative (non-leaf) class, a random subset was used that spans 872 samples of Google's Open Images V7 dataset \cite{kuznetsova_open_2020}, which contains various random objects.
During the evaluation loop, the models' metrics of accuracy, precision, recall, and F1 score were collected. These results are illustrated in Table \ref{tab:panoptic_classification_metrics} and Figure \ref{fig:panoptic_classification_eval_metrics}. It is evident that in all of the metrics, ResNet delivers the best performance, far outweighing the other models in precision and F1-score with scores of 0.9580 and 0.9579 respectively, showing its ability to generalize reliably.
InceptionV3 closely follows ResNet's performance with the overall second best scores, exhibiting a precision of 0.9272, a recall and F1 score of 0.9268, showing its strong ability to generalize and reliably distinguish between leaf and non-leaf regions.
YOLOv8 achieves a decent precision of 0.8028 but exhibits lower scores than ResNet and Inception in recall, showing the model's inability to correctly identify prevalent plant leaf regions, i.e., a high false negative rate. The Autoencoder falls behind YOLOv8, Inception and ResNet in precision and F1 score, exhibiting poor performance in classifying non-leaf regions as such, i.e., a high false positive rate. In the confusion matrix in Figure \ref{fig:panoptic_classification_conf_matrix} you can see that the Autoencoder and YOLOv8 are capable of predicting positive examples mostly correctly, but struggle to classify non-leaf examples with similar performance to the random baseline.
Generally, the evaluation results of the classifier step align accurately with the evaluation of their use in combination with SAM. However, the precision in segmentation is way lower due to misclassifications based on various factors, showing that the performance of a classifier significantly decreases in combination with SAM compared to its sole exertion.
\begin{table}[]
\centering
\begin{tabular}{lrrr}
\toprule
& Precision & Recall & F1-score \\
\midrule
Autoencoder & 0.6718 & 0.6556 & 0.6472 \\
ResNet & \textbf{0.9580} & \textbf{0.9579} & \textbf{0.9579} \\
YOLOv8 & 0.8028 & 0.7175 & 0.6960 \\
Inception & 0.9272 & 0.9268 & 0.9268 \\
\bottomrule
\end{tabular}
\caption{Evaluation metrics of the panoptic classifier models}
\label{tab:panoptic_classification_metrics}
\end{table}
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{graphics/panoptic_classification_eval_metrics.png}
\caption{Evaluation metrics of the different classification models in panoptic segmentation}
\label{fig:panoptic_classification_eval_metrics}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/confusion_matrix_classifiers.png}
\caption{Confusion Matrix of the leaf classifier models}
\label{fig:panoptic_classification_conf_matrix}
\end{figure}
\subsection{Instance Segmentation} \label{sec:results_segmentation_instance}
For instance segmentation the models presented were trained to segment all regions containing a leaf from a given input image. For this task, the Mask R-CNN, RetinaNet, and YOLOv8 architectures have been selected.
For this method, YOLOv8 emerged as the single viable model. Its performance compared to the SAM-based models is illustrated in Section \ref{sec:results_segmentation}.
The selection was made based on the training performance of the different models. The detailed results of the training can be seen in Section \ref{sec:leaf_segmentation_instance_training}.
In the training of Mask R-CNN and RetinaNet a notable pattern showed. While Mask R-CNN was able to converge in its training loss and validation mean average precision (mAP) and mean average recall (mAR) of its bounding boxes, RetinaNet did not experience the same behavior but exhibited unstable training and validation performance, indicating that it is not able to generalize with the presented data. For this reason, RetinaNet will not be evaluated, but omitted from this study.
When comparing Mask R-CNN and YOLOv8's segmentation performance, it also shows that YOLOv8 is far superior to Mask R-CNN in terms of segmentation mask performance. While YOLOv8 managed to reach a maximum mAP of 0.679, the best value of Mask R-CNN for this metric was 0.303. The difference in mAP of YOLOv8's 0.540 to Mask R-CNN's 0.194 further underlines the inferiority of the latter model to the former. Due to this poor performance, Mask R-CNN will also not be regarded further in this study.
\section{Disease detection} \label{sec:results_disease}
The disease detection models were evaluated on the test split of the dataset also used for training and validation (described in Section \ref{sec:method_disease}). During the evaluation, metrics for binary accuracy, recall, area under curve (AuC), and the F1 score were collected. The evaluation findings can be seen in Table \ref{tab:eval_metrics_disease_detection_masked}.
\iffalse % removed as these are outdated results
It is evident that, in general, InceptionV3 performed the best, achieving the highest scores for all evaluation metrics. This hints towards InceptionV3 being able to generalize the features of diseased leaves and apply that to new and unseen data best.
The rather simple AlexNet scored the second-best AuC score out of all the models, while its binary accuracy remains average, showing that even a simple model architecture can produce usable results for leaf disease detection. The second-best accuracy was achieved by VGG19 with a value of 0.7461, which was also a promising candidate judging from the training and validation metrics. In other metrics, VGG19 also performs well, showing that it was able to generalize patterns from training and transfer them to new data. LeNet and MobileNetV3Large, although they exhibited reasonable training metrics (see Section \ref{sec:disease_detection_result_training}), did not adapt to the unseen validation set, only producing negative predictions (evident in the Confusion Matrices in Figure \ref{fig:confusion_matrix_disease_detection}). VisionTransformer, constantly underperforming in training and validation metrics, also achieved the lowest evaluation metrics across the board, validating the notion that it is not able to generalize any patterns from the training data. The confusion matrix paints a clear picture of its tendencies to over-classify images into the true class with a false positive rate of 0.38. While much more pronounced in VisionTransformer, ConvNeXtLarge, which experienced similar performance in training and validation, also exhibits similar behavior with a false positive rate of 0.22. Although InceptionV3 outperformed VGG19 in all evaluation metrics, the sum of false positive and false negative rates is lower in VGG19 than InceptionV3, with $0.11 + 0.029 = 0.139$ and $0.14 + 0.01 = 0.15$ respectively. While performing well in the evaluation metrics, AlexNet also shows the best performance in the misclassification rates, with a false rate of only $0.066 + 0.019 = 0.085$.
\begin{table}[]
\centering
\begin{tabular}{llrrrr}
\toprule
& Binary Accuracy & Recall & AuC & F1-Score \\
\midrule
InceptionV3 & \textbf{0.8080} & \textbf{0.8080} & \textbf{0.8941} & \textbf{0.8376} \\
VisionTransformer & 0.6448 & 0.6532 & 0.7720 & 0.6505 \\
AlexNet & 0.7292 & 0.7348 & 0.8558 & 0.7551 \\
LeNet & 0.7171 & 0.7213 & 0.8214 & 0.7736 \\
ResNet152V2 & 0.7407 & 0.7441 & 0.8422 & 0.7977 \\
MobileNetV3Large & 0.7308 & 0.7336 & 0.8233 & 0.8003 \\
VGG19 & 0.7461 & 0.7485 & 0.8421 & 0.8094 \\
ConvNeXtLarge & 0.7358 & 0.7379 & 0.8307 & 0.7973 \\
\bottomrule
\end{tabular}
\caption{Evaluation metrics of the disease detection models}
\label{tab:eval_metrics_disease_detection}
\end{table}
\begin{figure}[h!]
\centering
\includegraphics[width=1\linewidth]{graphics/conf_matrix_disease_detection.png}
\caption{Confusion matrix over evaluation data of all discussed disease detection models}
\label{fig:confusion_matrix_disease_detection}
\end{figure}
\fi
In Table \ref{tab:eval_metrics_disease_detection_masked} and Figure \ref{fig:confusion_matrix_disease_detection_masked} the results of the disease classification models are illustrated.
Evaluation of various deep learning models on the dataset revealed different levels of performance on precision, recall, F1 score, accuracy, and area under the curve (AuC). AlexNet emerged as the best performing model, achieving a precision of 0.8674, recall of 0.8305, and an F1 score of 0.8261. Its accuracy and AuC also reached 0.8305, indicating its strong generalization and suitability for the task.
InceptionV3 also performed well, with a precision of 0.8354, a recall of 0.7784, and an F1 score of 0.7685, ResNet152V2 followed closely with a precision of 0.8340, a recall of 0.7786, and an F1-score of 0.7691, demonstrating robust performance across all metrics, although their overall metrics were lower than those of AlexNet. The Vision Transformer achieved moderate results, with a precision of 0.6596, a recall of 0.6143, and an F1 score of 0.5849, reflecting a level of performance that was notably below the top-performing models, potentially due to its architectural differences.
MobileNetV3Large, VGG19, and ConvNeXtLarge, which already encountered issues during training (see Section \ref{sec:disease_detection_result_training}), only produced predictions for diseased leaves regardless of the actual input (see Figure \ref{fig:confusion_matrix_disease_detection_masked}), producing identical metrics, with precision, recall, F1 score, precision, and AuC values of 0.2500, 0.5000, 0.3333, 0.5000, and 0.5000, respectively. This uniformity suggests that these models struggled to learn effectively from the dataset, likely due to underfitting or limitations in their architectures for the specific task. These results underscore the varying strengths of the models tested, and AlexNet clearly outperforms its counterparts in this evaluation.
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& Precision & Recall & F1-score & Accuracy & AuC \\
\midrule
InceptionV3 & 0.8354 & 0.7784 & 0.7685 & 0.7784 & 0.7784 \\
VisionTransformer & 0.4805 & 0.4999 & 0.3340 & 0.4999 & 0.4999 \\
AlexNet & 0.8674 & 0.8305 & 0.8261 & 0.8305 & 0.8305 \\
ResNet152V2 & 0.8340 & 0.7786 & 0.7691 & 0.7786 & 0.7786 \\
MobileNetV3Large & 0.6778 & 0.5017 & 0.3376 & 0.5017 & 0.5017 \\
VGG19 & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
ConvNeXtLarge & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
\bottomrule
\end{tabular}
\caption{Evaluation metrics of the disease detection models}
\label{tab:eval_metrics_disease_detection_masked}
\end{table}
\iffalse
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& Precision & Recall & F1-score & Accuracy & AuC \\
\midrule
InceptionV3 & 0.8405 & 0.7923 & 0.7847 & 0.7923 & 0.7923 \\
VisionTransformer & 0.6596 & 0.6143 & 0.5849 & 0.6143 & 0.6143 \\
AlexNet & 0.8571 & 0.8568 & 0.8568 & 0.8568 & 0.8568 \\
LeNet & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
ResNet152V2 & 0.8777 & 0.8505 & 0.8477 & 0.8505 & 0.8505 \\
MobileNetV3Large & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
VGG19 & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
ConvNeXtLarge & 0.2500 & 0.5000 & 0.3333 & 0.5000 & 0.5000 \\
\bottomrule
\end{tabular}
\caption{Evaluation metrics of the disease detection models}
\label{tab:eval_metrics_disease_detection_masked}
\end{table}
\fi
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/conf_matrix_disease_detection_masked.png}
\caption{Confusion matrix over evaluation data of all discussed disease detection models}
\label{fig:confusion_matrix_disease_detection_masked}
\end{figure}
The metrics gathered during the training process are displayed in Appendix \ref{sec:disease_detection_result_training}.
\section{Complete Pipeline} \label{sec:results_total}
In the last two sections, the performance of each stage was individually evaluated. In this section, evaluations of the compound of both stages will be performed. The evaluation will be performed with the data provided in the PlantDoc dataset \cite{singh_plantdoc_2020}. This dataset contains a total of 2,922 images, of which 916 are of healthy plants and 2,006 are infected with various diseases.
In the evaluation loop, a set of constraints for each leaf region was imposed, selecting specific regions based on their sharpness $sh(\mathcal{R}) = avg(\sqrt{(\nabla_x \mathcal{R}_{gray})^2 + (\nabla_y \mathcal{R}_{gray})^2})$, size, and predicted leaf probability by the leaf classifiers. These selections were either based on thresholding using a specific value (leaf probability) or selecting the top k candidates from the list (region area and sharpness).
For each pipeline configuration, the optimal values for these constraints were determined by an iterative search algorithm. In addition, the optimal threshold for classifying leaves as healthy or diseased was determined this way. The results of this search algorithm are summarized in Table \ref{tab:results_complete_optimal_search}.
% add the table of the optimal hyperparameter settings here (from eval_out notebook)
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& \makecell[r]{DISEASED\\THRESH-\\OLD} & \makecell[r]{LEAF\\PROB-\\ABILITY} & \makecell[r]{N\\SELECT\\AREA} & \makecell[r]{N\\SELECT\\SHARPNESS} \\
\midrule
\makecell[l]{Inception (S) + AlexNet} & 0.8000 & 0.9000 & 4 & 2 \\
\makecell[l]{Inception (S) + InceptionV3} & 0.9375 & 0.8000 & 3 & 2 \\
\makecell[l]{Inception (S) + ResNet152V2} & 0.9500 & 0.8500 & 2 & 2 \\
\makecell[l]{ResNet (S) + AlexNet} & 0.8500 & 0.8000 & 3 & 2 \\
\makecell[l]{ResNet (S) + InceptionV3} & 0.9500 & 0.8000 & 2 & 2 \\
\makecell[l]{ResNet (S) + ResNet152V2} & 0.9500 & 0.8000 & 2 & 2 \\
\makecell[l]{YOLOv8 (S) + AlexNet} & 0.8000 & 0.9250 & 4 & 2 \\
\makecell[l]{YOLOv8 (S) + InceptionV3} & 0.9500 & 0.8000 & 2 & 2 \\
\makecell[l]{YOLOv8 (S) + ResNet152V2} & 0.9500 & 0.8000 & 2 & 3 \\
\bottomrule
\end{tabular}
\caption{Optimal leaf region selection parameters determined by iterative search by pipeline configuration}
\label{tab:results_complete_optimal_search}
\end{table}
The evaluation of the pipeline configurations, as shown in Tables \ref{tab:total_eval} and \ref{tab:results_complete_optimal_search}, demonstrates the performance variability across different model combinations, influenced by both classification metrics and the optimal selection of hyperparameters for leaf region constraints. Among the classification results, YOLOv8 (S) + AlexNet emerged as the best performing configuration, achieving the highest precision of 0.7101, recall and AuC of 0.6971, and an F1-score of 0.6923. The strong performance of this configuration can be attributed to its effective balance between precision and recall, supported by its optimal hyperparameter settings: a diseased threshold of 0.8000, a leaf probability threshold of 0.9250, and selection of 4 regions based on area and 2 based on sharpness. Similarly, Inception (S) + AlexNet demonstrated competitive classification metrics, with an F1-score of 0.6607 and an accuracy of 0.6643, benefiting from similar hyperparameter settings, including a diseased threshold of 0.8000 and the selection of 4 and 2 regions for area and sharpness, respectively.
The iterative search algorithm used to optimize the leaf region constraints identified distinct patterns across the pipeline configurations. Configurations such as YOLOv8 (S) + ResNet152V2 and YOLOv8 (S) + InceptionV3, which achieved moderately lower classification metrics (F1-scores of 0.6012 and 0.6274, respectively), were characterized by stricter hyperparameter settings, with diseased thresholds of 0.9500 and the selection of fewer regions (2 for area and either 2 or 3 for sharpness). Similarly, ResNet-based combinations generally demonstrated lower classification performance, with ResNet (S) + ResNet152V2 achieving the lowest F1-score of 0.5884 and an accuracy of 0.6107. These configurations also consistently required tighter thresholds, such as a diseased threshold of 0.9500, likely reflecting their reduced capacity to generalize in the classification task.
Interestingly, configurations integrating InceptionV3, such as Inception (S) + InceptionV3 and ResNet (S) + InceptionV3, achieved slightly better results in terms of classification performance compared to other ResNet-based setups. For example, Inception (S) + InceptionV3 achieved an F1-score of 0.6428 and an accuracy of 0.6563, with optimal hyperparameters including a diseased threshold of 0.9375 and the selection of 3 regions based on area and 2 based on sharpness. This suggests that InceptionV3 contributes to a more nuanced feature extraction process, particularly when paired with models like Inception or ResNet.
Overall, the results indicate that the inclusion of AlexNet in the pipeline consistently enhances performance, particularly when paired with YOLOv8 or Inception. The iterative search for hyperparameters highlights the importance of balancing thresholds for diseased classification and region selection criteria, with the best-performing configurations often requiring more lenient diseased thresholds and a higher number of selected regions. These findings underline the critical role of both model architecture and fine-tuned hyperparameter selection in achieving optimal classification performance. Further research could explore the interaction between these factors to refine pipeline designs for more robust performance across datasets and applications.
\begin{table}[]
\centering
\begin{tabular}{lrrrrr}
\toprule
& Precision & Recall & F1-Score & Accuracy & AuC \\
\midrule
Inception (S) + AlexNet & 0.6715 & 0.6643 & 0.6607 & 0.6643 & 0.6643 \\
YOLOv8 (S) + InceptionV3 & 0.6720 & 0.6430 & 0.6274 & 0.6430 & 0.6430 \\
YOLOv8 (S) + ResNet152V2 & 0.6621 & 0.6244 & 0.6012 & 0.6244 & 0.6244 \\
ResNet (S) + ResNet152V2 & 0.6412 & 0.6107 & 0.5884 & 0.6107 & 0.6107 \\
ResNet (S) + InceptionV3 & 0.6590 & 0.6349 & 0.6205 & 0.6349 & 0.6349 \\
YOLOv8 (S) + AlexNet & \textbf{0.7101} & \textbf{0.6971} & \textbf{0.6923} & \textbf{0.6971} & \textbf{0.6971} \\
Inception (S) + InceptionV3 & 0.6842 & 0.6563 & 0.6428 & 0.6563 & 0.6563 \\
Inception (S) + ResNet152V2 & 0.6492 & 0.6254 & 0.6098 & 0.6254 & 0.6254 \\
ResNet (S) + AlexNet & 0.6743 & 0.6633 & 0.6579 & 0.6633 & 0.6633 \\
\bottomrule
\end{tabular}
\caption{Classification metrics of the evaluation of the pipeline by pipeline composition}
\label{tab:total_eval}
\end{table}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/total_eval_metrics_selected_new.png}
\caption{Bar plot of evaluation metrics by pipeline composition (5 best performing (in terms of AuC) configurations)}
\label{fig:enter-label}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.75\linewidth]{graphics/confusion_matrix_total_selected_new.png}
\caption{Confusion matrices of all viable compositions of the total pipeline}
\label{fig:enter-label}
\end{figure}
\iffalse % outdated non-masked and non-optimized selection results
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/total_eval_metrics.png}
\caption{Enter Caption}
\label{fig:total_eval}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/confusion_matrix_total.png}
\caption{Confusion matrices of all compositions of the total pipeline}
\label{fig:confusion_matrix_total}
\end{figure}
\fi
\chapter{Discussion}
This study explores the efficacy of detecting plant diseases in leaves using computer vision in real-world scenarios. Although earlier studies proposed methods for detecting diseases in plant leaves, they did not address their use outside the laboratory context, omitting the need to identify and segment individual leaves in a recording taken in real-world applications, where multiple leaves and plants may be present.
The research identified that employing the proposed methodologies for detecting plant diseases, in conjunction with a segmentation stage, constitutes an effective approach to accurately discerning areas within an image that contain plant leaves and determining whether these leaves exhibit any visible signs of disease.
Overall the proposed methodology put forward a image pipeline that is reliably able to tell apart the presence and location of diseased leaves in an image. This pipeline is composed of SAM-based YOLOv8 in the leaf segmentation stage and AlexNet as the disease detection model, producing an AuC of 69.71\% and an accuracy of 71.01\%.
Comparing these results with the evaluation of the individual stages shows that the performance of the complete pipeline depends not only on the individual stages' performance but also on their capability of "working together". The best model in leaf segmentation (discussed in Section \ref{sec:results_segmentation}) achieved a precision and recall of 95.8\% and 95.79\% by ResNet which in the evaluation of total performance reached lower scores than YOLOv8, which exhibited lower performance scores in leaf classification evaluation of 80.28\% and 71.75\%. These results show that the performance in a single stage does not guarantee comparable performance in the application in the total pipeline.
The evaluation of the segmentation performance in Section \ref{sec:results_segmentation} is not completely representative of the performance of the models in actual applications, because the dataset used for the evaluation contained fewer positives contained in the ground truth annotations than the actual leaf regions present in the image. However, when manually inspecting the segmentation results, it is evident that all configurations exhibit a significant number of misclassifications, showcasing the poor performance results due not only to the dataset quality, but also to the actual model performance.
Poor segmentation performance (especially specificity) shows that the synthetically generated dataset is not suitable for training models with the task of segmenting leaves. This could be combatted with an extensive dataset of leaf regions annotated in images containing more than one leaf. This can replace the synthetically generated dataset described in Section \ref{sec:data_aug} during training, increasing the performance of the segmentation stage, which in turn allows the disease detection stage to produce more accurate results.
In an attempt to create true leaf classifications, segmentation regions (described in Section \ref{sec:results_total}), which are specific to the dataset at hand and cannot be generalized to any situation, were selected from the results to select a relatively large centrally positioned leaf area from the image. However, even with this limitation, the performance of the best pipeline configuration is still classifying about 30\% of the images presented with the wrong label. Although these performances do not accurately reflect the pipelines' performance in real-world scenarios, they present a baseline for comparision of the different presented methods and give a basic understanding in the performance characteristics of each configuration.
The evaluation of the presented pipelines gives insight into the capabilities of detecting leaf diseases in non-lab conditions. However, these results must be interpreted with caution, as the conditions present in the dataset may differ significantly from real-world applications. While the test dataset PlantDoc \cite{singh_plantdoc_2020} contains images of healthy and diseased plants with multiple leaves in different orientations and occlusions present, mirroring real-world conditions, it is predominantly comprised of images with a narrow focus on single leaves. In real-world applications, where footage containing many plants and crops, e.g. an overview of an agricultural field, may be used, leaf regions are much smaller than in the evaluation dataset used for this work. Also, no single leaves, which are much more emphasized than others, will be present. Given these differences between real-world and evaluation data, the results presented may overestimate the presented approaches' robustness and generalizability. Particularly, the segmentation of smaller size leaf regions in wide-angle footage may lead to a decline of segmentation capability in the first stage of the presented pipelines. To combat this, future studies could include the usage of models like Semantic-SAM as the panoptic segmentation network, which are capable of segmenting image regions in different granularities, enabling the segmentation of small leaf regions \cite{li_semantic-sam_2023}.
The training process of instance segmentation models relies heavily on the availability of comprehensive and complete datasets. However, the scarcity of such datasets in the domain of leaf images poses a significant challenge in training accurate instance segmentation models. Generally, datasets of leaf images including annotations are quite scarce, and the ones that exist are lacking in quality. For example, the PhenoBench dataset \cite{weyler_phenobench_2023} which in and of itself is not optimal for the task at hand, as it contains images of crops in front of a distinctive background, additionally does not contain data in the form claimed by the authors. Instead of presenting mask information for leaf instances in an image, the masking is rather of a semantic kind, not differentiating between multiple instances of leaves. To combat this lack of suitable data, a synthetic image dataset was created, presenting a scalable approach to obtaining data in the necessary fidelity.
Despite its advantages, the approach of placing leaf images onto a background has its shortcomings. The differences in texture, lighting conditions, and background noise between the synthetic dataset and real-world images likely contributed to the instance segmentation models' limited ability to generalize. As a result, instance segmentation models trained on this dataset exhibited poor performance when applied to real-world scenarios.
To overcome these limitations, the development of a real-world dataset with comprehensive and complete annotations for leaf regions is of paramount importance. Resources like this would enable the instance segmentation models to generalize the nuances of real-world data and result in improved segmentation performance. While the synthetic dataset used in this study served as an insightful starting point, the limitations of the resulting models underline the importance of developing more comprehensive datasets.
Additionally to improved training performance, such datasets would also improve the ability to evaluate the approaches presented in this work. The dataset used for evaluating the segmentation stages \cite{giovi_leaf_2024}, similar to PlantDoc \cite{singh_plantdoc_2020} presented a narrow focus on specific leaves in the image and only provided a segmentation mask for them, whereas the leaf regions in the background were not annotated. This led to a high rate of false positives in the segmentation stage and limited the significance of results produced by this evaluation.
Despite attaining satisfactory performance during training with average precision scores reaching 0.998 for YOLOv8, instance segmentation models failed to show adequate performance on previously unseen data, as detailed in Section \ref{sec:results_segmentation_instance}, achieving an inadequate mAP of 0.054.
The inferiority of this method is further underlined by two of the three models that did not reach convergence during training and evaluation. This additionally shows that training instance segmentation models solely on synthetic datasets does not enable them to generalize and be used in a leaf segmentation application.
Using a pre-trained model for the complex task of panoptic segmentation achieved satisfactory performance in unseen real-world inputs, generating sensible segmentation outputs for image regions even in complex backgrounds.
The object detection network ResNet proved to be the best performing model for differentiating between leaf and non-leaf segments. YOLOv8, while performing well during training, classified much less segments as leaves than ResNet or InceptionV3. Other architectures like the Autoencoder exhibited poor performance, being unable to classify anything. Due to the lack of quality datasets, that present ground truth annotations for leaf regions in an image, it is hard to accurately tell which of ResNet and Inception performs better in actuall application.
The classification of segmented leaves into healthy and diseased categories was examined using a variety of classification models. Evaluation of these models indicated that AlexNet demonstrated the greatest ability to generalize patterns learned during training to unseen data, achieving an accuracy of 85.68\%. Interestingly, the model with the least computational and architectural complexity is better able to adapt to the training data than more intricate architectures. ResNet152V2 closely following all metrics and even exceeding in precision of 87.77\% is a close follow-up.
Although the study explores how the prevalence of diseases can be attributed to regions in a recorded image, for real-world use, it would be necessary to enable the model of attributing plant diseases to specific plant instances in the recorded image. Furthermore, this study only discusses diseases that show spectrally visible features; diseases without showing any strain in the human-visible spectrum are not detectable by the proposed methods.
\chapter{Conclusion}
Given the critical role of plant health in ensuring food safety and improving crop efficiency, monitoring the health status of plants in agricultural settings has gained greater importance. Although previous research has made significant advances in the identification of leaf diseases, there remains a lack of models that are effective in real world applications.
In this research, a novel methodology for disease detection in practical settings was introduced. This approach involves a two-stage Deep Learning pipeline, which initially segments the image to isolate leaves, followed by the classification of these segments to assess the presence of disease. This methodology is particularly applicable to real-world situations, as it does not require the preparation of individual leaf images; rather, it is capable of processing multiple leaves within a single image to generate its outcomes.
The best performance reached by this methodology was a precision of 71.01\% and an accuracy of 69.71\%.
\appendix
\begin{appendix}
\chapter{Training results}
\section{Leaf segmentation}
\subsection{Panoptic segmentation} \label{sec:leaf_segmentation_panoptic_training}
The training performance of the autoencoder on the Urban Street dataset \cite{yang_urban_2023} was monitored over 80 epochs, capturing the evolution of various metrics, including training loss and validation accuracy, precision, recall, and F1-score. As shown in Figure \ref{fig:autoencoder_train}, the training loss decreased rapidly in the initial epochs and approached convergence well before reaching the 69th epoch, prompting early stopping to prevent overfitting. This rapid decrease in loss suggests effective optimization and quick adaptation of the model to the training data.
In contrast, the validation metrics — accuracy, precision, recall, and F1-score — remained relatively stable and showed limited improvement throughout the training process. Precision and recall fluctuated around 0.5, indicating that the model struggled to consistently identify relevant patterns in the validation data. The F1-score, a metric that combines precision and recall to provide a balanced view of the model’s performance, remained lower than the other metrics, further indicating the model’s difficulty in achieving reliable performance on the validation set. These stagnating validation metrics imply that while the model successfully minimized training loss, it may be overfitting to the training data or lack the capacity to generalize effectively to unseen samples in the Urban Street dataset.
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{graphics/autoencoder_training_metrics.png}
\caption{Training metrics of the autoencoder discussed in Section \ref{sec:method_autoencoder}}
\label{fig:autoencoder_train}
\end{figure}
The training loop of ResNet shows interesting trends. It is evident, that the loss as well as the validation loss decrease significantly and converge towards the end of training. The validation metrics all follow the same exact values, perfectly overlaying each other in the plot. Also interestingly is that they all start at 0.98 at the beginning of training, showing that either the pretrained ResNet is already able to extract the relevant features or the sheer length of the training epoch of over 74,000 already lead to a noticable generalization and learning.
\begin{figure}
\centering
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{graphics/resnet_training_loss.png}
\caption{Training and validation loss of ResNet over epochs}
\label{fig:resnet_training_loss}
\end{subfigure}
\begin{subfigure}{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{graphics/resnet_training_val_metrics.png}
\caption{Validation metrics of ResNet over epochs in the range between 0.98 and 1}
\label{fig:resnet_training_val_metrics}
\end{subfigure}
\caption{Training progress of ResNet with average values over one epoch}
\label{fig:resnet_training}
\end{figure}
InceptionV3, in contrast to ResNet exhibits a bot more troublesome training progress. In the first epochs the validation loss increases significantly to over 100 (clipped in Figure \ref{fig:inception_training}), but then slowly decreases to acceptable levels. This might be due to the learning rate scheduling, which reaches its highest learning rate in this stage of the training loop. This is also reflected in the validation metrics, which show a significant dip in this stage of training, but quickly recover until they exhibit another dip in epoch 8, before then reaching a new best score in all metrics.
\begin{figure}
\centering
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{graphics/inception_training_loss.png}
\caption{Training and validation loss of ResNet over epochs}
\label{fig:inception_training_loss}
\end{subfigure}
\begin{subfigure}{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{graphics/inception_training_val_metrics.png}
\caption{Validation metrics of ResNet over epochs \newline}
\label{fig:inception_training_val_metrics}
\end{subfigure}
\caption{Training progress of ResNet with average values over one epoch}
\label{fig:inception_training}
\end{figure}
\subsection{Instance segmentation} \label{sec:leaf_segmentation_instance_training}
In instance segmentation, the presented models were trained to segment regions containing a leaf from a given input image. For this task, the Mask R-CNN, RetinaNet, and YOLOv8 architectures have been selected.
During the training loop described in Section \ref{sec:method_segmentation} Mask R-CNN and RetinaNet exhibited notable differences in the validation metrics.
In Figure \ref{fig:seg_precision_recall} you can see the comparison of the two main validation metrics, mean average recall (mAR) and mean average precision (mAP), throughout the progression of training. As you can see, in general, RetinaNet achieved a higher mAR with a maximum value of $0.5901$, while Mask R-CNN achieved a maximum of $0.5360$ in this metric. However, the progression of mAR is much smoother in Mask R-CNN than with RetinaNet, with a smoothness of $sm(\text{Mask R-CNN}) = 0.0075$ and $sm(\text{RetinaNet}) = 0.0297$. Where the smoothness $sm$ is defined by the standard deviation ($\sigma$) of the derivative of a function:
\begin{equation} \label{eqn:smoothness}
sm(f) = \sigma(\frac{df(x)}{dx})
\end{equation}
This behavior is more pronounced when inspecting the mean average precision (mAP) in both networks during training (seen in Figure \ref{fig:seg_precision_recall}). Here, Mask R-CNN achieves an overall higher score of 0.673, as well as faster convergence compared to RetinaNet, which arguably does not achieve any convergence. Once again, the smoothness of Mask R-CNN is much higher than RetinaNet's with $sm(\text{Mask R-CNN}) = 0.015$ and $sm(\text{RetinaNet}) = 0.140$ respectively.
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/segmentation_map_mar.png}
\caption{Mean average recall (mAR) and mean average precision (mAP) of the two discussed segmentation models}
\label{fig:seg_precision_recall}
\end{figure}
Due to the inadequate performance and the absence of convergence during training, along with the lack of a mask output in RetinaNet, this model shows to be unsuitable for the task of leaf segmentation and will therefore not be included in the subsequent stages of this research.
When using YOLOv8 for leaf classification in combination with panoptic segmentation using SAM, that is, a simpler task of detecting only a single leaf, YOLOv8 trained on the UrbanStreet dataset \cite{yang_urban_2023}, reached substantially higher training and validation metrics, with a maximum precision and recall of 0.998 and 0.997 respectively. The progress of all metrics over the course of the training can be seen in Figure \ref{fig:yolo_training_metrics_us}. Although the convergence of the validation metrics is not as clear as in the instance segmentation version, the training loss functions all converge after the 100 epochs.
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{graphics/yolo_training_result_urban_street.png}
\caption{Training and validation metrics of the YOLOv8 model on the Urban Street leaves dataset \cite{yang_urban_2023} used for leaf classification}
\label{fig:yolo_training_metrics_us}
\end{figure}
The training and validation metrics of YOLOv8 as an instance segmentation model on the synthetically generated dataset can be seen in Figure \ref{fig:yolo_training_metrics_synthetic}. It is evident that the model's performance has converged with respect to training loss and validation metrics. The maximum values attained by the instance segmentation model are 0.626 for precision and 0.652 for the recall.
\begin{figure}
\centering
\includegraphics[width=0.95\linewidth]{graphics/yolo_training_result_synthetic.png}
\caption{Training and validation metrics of the YOLOv8 model in the synthetically generated dataset from Section \ref{sec:data_aug} used for instance segmentation}
\label{fig:yolo_training_metrics_synthetic}
\end{figure}
When comparing the validation metrics of the two (remaining) instance segmentation models (seen in Figure \ref{fig:yolo_mrcnn_instance_seg}), Mask R-CNN's scores for bounding box predictions are increasing steadily to values of 0.790 and 0.859 for mAR and mAP respectively. Although the box predictions reach usable scores, it is evident that Mask R-CNN did not manage to generate any meaningful results for the segmentation masks, with its mAP and mAR strongly decreasing in the beginning of training and after that fluctuating heavily and reaching final values of 0.303 and 0.194 respectively. All the while, YOLOv8's mAR and mAP for mask predictions are steadily rising throughout the training process and eventually converging to values of around 0.641 and 0.677. Throughout training, the maximum mAP box score reached by Mask R-CNN is 0.540 whereas YOLOv8 achieved a score of 0.679. As Mask R-CNN proves unviable for this task, it will be omitted in further progress of this study.
\begin{figure}[]
\centering
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{graphics/yolo_mrcnn_instance_seg_mar.png}
\caption{Comparison of mAR between YOLOv8 and Mask R-CNN}
\label{fig:yolo_mrcnn_instance_seg_mar}
\end{subfigure}
\hfill
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{graphics/yolo_mrcnn_instance_seg_map.png}
\caption{Comparison of mAP between YOLOv8 and Mask R-CNN}
\label{fig:yolo_mrcnn_instance_seg_map}
\end{subfigure}
\caption{Comparison of mean average recall (mAR) and mean average precision (mAP) between YOLOv8 and Mask R-CNN}
\label{fig:yolo_mrcnn_instance_seg}
\end{figure}
\begin{table}[]
\centering
\begin{tabular}{lrr}
\toprule
& mAP & mAR \\
\midrule
RetinaNet & 0.5901 & 0.7120 \\
Mask R-CNN & 0.5360 & 0.6730 \\
YOLOv8seg & 0.5403 & 0.6790 \\
\bottomrule
\end{tabular}
\caption{Maximum values for mAP and mAR during training per model}
\label{tab:instance_segmentation_max_values}
\end{table}
\section{Disease detection} \label{sec:disease_detection_result_training}
For detecting diseases in already segmented leaf regions of the image, seven different models were trained to differentiate between healthy and diseased leaves. The metrics collected during the training process can be seen in Figure \ref{fig:metrics_disease_detection}.
Table \ref{tab:disease_detection_metrics} compares the performance of various models in classifying leaf diseases, using key metrics such as Area under the Curve (AuC), and Recall across both training and validation datasets. AlexNet achieves the highest training metrics, reaching an AuC of 0.9968 and a Recall of 0.9846, which underscores its robust classification capabilities on the training set. While its validation scores are slightly lower, they remain impressive, indicating strong generalization with an AuC of 0.9963, BA and Recall of 0.9825. InceptionV3 exhibits competitive performance, achieving the same AuC as AlexNet and a slightly lower recall of 0.9841, as well as similar validation metrics with an AuC of 0.9960 and a recall of 0.9826, which surpasses the validation recall of AlexNet.
MobileNetV3Large also demonstrates high performance, particularly in terms of balanced training and validation scores, with minimal drop-offs between datasets. This model offers a reliable balance of efficiency and performance. VisionTransformer, and ConvNeXtLarge show moderately high scores, though they fall short of the leading models, particularly in validation performance, indicating they might be more prone to slight overfitting or may have less capacity for capturing subtle disease-related features in this dataset. VGG19, while competitive, shows comparatively lower scores on both training and validation, suggesting potential limitations in its feature extraction for this specific task.
Overall, the analysis highlights AlexNet and InceptionV3 as top-performing models for leaf disease classification, with InceptionV3 particularly excelling in validation performance, showcasing its effectiveness in achieving reliable generalization. Additionally, ResNet and MobileNetV3Large also emerge as viable leaf detection models, exhibiting lower but still acceptable performance in their task.
Figure \ref{fig:metrics_disease_detection_masked} presents the training and validation performance of various models over 30 epochs in terms of binary accuracy and loss, providing insights into their convergence behaviors and generalization capabilities for leaf disease classification.
The evaluation of deep learning models, including InceptionV3, MobileNetV3Large, ConvNeXtLarge, ResNet152V2, AlexNet, VGG19, and VisionTransformer, highlights distinct trends across the metrics of AUC, loss, recall, validation AUC, validation loss, and validation recall. In training you can observe that the models InceptionV3, AlexNet and ResNet152V2 exhibited the best performance with similar trends in all metrics. They demonstrate loss converging to levels below 0.15 throughout the training process. Their AuC-score and recall also exhibit similar trends, reaching scores of 0.9946, 0.9937, 0.9915 and 0.9765, 0.9735, 0.9667 respectively. This showcases the models' strong performance of learning and generalizing the features of the trained healthy and diseased classes. While MobileNetV3Large shows acceptable performance on par with the previously discussed models, it significantly lacks behind in the validation metrics, exhibiting unstable performance with highly fluctuating validation loss, AuC and recall, showing that this model is not able to generalize. While the VisionTransformer exhibits mediocre performance during training it shows that the loss during the validation cycle sees only minimal improvement over a period of 12 epochs indicating the model is not suited for this task. The other models, namely ConvNeXtLarge and VGG19 all demonstrate that they are also not suited for the classification task at hand, barely showing any improvement of the training and validation metrics during the training process and trigger early stopping quite early during training.