# Automatic SCOring of Atopic Dermatitis using Deep Learning (ASCORAD): A Pilot Study

https://doi.org/10.1016/j.xjidi.2022.100107

## Introduction​

Atopic dermatitis (AD) is a multifaceted, chronic relapsing inflammatory skin disease that is commonly associated with other atopic manifestations such as allergic conjunctivitis, allergic rhinitis, and asthma (Berke et al., 2012; Bieber, 2008; Drucker et al., 2017). It is the most common skin disease in children, affecting approximately 15‒20% of children and 1‒3% of adults (Eichenfield et al., 2014; Nutten, 2015). The onset of the disease is most common by age 5 years, and early diagnosis and treatment are essential to avoid complications of AD and improve QOL (Eichenfield et al., 2014).

Much work has been done in the development of a better scoring system to reach a more objective and quicker-to-fill method. Novel tools for patients, such as the patient-oriented validated scoring system patient-oriented-SCORAD (Stalder et al., 2011) detect changes in signs and symptoms without the intervention of doctors. Likewise, the Three Item Severity score is a simple method to determine the severity of AD, and takes about 43 seconds per patient. The Eczema Area and Severity Index (Hanifin et al., 2001) showed a good interobserver and intraobserver variability, but it is a complex and time-consuming index to fill. However, all these scoring systems still suffer from the same variability problem because they share similarities with SCORAD (Chopra et al., 2017).

In recent years, artificial intelligence (AI) has achieved human-expert-like performance in a wide variety of tasks such as skin cancer classification, detection, and lesion segmentation. Extensive work has been done in the detection of AD with different imaging methods, including multiphoton tomography (Guimarães et al., 2020), clinical image (Wu et al., 2020), and even electronic health records (Gustafson et al., 2017). Skin pathologies such as psoriasis have also attracted the attention of researchers for the same reasons as AD, because the main scoring system, PASI, is a time-consuming and highly subjective scoring method. Dash et al. (2019) proved that convolutional neural networks are able to segment psoriasis with high accuracy, sensitivity, and specificity, outperforming existing methods. Pal et al. (2016) showed the effectiveness of convolutional neural networks in visual sign classification, a key task to automatic severity grading. Dash et al. (2020) combined both segmentation and severity grading, creating a computer-aided diagnosis (CADx) system for psoriasis lesion grading.

Creating a more objective and practical scoring system for AD assessment is key to improving evidence-based dermatology. In this study, we introduce the Automatic SCORAD (ASCORAD), an automatic version of the SCORAD that provides a quick, accurate, and fully automated scoring method.

## Results​

### Annotation​

Firstly, we calculated the variability among the expert dermatologists across the three datasets. This provided a baseline that made possible the appraisal of the results of the Legit. Health-SCORADNet algorithm. We found out that the lesion segmentation annotation was very consistent across datasets, with an accuracy of 81.0‒91.3%, area under the curve (AUC) of 0.91, F1 of 0.86‒0.91, and relative SD (RSD) of 8.6‒9.1%. It can also be seen that Legit.Health-AD-FPK-IVI had the largest disagreement, if we look at the intersection over union (IoU) metric, with 0.80 against 0.86 and 0.91 on light-skin datasets. Note that the F1 score is also the lowest for the light-skin dataset. In regard to visual sign severity assessment, Legit.Health-AD had more disagreement among the specialists, but the other datasets had more positively skewed distributions, meaning that the majority of the intensity values were close to 0.

#### Lesion surface segmentation​

We compared the difference at pixel-level because there was no physical reference on the images to obtain the real size of the lesions. As shown in Table 1, the annotations of the three datasets had an RSD close to 9%, Cohen’s kappa of 0.79, and AUC around 0.90. Despite the similarity in the results on the previously mentioned metrics, Legit.Health-AD-FPK-IVI seemed to have more discrepancies among the annotators because it showed the lowest IoU and F1 values, 0.80 and 0.86, respectively.

Table 1. Annotator’s Performance in Lesion Surface Segmentation

DatasetsACCAUCIoUF11RSDCohen’s Kappa

Abbreviations: ACC, accuracy; AUC, area under the curve; IoU, intersection over union; RSD, relative SD.

These results provide the background for comparing with the results of Legit.Health-SCORADNet.

1

F1 denotes F1 score.

#### Visual sign severity assessment​

The results presented in Table 2, Table 3, Table 4 provide the baseline to appraise the results of Legit.Health-SCORADNet in the visual sign severity-assessment task. All the values are below random RSD and above random full agreement rate (FAR), partial agreement rate (PAR) 1, and PAR2 for all visual signs. Erythema was the visual sign that obtained the best Cohen’s kappa value in general, and lichenification (0.06) in Legit.Health-AD and excoriations (0.08) and dryness (0.09) in Legit.Health-AD-FPK-IVI had values very close to 0. The six visual signs constitute a maximum of 63 points of the SCORAD because the sum of the intensities was multiplied by 75$\frac{7}{5}$ (equation 2). Given the RSD results in terms of SCORAD points, the variability of Legit.Health-AD was around 11 points (RSD ˆ = 17%), and both Legit.Health-AD-Test and Legit.Health-AD-FPK-IVI had the same variability, on average, of 8 points (RSD ˆ = 12%).

Table 2. Annotator’s Performance in Legit.Health-AD Visual Sign Severity Assessment

Visual SignsRSDRMAE (Mean)RMAE (Median)FARPAR1PAR2Cohen’s Kappa
Erythema11.510.78.333.192.094.50.34
Edema16.214.711.921.374.184.90.15
Oozing20.018.214.818.059.679.30.19
Excoriations17.415.912.922.666.581.20.17
Lichenification20.318.315.110.759.174.60.06
Dryness18.716.912.820.069.382.30.14
Average17.415.813.814.464.779.30.17

Abbreviations: FAR, full agreement rate; PAR, partial agreement rate; RMAE, relative mean absolute error; RSD, relative SD.

These results provide the baseline to appraise the results of Legit.Health-SCORADNet.

Table 3. Annotator’s Performance in Legit.Health-AD-Test Visual Sign Severity Assessment

Visual SignsRSDRMAE (Mean)RMAE (Median)FARPAR1PAR2Cohen’s Kappa
Erythema12.111.28.834.088.091.50.35
Edema7.97.35.655.893.196.70.22
Oozing10.39.57.544.489.993.10.39
Excoriations12.711.69.439.779.087.10.20
Lichenification10.19.37.446.888.092.90.21
Dryness16.514.912.220.472.480.30.19
Average11.610.68.540.285.090.30.26

Abbreviations: FAR, full agreement rate; PAR, partial agreement rate; RMAE, relative mean absolute error; RSD, relative SD.

These results provide the baseline to appraise the results of Legit.Health-SCORADNet.

Table 4. Annotator’s Performance in Legit.Health-AD-FPK-IVI Visual Sign Severity Assessment

Visual SignsRSDRMAE (Mean)RMAE (Median)FARPAR1PAR2Cohen’s Kappa
Erythema11.910.88.842.380.188.20.23
Edema8.68.06.354.090.994.50.13
Oozing12.711.69.435.181.987.30.27
Excoriations9.79.07.045.092.795.50.08
Lichenification13.312.29.727.985.590.90.27
Dryness18.216.413.410.870.281.00.09
Average12.411.39.135.986.689.60.18

Abbreviations: FAR, full agreement rate; PAR, partial agreement rate; RMAE, relative mean absolute error; RSD, relative SD.

These results provide the baseline to appraise the results of Legit.Health-SCORADNet.

Legit.Health-SCORADNet was validated through two experiments in which the network was trained on several data splits because we applied a k-fold cross-validation technique: 6-fold for the first experiment and 3-fold for the second experiment. All the results presented in Table 5, Table 6, Table 7, Table 8, Table 9 were obtained by averaging the results of the network’s performance on the different data splits and were measured using the same metrics as the annotation, with the purpose of making a direct comparison of both.

Table 5. Legit.Health-SCORADNet’s Results in Light Skin Lesion Surface Segmentation

Clinical SignACC, % (95% CI)AUC (95% CI)IoU (95% CI)F11 (95% CI)
Lesion surface84.6 (80.9‒88.3)0.93 (0.90‒0.96)0.64 (0.59‒0.69)0.75 (0.71‒0.79)

Abbreviations: ACC, accuracy; AUC, area under the curve; CI, confidence interval; IoU, intersection over union.

1

F1 denotes F1 score.

Table 6. Legit.Health-SCORADNet’s Results in Dark Skin Lesion Surface Segmentation

RMAE 11.(95% CI)RMAE 22 (95% CI)RMAE 1 (95% CI)RMAE 2 (95% CI)
0‒3Median13.6 (9.7‒17.5)14.3 (10.4‒18.2)21.2 (17.3‒25.0)20.8 (16.9‒24.7)
0‒10Median14.3 (10.4‒18.2)13.2 (9.3‒17.0)22.8 (18.9‒26.7)20.0 (16.0‒23.9)
0‒100Median14.4 (10.5‒18.3)13.0 (9.1‒16.9)22.6 (18.7‒26.5)19.8 (15.9‒23.7)
0‒100Mean13.5 (9.6‒17.4)13.4 (9.5‒17.3)21.1 (17.2‒25.0)19.9 (16.0‒23.8)

Abbreviations: ACC, accuracy; AUC, area under the curve; CI, confidence interval; IoU, intersection over union.

Results are divided by experiment. The algorithm in experiment 1 was trained solely on light-skinned patient images, and the algorithm in experiment 2 was trained on mixed data containing 8% of dark-skinned patient images.

1

F1 denotes F1 score.

Table 7. Legit.Health-SCORADNet’s Results in Visual Sign Severity Assessment

Visual SignsRMAE 11 (95% CI)RMAE 22 (95% CI)
Erythema14.1 (10.2‒18.0)13.3 (9.4‒17.2)
Edema16.1 (12.2‒20.0)16.0 (12.1‒19.9)
Oozing22.3 (18.4‒26.2)19.4 (15.5‒23.3)
Excoriations11.5 (7.6‒15.4)9.6 (5.7‒15.4)
Lichenification10.3 (6.4‒14.2)8.7 (4.8‒12.6)
Dryness12.4 (8.5‒16.3)11.3 (7.4‒15.2)
Average14.4 (10.5‒18.3)13.0 (9.1‒16.9)

Abbreviations: CI, confidence interval; DEX, Deep EXpectation; RMAE, relative mean absolute error.

The models were trained on Legit.Health-AD using a different range and ground truth method and tested on Legit.Health-AD-Test and Legit.Health-AD-FPK-IVI.

1

RMAE 1 is obtained by applying the argmax function to the prediction.

2

RMAE 2 is obtained by applying the DEX method to the prediction.

Table 8. Legit.Health-SCORADNet’s Results in Light-Skin Visual Sign Severity Assessment

Visual Signs

RMAE 11 (95% CI)

RMAE 22 (95% CI)

Erythema

14.1 (10.2‒18.0)

13.3 (9.4‒17.2)

Edema

16.1 (12.2‒20.0)

16.0 (12.1‒19.9)

Oozing

22.3 (18.4‒26.2)

19.4 (15.5‒23.3)

Excoriations

11.5 (7.6‒15.4)

9.6 (5.7‒15.4)

Lichenification

10.3 (6.4‒14.2)

8.7 (4.8‒12.6)

Dryness

12.4 (8.5‒16.3)

11.3 (7.4‒15.2)

Average

14.4 (10.5‒18.3)

13.0 (9.1‒16.9)

Abbreviations: CI, confidence interval; DEX, Deep EXpectation; RMAE, relative mean absolute error.

1

RMAE 1 is obtained by applying the argmax function to the prediction.

2

RMAE 2 is obtained by applying the DEX method to the prediction.

Table 9. Legit.Health-SCORADNet’s Results in Dark Skin Visual Sign Severity Assessment

Visual Signs

Experiment 1

Experiment 2

RMAE 11 (95% CI)

RMAE 22 (95% CI)

RMAE 1 (95% CI)

RMAE 2 (95% CI)

Erythema

17.8 (13.9‒21.7)

15.7 (11.8‒19.6)

16.2 (12.2‒20.2)

14.3 (10.3‒18.3)

Edema

16.8 (12.9‒20.7)

18.6 (14.7‒22.5)

18.1 (14.1‒22.0)

15.4 (11.4‒19.4)

Oozing

24.9 (21.0‒28.8)

22.7 (18.8‒26.6)

9.3 (5.3‒13.3)

9.0 (5.0‒13.0)

Excoriations

10.1 (6.2‒14.0)

9.6 (5.7‒13.5)

10.2 (6.2‒14.2)

8.0 (4.0‒12.0)

Lichenification

25.9 (22.0‒29.8)

20.6 (16.7‒24.5)

24.0 (20.0‒28.0)

19.8 (15.8‒23.8)

Dryness

39.9 (36.0‒43.8)

31.7 (27.8‒35.6)

26.0 (22.0‒30.0)

19.3 (15.3‒23.3)

Average

22.6 (18.7‒26.5)

19.8 (15.9‒23.7)

17.3 (13.3‒21.3)

14.3 (10.3‒18.3)

Abbreviations: CI, confidence interval; DEX, Deep EXpectation; RMAE, relative mean absolute error.

Results are divided by experiment. The algorithm in experiment 1 was trained solely on light-skinned patient images, and the algorithm in experiment 2 was trained on mixed data containing 8% of dark-skinned patient images.

1

RMAE 1 is obtained by applying the argmax function to the prediction.

2

RMAE 2 is obtained by applying the DEX method to the prediction.

Legit.Health-SCORADNet showed a good performance at visual sign severity assessment, obtaining a relative mean absolute error (RMAE) of 13.0% and AUC of 0.93 at surface estimation. The total execution time of Legit.Health-SCORADNet for a single image was 0.34 seconds, running on an Intel Xeon Platinum 8260 CPU at 2.40 GHz (Intel, Santa Clara, CA).

#### Lesion surface segmentation​

Legit.Health-SCORADNet’s lesion surface segmentation results are presented in Tables 5 and 6. The AUC, IoU, and F1 for light skin were 0.93, 0.64, and 0.75, respectively, whereas the results on those metrics were 0.83, 0.32, and 0.42, respectively, for dark skin. However, when training in a small subset of dark skin images (experiment 2), the results significantly improved (0.41 for IoU and 0.33 for F1), as shown in Table 6. Figures 1 and 2 show the ground truth and the prediction for a sample case of Legit.Health-AD-Test and Legit.Health-AD-FPK-IVI, respectively.

Figure 1. Lesion surface segmentation masks. (a) Original image. (b) Legit.Health-SCORADNet’s prediction. (c) Ground truth. (d) Mask drawn by the first specialist. (e) Mask drawn by the second specialist. (f) Mask drawn by the third specialist. Legit.Health-AD-Test sample image gathered from Danderm Dermatology Atlas with the owner's permission.

Figure 2. Results of experiments 1 and 2 models on a dark skin image. (a) The predicted surface mask of the model trained on light skin. (b) The predicted surface mask of the model trained on both light and dark skin. (c) The ground truth mask. Legit.Health-AD-FPK-IVI sample image gathered from Danderm Dermatology Atlas with the owner's permission.

#### Visual sign severity assessment​

On average, we achieved the best performance when we trained the network with the ground truth that resulted from applying the median and normalizing the outcome into the 0‒100 range (Table 7). Using that configuration, we ran experiments 1 and 2, and we got an RMAE of 13.0% in Legit.Health-AD-Test, which had an interobserver RMAE of 10.6%, having trained Legit.Health-SCORADNet on a dataset with 15.8% RMAE (Table 8). The RMAE on Legit.Health-AD-FPK-IVI was slightly higher: 14.3% (Table 9) when including dark skin images in the training set, and 19.8%, without including dark skin images. The visual sign with the worst performance on light skin was oozing (19.4%), followed by edema (16.0%). Lichenification (19.8%) and dryness (19.3%) were the most difficult visual signs for the algorithm to correctly predict on dark skin, with edema (15.4%) also having a value above the average. Interestingly, oozing got a much lower RMAE on Legit.Health-AD-FPK-IVI, whereas both test datasets had the same oozing intensity distribution. The distribution of predicted intensity values was plotted next to the ground truth distributions (Figure 3) to show that Legit.Health-SCORADNet was able to predict values in the whole range and not only the mean of the distribution.

Figure 3. Legit.Health-AD-Test visual sign intensity distribution of ground truth labels and predictions. The horizontal axis is in the range 0‒100 because the results are given using the best performing model, which was trained with ground truth labels in that range.

## Discussion​

ASCORAD shows promise as an automatic scoring system that might enable a more objective and quicker evaluation. Indeed, a deep learning algorithm could simplify the assessment of AD, a very common skin disease that affects 15‒20% of children (Asher et al., 2006) and 1‒3% of adults worldwide. Scoring systems such as SCORAD and Eczema Area and Severity Index have high interobserver variability and are time-consuming. An AI-automated approach may help to reduce such bias and therefore be a more precise and objective criterion for evaluation in pharmaceutical studies and routine clinical practice.

Our results show that deep learning may be noticed as a fast and objective alternative method for the automatic assessment of AD with great potential, already achieving results comparable with those of human expert assessment, while reducing interobserver variability and being more time-efficient. ASCORAD could also be used in situations where face-to-face consultations are not possible, providing an automatic assessment of clinical signs and lesion surface. It could also be a potential tool to reduce the time and effort of training clinical assessors for clinical trials and in clinical practice.

However, additional validation studies are needed in real-world settings and with diverse populations to ensure generalizability. Despite that the dataset used in this study captures the variability of a wide range of parameters, the algorithm should be tested on other datasets to prove its robustness and generalizability, in particular to dark skin tones. In the future, we intend to test ASCORAD in validation studies in which the objective part of the SCORAD will be assessed in person by the dermatologist. Comparing the result of the algorithm with those of face-to-face assessment is crucial because some visual signs such as edema, dryness, or oozing might present more difficulties in estimating the severity by image than in person. Furthermore, the AI Marker will be used in this study, helping the CADx system to correctly calculate the surface by converting lesion pixels into a metric unit of measurement.

To put our results into clinical context, the annotated lesion area was compared with the algorithm-predicted area. Because some photographs do not show the complete lesion area, live assessment method cannot be directly compared with the photograph assessment method. However, image-based area assessment by an expert and predicted area have the same basis for their analysis and are therefore directly comparable. Legit.Health-SCORADNet resulted in a good overall RMAE of 13.0% and an excellent AUC of 0.93 and IoU of 0.75 for lesion surface estimation on light skin.

Legit.Health-AD-Test and Legit.Health-AD-FPK-IVI datasets have strong positively skewed distributions for all the visual signs, which means that the most frequent intensities are 0 and 1. It seems that a vast majority of images are of mild AD or that the observers had a strong bias toward low-intensity values. If the majority of the visual signs are close to zero intensity, it is possible that the RSD reflected lower disagreement (9% vs. 17% in Legit.Health-AD). In fact, Oranje et al.(2007) found an RSD of 20%, which was very close to the interobserver variability found in Legit.Health-AD.

Looking at Cohen’s kappa values, it seems that some of the visual signs such as lichenification in Legit.Health-AD and excoriations and dryness in Legit.Health-AD-FPK-IVI have a null interobserver agreement. However, Cohen's kappa is a statistical measure for nominal classification problems, and metrics such as RSD, RMAE, FAR, PAR1, and PAR2 show that the annotation of the specialists is far from random. For example, the visual sign excoriations in Legit.Health-AD-FPK-IVI obtains a Cohen’s kappa value of 0.09 and PAR2 of 95.5%, far from the random value (62%).

In short, we have proved that a convolutional neural network trained with the observer’s average results can achieve an RMAE similar to that of one of the experts. Furthermore, our automatic method outputs a value in the range 0‒100 for each visual sign instead of the range 0‒3 as the usual SCORAD, broadening the spectrum of possible outputs and turning the discrete problem into more continuous.

We believe that our algorithm has the potential to reduce costs in dermatology by saving time while improving the documentation process of the evolution of the disease. This could be interesting for the application in pharmaceutical clinical trials, as well as in clinical practice.

## Materials and Methods​

### Datasets and annotations​

In this retrospective, noninterventional study, three new annotated datasets were constructed to train and validate the performance of the lesion surface segmentation and visual sign severity assessment algorithms. The first two datasets comprise solely light-skinned patients (Fitzpatrick I‒III) because it proved to be easier to gather datasets of such characteristics, whereas the third consists of images of IV‒VI skin types according to the Fitzpatrick scale. Demographic characteristics of each dataset are gathered in Table 10. Clinical images were collected from online public sources, and patient consent and ethics committee approval were not necessary. Published images belong to Danderm Dermatology Atlas, and the author gave his consent for publication.

Table 10. Demographic Characteristics

Datasets

Age Groups (%)

Sex (%)

Skin Type (%)

<18

18‒29

30‒39

40‒49

50‒64

>65

Male

Female

Light

Dark

31

23

26

14

4

2

39

61

100

0

100

0

0

100

Legit.Health-AD is a dataset collected from online dermatological atlases that consist of 604 images that belong to light-skinned patients, of which one third are children (Table 10), suffering from AD, with lesions present on different body parts. The dataset contains the following percentage of body zones: head (22%), trunk (11%), arms (23%), hands (9%), legs (16%), feet (8%), genitalia (3%), full body (1%), and skin close-up (7%). The dataset contains a substantial variety of clinical images taken from different angles, distances, light conditions, body parts, and disease severity. Figure 4 depicts the normalized intensity distribution by visual sign. The images have a minimum size of 260 × 256 pixels, an average size of 667 × 563 pixels, and a maximum size of 1,772 × 1,304 pixels.

Figure 4. Comparison of the intensity level distribution by a visual sign of the datasets used in the study.

A second dataset, Legit.Health-AD-Test, was built for testing purposes. The dataset was gathered from several dermatological atlases publicly available and contains a total number of 367 images that belong exclusively to light-skinned patients. The dataset is only characterized by skin type (Table 10), and basic demographic information such as age and sex is missing because the original sources do not provide that information. The images were downloaded one by one, and each of them was reviewed by a physician to approve the inclusion of the image in the dataset. Duplicates or very similar images were removed, and no other data sampling technique was applied. Similar to Legit.Health-AD, the dataset contains images of children and adults with great variability in angles, distances, light conditions, body parts, and disease severity. The dataset contains the following percentage of body zones: head (35%), trunk (20%), arms (18%), hands (7%), legs (13%), feet (2%), genitalia (2%), and skin close-up (3%). The visual sign intensity distribution of this dataset is different from that of Legit.Health-AD, having more cases of zero intensity for most of the visual signs (Figure 4). The images have a minimum size of 313 × 210 pixels, an average size of 574 × 537 pixels, and a maximum size of 2,848 × 3,252 pixels.

Legit.Health-AD-FPK-IVI is a dataset collected from online dermatological atlases that contain photos of children and adult patients with Fitzpatrick IVVI skin types suffering from AD. The same manual procedure as that of Legit.Health-AD-Test was applied to gather the dataset, and basic demographic information such as age and sex is also missing (Table 10). It is composed of 112 images with a minimum size of 200 × 204 pixels, an average size of 766 × 695 pixels, and a maximum size of 3,_024 × 4,032 pixels. The dataset contains the following percentage of body zones: head (41%), trunk (10%), arms (17%), hands (8%), legs (13%), feet (3%),_ and skin close-up (8%). The goal of including this dataset in the study was to gather preliminary evidence of the efficiency of the algorithms in dark skin.

### Ground truth labels​

The corresponding ground truths of each dataset were prepared by nine experts, three for each dataset, who treat patients with AD in their daily practice, to reduce variability by combining their results. The experts annotated the images without more context than the images. They had to draw a mask over the lesion and choose a score from 0 to 3 for each visual sign that comprise the SCORAD.

We obtained the ground truth labels for lesion segmentation and visual sign intensity classification by averaging the masks of the three annotators and by averaging the intensity levels. We chose the mean over the median because it is the statistical measure that gets the best results for generating ground truth labels from multiannotator ordinal data (Lakshminarayanan and Teh, 20131).

### Data preprocessing​

Images were resized to 512 × 512, and pixel values scaled between 0 and 1. In addition, images in which the disease was too small in the picture were cropped, focusing on the disease. Ground truth labels were obtained from averaging the results as explained in the previous section. However, we ran some additional experiments using an alternative ground truth only for the training set, consisting of the median visual intensity, instead of the mean. As a result of applying the mean and median, discrete visual sign intensity levels yielded real numbers, which had to be rounded to return to the discrete range 0‒3. To prevent information loss, we considered rescaling the values to 0‒10 and 0‒100 before rounding and compared these ranges with the original one.

With regard to lesion surface masks, the average mask was computed, resulting in a grayscale image in the range 0‒255. A pixel intensity threshold of 155 was applied to obtain a binary mask that was used as the ground truth. Images were finally normalized to the range 0‒1.

### Deep learning model​

The ASCORAD calculation can be divided into two parts: lesion surface segmentation and visual sign severity assessment. We trained two separated models, one for each task, and named Legit.Health-SCORADNet to the neural networks involved in the calculation of the ASCORAD (source code is available at github.com/Legit-Health/ASCORAD).

#### Lesion surface segmentation​

For the lesion surface segmentation problem, we applied a U-Net, an architecture that was first designed for biomedical image segmentation and showed great results on the task of cell tracking (Ronneberger et al., 20152). The main contribution of this architecture was the ability to achieve good results even with hundreds of examples. The U-Net consists of two paths: a contracting path and an expanding path. The contracting path is a typical convolutional network where convolution and pooling operations are repeatedly applied. We decided to use the Resnet-34 (He et al., 20153) architecture, which is the typical backbone used in the contracting path.

#### Visual sign severity assessment​

We trained a multioutput (Xu et al., 2020) classifier, with one softmax layer per visual sign (Figure 5). We used the EfficientNet-B0 network architecture (Tan and Le, 20194) that was pre-trained on approximately 1.28 million images (1,000 object categories) from the 2014 ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 20145) and trained it on our dataset using transfer learning (Pan and Yang, 2010). EfficientNets achieve better accuracy and efficiency than previous convolutional neural networks with fewer parameters by applying a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. There are eight versions, consisting of a different number of parameters, with the B0 being the smallest network that achieves state-of-the-art 77.1% top-1 accuracy on ImageNet for a network consisting of 5 million parameters.

Figure 5. The visual signs that compose the SCORAD. Each visual sign can be classified into four intensity levels: none (0), mild (1), moderate (2), and severe (3). The multioutput EfficientNet-B0 network trained for visual sign intensity estimation has one head for each visual sign. Lichenf., lichenification; SCORAD, SCOring Atopic Dermatitis.

Visual sign severity grading can be seen as a piecewise regression or alternatively as a discrete classification with four discrete value labels for each visual sign intensity. In the case of multiple visual signs, a multilabel classification network can be used to solve the problem. However, to exploit methods such as Deep EXpectation (Rothe et al., 2015), one softmax layer per visual sign is needed. So, for the purpose of applying the Deep EXpectation method, we constructed a multioutput classifier with six softmax layers consisting of N neurons each, with N being 4, 11, or 101, depending on the range of the ground truth labels.

Deep EXpectation method proved to obtain better results on regression metrics by approaching a regression problem the same way that you would approach a classification problem, and therefore applying a softmax expected value:(1)E(O)\=∑i\=0Nyioi$\text{E}\left(\text{O}\right)=\sum _{\text{i}=0}^{\text{N}}{\text{yio}}_{\text{i}}$where O \=0,1,...; N is the N$\text{N}$-dimensional output layer of each visual sign, representing softmax output probabilities oi∈ O, and yi${\text{y}}_{\text{i}}$ are the discrete intensity levels corresponding to each class i$\text{i}$.

### Evaluation metrics​

Dermatologists may have a bias, a fixed effect where one observer consistently measures high or low. There may also be a random effect or heterogeneity, where the observer scores higher than others for some patients and lower for others. To measure interobserver variability, understand annotation quality in more detail, and compare it with the performance of the algorithms, we calculated the following set of metrics.

First of all, we computed the RSD and Cohen’s kappa for all the visual signs and lesion surface segmentation. In the case of the annotation of visual sign intensity, we also measured the times that the three observers gave the same result or the FAR. To complement FAR, two more metrics were calculated: the times that at least two observers gave the same result, whereas the third observer gave a result that deviated ±1 from the other observer’s or the PAR1. The same metrics without the ±1 condition for the third observer were called PAR2. Therefore, the metrics are ordered as follows in regard to their restrictiveness: FAR > PAR1 > PAR2. To assess the quality of the annotations and understand the results in more depth, we compared the results with an algorithm that randomly picked three intensity values for each visual sign. We ran this millions of times and found that RSD of a random visual sign evaluation tends to 27%, FAR tends to 6%, PAR1 tends to 34%, and PAR2 tends to 62%.

We also calculated the metrics that allowed a direct comparison of the Legit.Health-SCORADNet and the annotation, for both lesion segmentation and visual sign severity assessment. Pixel accuracy, AUC, IoU, and F1 score metrics were the preferred metrics for segmentation, whereas for the severity assessment of visual signs, we used RMAE.

### Experimental setup​

We ran two main experiments for each task: one with images containing only light skin and another adding a small number of dark skin images in the training set.

In the first experiment, we used Legit.Health-AD for training and Legit.Health-AD-Test and Legit.Health-AD-FPK-IVI for testing. We followed a six-fold cross-validation strategy to train the models. The models trained on the different folds were tested on both test sets, and the results were averaged over the folds to reduce the variance and bias.

The second experiment was built to better understand the performance of the network on dark skin when including a tiny fraction of dark-skinned patient images in the training set. In this experiment, we used Legit.Health-AD, Legit.Health-AD-Test, and a subset of Legit.Health-AD-FPK-IVI for training and the rest for testing. The training and test subsets of Legit.Health-AD-FPK-IVI were obtained with a three-fold cross-validation strategy. This means that the training set was composed of 971 light-skinned patient images, Legit.Health-AD and Legit.Health-AD-Test combined, and 75 dark-skinned patient images, which is a tiny fraction of the total images (8%). The dark skin test set was composed of the remaining 37 images. This split was done three times (three-fold), including different images in the training and test set, to obtain more reliable results.

In the case of visual sign severity assessment, we also ran experiments to find the optimal range, testing 0‒3, 0‒10, and 0‒100 ranges. In addition, we tested the mean and the median as the statistical measure for obtaining the ground truth of the training set. This project was entirely run on a single NVIDIA Tesla V100 (32 gigabytes) graphics processing unit (Nvidia, Santa Clara, CA).

With the objective of making the algorithms accessible to the healthcare professional, we created a fully integrated CADx system, a web application that integrates the Legit.Health-SCORADNet algorithm and calculates the patient-based ASCORAD using clinical images. The CADx system includes three stages: uploading the images of the affected areas, processing the images, and reporting the ASCORAD.

In the first stage, images of affected areas are uploaded to the system using a simple user interface, depicted in Figure 6a. The user has to choose the body zone from the options defined in the original SCORAD (Stalder et al., 1993): head and neck, right upper limbs, left upper limbs, right lower limbs, left lower limbs, anterior trunk, back, and genitals. In some cases, such as children aged <2 years with all bodies affected, a full-body photograph can also be uploaded. In addition, the patient answers a simple questionnaire of two items: itchiness (0‒10) and sleeplessness (0‒10).

Figure 6. CADx system. (a) Illustration of the questionnaire. (b) Illustration of the report generated by the CADx system. The report contains the evolution across the time of the ASCORAD, the last reported ASCORAD item by item, a picture of the lesion surface predicted by the algorithm, the final ASCORAD score with its translation to a category, and some additional information such as image quality. The example record shown is fictional. ASCORAD, Automatic SCOring Atopic Dermatitis; CADx, computer-aided diagnosis; CET, Central European Time; DIQA, Dermatology Image Quality Assessment; DLQI, Dermatology Life Quality Index; Jul, July.

In the second stage, the Legit.Health-SCORADNet algorithm processes the images and automatically calculates the severity of AD by calculating the intensity of each visual sign and the surface of the lesion. Finally, the output of the algorithm is shown in a user-friendly report containing an image with the estimated lesion surface and a chart with the evolution of the ASCORAD over time. The final report of the proposed CADx system is depicted in Figure 6b.

Computing the ASCORAD requires calculating the proportion of skin covered by the lesion. We solved this by including a small piece of hardware called AI Marker, a sticker with several shapes and colors that helps to translate pixels into a metric unit of measurement. The AI Marker should be kept close to the lesion, and it is automatically detected. In addition, the body surface area is calculated with the patient's height and weight using the Mosteller (Lee et al., 2008; Orimadegun and Omisanjo, 2014) formula. Once the surface of the lesion and body surface area are estimated, the percentage can be calculated by dividing the surface of the lesion by the body surface area (equation 2). This allows the CADx system to calculate the final value of ASCORAD. When the AI Marker is not used, lesion surface percentage is input by the user manually, although the CADx system is still capable of calculating the visual sign intensity values automatically.

When more than one image is uploaded, the surface of the images is summed, and the maximum (Dirschka et al., 2017) of each visual sign intensity is used for the ASCORAD calculation. Therefore, the final formula for N images of the whole body can be written as follows:(2)ASCORAD \=15∑iNaibody surface area+72∑j\=16max(Bi,1,...,Bi,N)+Cwhere a$\text{a}$ stands for the lesion surface in a metric unit of measurement, B∈(0,3)$\text{B}\in \left(0,3\right)$ stands for visual sign intensity, C ∈(0,20)$\in \left(0,20\right)$ stands for the sum of the symptoms input by the patient.

### Software and statistical analysis​

The models were implemented and trained using Pytorch (Paszke et al., 2019); Metrics and k-fold were calculated in Python using the SciKit-Learn package (Pedregosa et al., 20126) and plotted using MatPlotLib (Hunter, 2017).

### Data availability statement​

The images of Legit.Health-AD, Legit.Health-AD-Test, and Legit.Health-AD-FPK-IVI datasets related to this article can be found at http://www.atlasdermatologico.com.br/, hosted at Dermatology Atlas; http://www.danderm-pdv.is.kkh.dk/, hosted at Danderm; https://www.dermatlas.net/, hosted at Interactive Dermatology Atlas; https://www.dermis.net/dermisroot/en/home/index.htm, hosted at DermIS (Diepgen and Eysenbach, 1998); https://dermnetnz.org/, hosted at DermNet NZ; and http://www.hellenicdermatlas.com/en/, hosted at Hellenic Dermatological Atlas.

## ORCIDs​

Alfonso Medela: http://orcid.org/0000-0001-5859-5439

Taig Mac Carthy: http://orcid.org/0000-0001-5583-5273

S. Andy Aguilar Robles: http://orcid.org/0000-0003-0618-6179

Carlos M. Chiesa-Estomba: http://orcid.org/0000-0001-9454-9464

Ramon Grimalt: http://orcid.org/0000-0001-7204-8626

## Author Contributions​

Conceptualization: AM, RG; Data Curation: AM; Formal Analysis: AM, CMCE; Investigation: AM, TMC, SAAR, CMCE, RG; Methodology: AM, TMC, RG; Project Administration: SAAR; Visualization: TMC; Writing - Original Draft Preparation: AM; Writing - Review and Editing: AM, TMC, SAAR, CMCE, RG

## Conflict of Interest​

The authors state no conflict of interest.

## Acknowledgments​

The authors thank Fernando Alfageme Roldán for technical advice, BioCruces Bizkaia Health Research Institute for the academic support, and IBM for providing the computing infrastructure for the deep learning experiments.