Skip to content

Nau3D/bridging-the-gap-between-real-and-synthetic-traffic-sign-datasets

Repository files navigation

Bridging the Gap Between Real and Synthetic Traffic Sign Repositories

Diogo Lopes da Silva and António Ramires Fernandes, in proceedings of Delta 2022

This work aims at generating synthetic traffic sign datasets. These datasets can be be used to train CNN models that provide high levels of accuracy when tested with real data.

We employ traditional techniques for the generation of our synthetic samples, and explore two new operators: Perlin and Confetti Noise. These two operators proved essential in helping us achieve accuracies that are extremelly close to those obtained with real data datasets.

These datasets are created based only on a set of templates. We tested our synthetic datasets against three known Traffic Sign datasets:

Here are some samples generated by our script for the German Traffic Sign Repository Benchmark:

German synthetic samples

When real data is available it is possible to add more information to the generator. In our work we explored adding brightness information. We found that all the studied datasets obbeyed a Johnson distribution regarding brightness and were able to obtain the parameters of such distribution for each data set. Based on this information we were able to generate samples with brightness values from the respective distribution.

The table below show the paramters of the Johnson distribution for each of the evaluated datasets:

dataset $\gamma$ $\delta$ $\xi$ $\lambda$
GTSRB 0.747 0.907 7.099 259.904
BTSC 0.727 1.694 2.893 298.639
rMASTIF 0.664 $1.194 20.527 248.357

The following image shows the brightness distribution for the 3 datasets, and the found Johnson distribution based on the parameters of the table above.

Johnson distribution

In the absence of real data, the overall brightness of the image is computed as:

$B = bias + u^\gamma \times (255 - bias)$ (Eq. 1)

where

  • $bias$ determines the minimum brightness,
  • $u$ is a sample from an uniform distribution between [0,1],
  • $\gamma$ is a parameter to control the exponential.

In our tests we set $bias=10, \gamma = 2$.

Most works so far use real scenery images as backgrounds for the synthetic samples. We also explored applying solid colour backgrounds, as shown in the image above.

All options considered we were able to generate four different datasets:

  • SES: Synthetic dataset with brightness drawn from exponential equation (Eq. 1) and solid color backgrounds.
  • SER: Synthetic dataset with brightness drawn from exponential equation (Eq. 1) and real image backgrounds.
  • SJS: Synthetic dataset with brightness drawn from Johnson distribution and solid color backgrounds.
  • SJR: Synthetic dataset with brightness drawn from Johnson distribution and real image backgrounds.

THe full sample generator algorithm is as follows:

algorithm

Script generator.py can be used to generate the synthetic datasets.

Usage example:

python generator.py --templates template_location --output dest_dir --number 2000 --seed 0 --brightness exp2 --negative_folder backgrounds --negative_ratio 1

Parameters:

  • --templates: the folder where the templates are located
  • --output: folder where the synthetic dataset will be written
  • --number: number of samples to generate for each class
  • --seed: differente seeds will provide different datasets
  • --brightness: one of ['exp2', 'belgium', 'croatian', 'german']. 'exp2' refers to equation (1), whereas the other options will use the respective Johnson distribution.
  • --negative_folder: (optional) if using real backgrounds, the folder where these are stored. Note: background images should be signless street views for best results.
  • --negative_ratio: [0m,1] the ratio between solid color backgrounds and negative backgrounds. 0 is all solid color.

Some more options are available, see the script code.

As opposed to previous works such as [1] and [2] we didn´t aim at achieving photo-realistic imagery for our synthetic samples. As can be seen from the samples above, our signs are not realistic at all, yet our results clearly surpass previous attempts of using synthetic datasets. This may hint that our notion of "realism" may not be the mos suitable for a CNN model.

To train a model with a dataset (synthetic, real, or a merge of both) use the script train.py

Usage example:

python train.py --data my_dataset --seed 0 --runs 5 --epochs 40

Parameters:

  • --data: folder where the dataset is locates
  • --seed: sets pytorch seeds. Different seeds will generate different trained models
  • --runs: number of models to train
  • --epochs: total number of epochs to run

Other parameters are available, see the script code.

Results for synthetic datasets solo

Results for our synthetic datasets. Where available the best third party results are also presented. Results are also presented for models trained with real data to show the gap between training with real and synthetic data. Note: the real datasets we're previously balanced using the script balance.py. Our results are an average of 5 runs.

GTSRB (real data = $99.64 \pm 0.02$)

Real Bg. Solid Bg.
Luo et. al. 97.25 (real data = 99.20)
SE $99.32 \pm 0.25$ $99.25 \pm 0.06 $
SJ $\mathbf{99.41 \pm 0.05}$ $99.39 \pm 0.08$

BTSC (real data = $99.30 \pm 0.03$)

Real Bg. Solid Bg.
SE $98.86 \pm 0.12$ $\mathbf{99.12 \pm 0.04} $
SJ $98.92 \pm 0.09$ $99.11 \pm 0.09$

rMASTIF (real data = $99.71 \pm 0.05$)

Real Bg. Solid Bg.
SE $99.27 \pm 0.14$ $\mathbf{99.47 \pm 0.09} $
SJ $99.37 \pm 0.08$ $99.26 \pm 0.17$

Our synthetic data results are always less than half of a percent from the results we obtained with real data. For both BTSC and rMASTIFF we got better results with SES datasets. For the GTSRB synthetic datasets best results were obtained with the SJR dataset. This discrepancy may be due to the fact that this later dataset is the darkest of them all (see the figure with the Johnson distributions).

Results for merged datasets

For these tests we merged the synthetic datasets with the real dataset, resulting in approximatelly 50/50 division between real and synthetic data. Note that our real datasets have been balanced to have at least 2000 images per class, and the synthetic datasets also have 2000 samples per class. As the solid colour datasets are less biased and provide more diversity in this context (real data already has real backgrounds) these were selected in this experiment.

GTSRB BTSC rMASTIF
Lou et. al. $99.41$
Real + SES $99.70 \pm 0.04$ $99.36 \pm 0.05$ $99.81 \pm 0.04$
Real + SJS $\mathbf{99.75 \pm 0.02}$ $\mathbf{99.40 \pm 0.05}$ $\mathbf{99.84 \pm 0.07}$

As expected with merged datasets we achieve better results than both real and synthetic datasets on their own. A slight advantage can be observed when using the SJS datasets.

Results for ensembles

Since we have three different types of datasets (5 trained models for each) we decided to try ensembling these different models. Each ensemble has three models trained with the following datasets:

  • SER for the full synthetic dataset
  • Real data
  • Merged: Real + SJS

Each ensemble was evaluated 5 times.

GTSRB BTSC rMASTIF
$\mathbf{99.82 \pm 0.02}$ $\mathbf{99.38 \pm 0.05}$ $\mathbf{99.79 \pm 0.07}$

While ensembles for BTSC and rMASTIFF provide worse results than the models trained with the merged datasets, in GTSRB we are able to surpass the result from Haloi [3], at $99.81$. The best ensemble for GTSRB achieved an accuracy of $99.85$, getting right 12611 of 12630 images on the test set. Note our input is 32x32 vs Haloi 128x128.

Cross-testing

In order to be able to assess synthetic datasets generalization capability we performed cross-testing across the different datasets.

To perform this test, we used the models trained in a dataset, for instance from Germany, and tested these models in the common classes from the other two datasets. By common classes we mean classes where the pictograms have the same semantic meaning even though the pictograms may vary slightly from country to country. The following figure shows the class equivalence we found for this test.

similar signs

sample count Real SER SES
Trained: GTRSB - Tested_ BTSC + rMASTIF $1829$ $97.18$ $\mathbf{98.33}$ $98.30$
Trained: BTSC - Tested_ GTSRB + rMASTIF $6410$ $82.39$ $\mathbf{95.75}$ $94.73$
Trained: rMASTIF- Tested_ GTSRB + BTSC $8029$ $90.24$ $94.50$ $\mathbf{95.41}$

Results hint that, when compared to real data, our synthetic datasets perform better when confronted with slightly different pictograms, different cameras, and different lighting conditions. This is perhaps the most relevant test, as it is not feasible to capture all signs in all lighting conditions and with a broad range of cameras. Furthermore, slightly different pictograms for each sign can be found in most countries, as a result of the introduction of updated signs coexisting with older versions of the same sign.

Unleasing synthetic datasets

All previous test were performed by considering templates that appeared in the training set for each country. However, in particular in the BTSC dataset some variations of the Parking sign are only available on the test set. In a real usage scenario it would make sense to use all the variations to build the synthetic dataset, afterall the cost of adding a variation is just getting the template image. The test was also performed for the GTSRB as some variants are only found in the test set.

The following results were obtained by merging one of these unleashed datasets with real data, together with the previous merged results:

restricted Unleashed
GTSRB $99.70$ $\mathbf{99.80}$
BTSC $99.36$ $\mathbf{99.76}$

The result for GTSRB shows a considerable improvment compared to the previous merged version, being very close to Haloi's result [3] with a single model having a quarter the number of parameters. Regarding BTSC, the result surpasses the result in [4] (best result reported so far for BTSC with an accuracy of $99.71$), classifying correctly 2514 out of 2520 images in the testset.

Conclusion

When considering solo datasets, models trained with our synthetic datasets provides accuracies which are very close to models trained with real data.

Combining these datasets, either ensembling or merging with real data, provides results that surpass all previous published reports on three datasets. The table bellow shows the best results for each dataset:

dataset method accuracy prev. reports
GTRSB ensemble $99.82 \pm 0.02$ $99.81$ [3]
BTSC unleashed $99.76$ $99.71$ [4]
rMASTIF merged $99.84 \pm 0.07$ $99.53 \pm 0.10$ [5]

Nevertheless, despite having surpassed current state of the art results, we believe that the most interesting result probably comes from the cross-testing experiment where we have observed a significantly higher generalization capability of our synthetic datasets vs real datasets.

The ref to the full paper is:

Lopes da Silva, D. and Fernandes, A. (2022). Bridging the Gap between Real and Synthetic Traffic Sign Repositories. In Proceedings of the 3rd International Conference on Deep Learning Theory and Applications - DeLTA, ISBN 978-989-758-584-5; ISSN 2184-9277, pages 44-54. DOI: 10.5220/0011301100003277

Refs

[1] Luo, H., Kong, Q., and Wu, F. (2018). Traffic sign image synthesis with generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2540–2545.

[2] Spata, D., Horn, D., and Houben, S. (2019). Generation of natural traffic sign images using domain translation with cycle-consistent generative adversarial networks. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 702–708.

[3] Haloi, M. (2015). Traffic sign classification using deep inception based convolutional networks, arxiv

[4] Mahmoud, M. A. B. and Guo, P. (2019). A novel method for traffic sign recognition based on dcgan and mlp with pilae algorithm. IEEE Access, 7:74602–74611.

[5] Jurisic, F., Filkovic, I., and Kalafatic, Z. (2015). Multiple-dataset traffic sign classification with Onecnn. 3rd IAPR Asian Conference on Pattern Recognition