Description
We introduce LPR-MNIST, a synthetic dataset replicating the key aspects of syntax evolution on vehicle license plates. LPR-MNIST is a collection of 100,000 synthetic image-text pairs, generated by concatenating 5 black-and-white digits from MNIST, each padded to 32 x 32. 32 pixels of zero-padding are then randomly shared between the left and right sides of the assembled image to mimic the natural variability in the absolute position of characters on real plates. Thus, resulting images have a shape of 32 x 192. Labels are drawn from ‘00000’ to ‘99999’ such that each possible combination is represented once. For each digit in a given label, the MNIST image is then randomly picked among all instances of this digit.

Protocols
The provided training, validation and test splits follow the 80/10/10% scheme. They purposely do not simulate any syntax shift so that the user can recreate any experiment from the paper and beyond. Creating a syntax tipping point is simply achieved by removing all samples with an arbitrary character in an arbitrary position during training. For instance, the main results in the paper are obtained by removing digit 9
in the leftmost position in training and validation splits. Meanwhile, the test split should be divided into source and target syntax where the chosen digit is absent and present in the chosen position, respectively. The target syntax part of the test split is therefore the most challenging for the model as it contains unseen syntax.
Citing the dataset
The dataset is introduced in the following publication. Use the following bibtex for citing the dataset: NOT AVAILABLE
Licence
This work, “LPR-MNIST”, is adapted from MNIST by Yann LeCun and Corinna Cortes, used under CC BY-SA 3.0. Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. “LPR-MNIST” is licensed under CC BY 4.0 by Florent Meyer.
The LPR-MNIST dataset is a modified version of the MNIST database where images from the original database have been concatenated to produce images of digit sequences. A split into training, validation, and test sets of the created sequences provides an experimental protocol for comparing experiments.
How to get the dataset ?
Before downloading the dataset, you agree to comply with the terms Creative Commons Attribution-Share Alike 4.0 license. To agree with the licence and receive the download link, please complete the following contact form.