The aim of this thesis will be to look at self-supervised learning in order to take advantage of unsu-pervised examples for handwriting recognition. In recent years, a great deal of work has focused on
self-supervised approaches for network pre-training. Starting with a pretext task that does not require
manual annotation, the model nonetheless acquires strong modeling capabilities. These approaches
form the basis of all literature foundation models, whether text-based (GPT-3), image-based (DINO) or both (CLIP). The pretext tasks can be diverse : prediction of the next token for GPT-3 or contrastive learning for CLIP and DINO. More recently, work has shown remarkable results for the pretext task of partially masked image reconstruction with Transformer networks.
Thus, we are interested in performing self-supervised learning for a handwriting recognition task in
images. The aim here is twofold : a) to adapt a current system to deal with a wide range of examples
in order to be generic and improve the generalization capabilities ; b) to enable a system to specialize
on an unlabeled corpus in order to enhance the transcription capabilities of this corpus. Thus, self-
supervised learning should make it possible to adapt current reference systems to achieve these
generalization and specialization objectives while minimizing the amount of annotated data required.