Proceedings of the 9th Workshop on the Representation and Processing of Sign Languages, pages 165–170
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC
Automatic Classification of Handshapes in Russian Sign Language
Medet Mukushev∗ , Alfarabi Imashev∗ , Vadim Kimmelman† , Anara Sandygulova∗
∗
Department of Robotics and Mechatronics, School of Engineering and Digital Sciences, Nazarbayev University
Kabanbay Batyr Avenue, 53, Nur-Sultan, Kazakhstan
†
Department of Linguistic, Literary and Aesthetic Studies, University of Bergen
Postboks 7805, 5020, Bergen, Norway
[email protected],
[email protected],
[email protected],
[email protected]
Abstract
Handshapes are one of the basic parameters of signs, and any phonological or phonetic analysis of a sign language must account
for handshapes. Many sign languages have been carefully analysed by sign language linguists to create handshape inventories. This
has theoretical implications, but also applied use, as an inventory is necessary for generating corpora for sign languages that can be
searched, filtered, sorted by different sign components (such as handshapes, orientation, location, movement, etc.). However, creating an
inventory is a very time-consuming process, thus only a handful of sign languages have them. Therefore, in this work we firstly test an
unsupervised approach with the aim to automatically generate a handshape inventory. The process includes hand detection, cropping, and
clustering techniques, which we apply to a commonly used resource: the Spreadthesign online dictionary (www.spreadthesign.com), in
particular to Russian Sign Language (RSL). We then manually verify the data to be able to apply supervised learning to classify new data.
Keywords: Sign Language Recognition, Machine Learning Methods, Information Extraction
1. Introduction 2004), and demonstrate the utility of the supervised learn-
ing on new data.
Signs in sign languages are composed of phonological com-
ponents put together under certain rules (Sandler and Lillo-
Martin, 2006). In the early days of sign language linguis- 2. Handshape as a phonological component
tics, three main components were identified: handshape, lo- Ever since the seminal book by Stokoe (1960) on American
cation on the body, and movement, while later orientation Sign Language (ASL), signs in sign languages are analyzed
and non-manual component were added. A recent paper as consisting of several parameters, one of the major ones
stresses the need to combine interdisciplinary approaches being handshape (Sandler and Lillo-Martin, 2006). Hand-
in order to build successful sign language processing sys- shape itself is not considered an atomic parameter of a sign,
tems that account for their complex linguistic nature (Bragg usually being further subdivided into selected fingers and
et al., 2019). finger flexion (Brentari, 1998).
By deploying a number of computer vision approaches, this Much research has been devoted to theoretical approaches
paper aims to automate one of the most time-consuming to handshapes (see (Sandler and Lillo-Martin, 2006) for
tasks for linguists i.e. creation of a handshape inventory. an overview), as well as to descriptions of handshape in-
Many researchers worked on establishing phonetic hand- ventories in different sign languages (see e.g. (Caselli et
shape and phonemic handshape inventories (see e.g. (Van al., 2017; Sutton-Spence and Woll, 1999; Fenlon et al.,
der Kooij, 2002; Nyst, 2007; Tsay and Myers, 2009; Kubuş, 2015; Kubuş, 2008; Klezovich, 2019; Kubuş, 2008; Van
2008; Klezovich, 2019). In all of these works, handshapes der Kooij, 2002; Prillwitz, 2005; Tsay and Myers, 2009)).
were extracted and annotated manually (Klezovich, 2019). Several issues have been identified in studying handshapes
Klezovich (2019) proposed the first handshape inventory that can be currently addressed using novel methods. First,
for Russian Sign Language (RSL) by applying semi- many researchers identify the existence of the so-called un-
automatic approach of extracting hold-stills in a sign video marked handshapes (Sandler and Lillo-Martin, 2006, 161-
based on images overlay approach. The reason for extract- 162). These handshapes are maximally distinct in terms
ing hold-stills from the rest of the video frames is due to of their overall shape, they are the easiest to articulate, the
the fact that handshapes are the most clear and visible in most frequently occurring in signs, the first to be acquired
hold positions, and transitional movements never contain by children, etc. For instance, in ASL, the following hand-
distinct handshapes. Klezovich proposed to extract hold- shapes are generally treated as unmarked: A (fist), 5 (all
stills and then manually label only these frames, which can fingers outstretched), 1 (index finger straight, all the other
significantly speed up the process of creating handshape in- closed), E (all fingers bent and touching).
ventories (Klezovich, 2019). Since unmarkedness of handshapes derives from their vi-
In this paper, we test an automatic approach to generating sual and articulatory properties, it is expected that the same
handshape inventory for Russian Sign Language. First, we handshapes should be unmarked across different sign lan-
try an unsupervised learning and demonstrate that the re- guages. This appears to be the case, although slight vari-
sults are unsatisfactory, because this method cannot distin- ation can also be observed. For instance, in Turkish Sign
guish handshapes separately from orientation and location Language (TID), 7 handshapes can be identified as being
in their classification. Second, we manually label a train- the most frequent, including two handshapes based on the
ing dataset according to HamNoSys handshapes (Hanke, fist with or without outstretched thumb (Kubuş, 2008).
165
Figure 1: 135 top activated clusters for HOG descriptors.
In addition to the observation that (approximately) the same In the current study, we propose and test a method that can
handshapes are the most frequent, a surprising finding is be applied to classifying handshapes across many sign lan-
that the frequency of the most frequent handshapes is ex- guages using a common data set: the Spreadthesign online
tremely similar across different sign languages. For in- dictionary (www.spreadthesign.com). As a proof of con-
stance, in British Sign Language (BSL), 50% of signs have cept, we analyze data from Russian Sign Language.
one of the four unmarked handshapes (Sutton-Spence and
Woll, 1999); in Turkish Sign Language, if we only consider 3. Dataset pre-processing
the four most frequent handshapes, this would account for 3.1. Dataset
57% of the signs (Kubuş, 2008), and, in ASL, the four most The dataset was created by downloading videos from the
frequent handshapes in the ASL-LEX dataset (Caselli et al., Spreadthesign online dictionary (www.spreadthesign.com).
2017) account for 49% of all signs. We have downloaded a total of 14875 RSL videos from the
Secondly, some researchers argue that sign languages dif- website. The videos contain either a single sign or a phrase
fer in their phonemic inventories, including the inventories consisting of several signs.
of handshapes. For instance, Sign Language of the Nether- Klezovich (2019) used the Spreadthesign online dictio-
lands has 70 phonetic and 31 phonemic handshapes (Van nary too, and after removing compounds, dactyl-based and
der Kooij, 2002), and many other sign languages are re- number-based signs, she ended up working with 3727 signs
ported to have inventories of similar sizes (Kubuş, 2008; or 5189 hold-stills.
Caselli et al., 2017; Prillwitz, 2005). At the same time, In our case, blur images are removed using variation of
Adamorobe Sign Language has been reported to have only Laplacian with a threshold of 350. If the variance is lower
29 phonetic and 7 phonemic handshapes (Nyst, 2007). On than the threshold then image is considered blurry, other-
the opposite end, a recent study of Russian Sign Language wise image is not blurry. Normally, we select threshold by
(RSL) based on semi-automatic large scale analysis has trial and error depending on a dataset, there is no univer-
claimed that RSL has 117 phonetic but only 23 phonemic sal value. This reduced the number of total images from
handshapes (Klezovich, 2019). Note however, that it is very 141135 images to 18226 cropped images of hands.
difficult to directly compare results from different sign lan-
guages because different methods of assessing phonemic 3.2. Hand extraction
status of handshapes are used. Hand detection can be considered as a sub-task of object
So we can observe both similarities in handshapes across detection and segmentation in images and videos. Hands
different sign languages, as well as considerable variation. can appear in various shapes, orientations and configura-
At the same time, it is difficult to make direct comparison tions, which creates additional challenges. Object detec-
because different datasets and annotation and classification tion frameworks such as MaskRCNN (He et al., 2017) and
methods are applied in different studies. CenterNet (Duan et al., 2019) can be applied for this task.
166
However, occlusions and motion blur might decrease ac-
curacy of the trained models. For these reasons, in this
work, we used a novel CNN architecture namely Hand-
CNN (Narasimhaswamy et al., 2019). Its architecture is
based on the MaskRCNN (He et al., 2017) with an addi-
tional attention module that includes contextual cues dur-
ing the detection process. In order to avoid issues with the
occlusions and motion blur, Hand-CNN’s proposed atten-
tion module is intended for two types of non-local contex-
tual pooling, feature similarity and spatial relationship be-
tween semantically related entities. The Hand-CNN model
provides segmentation, bounding boxes and orientations of
detected hands. We utilize the predicted bounding boxes to
Figure 2: Average Silhouette Coefficient scores for the
crop hands with two padding parameters: a 0-pixel padding
model trained on AlexNet features
and a 20-pixel padding. As a result, the first group contains
cropped images of detected hands only, while the other
group contains cropped images of hands and their positions
relative to the body.
3.3. Image pre-processing
To images with a 0-pixel-padding on detected hands, we
apply Histogram of Oriented Gradients (HOG) descriptors
(Dalal and Triggs, 2005). HOG feature descriptors are
commonly used in computer vision for object detection
(e.g. people detection in static images). This technique is
based on distribution of intensity gradients or edge direc-
tions. Firstly, an image is divided into small regions and
then each region has its histogram of gradient directions
calculated. Concatenations of these histograms are used as Figure 3: Normalized mutual information
features for clustering algorithm. In this work we use “fea-
ture” module of the scikit-image library (van der Walt et al.,
varying step size, we ended up with setting for the follow-
2014) with the following parameters: orientations = 9, pix-
ing sizes: 100, 150, 200, 300 and 400 clusters.
els per cell = (10,10), cells per block = (2,2) and L1 used as
a block normalization method. Prior to this pre-processing, 4.2. Analysis and evaluation
all images are transformed to grayscale and resized to 128
We use two metrics to evaluate the performance of the clus-
by 128 pixel images.
tering models: the Silhouette Coefficient and Normalized
To images with a 20-pixel-padding on detected hands we
Mutual Information (NMI).
utilize AlexNet (Krizhevsky et al., 2012). It is a Convolu-
When the ground truth labels are not known for predicted
tional Neural Network (CNN) commonly used for various
clusters, Silhouette Coefficient score is applied. Silhouette
image processing tasks as a baseline architecture. We use
method is used for interpretation and validation of cluster-
only the first five convolutional layers with 96, 256, 384,
ing analysis (Rousseeuw, 1987). Its value gives understand-
384 and 256 filters, as we only need to extract features for
ing of how similar an item is to its own cluster compared
clustering purposes without the need for classification of
to other clusters. Silhouette is bounded between -1 and +1,
images. Prior to feature extraction all images are resized to
where a higher value means that a clustered item is well
224 by 224 pixels. CNN features are PCA-reduced to 256
matched to its cluster and less matched to other clusters. As
dimensions before clustering.
can be seen from Figure 2, the maximum value of Silhou-
ette Coefficient score is observed for the model trained on
4. Unsupervised Methodology
AlexNet features for 100 clusters after 15 epochs. However,
4.1. Clustering the score itself is just slightly over 0.12 which indicates that
We utilize a classical clustering algorithm, namely k- our clusters are overlapping.
means. Thus, k-means implementation by (Johnson et al., In addition, we use predicted labels to measure NMI. It is
2019) is applied to ConvNet features, while scikit-learn a function that measures the agreement between predicted
(Pedregosa et al., 2011) implementation is applied to HOG and actual labels. Perfect labeling gives score of +1 and
features. Each training is performed for 20 iterations with bad labeling gives negative scores. As we can see from
random initialization. Figure 3, all models with different number of clusters result
We experimentally determined the number of clusters to be in the scores reaching 0.9 after 15 epochs.
specified for clustering. It seemed like handshape orien- The reason for such results might be that image descriptors
tation was also accounted for by the clustering algorithm, for hands are too close to each other, which makes it diffi-
the idea was to increase the number of clusters to force the cult for the algorithm to differentiate. At the same time,
algorithm to differentiate between orientations. By trying NMI score indicates that predicted labels are almost the
167
same after each training epoch. In order to increase density
of predicted clusters additional pre-processing of images is
required.
4.3. Results
Figure 1 gives us insights about the results of applying un-
supervised clustering to handshapes. First, it is clear that
the algorithm does not distinguish classes only based on
handshapes, but also based on orientation (for images with
0-pixel-padding), and also based on localization (for im-
ages with 20-pixel-padding). If the linguistic task of creat-
ing an inventory of phonemic handshapes is at stake, this is
a clear disadvantage of this approach.
Second, despite its shortcomings, the method does provide
some linguistically interesting results. Specifically, one
can see that the handshapes which are expected to be un-
marked (A, 5, 1) appear frequently and as labels for multi-
ple classes. Thus, even though the classification is not lin-
guistically relevant, the effect of markedness is still visible
in the results of this unsupervised approach.
5. Supervised Methodology
5.1. Dataset
Figure 4: Handshape classes count
Given that the unsupervised approaches did not result in
a clustering reflective of relevant handshape classes, we
turned to a supervised approach. The results of HOG clus- classification. However, the dataset we used appears to be
tering was used as the initial dataset that contained 140 too small to attempt a phonetic classification.
clusters of 18226 images. It was decided to manually clean The manually labeled subset was later divided into a train-
the automatically generated clusters for inaccuracies. This ing set with 6430 images and a validation set with 916 im-
task was performed by four undergraduate students, who ages. Figure 4 shows the number of tokens for each class
divided the folders first between each other and then one in a training and validation sets combined. Figure 4 also
person merged all of them. shows a linguistically relevant result: our manual classifi-
cation of handshapes also demonstrates the expected fre-
First, each cluster (folder) was visually scanned for the
quency properties of marked and unmarked handshapes.
most frequently classified handshape in order to remove
In particular, the most frequent handshapes are the ones
handshapes that did not belong there from that folder.
expected to be unmarked: A (fist), 5 (hand with all fin-
These steps were performed for all 140 folders. Since there
gers spread), 1 (index finger), and B (a flat palm)).1 These
were many folders of the same handshape with the only dif-
forms together constitute 48% of all handshapes (if the two-
ference in orientation, they were merged, which resulted in
handed signs are disregarded).
35 classes and a large unsorted (junk) folder. Thus, the final
version of the dataset contains 35 classes of 7346 cropped 5.2. ConvNet and transfer learning
images with 0-pixel-padding. Training an entire ConvNet from scratch for a specific task
The classes were created using intuitive visual similarity as requires big computational resources and large datasets,
a guide, and by linguistically naive annotators. However, a which are not always available. For this reason, a more
post factum analysis shows that the manual classification is common approach is to use ConvNet that was pretrained
linguistically reasonable as an approximation of a phono- on large datasets, such as ResNet-18 or ImageNet (which
logical inventory. Specifically, the classes that were created contains 1.2 million images divided into 1000 categories)
are distinguished by selected fingers, spreading (spread or as a feature extractor for a new task. There are two com-
not), and finger position (straight, bent, curved). Thumb mon transfer learning techniques based on how we use pre-
position is only used as a distinguishing feature for opposed trained ConvNet: finetuning the ConvNet and ConvNet as
thumb vs. all other possibilities. Non-selected finger posi- a fixed feature extractor. In the first technique, we use
tion is not taken into account. This reasonably approxi- weights of a pretrained model to initialize our network in-
mates features relevant for proposed phonological invento- stead of random initialization. All the layers of the Con-
ries in other sign languages, and, as such, can be used for vNet are trained. In the second approach, we freeze the
RSL as well. weights for all of the network layers and only the last final
If phonetic classes were the target, then classes would also fully connected layer is changed with random weights and
need to be distinguished by exact thumb position and also
by the differences in non-selected fingers. In such a case 1
The 10.88% class in Figure 4 includes all two-handed signs,
the full inventory of possible handshapes described in Ham- which we do not attempt to classify according to handshape at the
NoSys (Hanke, 2004) could be used as the basis for manual moment.
168
of handshape is linguistically relevant, but not visually sep-
arable from orientation and location by this very basic data-
driven approach.
We have demonstrated that an alternative approach involv-
ing a manual classification step can be quite effective.
However, manual classification is problematic for obvious
reasons, as it involves human judgment.
Both approaches, however, offer some linguistically rel-
evant insights, specifically concerning unmarked hand-
shapes. In the unsupervised approach, it is clear that
many clusters are assigned unmarked handshapes as labels,
which can be explained by both their frequency and visual
salience. In the supervised approach, our manual classifi-
cation of 7346 handshapes demonstrated that the unmarked
handshapes (A, 1, 5, B) are indeed the most frequent ones.
Finally, applying the ConvNet model to the whole dataset
of 18226 handshapes has shown that top 3 classes are A,
B, 5. Interestingly, the 1 handshape is not in the top most
frequent ones. The most likely explanation is that this
handshape is frequently misclassified as the handshape with
middle finger bent and the other fingers outstretched (the
‘jesus’ handshape in the figures), which is a rare marked
handshape in the manually classified dataset, but frequent
Figure 5: Handshape classes count using classifier in the results using the classifier.
Thus, both successful and less successful applications of
only this layer is trained. We implemented our networks us- machine learning methods show the importance of un-
ing PyTorch (release: 1.4.0) that is an open source machine marked handshapes in RSL. It would be interesting to ex-
learning library (Paszke et al., 2019). Our code is based tend these approaches to other sign languages for compar-
on Chilamkurthy’s Transfer Learning for Computer Vision ative purposes.
Tutorial (Chilamkurthy, 2017). ResNet-18 (He et al., 2016)
model was used as a pretrained model. 6.2. Comparison with Klezovich 2019
5.3. Results As discussed above, Klezovich (2019) proposed the first
handshape inventory for RSL by applying semi-automatic
We trained two networks using both approaches. Each
approach of extracting hold-stills in a sign video using the
model was trained for 200 epochs. Using the second ap-
same dataset used here (Spreadthesign). This gives us the
proach (i.e. ConvNet as a fixed feature extractor with only
opportunity to compare the results of a more traditional lin-
the last layer trained), the best accuracy of 43.2% was
guistic analysis of handshape classes in RSL with the ap-
achieved. On the other hand, the first approach (i.e. fine-
proach used in the current study.
tuning the ConvNet and training all layers) demonstrated a
better accuracy of 67%. Therefore, the finetuned model was A direct comparison is possible between Klezovich’s re-
used for further accuracy improvements. First, we added sults and the results of our unsupervised learning ap-
data augmentation to increase the number of samples. Sam- proaches. Both result in a classification of handshapes.
ples were randomly rotated and visual parameters (bright- However, we have demonstrated that the results of unsu-
ness, contrast, and saturation) were randomly changed with pervised clustering are unsatisfactory, so it cannot be used
a probability of 0.25. This helped to increase the accuracy for any linguistically meaningful applications.
of the best model up to 74.5% after 200 epochs. Later, we As for the supervised approach, both our approach and Kle-
used this trained model to predict labels for all 18226 hand- zovich’s analysis include manual annotation, but in differ-
shapes. In order to remove cases that were misclassified, a ent ways. Klezovich manually classified handshapes into
threshold for prediction probability was set to 0.7. And as potential phonemic classes using linguistic criteria, which
a result, 12042 samples were classified. Figure 5 demon- resulted in a large linguistically informed inventory. We
strates the number of predicted samples for each class. manually classified handshapes based on visual similarity
into a smaller number of classes, and then used this as a
6. Discussion dataset for machine learning.
The comparison between Klezovich’s and our manual clas-
6.1. Insights from unsupervised and supervised sifications is not very informative, as only the former was
approaches based on linguistic criteria. Given that Klezovich’s classifi-
The current study shows that the unsupervised approach cation was not used as a training set for automatic recogni-
does not seem promising in the task of automating hand- tion, no comparison is possible for this aspect either. This
shape recognition. The main problem is that the category issue is left for future research.
169
7. Conclusion scale similarity search with gpus. IEEE Transactions on
Big Data.
We have shown that by deploying a number of classical
Klezovich, A. (2019). Automatic Extraction of Phonemic
machine learning algorithms, it is possible to partially auto-
Inventory in Russian Sign Language. BA thesis, HSE,
mate one of the most time-consuming tasks for linguists i.e.
Moscow.
creation of a handshape inventory, and, in addition, to in-
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
vestigate frequencies of various handshapes in a large data
Imagenet classification with deep convolutional neural
set. At the moment, it seems that unsupervised approaches
networks. In Advances in neural information processing
cannot be used to create handshape inventories because ori-
systems, pages 1097–1105.
entation and location differences also influence clustering,
and to an even greater extent than handshape itself. A su- Kubuş, O. (2008). An Analysis of Turkish Sign Language
pervised approach is clearly more effective, however, it re- Phonology and Morphology. Diploma thesis, Middle
quires a manual annotation component where a substantial East Technical University, Ankara.
number of handshapes is manually classified. This intro- Narasimhaswamy, S., Wei, Z., Wang, Y., Zhang, J., and
duces additional problems of determining the number of Hoai, M. (2019). Contextual attention for hand detec-
classes for manual classification. Upon achieving the satis- tion in the wild. arXiv preprint arXiv:1904.04882.
fying unsupervised clustering results, future work will fo- Nyst, V. (2007). A Descriptive Analysis of Adamorobe
cus on comparing and applying this framework to other sign Sign Language (Ghana). LOT, Utrecht.
languages. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. (2019). Pytorch: An imperative style, high-
8. Bibliographical References performance deep learning library. In Advances in Neu-
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, ral Information Processing Systems, pages 8024–8035.
P., Braffort, A., Caselli, N., Huenerfauth, M., Ka- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
corri, H., Verhoef, T., Vogler, C., and Ringel Morris, Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
M. (2019). Sign language recognition, generation, and Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
translation: An interdisciplinary perspective. In The 21st napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
International ACM SIGACCESS Conference on Comput- (2011). Scikit-learn: Machine learning in Python. Jour-
ers and Accessibility, ASSETS ’19, pages 16–31, New nal of Machine Learning Research, 12:2825–2830.
York, NY, USA. ACM. Prillwitz, S. (2005). Das Sprachinstrument von
Brentari, D. (1998). A prosodic model of sign language Gebärdenspachen und die phonologische Umset-
phonology. MIT Press. zung für die Handformkomponente der DGS. In Helen
Caselli, N. K., Sehyr, Z. S., Cohen-Goldberg, A. M., and Leuninger et al., editors, Gebärdesprachen: Struktur,
Emmorey, K. (2017). ASL-LEX: A lexical database of Erwerb, Verwendung, pages 29–58. Helmut Bukse
American Sign Language. Behavior Research Methods, Verlag, Hamburg.
49(2):784–801, April. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the
Chilamkurthy, S. (2017). Transfer learning for computer interpretation and validation of cluster analysis. Journal
vision tutorial. https://chsasank.github.io/. of Computational and Applied Mathematics, 20:53 – 65.
Sandler, W. and Lillo-Martin, D. (2006). Sign language
Dalal, N. and Triggs, B. (2005). Histograms of oriented
and linguistic universals. Cambridge University Press.
gradients for human detection.
Stokoe, W. (1960). Sign Language Structure: An Out-
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., and Tian, Q.
line of the Visual Communication Systems of the Amer-
(2019). Centernet: Keypoint triplets for object detection.
ican Deaf. Number 8 in Studies in Linguistics: Occa-
In Proceedings of the IEEE International Conference on
sional Papers. Department of Anthropology and Linguis-
Computer Vision, pages 6569–6578.
tics, University of Buffalo, Buffalo.
Fenlon, J., Cormier, K., and Schembri, A. (2015). Building
Sutton-Spence, R. and Woll, B. (1999). The Linguistics
BSL SignBank: The Lemma Dilemma Revisited. Inter-
of British Sign Language. Cambridge University Press,
national Journal of Lexicography, 28(2):169–206, June.
Cambridge.
Hanke, T. (2004). Hamnosys: representing sign language
Tsay, J. and Myers, J. (2009). The morphology and
data in language resources and language processing con-
phonology of Taiwan Sign Language. In James Tai et al.,
texts. In Oliver Streiter et al., editors, LREC 2004, Work-
editors, Taiwan Sign Language and Beyond, pages 83–
shop proceedings: Representation and processing of sign
130. The Taiwan Institute for the Humanities, Chia-Yi.
languages., pages 1–6, Paris.
Van der Kooij, E. (2002). Phonological Categories in Sign
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid- Language of the Netherlands. The Role of Phonetic Im-
ual learning for image recognition. In Proceedings of the plementation and Iconicity. LOT, Utrecht.
IEEE conference on computer vision and pattern recog- van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J.,
nition, pages 770–778. Boulogne, F., Warner, J. D., Yager, N., Gouillart, E.,
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Yu, T., and the scikit-image contributors. (2014). scikit-
Mask r-cnn. In Proceedings of the IEEE international image: image processing in Python. PeerJ, 2:e453, 6.
conference on computer vision, pages 2961–2969.
Johnson, J., Douze, M., and Jégou, H. (2019). Billion-
170