Despite the recent ubiquity of large language models and their high zero-shot prompted performanc... more Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
Compositionality in language models presents a problem when processing idiomatic expressions, as ... more Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books -NCSSB -datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but quantity also plays a role in models without context inclusion.
This paper explores the use of word2vec and GloVe embeddings for unsupervised measurement of the ... more This paper explores the use of word2vec and GloVe embeddings for unsupervised measurement of the semantic compositionality of MWE candidates. Through comparison with several human-annotated reference sets, we find word2vec to be substantively superior to GloVe for this task. We also find Simple English Wikipedia to be a poor-quality resource for compositionality assessment, but demonstrate that a sample of 10% of sentences in the English Wikipedia can provide a conveniently tractable corpus with only moderate reduction in the quality of outputs. 2 Past Research Lin (1999) employs a substitution-based method to detect non-compositionality. However, while noncompositional phrases also exhibit institutionalisation (resistance to substitution of synonyms), the re-This work is licensed under a Creative Commons Attribution 4.
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)
As social media platforms grow, so too does the volume of hate speech and negative sentiment expr... more As social media platforms grow, so too does the volume of hate speech and negative sentiment expressed towards particular social groups. In this paper, we describe our approach to SemEval-2023 Task 10, involving the detection and classification of online sexism (abuse directed towards women), with fine-grained categorisations intended to facilitate the development of a more nuanced understanding of the ideologies and processes through which online sexism is expressed. We experiment with several approaches involving language model finetuning, class-specific adapters, and pseudolabelling. Our best-performing models involve the training of adapters specific to each subtask category (combined via fusion layers) using a weighted loss function, in addition to performing naive pseudo-labelling on a large quantity of unlabelled data. We successfully outperform the baseline models on all 3 subtasks, placing 56th (of 84) on Task A, 43rd (of 69) on Task B, and 37th (of 63) on Task C.
Through comparison with several human-annotated reference sets, we find word2vec to be substantiv... more Through comparison with several human-annotated reference sets, we find word2vec to be substantively superior to GloVe for unsupervised measurement of the semantic compositionality of MWE candidates. We also demonstrate that a sample of 10% of sentences in the English Wikipedia can provide a conveniently tractable corpus with only moderate reduction in the quality of outputs.
Release of MWE resources following completion of dissertation and publication of ACL paper summar... more Release of MWE resources following completion of dissertation and publication of ACL paper summarising the work. Push to Zenodo.
University Student Surveys Using Chatbots: Artificial Intelligence Conversational Agents
Predefined web surveys are often used to collect course evaluations from students in higher educa... more Predefined web surveys are often used to collect course evaluations from students in higher education institutions. These institutions use the evalua-tions to adjust their courses’ pedagogical standards and lecture style to cope with an increasingly uncertain and complex world. Many limitations to us-ing web surveys have been reported such as low response rates and low-quality responses to open questions. To overcome these limitations, artifi-cial intelligence conversational agents (CAs) or ‘chatbots’ are used to play the interviewer role, facilitating the enhancement of the quality of respons-es. This is accomplished by mimicking human-human conversations; by ask-ing questions in a friendly, casual way and pursuing high-quality responses. This study aims to explore the opportunities and the obstacles of using CAs in collecting course evaluations in three European universities (UK, Spain and Croatia) and one Centre of excellence in Cyprus. The transcripts collect-ed have been analyzed using statistical data analysis methods and qualitative data analysis techniques. Our findings reveal that the use of CAs in collect-ing course feedback from students has a positive impact on response quality and can boost students’ enjoyment levels. Furthermore, gender differences and student age have been identified as important factors that can influence the depth of the conversation with the CA.
Uploads
Papers by Thomas Pickard