Automatic Classification of Human Translation and Machine Translation: a Study from the Perspective of Lexical Diversity

EasyChair Preprint 5487

9 pages•Date: May 8, 2021

Abstract

By using a trigram model and fine-tuning a pretrained BERT model for sequence clas- sification, we show that machine transla- tion and human translation can be classi- fied with an accuracy above chance level, which suggests that machine translation and human translation are different in a systematic way. The classification accu- racy of machine translation is much higher than of human translation. We show that this may be explained by the differ- ence in lexical diversity between machine translation and human translation. If ma- chine translation has independent patterns from human translation, automatic met- rics which measure the deviation of ma- chine translation from human translation may conflate difference with quality. Our experiment with two different types of au- tomatic metrics shows correlation with the result of the classification task. Therefore, we suggest the difference in lexical diver- sity between machine translation and hu- man translation be given more attention in machine translation evaluation.

Keyphrases: BLEU, lexical diversity, machine translation, machine translation evaluation, translation varieties

Links:

https://easychair.org/publications/preprint/t2Sj

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:5487,
  author    = {Yingxue Fu and Mark-Jan Nederhof},
  title     = {Automatic Classification of Human Translation and Machine Translation: a Study from the Perspective of Lexical Diversity},
  howpublished = {EasyChair Preprint 5487},
  year      = {EasyChair, 2021}}

Download PDF Open PDF in browser