On the relationship between form and information content
Cassandra Jacobs, UW-Madison
Research Associate in Psychology
Zipf (1949) famously demonstrated that longer words are typically less frequent. However, more recent work has suggested that word frequency is not the driving factor explaining word length – Piantadosi, Tily, and Gibson (2011) proposed instead that words that usually occur in more surprising or unpredictable contexts tend to be longer, which they demonstrated in a large-scale corpus study over several languages – and that this relationship was stronger than the original relationship specified by Zipf. In this talk, I will present ongoing work that uses cloze data collected using naturalistic sentences (Luke & Christianson, 2018) to test whether words that are unpredictable in fact tend to be longer. Combining the human cloze data with a state-of-the-art model of next-word prediction, we find that words that are less predictable are not necessarily shorter or longer. Rather, the relationship between word length and predictability is highly non-linear, while the relationship between word length and word frequency reflects Zipf’s original proposals. In addition to the non-linear relationship, we find that word frequency much more strongly predicts word length than more surprising words, which contradicts the claims of Piantadosi et al. I will discuss potential explanations for these discrepancies and future directions for this research.
References:
Luke, S. G., & Christianson, K. (2018). The Provo Corpus: A large eye-tracking corpus with predictability norms. Behavior Research Methods, 50, 826-833.
Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108, 3526-3529.