Abstract
Logistic regression is a powerful and widely used analytical tool in linguistics for modelling a binary outcome variable against a set of explanatory variables. One challenge that can arise when applying logistic regression to linguistics data is complete or quasi-complete separation, phenomena that occur when (paradoxically) the model has too much explanatory power, resulting in effectively infinite coefficient estimates and standard errors. Instead of seeing this as a drawback of the method, or naïvely removing covariates that cause separation, we demonstrate a straightforward and user-friendly modification of logistic regression, based on penalising the coefficient estimates, that is capable of systematically handling separation. We illustrate the use of penalised, multi-level logistic regression on two clustered datasets relating to second language acquisition and corpus data, showing in both cases how penalisation remedies the problem of separation and thus facilitates sensible and valid statistical conclusions to be drawn. We also show via simulation that results are not overly sensitive to the amount of penalisation employed for handling separation.
Original language | English |
---|---|
Article number | 100044 |
Journal | Research Methods in Applied Linguistics |
Volume | 2 |
Issue number | 1 |
Early online date | 2 Mar 2023 |
DOIs | |
Publication status | Published - Apr 2023 |