Dealing with complete separation and quasi-complete separation in logistic regression for linguistic data

Robert G. Clark*, Wade Blanchard, Francis K.C. Hui, Ran Tian, Haruka Woods

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    7 Citations (Scopus)

    Abstract

    Logistic regression is a powerful and widely used analytical tool in linguistics for modelling a binary outcome variable against a set of explanatory variables. One challenge that can arise when applying logistic regression to linguistics data is complete or quasi-complete separation, phenomena that occur when (paradoxically) the model has too much explanatory power, resulting in effectively infinite coefficient estimates and standard errors. Instead of seeing this as a drawback of the method, or naïvely removing covariates that cause separation, we demonstrate a straightforward and user-friendly modification of logistic regression, based on penalising the coefficient estimates, that is capable of systematically handling separation. We illustrate the use of penalised, multi-level logistic regression on two clustered datasets relating to second language acquisition and corpus data, showing in both cases how penalisation remedies the problem of separation and thus facilitates sensible and valid statistical conclusions to be drawn. We also show via simulation that results are not overly sensitive to the amount of penalisation employed for handling separation.

    Original languageEnglish
    Article number100044
    JournalResearch Methods in Applied Linguistics
    Volume2
    Issue number1
    Early online date2 Mar 2023
    DOIs
    Publication statusPublished - Apr 2023

    Fingerprint

    Dive into the research topics of 'Dealing with complete separation and quasi-complete separation in logistic regression for linguistic data'. Together they form a unique fingerprint.

    Cite this