TY - JOUR
T1 - Data mining methodological weaknesses and suggested fixes
AU - Maindonald, John
PY - 2006
Y1 - 2006
N2 - Predictive accuracy claims should give explicit descriptions of the steps followed, with access to the code used. This allows referees and readers to check for common traps, and to repeat the same steps on other data. Feature selection and/or model selection and/or tuning must be independent of the test data. For use of cross-validation, such steps must be repeated at each fold. Even then, such accuracy assessments have the limitation that the target population, to which results will be applied, is commonly different from the source population. Commonly, it is shifted forward in time, and it may differ in other respects also. A consequence of source/target differences is that highly sophisticated modeling may be pointless or even counter-productive. At best, model effects in the target population may be broadly similar. Investigation of the pattern of changes over time is required. Such studies are unusual in the data mining literature, in part because relevant data have not been available. Several recent investigations are noted that shed interesting light on the comparison between observational and experimental studies, with particular relevance when there is an interest in giving parameter estimates a causal interpretation. Data mining activity would benefit from wider co-operation in the development and deployment of computing tools, and from better integration of those tools into the publication process.
AB - Predictive accuracy claims should give explicit descriptions of the steps followed, with access to the code used. This allows referees and readers to check for common traps, and to repeat the same steps on other data. Feature selection and/or model selection and/or tuning must be independent of the test data. For use of cross-validation, such steps must be repeated at each fold. Even then, such accuracy assessments have the limitation that the target population, to which results will be applied, is commonly different from the source population. Commonly, it is shifted forward in time, and it may differ in other respects also. A consequence of source/target differences is that highly sophisticated modeling may be pointless or even counter-productive. At best, model effects in the target population may be broadly similar. Investigation of the pattern of changes over time is required. Such studies are unusual in the data mining literature, in part because relevant data have not been available. Several recent investigations are noted that shed interesting light on the comparison between observational and experimental studies, with particular relevance when there is an interest in giving parameter estimates a causal interpretation. Data mining activity would benefit from wider co-operation in the development and deployment of computing tools, and from better integration of those tools into the publication process.
KW - Comparison of algorithms
KW - Data mining
KW - Observational data
KW - Predictive accuracy
KW - Reject inference
KW - Selection bias
KW - Statistics
KW - Target population
UR - http://www.scopus.com/inward/record.url?scp=84870549537&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84870549537
SN - 1445-1336
VL - 61
SP - 9
EP - 16
JO - Conferences in Research and Practice in Information Technology Series
JF - Conferences in Research and Practice in Information Technology Series
T2 - 5th Australasian Data Mining Conference, AusDM 2006
Y2 - 29 November 2006 through 30 November 2006
ER -