TY - GEN
T1 - Towards a parallel data mining toolbox
AU - Christen, P.
AU - Hegland, M.
AU - Nielsen, O. M.
AU - Roberts, S.
AU - Strazdins, P.
AU - Semenova, T.
AU - Altas, Irfan
AU - Hancock, Timothy
N1 - Publisher Copyright:
© 2001 IEEE.
PY - 2001
Y1 - 2001
N2 - This paper presents research projects tackling two aspects in data mining. First, a toolbox is discussed that allows flexible and interactive data exploration, analysis and presentation using the scripting language Python. The advantages of this toolbox are that it provides the functionality to process multiple SQL queries in parallel, and enables fast data retrieval using a supervised caching mechanism for commonly used queries. These two facets of the toolbox allow for fast, efficient data access reducing the time spent on data exploration, preparation and analysis. Secondly, an approach to predictive modelling is presented that leads to scalable parallel algorithms for high dimensional data collections. This is an essential requirement for data mining algorithms as those that do not scale linearly with the data size are infeasible. These algorithms are implemented in parallel and achieve an almost ideal speedup for their respective implementations. One aim of the presented research is to integrate and combine these two different aspects of data mining into an efficient but flexible data mining toolbox that allows the experienced data miner to attack large scale problems interactively or with batch processing.
AB - This paper presents research projects tackling two aspects in data mining. First, a toolbox is discussed that allows flexible and interactive data exploration, analysis and presentation using the scripting language Python. The advantages of this toolbox are that it provides the functionality to process multiple SQL queries in parallel, and enables fast data retrieval using a supervised caching mechanism for commonly used queries. These two facets of the toolbox allow for fast, efficient data access reducing the time spent on data exploration, preparation and analysis. Secondly, an approach to predictive modelling is presented that leads to scalable parallel algorithms for high dimensional data collections. This is an essential requirement for data mining algorithms as those that do not scale linearly with the data size are infeasible. These algorithms are implemented in parallel and achieve an almost ideal speedup for their respective implementations. One aim of the presented research is to integrate and combine these two different aspects of data mining into an efficient but flexible data mining toolbox that allows the experienced data miner to attack large scale problems interactively or with batch processing.
UR - http://www.scopus.com/inward/record.url?scp=84981166203&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2001.925141
DO - 10.1109/IPDPS.2001.925141
M3 - Conference contribution
T3 - Proceedings - 15th International Parallel and Distributed Processing Symposium, IPDPS 2001
SP - 1563
EP - 1570
BT - Proceedings - 15th International Parallel and Distributed Processing Symposium, IPDPS 2001
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th International Parallel and Distributed Processing Symposium, IPDPS 2001
Y2 - 23 April 2001 through 27 April 2001
ER -