Data Agnosticism: Feature Engineering Without Domain Expertise
Authors: Kridler, Nicholas, Accretive Health
Bits are bits. Whether you are searching for whales in audio clips or trying to predicit hospitalization rates based on insurance claims, the pro- cess is the same: clean the data, generate features, build a model, and iter- ate. Better features lead to a better model, but without domain expertise it is often difficult to extract those features. Numpy/Scipy, Matplotlib, Pandas, and Sci-kit Learn provide an excellent framework for data anal- ysis and feature discovery. This is evidenced by high performing models in the Heritage Health Prize and the Marinexplore Right Whale Detec- tion challenge. In both competitions, the largest performance gains came from identifying better features. This required being able to repeatedly visualize and characterize model successes and failures. Python provides this capability as well as the ability to rapidly implement and test new features. This talk will discuss how Python was used to develop competi- tive predictive models based on derived features discovered through data analysis.