Conference site ยป Proceedings

Python in Data Science Research and Education

Randy Paffenroth
Worcester Polytechnic Institute, Mathematical Sciences Department and Data Science Program

Xiangnan Kong
Worcester Polytechnic Institute, Computer Science Department and Data Science Program


In this paper we demonstrate how Python can be used throughout the entire life cycle of a graduate program in Data Science. In interdisciplinary fields, such as Data Science, the students often come from a variety of different backgrounds where, for example, some students may have strong mathematical training but less experience in programming. Python’s ease of use, open source license, and access to a vast array of libraries make it particularly suited for such students. In particular, we will discuss how Python, IPython notebooks, scikit-learn, NumPy, SciPy, and pandas can be used in several phases of graduate Data Science education, starting from introductory classes (covering topics such as data gathering, data cleaning, statistics, regression, classification, machine learning, etc.) and culminating in degree capstone research projects using more advanced ideas such as convex optimization, non-linear dimension reduction, and compressed sensing. One particular item of note is the scikit-learn library, which provides numerous routines for machine learning. Having access to such a library allows interesting problems to be addressed early in the educational process and the experience gained with such “black box” routines provides a firm foundation for the students own software development, analysis, and research later in their academic experience.


data science, education, machine learning

Bibtex entry

Full text PDF