Using Hadoop to Cluster Data in Energy System
University of Dayton Hou Jun
This thesis takes a distributed computing approach built upon Apache Hadoop, which is a distributed data analysis framework running on multiple computers. The main components of this work includes implementation of k-means machine learning algorithms on the Hadoop Map-Reduce framework, processing raw data from real energy systems, classifying the data using k-means algorithms in Hadoop, and improvement on seed selection for k-means algorithms. Finally, this thesis demonstrates the efficiency and effectiveness of our approach using different data sets.