Scalable Data Analytics and Machine Learning

This project implements the K-means clustering algorithm in Spark and uses it to analyze political campaign data from the Federal Election Commission. It performs data analytics on about 432 MB of data that describes the finances of candidates running for election in 2016, and it details contributions from individual and organizations to campaigns disclosed to the FEC.

The raw FEC data files are first loaded into Spark's Dataframes and SQL can be used to make various queries on the data. Finally, the implemented K-means Clustering algorithm is used categorize campaign contributions to the candidates by geographic location using clusters.