Sourcerer: Mining and Searching Internet-Scale Software Repositories
Data Mining and Knowledge Discovery
Jan 2008
Authors: E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, P. Baldi. Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised,