Implicit Provenance for Machine Learning Artifacts - Rese...

Machine learning (ML) presents new challenges for reproducible software engineering, as the artifacts required for repeatably training models are not just versioned code, but also hyperparameters, code dependencies, and the exact version of the training data. Existing systems for tracking the lineage of ML artifacts, such as TensorFlow Extended or MLFlow, are invasive, requiring developers to refactor their code that now is controlled by the external system. In this paper, we present an alternative approach, we call implicit provenance, where we instrument a distributed file system and APIs to capture changes to ML artifacts, that, along with file naming conventions, mean that full lineage can be tracked for TensorFlow/Keras/Pytorch programs without requiring code changes. We address challenges related to adding strongly consistent metadata extensions to the distributed file system, while minimizing provenance overhead, and ensuring transparent eventual consistent replication of extended metadata to an efficient search engine, Elasticsearch. Our provenance framework is integrated into the open-source Hopsworks framework, and used in production to enable full provenance for end-to-end machine learning pipelines.