Learn2Mine

Learn2Mine: An Open-Source Cloud-Based Informatics Platform For Integrated Teaching and Data Exploration Author(s): Clayton Turner, Jacob Dierksheide, and Paul Anderson

While the development of cross-disciplinary fields, such as bioinformatics and data science, has changed scientific inquiry in many respects, the realization of their full potential has in many instances been hindered by the separation of algorithm developers and domain experts. Cloud-based, platform independent services and software have emerged as a mechanism to lower the technical barriers of scientific computing. A byproduct of this trend, Learn2Mine is an open-source cloud-based informatics platform engineered for data exploration. Combining the machine learning techniques from Weka with a user-interface that relaxes the learning curve and ease-of-use given by other software, Learn2Mine provides scientists with a single integrated platform to learn data mining techniques and apply them to their biological datasets. As scientists learn work through the online lessons, their confidence in their own analytical results will increase. Thus, scientists will be able to focus on evaluating methods for analyzing their data and use Learn2Mine’s integrated cloud-storage to share data and results with collaborators and the community. The platform is also designed to teach data mining and machine learning techniques to a computer scientist, allowing them to apply classifiers, clustering algorithms, etc to datasets that have been shared by the community. This allows students with a variety of backgrounds to share and compare their results, thus, providing context in their data mining application while their knowledge of statistical techniques improves.

Learn2Mine utilizes a decomposition, or drill-down, method of input and output, as opposed to workflow methods, most prominent in typical systems, such as RapidMiner, Taverna, and Galaxy. This lets Learn2Mine operate at higher levels of abstraction, encapsulating complex workflows based on their input and output. Scientists can customize shared templates for statistical analysis, combining algorithms and tools to better fit their data. For example, a scientist observing genetics data may need a custom template that integrates a Nearest Neighbor Algorithm with a Support Vector Machine in order to produce better results with the required utility. Novel customization is supported and new templates by scientists can persist indefinitely in Learn2Mine, in addition to being shared back to the community, allowing Learn2Mine to evolve. These capabilities and features are accessible by cloud-based software as a service that is currently being hosted on Google App Engine with storage and sharing of biological datasets provided by integrating with Google Drive. In addition to these advantages, Learn2Mine also teaches data science through the implementation of gameful experience, which allows students to visit the site and immediately start programmatic lessons with built-in bioinformatics practice datasets and specialized feedback. Classifiers, like abilities, can be unlocked by completing lessons and unlocked abilities can be viewed in the skill tree on a student’s profile, linked through Google. The system’s initial implementation and associated feedback integration will be tested on its ability to analyze standard pedagogical datasets (e.g., Iris dataset). Second, the utility of this software will be demonstrated through ongoing collaborations with the Air Force Research Laboratory, where metabolomics spectroscopic datasets of human fatigue and toxicology will be analyzed.