While machine learning applications for classifying data items like tweets or news articles have recently experienced tremendous growth, the process of building a labeled training dataset for these methods continues to be a tremendous challenge. A quality labeled training dataset is critical for machine learning. It is the foundation an algorithm uses to learn to classify future data items. Yet human coders must often expend considerable amounts of time and resources to build this dataset, and it can be even more challenging to ensure consistency between human coders.
RTI Data Scientist Rob Chew experienced this firsthand on a number of Center for Data Science projects and imagined a user-friendly tool that could make the labeling process more efficient and enjoyable. Recently, with support from the National Consortium for Data Science as a 2017-2018 Data Fellow, Rob led a team to develop SMART: Smarter Manual Annotation for Resource-constrained collection of Training data.
The primary goal of SMART is to make the data labeling process more manageable for research teams looking to build out training datasets for classification tasks. While marketplaces exist for outsourcing data labeling, researchers often cannot utilize these options for restricted access datasets or complex labeling structures that require expert knowledge. The SMART platform addresses these challenges by providing a secure platform for in-house coders with comprehensive metrics to ensure consistency between coders.
The secondary goal of SMART is to encourage further innovation in active learning methods by machine learning researchers. In many existing data labeling processes, human coders receive data items in a random order. With active learning, SMART learns from past codes and only shows coders the data items it is most uncertain about classifying, thereby gaining the most knowledge from each new item. Machine learning researchers can also use the SMART platform to compare alternative active-learning algorithms.
SMART is available to the public on GitHub. At the moment, it supports text classification only. In the future, Rob hopes to extend support to image classification and to allow for multi-labeling. By publishing SMART as an open-source tool, Rob hopes to continue to improve the platform so that machine learning becomes even more accessible to researchers at RTI and beyond.
The idea for SMART and many similar innovations emerged from the rapidly accelerating demand for data science at RTI. Over its five-year history, the RTI Center for Data Science has enabled our researchers to use a wide array of machine learning, natural language processing, computer vision, and other data science methods to solve problems with massive amounts and varieties of data. Data Scientist Michael Wenger and Summer Intern Caroline Kery were crucial members of the SMART development team and spearheaded key components of the year-long project. Michael said that building SMART was a terrific opportunity to use backend, frontend, and modeling skills and that “developing a full stack application to assist in data annotation brought many challenges but was a great learning experience.” Caroline said that the work was important to her experience as a rising graduate student, and she especially appreciated the mentorship from both Rob and Michael that helped her improve her programming knowledge and practices.