Mechanical Turk has proven to be a powerful tool for machine learning. In particular, it makes it very easy to generate large amounts of training data for machine learning tasks. For example, one can have Mechanical Turk workers transcribe recorded speech and use that to train speech recognition algorithms. A constant concern when gathering training data, especially from a service like Mechanical Turk, is the quality of the data. This post is about ensuring high-quality data for obscure tasks that the average person wouldn’t immediately know how to do.
Let’s go back to speech transcription. Speech transcription tasks are self-explanatory. You listen to the audio and you write down what the speaker is saying. There is no doubt about the objective of the task. Speech transcription tasks are also easy. Anybody can transcribe speech, provided they can hear and understand the speaker. Of course, transcriptions often have imperfections, and making good quality transcriptions is a laborious task, but it easy compared to tasks that are creative or mentally challenging. Tasks like “Paint a portrait of the man in this picture” or “Design a logo for my company” would be two such examples. There tasks are also self-explanatory, but difficult.
Speech transcription is easy because you don’t need to learn to do it. Painting a portrait is difficult because you do need to learn to do it through lots of practice. In this case, you can’t expect good results from such a task on Mechanical Turk because the task is difficult.
There are also tasks that are easy, but not self-explanatory, in which the objective is initially not clear. These are tasks where someone new to the task will likely not know whether they got it right or wrong. Suppose you were an alien and I told you to label chairs in photographs. Because there is a vast variety of chairs, you would have to look at many photographs of chairs — perhaps dozens, or even hundreds — before you could reliably identify chairs in new photographs. Now replace labeling chairs with labeling some kind of obscure and variable object, like certain anatomical features in medical scans, or morphological features in fly embryos, and this becomes a real problem.
Labeling chairs in photographs is technically easy, but not self-explanatory if you’ve never seen a chair. If you are given a Mechanical Turk task to label something you haven’t seen much of, you will probably do a poor job. Unlike the task of painting a portrait, this kind of task is easy. But in this case, you can’t expect good results from such a task because it is not self-explanatory. It has a learning curve and requires training. You can include an extensive list of examples in your instructions for the task, but you have no guarantee that the worker will take the time to look at all of them, or even any of them.
One possible solution is to require commitment to both a training phase where the worker completes a whole set of tasks and receive feedback on each one in order to learn what is expected of them and a testing phase (in analogy with machine learning) where the worker will complete a whole set of other tasks for pay. Both of these stages can be encapsulated into one big task, which the worker will be required to commit to. One could also measure how well the worker completes the training tasks and set a threshold for doing the actual task.
Using Mechanical Turk to teach workers to label data in a certain way is a way to teach people to teach machines. The ultimate goal is to train a machine to do a task. The problem is that there is no data with which to train. One might be able to get away without crowdsourcing the generation of training data in some cases, but in other cases it becomes intractable. For example, labeling gene expression patterns in fly embryos across thousands of genes and more than a dozen developmental stages is too big a task for a small team of researchers to do. The solution is to have a systematic way to train many people — who are much faster learners — to do the task. The labor of these workers then serves as a way to train machine learning algorithms with tons of data.
To recap, tasks on Mechanical Turk could be either easy or difficult, and either self-explanatory or not. Quality of results might suck because tasks are too difficult, or because they are not self-explanatory enough for the average worker to complete in a satisfactory way. For tasks that are not self-explanatory, instead of getting crappy results from workers who do one task and never come back, we can get workers to commit to an entire training and testing program in which they can learn what results are expected, resulting in large amounts of high quality training data.
I am currently working on modyfing LabelMe, a tool for labeling objects in photographs, to generate such batch training and testing for image labeling tasks on Mechanical Turk. I hope to post more on this in the future, with actual results comparing the old single tasks and the new batch tasks. Stay tuned!