Lessons learned: Four tips to use when designing for raters
Simplifying raters’ workflow is essential for efficiently training ML models. The UX tweaks we made for raters as described above are a start, but my team at Google Brain is still in the early stages of building our own tool for labeling data. Here are a few things the team has learned so far about designing for rating.
1. Use multiple shortcuts to optimize key flows
Raters regularly use keyboard shortcuts to move between lines, select text, and assign labels. It’s a testament to the original engineer of the raters’ annotation tool that no one ever complained about the word selection function. It’s always been a slick process, automatically selecting the entire word for faster highlighting.
When you start providing multiple ways for raters to complete a task, consider asking them for ideas. For example, they helped us identify that grouping text highlights was a helpful action. After a long day of manual grouping, one of our raters had the brilliant idea to select a paragraph of text and use a keyboard shortcut to automatically group all highlights within the selected region. That saved her from having to individually select each highlight to group them, which shaved useful seconds off the task.
2. Always provide easy access to labels
A typical dropdown menu might seem like the most obvious way to let raters label a highlighted phrase with tags, but it isn’t. Let’s go back to our ice cream store example. Imagine a rater highlighting “sea salt caramel,” then clicking on a drop-down menu with more labels to describe the ice cream as “creamy.” Dropdown menus are helpful for the first label, but not for subsequent labels. In studies with our rater annotation tool, raters complained that the menus, represented by text boxes that offered choices for additional labels, covered up existing labels on the screen. Obscuring the screen made it difficult for raters to see the additional context they needed from the text.
We decided to redesign the menu so when raters select a highlight, the dropdown menu of labels opens in the margins of the screen. As raters add labels, they still appear below highlighted text, but are no longer covered up by a dropdown. This helped raters confirm which labels were already linked to words before adding more.
3. Let raters change their minds
I learned from studying the behaviors of raters that they don’t do their work in a set sequential order. When reviewing text, they might highlight words and label them as they go, or they might make it through a couple paragraphs, see the topic shifted, then go back and put similar highlights into groups. If they’re reviewing images, they might skip a few tough ones and go back to them later. Any two raters might frequently disagree and need to have a conversation about how to reconcile their choices.
Raters need a flexible workflow and tools that support editing and out-of-sequence changes. One way to do this is to create text fields outside the main body, much like the ‘comment’ dialogs in Google Docs that show up in the margins and clearly reference highlighted text. This allows raters to leave notes on their progress or remember questions for later reference.
It’s also important to ensure that raters can discuss their choices and make the changes needed to align with each other. Supporting methods of offline discussion, such as shared commenting on the file itself, helps raters resolve disagreements more quickly.
4. Auto-detect and display errors
It’s not fair to ask raters to simply “do their best” to meet a rule set for any given project. Build the rules into the tool so it’s visually clear when errors are present. For example, we contemplated showing alerts for important errors or blocking submissions until they’re fixed. Instead, we settled for an auto-prompted “error list” raters must check before submitting. It lets raters know when errors are detected but avoids the potential annoyance of pop-up alerts. Since setting up this list, the number of submissions with errors dropped precipitously, and we don’t have as much throwaway data.