Datasaur: Teaching Computers To Understand Human Language

By: Brett Gibson

Teaching computers to understand human language may be one of the most potentially impactful goals in software today. To date, all signs point to machine learning (ML) being the approach most likely to succeed in that goal. But for machine learning models to work well, we need a lot of data — and in most scenarios a lot of labeled data. Our current need to label a lot of data for natural language processing (NLP) models accurately and efficiently is precisely why Initialized invested in Datasaur. We’re proud to announce we were able to lead their seed round joined by OpenAI CTO Greg Brockman, with previous investment from Y Combinator.

With the release of GPT-3 by OpenAI, we have seen very impressive gains in computer’s ability to generate sensible speech. General purpose, large-scale models based on unsupervised learning (or models that are trained using unlabeled, unstructured data) are great for generating sensible sounding language.

But there is still a large gap in how quickly humans are able to interpret language, understand context and internal references, and tracking meaning from sentence to sentence versus how well a computer can do these tasks. On these tasks, supervised models — or those trained on input data that has previously been labeled by humans — are the most accurate and consistent approach.

On top of that, there are narrower domains, like law, where we need computers to understand and interpret idiosyncratic terminology and meanings. In some of these cases the number of humans who can even decipher the languages is quite low. Supervised ML approaches have even more of an advantage in these narrow domains.

The hard part of building NLP models is often the labeling. The tooling for training these models is widely open source and available, and raw data abounds. We have the data and humans who understand it immediately; we just need labeling tools to extract that understanding as quickly and simply as possible from human labelers. Datasaur is solving exactly this problem, creating interfaces that delight individual data labelers, tied together with all the workflow management a team needs to generate quality labels.

Image for post — The Datasaur Labeling Interface

I already knew labeling was a real problem and was impressed with the product before we took a meeting Datasaur. Meeting Ivan Lee, Datasaur’s founder and CEO, sealed my conviction this was the type of team and opportunity Initialized should be backing. Ivan Lee drew from his experience studying human-computer interfaces and time building similar tooling at Apple and Yahoo to create this solution. He has thought deeply about how to give labelers the best experience, and is constantly talking to customers about how to further improve and make their lives easier. We’re thrilled to have Ivan and Datasaur join the portfolio.

If you’re interested in what goes into data labeling for NLP, Ivan has a clear but comprehensive recent post on the topic and if you have some data to label definitely check out Datasaur.