AI has made great leaps in progress in the last five years thanks to the cloud-enabled increase in computational capacity, availability of large-scale datasets, and collaborative sharing of research and code by scientists and researchers. More sophisticated models are now able to process increasingly large volumes of information.
On episode 7 of AI at Work, we covered some trending topics, addressing the questions of why we aren’t seeing more AI research making it into the products we use, and what does it take to build a strong data science team. Dhairya Dalal and Daniel Shank of Talla’s data science team share their insights and explain data science terms in their simplest form.
Challenges and Trends in Applying Data Science at Your Company
With AI being the hot topic that it is, there’s a lot of high-level research taking place, but we often don’t see it translating into the products that we use. How come? Shank and Dalal explain that scientists are focusing more on answering very narrow and specific questions, and less on thinking about how their model might be extrapolated to suit a production setting.
“That's usually a large part of the challenge that you run into applying research to production. But the other thing is that usually, you run into more domain specific issues. The data sets [the scientists] are using work because they're basically optimizing for particular distributions or particular properties of the data set that might not exist in the data set they have in production or in your business use case,” elaborates Dalal.
Adapting the code used in research to production is very tough. According to Shank, even minute changes in “things like your computation, the libraries you're using to do calculations, that have small differences out like decimal places, are going to make a big difference. And that's just not acceptable for production, right? Especially a lot of methods that are sensitive like that.” Shank describes how data science models are probabilistic and can sometimes produce different results provided the same input. In production, consistency is critical. It is difficult to incorporate models that are likely to produce the right answer but are not guaranteed to do so 100% of the time.
What works in one narrow scope might not be broadly applicable to another. Citing a common frustration, Shank describes how some articles will declare, “AI can now read books and answer questions about them and then you get really excited and you actually go and you pull down their GitHub repo and you start to try to get to work. Then you realize that oh, actually, it works for that particular collection of articles or news articles. It works great on certain kinds of news articles that can answer questions about that, but in order to get it to work on something you care about, like in our case more business-related things, then you would need to actually make your own data set, which is really expensive and very time-consuming.”
As more companies integrate data science teams into their core, another trending topic touches upon identifying the keys to building a strong data science team.
What does it take to be a strong data scientist that’s part of a successful data science team? For individuals, “the expertise is all in understanding what models are good for particular kinds of problems,” according to Shank. For teams, Dalal expands, “you really want three main things. One is the ability to communicate really well. Being able to take the complexities and the nuances of the models and then being able to explain it to a lay audience on the business side, the product side, as well as to an engineering side. The second aspect is just being able to just go in and write the code in the environment that you have. Then the last thing I think is useful for building a data scientist team is having specific domain expertise. You want someone who has a deep NLP experience or has deep computer vision experience or has deep optimization experience because those specialties will help take a general product feature and deliver that differentiating value for wherever you're building.”
Data science teams are undergoing a process of evolution. It is no longer sufficient to have a group of siloed researchers working with piles of data and producing black box results. Effective data science teams require interdisciplinary members who have strong communication skills and can be heavily involved with the business and engineering side of operations.
Common Data Science Terms Explained
Terminology can get very technical, very quickly. Here’s a rundown of the AI terms our data scientists defined in this episode.
Structured data - Information that is well organized and annotated. For example, information stored in relational databases.
Unstructured data - Essentially free-form text data that exists in the wild in raw and unstructured formats. In addition to text, it can also be information that's captured in images or infographics that are in PDFs, PowerPoint slides, or other document formats.
Discriminative models - Discriminative models answer very specific questions about data, mainly what category or group does each data point belong to. They are used to classify or distinguish data points based on prior observations. So for example, given many pictures of muffins and dogs, the model can predict if an arbitrary new picture is a muffin or a dog.
Generative models - Instead of answering a question about data, it simulates the data itself. So you show it a data set and it will actually reproduce more examples that might have been drawn from the data set but worked. You can use the generative model to answer questions, too, but it takes a little bit more time. You just have to generate enough data points until you can just figure out the statistics on how many images that looked like this were dogs and how many of them were muffins.
Bugs - An issue directly caused by human error, usually by an engineer who made a mistake in writing the code.
Errors - An error is when you say, well, the model gave you an answer, it’s just not the right answer. For whatever reason, you don't have enough data, your problem is difficult, it's noisy, or the model is just off.
Reinforcement learning - Essentially, an agent teaching itself how to learn a task and how to do a task. The model is training itself on something called a reward function. What it does is it says, hey, I have an idea of what is right based on this function. That's going to tell me if it matches my example, I'll give it a positive value. If it doesn't match it, I'll give it a penalty or negative value. The reinforcement agent starts randomly walking through the space of the actions, trying to understand which set of these actions will get me the correct output.
Overfitting - Overfitting is this idea that you show a model a particular data set and it only has that data set to look at. It will learn how to answer questions that it sees in a way that doesn't generalize or doesn't apply to data that it hasn't seen. So it's really only a problem if you don't have all the data, which is almost always the situation you find yourself in. It's really what you wrestle with all the time with machine-learning is trying to get your model to generalize or learn something more universal from the data that it sees that applies to everything else.
Despite the challenges of translating and applying AI research to production, advances continue to grow and more and more companies turn to build out data science teams. We’ve shared some perspectives on trending AI topics from two members of Talla’s data science team and decoded commonly used AI terms into plain English. Tune in to more episodes of AI at Work to keep your finger on the pulse of the latest advice for integrating AI into your workflow.