Disambiguation Podcast - Automation and AI for Data Management - Transcript

November 3, 2023 Michael Fauscette

Michael Fauscette

Welcome to disambiguation. I'm your host Michael Fauscette. Each week we interview experts in AI generative AI and business automation to help business leaders understand how to use these tools for the biggest business impact. In our show today, we look at the role of automation and AI and data management. I'm joined by Gaurav Pathak, VP of Product Management at Informatica welcome, Gaurav. So just to get this started, could you give us a little bit about yourself talk about your role a bit at Informatica.

Gaurav Pathak

Sure. And Thanks, Michael. Thanks for having me. On your podcast today, I love to be here talk about AI for data management. I lead the Product Management here at Informatica for AI initiatives and metadata. I've been with Informatica for 10 years, working on products like the cloud data governance and catalog enterprise data catalog, Big Data Management Suite, etc. And before that, I've spent 10 years with Oracle, so have spent almost all of my career in data and data management. And we all live in exciting times. AI is upon us. And I think this is a great space to be in if you're doing data management now, given the new possibilities that AI provides.

Michael Fauscette

Yeah, I mean, I think, you know, obviously, this whole podcast has really been focused around that and a bunch of my research and I get excited when I think about the opportunities for businesses. And you know, one of the areas that I think is really foundational for everything, of course, is the data. And so that's why I'm excited for us to dig into this a little bit and understand more about what you're offering and where you think that going from a data management and an automation perspective as well. So maybe you could start by giving us a bit of an overview on Claire and its purpose and kind of how it sits in the Informatica portfolio. Or

Gaurav Pathak

Claire for us at Informatica an AI engine that provides number one, productivity benefits to the data teams. These are data engineers, data analysts, data scientists, data stewards data reliability engineers across these different functions, working with data as their day to day jobs. And number two, providing data democratization benefits to business users, we, in business who are working on our day-to-day jobs, we need data for everything we need data for making new decisions, we need data for making predictions. And that's been one area where a lot of challenges have been there. So, we started Claire, as a project within Informatica all the way back in 2016. For these two benefits, and underlying all of this, our approach is to use our understanding of an organization's data assets to provide capabilities that, you know, provide these benefits to both of these different audiences. So that is the story about how we got into it, which is, we've been in the metadata space for a very long time more than two decades. Informatica is very old, you know, is a data integration data management company has been around for a very long time, we started as an ETL provider with software like power center. And then we started getting requirements from our users about how do we be creating all these data pipelines that are moving data from applications into data warehouses, like Tara data? How can we debug them? How can we understand them better, and we created tools for them. We call them super glue. These were for data engineers to be able to debug their pipelines understand whether the data landed properly or not. And that was the first use of metadata for us within Informatica, right. And Metadata is data about data assets itself. Like, you know, you may have a customer table that has, you know, all the data about your 100,000 customers, but the metadata about it is, what are the different attributes of customers that we're interested in capturing? What is the freshness of this customer table? Are there any errors when the pipeline last ran? So that you know you're all you're getting in the report is the right customer data and so on. So, so that's how we started but about 2016 things were changing. There was this customer big healthcare provider either and then they were here at Informatica offices, and they were talking about their problem. And it was very interesting. They were talking about how they have these 20,000 Plus known databases within the organization. And we had no idea what was inside of them. Right? And every time somebody is asking, Where's my patient data? Or where is my claims data? What is my case data? Why did and they had no way to point them to okay, this is the right database to go to. So, for us, that was a big change. It was a new way in which we had to think about metadata. It is not just about oh, these are the attributes, or this is for data engineers to debug their pipelines. We need started thinking about metadata as a way of users understanding what data assets they have, getting to them finding, discovering, governing, finding the right quality of those datasets, finding the quality of those data assets, I think that's where we realized metadata could play a big part. And that's where players started. You know, because when we looked at those 20,000, databases, among all the good things of you know, there were 20,000 databases. The other thing that we found was that, you know, the thing that the things didn't, those databases, tables, column views, and piles, they were not named properly. A table was containing patient data, maybe called tab underscore pack, or, God forbid, file mn dot CSV. The key ask from the folks was that I'm searching for patient data, I should be able to get that tab under scope back table, or the file, one dot CSV. So, we start clear as, as an engine that can fill in the gaps of an organization's metadata. We wrote all these deep dive scanners, as we call them that extracted this metadata from these databases. But then Claire, properly started categorizing them, labeling them saying, oh, this data looks like patient data. Or this table looks like it contains PII data, because it has first names, last name addresses. And, and that's where play started. And of course, we evolved it into doing a lot more for all

Michael Fauscette

is that what you call, I think the intelligent metadata driven engine. And that I think, is a is a differentiator from what I've seen, at least in the data management space. So how does that from an AI machine learning technology? How does that help with the data quality?

Gaurav Pathak

For us, I think there are three big differentiators for us. As far as Informatica is concerned. Number one, like we talked about our understanding of an organization's metadata because we spent a lot of time in making sure that automated processes like layer like our scanners are able to understand or organize a customer's data set as a metadata Knowledge Graph. Second, our differentiation comes from our more than three decades now experience in data management projects. So, we understand what data quality looks like, data pipelines look like what a data governance project looks like. And being able to train our AI and now large language models on those is a differentiator. And finally, in the cloud, intelligent Data Management Cloud is our offering. That's the offering that most of our customers use now. That's, that has all the capabilities required for data management, whether it is cataloging governance, marketplace, quality integration. So, all these three together, I think, brings in the differentiation for Informatica. It's very hard to get, like with data quality, the question you asked, data quality is a hard problem. It is a problem that you know, where things can go wrong at multiple levels, right from where you're observing these datasets, all the way down to where you are, you're trying to use this data in data science pipelines for training and validation. So for us being able to understand okay, this is the particular concept that we're dealing with we may be dealing with customer churn, you ever deal with dealing with fraud, maybe dealing with predicting, you know, any of the things that AI is can predict today, but us being able to say this is what we're dealing with, and these are the right data quality rules that have to be created for it. So, data comes into play, we look at let's say a file called file one dot CSV with, you know, multiple columns. You can say okay, this column looks like an address. data set, and there's an address, then we try to find if there is a zip code and zip codes should match the street addresses, we automatically run those data quality rules based on that metadata layer does that and then automatically creates those earlier quality scores. That's how.

Michael Fauscette

that makes sense. So I know you talked about intelligent Data Management Cloud. And I know, recently you announced some additions in the generative AI world. So, Claire GPT, I think was one and then I believe you also had an assistant or a copilot, like, offering so how does that elevate the functions of this intelligent Data Management Cloud?

Gaurav Pathak

Sure, intelligent data management. So, we've been in the AI business and AI for data management business since 2018. It was not as cool as it is today, like everybody's talking about AI because of general AI, right. But we over that period brought about 20 Plus capabilities in intelligent Data Management Cloud. Those include things like automatic data quality, the example that I was giving you automated data pipelines automatically doing entity matching and so on. But to generate a VI, we now have more possibilities of what we can do with data management, especially in the user experience, side, where being able to understand what the user is trying to do, whether it is to create a new data quality rule, whether it is to generate a new pipeline, whether it is to create new master data record, and then having all the software work for them in, in the context of the project, the data management project that they are doing, I don't think generative AI is still at a place where it can do entire jobs on its own. It is it still requires a lot of human hand holding, right? It does not excel at multiple things. We'll talk about that. But right now, being able, it's still even at its early stages, right? It came out GPT came out in November of last year, it's only been less than a year. And we can already see a lot of potential in how we can use generative AI for generating data management code automatically, being able to do all these data management tasks. So that's what led GPT is about. Right? And we are going to release the first version of that product early next year. It's already in private preview with several shots.

Michael Fauscette

Well, I mean, you talked about the whole idea, the whole concept around democratizing the way that companies can use their data. So, I mean, obviously the data scientist, you know, data quality, all of that is a function. But you know, from the other end of it from the end user end of it, I wonder, because certainly, you know, a GPT engine can be much more conversational, and really let you use, you know, natural language to interact with things. Is that a part of this idea? Does it? Does it really become more of a front-end tool, also, so that you can open this up to more users and give them more capabilities?

Gaurav Pathak

Absolutely. That's one of our goals, where we want users to be able to use natural language and talk to their data assets. Like they would talk to a data analyst or a data engineer to get them answer to a question that they're looking for. But of course, it's a hard problem. Before all of this existed before generative AI, we provided these business users software's self-service BI tools, like Power BI or Tableau and other tools for them to be able to make sense of data. But what we didn't provide them was information about the data assets itself, like is, is this the right data asset for doing your self-service bi analysis? And, and what that resulted in is, you know, we call it jokingly data brawls. Right, which is that two people coming in the same meeting, arguing about the same metric because they use something totally different or less. So those data brawls existed in the old world. I think we need to be careful that you know with generative AI, it does not become LLM brawls from here because LLM get access to incorrect data or bad quality data. Right. Sure. We can already see some of those, but you know things like hallucinations. If it does not know it tries to hallucinate, mix it up so it's already bad. At getting to actual values. So, for us, it's important that use of generative AI, when we bring it to the enterprise, it's still in the con is in the context of trusted, high quality data that everybody can agree on is the right way.

Michael Fauscette

Sure, that makes sense. So, from the other side of this, too, what I mean, obviously, you're offering an assistant and I assume that that also is something that helps the data scientists and other data professionals in your business interact with that data in a more effective way as well. Is that Is that correct? I mean, is it time for both ends both the user and the data team?

Gaurav Pathak

Yes, our goal is for data teams, for them to be able to create the first draft of the data management artifacts. So, if it's a data engineer, for them to be able to create a first draft of the data pipeline, if it's a data quality steward for them to be able to create a data quality rule, if it's a data analyst, being able to understand the data explore the data before they use it in the report. We think it's still the first draft and not the entire task, because we don't regenerative AI is there yet, where you can create a perfect pipeline or perfect data quality, who just by natural language interaction. So first, you create that and then then you tinker with it within our, you know, made to make to the user experience tools within intelligent data management. And for business users to be able to come to lead GPT and ask the questions like, how many customers did we get for product X in Quarter Three? And four, for having clear GPD? Understand, number one, what is this query all about? What are the data assets I can go to to get the answer, checking whether the users have permissions to that as well. Otherwise, I could have asked the question, what is God of salary? And like, so? So? All of these together? To answer that question, I think that's what is our good tip.

Michael Fauscette

Yeah, that, I mean, that really does change the way you can interact with the data. And it kind of takes it out of just the, you know, the idea of reporting, right, because you're really talking about on demand, and it meets the need. This is what I'm trying to solve right now. Or I'm trying to make a decision right now. So that sounds really powerful from a user perspective, particularly. You know, one of the things that I've seen, as I've talked to other different enterprise application vendors, is there's two really distinct approaches to how they're using those large language models. And I know that, you know, for some, for some companies, they've been very close. So they've been very focused on maybe one partner or, you know, integrating with their own language model, or whatever it might be. And then for other companies, you've taken a much more open approach so that you can pick the language model that suits what it is you're trying to do. And you could do multiple language models if there were different functions. So, I'm curious about how you're approaching that from Informatica is perspective around the large language models and how you're interacting with them and which ones you support?

Gaurav Pathak

Sure. For us, Claire GPT is a mixture of experts model. So, it's more like the second approach that you described, Mike. It's about using these smaller open source language models for the specific task that we are going after. And pledge GPT. For example, in version one will support four use cases, which is data discovery, finding where my data assets are exploring the metadata and understanding what this file is all about. What's the data quality of something? Where is this data coming from? Data Discovery, answering a question, what's my customer churn for q3 for product X, and number four is creating new ELT data pipelines. Now, all these four use cases have a dedicated large language model for them. Right now, we are looking at we are using actually open AI for the two of two use cases that are currently in private preview discovery and metadata explorer. They're also training our own open source, large language models like llama to 7 billion parameter model, which came out a few weeks ago, and is is great at the kinds of smaller use cases that we have divided the problem into. So our approach is to first provide state of the art form Intune open source model for that particular task discovery, metadata explorer, etc. And then eventually even allowing customers or partners to plug in their own models, which may know more about that data, because you're not trained it on specific customer data sets as well. So that that's where we are going with Glenn GPT.

Michael Fauscette

So strikes me and I, you know, obviously, again, talking to a lot of different vendors and how they approach this. How, because you're, obviously you are using customer data. So, I'm sure there's some concern around security and privacy issues. So how do you approach that in, in your, in the way you've implemented this?

Gaurav Pathak

Absolutely. And Informatica sells to Fortune 5000 companies, most of them are financials and health care, who are very, very secretive about their data. So that concern is top of mind of all of our customers, is my data getting sent to third party vendors who may not be as careful with data, right. And the security and compliance teams in these organizations are shutting down a lot of generative AI projects. They just are. They're not, they don't trust the current user agreements, user license agreements for a lot of these generative AI providers. So, for us, when we look at generative AI capabilities, we cannot send data to a third party provider, we cannot even send metadata sometimes to third party providers, like the customer attributes itself that may be specific to the customer. Our goal is then let's say when somebody is doing a data discovery, and somebody is asking, what are the data sets I can use to calculate customer churn. At that point, we take that natural language query, we convert it into a graph query for our catalog, which is where we store all of the metadata to to find out where the customer churn data sets are. That's the only use for large language models. It takes the query, converts it into a graph query for the catalog to understand. And then the graph query is launched against the customers catalog. No data metadata is exchanged as part of this process. And we are doing the same thing for discovery, metadata, explore all of the different IGBT modules. So yes, all of this is top of mind for us. We don't want to store the prompts. We don't want to learn from any of the metadata or data that customers provide. That's really our that's why the customers will choose Informatica over, you know, startups who may do otherwise. Yeah,

Michael Fauscette

I mean, obviously, for especially the industries that you mentioned, that's going to be a key differentiator and a very important one, both privacy and security issue as well. So that that makes sense. So, a couple of months ago, I did a AI adoption survey study we published and the number one challenge in preparing data for AI was listed as data quality. And we talked a little bit about that earlier. But I want to dig into that a little bit more in the context of Master Data Management Solutions and data quality solutions. How does Claire's ability to analyze those data patterns asst in in identifying those quality issues, and then making sure that it's, you know, that it is accurate, that it is consistent.

Gaurav Pathak

Sure. And we'll start with the generative AI pattern and then go into more details of this as well. We looked at what is generative AI good for today and maybe how it will become how it will change within the next six months because predicting any timeframe beyond six months is becoming more and more challenging nowadays with the new stuff arriving into data vi a very fast pace. So generative AI today good at summarization, code generation doing tasks rather than jobs. Right and there is limited reasoning capabilities that we are starting to see emerge. What is it bad at it is bad at planning things bad at consistently doing one task in the same way that it was doing before? Core mathematics? Also PR right. You know, while it is bad at all these things people are convinced that we end the bus, right? So, so definitely better. PR. We want to use generative AI, and organizations want to use generative AI, so that it complements all the different data assets that they have gathered that we have gathered, as well. For most organizations, generative AI is top of mind. Right, and you can't do generative AI without you have good data to go with. Right. And that includes the quality problem that you described. Like if, if I'm getting a lot of sensor data, but a particular thing is going wrong, because of which one particular dimension is not coming, it's all null. Then if you tried to train the AI with that sensor data, it will not Well, that is understood. But at the same time, there is also data privacy challenges. These, these large language models are basically compressed data in you know, in a format that allows you to do reasoning. But once a large language model knows a certain fact, it's very, very difficult for us to make it forget it. So, if a sensitive data that goes into a large language model, got a pothead salary or, or something else, it's hard for it to forget that and somebody asking that question will get to know that fact. Right, so being able to make sure that your data does not contain that sensitive information, not only for yourself, but your customers sensitive data that has become top of mind for organizations as well. And then all the governance related capabilities of oh, it shouldn't the training data should be bias free. We need to make sure that all the training data has lineage is established, who is the owner? Where did this data come from? Who is responsible for freshness of this data, etc. So all of these are key important facets that we're looking at, from Informatica, we're looking at your data quality has been a mainstay at what we have done since the last 20 years. But we've seen a revival in interest in data quality over the last one and a half years. People want to be able to create data quality rules, some by hand some wanting AI to create it for them. We support both of those modalities. And data privacy, we acquired a company called Privitar. In actually, maybe one and a half months ago, that's where we announced it. And we are now looking at integrating some of those capabilities that I was talking about removing sensitive data from unstructured text and being able to create policies for access management will be key. And then data governance as well. So, all of this will become center stage in the generative AI world, Mike, and I think more and more interesting, next few months. Yeah,

Michael Fauscette

I mean, I think it's interesting, because, you know, certainly, there was a time when we spent a lot of time talking about master data management and data quality. But then it seems like over the years, we haven't really focused around that. But now, with that explosion of you know, generative AI type solutions, it really does highlight the importance of that, and in ensuring that the quality is there, but then also ensuring that the right things are in the right place in your business, and that the things don't cross and get in the wrong place. Right. So, it's, it's extremely important from a governance perspective. So that's, that's great. So, you talked about one acquisition that you've just done. And so that makes me think about, you know, where do you think this is going? So, what do you think the future for Claire's, like new features integrations may be? And then also, just as a second question, what do you think is happening from an AI machine learning perspective, that could also add a lot of new capabilities, or enhance, you know, some of the current capabilities in the offering over the next couple of years?

Gaurav Pathak

Sure. For us that our vision for ClaireGPT includes ClaireGPT becoming the de facto interface from which all data and data management tasks start. Whether I'm asking a data exploration question, what was my products revenue in quarter three, to being able to create data pipelines, creating data quality rules for us that to be the front and center interface for intelligent data management cloud, right, allowing you to create first draft of your data management artifacts and then going to the individual tools to perfect it tinker with it? We want it to have runtime and synopsis capabilities being able to Understand, what, what are the different decisions to optimize when you're doing when you are creating a data management artifact. So, you can create a data pipeline in a way in which you get the best performance. But it may be, it may come at a very high cost, you know, because you are optimizing performance at that scale, or even want to optimize cost for it, you're, you're okay, if it runs the entire weekend, and on Monday, you get a good report. So, you're not, should come at low costs. So being able to take those kinds of human preferences, and create, generate the data management artifacts, that and then, of course, expand on all these, you know, for each of these different capabilities that we have in intelligent data management, cloud, making clear GPT, the de facto assistant, the Jarvis to the Iron Man, we're able to ask him to do things and it's able to do a lot of what you want on its own, and then being able to then you can provide the right guidance to finish the task. Like in Master Data Management, we acquired another company that was in 2020, it was company called Green Bay, it was out of University of Madison, Wisconsin. And provided us AI based entity matching capabilities allows us to figure out in really large datasets, what two entities are the same, the duplication and other things like that. And we have since then integrated that technology into our intelligent Data Management Cloud already available. But then from Claire GPT, being able to say this is the data set, go ahead and remove all duplicates from it. And making it as easy as that. So that's where we want to take lead up to being the data assistant comprehensively address all the different data management capabilities that we have in intelligent data management, we're being the front and center place for starting data mart tasks as well.

Michael Fauscette

Sure, now that that's very interesting. And certainly, you know, as I as I've learned about different offerings from different companies to this idea of transitioning to the natural interface, or user experience is a powerful one, and I think exciting one. And if you're an old school Star Trek fan, you know, the, the episode where they go back in time to the earth and Scott, he's talking to the those computer mouse and he's very frustrated. It seems like we're finally now going to get to the point where we can talk to the computers and have them respond and, and do things for us, which is pretty amazing.

Gaurav Pathak

It is evident to me that directed data, I think, was the personification of all things. The structure that understood and was able to take the right decisions. So yes, old school Star Trek, for sure.

Michael Fauscette

Exciting stuff. Well, that's all the time we have today. So first of all, I really thank you, Greg, for joining me today and helping us understand a bit more about data and data quality and data management. Pretty exciting times there. But before I let you go, one of the things that I like to ask everybody who comes on the show, is could you recommend somebody, a thought leader, an author, some mentor that, you know, influenced your career that you think would be important for the audience to to be exposed to? And you know, that they could learn from?

Gaurav Pathak

Oh, absolutely. And since we're talking about AI, I'll talk about a few few folks in this area who are doing a really good job. And I think, as we were talking about earlier, there are a lot of naysayers about AI so one of the voices of reason I find in all of this noise is Yun Liddicoat. He is head of AI research at meta and you know he's responsible for some of the open source, large language models like llama to and others. I think he has a very good balanced take on what AI can do for us and the dangers of AI as well. Along with him. There are a lot of people in the open source world we're doing a really great job in making AI available to a lot of us. One of them is you know this this column A hacker Gregory, Georgie Greg knob, and then he has this he has this tool called llama dot CPP which basically contests quantizes These large language models and allows you to run it on your Mac books. And he has done a great job Over the past few years to making this possible, and then on AI applications side, Andrey Carpathia is, you know, the ex Tesla, right. And he's done a lot of AI applications and a lot of AI teaching as well, all these three people, and the open source community supporting all the work that they have done. I think we have them to thank for all the good things we've seen.

Michael Fauscette

Yeah, that's great. Thank you. I really appreciate it. And I just want to thank you again for joining. It's been a pleasure and very educational. thank you very much.

Gaurav Pathak

Thank you, Mike. Thanks for having me.

Michael Fauscette

And that's the show for this week. Thank you all for joining remember to hit that subscribe button. And for more on AI and other software, research reports and post check out the area and research.com/blog and slash research reports. And don't forget to join us next week. I'm Michael Fauscette. And this is the disambiguation podcast.