July 01, 2024

Apache Iceberg as the new S3 and Data Analytics with Starburst | Episode #86

Justin Borgman, Co-Founder and CEO of Starburst, explores the cutting-edge world of data management and analytics. Justin shares insights into Starburst's innovative use of Trino and Apache Iceberg, revolutionizing data warehousing and analytics. Learn about the company's journey, the evolution of data lakes, and the role of data science in modern enterprises.

Play Episode

The player is loading ...

Apache Iceberg as the new S3 and Data Analytics with Starburst | Episode #86

Apache Iceberg as the new S3 and Data Analytics with Starburst | Episode #86

Justin Borgman, Co-Founder and CEO of Starburst, explores the cutting-edge world of data management and analytics. Justin shares insights into Starburst's innovative use of Trino and Apache Iceberg, revolutionizing data warehousing and analytics. Learn about the company's journey, the evolution of data lakes, and the role of data science in modern enterprises.

Apple Podcasts podcast player badge

Spotify podcast player badge

Castro podcast player badge

RSS Feed podcast player badge

Breaker podcast player badge

PocketCasts podcast player badge

YouTube Channel podcast player badge

Show Notes
Transcript

Justin Borgman, Co-Founder and CEO of Starburst, explores the cutting-edge world of data management and analytics. Justin shares insights into Starburst's innovative use of Trino and Apache Iceberg, revolutionizing data warehousing and analytics. Learn about the company's journey, the evolution of data lakes, and the role of data science in modern enterprises. Episode Overview: In Episode 86 of Great Things with Great Tech, Anthony Spiteri chats with Justin Borgman, Co-Founder and CEO of Starburst. This episode dives into the transformative world of data management and analytics, exploring how Starburst leverages cutting-edge technologies like Trino and Apache Iceberg to revolutionize data warehousing. Justin shares his journey from founding Hadapt to leading Starburst, the evolution of data lakes, and the critical role of data science in today's tech landscape. Key Topics Discussed: Starburst’s Origins and Vision: Justin discusses the founding of Starburst and the vision to democratize data access and eliminate data silos. Trino and Iceberg: The importance of Trino as a SQL query engine and Iceberg as an open table format in modern data management. Data Democratization: How Starburst enables organizations to perform high-performance analytics on data stored anywhere, avoiding vendor lock-in. Data Science Evolution: Insights into what it takes to become a data scientist today, emphasizing continuous learning and adaptability. Future of Data Management: The shift towards data and AI operating systems, and Starburst’s role in shaping this future. Technology and Technology Partners Mentioned: Starburst, Trino, Apache Iceberg, Teradata, Hadoop, SQL, S3, Azure, Google Cloud Storage, Kafka, Dell, Data Lakehouse, AI, Machine Learning, Big Data, Data Governance, Data Ingestion, Data Management, Capacity Management, Data Security, Compliance, Open Source ☑️ Web: https://www.starburst.io ☑️ Support the Channel: https://ko-fi.com/gtwgt ☑️ Be on #GTwGT: Contact via Twitter @GTwGTPodcast or visit https://www.gtwgt.com ☑️ Subscribe to YouTube: https://www.youtube.com/@GTwGTPodcast?sub_confirmation=1 Check out the full episode on our platforms: YouTube: https://youtu.be/kmB_pjGb5Js Spotify: https://open.spotify.com/episode/2l9aZpvwhWcdmL0lErpUHC?si=x3YOQw_4Sp-vtdjyroMk3Q Apple Podcasts: https://podcasts.apple.com/us/podcast/darknet-diaries-with-jack-rhysider-episode-83/id1519439787?i=1000654665731 Follow Us: Website: https://gtwgt.com Twitter: https://twitter.com/GTwGTPodcast Instagram: https://instagram.com/GTwGTPodcast ☑️ Music: https://www.bensound.com

RAW

a lot of people would know what Starburst lollies are right doicy burst of flavor that Pack-a-Punch also many of us have run an SQL star query pulling in everything hoping to make sense of an overwhelming amount of data but as data continues to grow and explode in volume orgs need something more powerful something that brings all those diverse data sets together seamlessly enter Starburst today's guest on episode 86 of great things with great Tech is Justin Borgman co-founder and CEO of Starburst Starburst leverages technology like tro

and Apache Iceberg to offer high performance data warehousing and analytics without the need for traditional data warehouses enabling organizations to derive actionable insights from their data more effectively this is great things with great Tech the podcast highlight and companies doing great things with great technology don't forget to head to YouTube at jtw JT podcast or on all good podcasting platforms subscribe follow to make sure you never miss an episode and now episode 86 with Justin the data management and analytics

company all right hey Justin welcome to the show and first and foremost I'm really interested in in your background obviously we're going to get into Starburst and everything you know big data and data science I think it's such a critical part of you know it and Enterprise today data is getting more and more critical but I'm really interested in hearing about your own background and how you came to co-found and be the CEO of stst yeah sure great to be here Anthony thanks for having me um my journey in

Big Data really goes back uh probably almost exactly 15 years now uh to the founding of my first company which was called hadap and that was back in 2010 um and that was a sequel engine for Hadoop so this is the early days of Hadoop uh um you know Hadoop was really the the First Data Lake uh and uh what we were trying to do at the time was really turn this data Lake into a data warehouse and so hadap was this SQL engine that would take SQL queries and process them uh on data living in Hadoop and uh that was a spin out from Yale

University we raised Venture Capital built that business over a few years uh and then that was acquired by teradata and I ended up spending uh a few years of terod uh and my my job at terod was kind of interesting because I was sort of responsible for figuring out the future of data warehousing analytics and in that context uh discovered another SQL engine uh at Facebook actually there was a very early open source project um originally called presto today it's known as trino and uh essentially um got super excited about uh the potential of

this uh this this query engine it allows you to run really fast SQL analytics as well as uh query data and other data sources as well and so uh we started investing in that very heavily while I was at trod my team became some of the leading uh contributors to the open source project and that led us to ultimately leave terod and form Starburst as the the company around that and the the creators of the project left Facebook and joined us and became my my co-founders that's awesome so yeah so obviously you know data and this data

science has been a part of your world for the last 15 years but I'm interested how did you get to that point Point like what what led you to sort of be really interested we'll talk about had a little bit later on because I I touched that a little bit in a very different way um back in back in the day as well but yeah what led you to sort of have this passion for you know this sort of Big Data data lakes and and the warehouses yeah I mean to me it's very clear that data is you know some people say the new oil you know I I would even

go further to say it's sort of the oxygen of businesses today that's good I haven't heard that that's a good one yeah I mean I I think we're all datadriven businesses uh at this point um you know we we use our own technology to run our own business more more efficiently and more effectively uh there are just so many sources of information that we can now leverage to make better decisions you know tailor and personalize our our services to customers uh and so forth and so to me this is this is going to be a

multi-decade uh um you know area of of relevance and investment and and Innovation and that was really exciting to me uh and to be able to play the the role of disruptor in a in a space like that uh is also really fun and exciting because really databases have largely played the same playbook for maybe 40 years which is you know get all the data into the database uh they're all proprietary historically you know oracle terata Etc uh and then you're sort of locked in uh and you know the bills go up and the only way you can analyze the

datas to get it back into this into this uh this Central database or data warehouse and and and so I think that that model's changing and that's very exciting to me to play a role in turning that model inside out which we'll certainly get into yeah and that's interesting because I guess what you're talking about and when you think when I think about databases they are very much you know in the old days mssql Oracle you know PHP and all those modern ones but really it was all about yeah you're

locked in you choose a database and then that's what you lock yourself into almost forever right like becomes you know the where the data lives and I think today we've seen data just explode across all sort of different platforms different mediums um and it's actually the storage of today I've used this a couple of times on a show I had a I had someone on a few years ago that said to me you know these devops guys when they are growing up and learning their trade you know when you ask them what storage

is they'll they'll say it's a database or or even a table level right they don't think about storage or the underlying layers so it's it's super it's super critical there um what what was your background in um in sort of education and whatnot to get you into the this sort of area so I I did major in computer science in undergrad I became a software engineer I I did that for about six years uh then I worked for a small startup in a product management role and then decided to go to business school

and the logic there was just uh I wanted to be able to have a bigger impact I wanted to sort of understand how how you know the little work that I was doing uh you know as an early software engineer in my career how that could impact you know sort of the bigger picture and uh and so you know got my MBA and it was really while I was in business school that I guess the inner nerd in me uh decided to go see what was going on in the computer science department uh and that's really where I met my co-founders of that first

business they were uh researchers uh who had written a paper called Hadoop DB which was basically the the pioneering thoughts around treating dup as a database or data warehouse and um that ultimately uh began my my journey in in entrepreneurship and and you know bringing that technology to Market yeah and just I think you you mentioned data Lake and data warehouse so just explain the difference there so database data Lake data warehouse I think that's important to level set when we talk about you know more about what Starburst

does yeah absolutely so um you know data warehouse generally refers to a uh read optimized so so sort of uh designed for analytics not for necessarily transactions but but for analytics uh central warehouse with Central database essentially so you know it's um this is this you know teradata is sort of the classic uh example oracle's exadata product would be another good example uh but you know all the big players uh basically had a tradition data warehouse and the idea was um your business starts to develop these data silos wouldn't it

be great to uh pull all the data together into one place to understand your business more holistically to be able to you know uh Market to your customers pulling in your marketing data and your billing data and and that sort of thing and so that was really the model that that uh grew up in the 80s and 90s I would say primarily um but in the 2000s the scale of data uh started to make that a lot more challenging uh particularly with the internet companies if you think about Google and Yahoo and companies like that

that simply just could not fit the internet into one single database and so they needed a different approach and that's really where the data Lake comes to be with the the creation of Hadoop Hadoop was the first data Lake uh it was of course built at Yahoo by reading a paper that actually Google had written so they they both had a very similar approach here and you know it became a way of storing data in really cheap commodity storage that could scale out to basically Infinity so so no scalability barrier um and then a a

separate sort of processing layer for how you would interact with that and in the earliest days it was something called map reduce which was super slow crunch infinite amounts of data just very slowly it would take hours uh and then we've gradually evolved uh since then to you know fast forward for a moment to today where you can really get data warehouse performance out of your data Lake and that's that's what's so exciting now you know 15 years later I I get it yeah you know I had a early interaction with her du but it wasn't

because I was looking to do anything specific with the data or whatever it was but this is a technologist remember vmw released um their own Hadoop cluster technology so they had the the download you'd install it would create a couple of of nodes and it would basically install the Hado cluster and you'd be able to do some map produce stuff some for something right and so that's about where I started and that was I've got to say that was maybe early 2010s potentially maybe even late late 2000s

but yeah so it was it was around that time when it started and I guess I didn't realize that you know that what I've realized now through the research and doing this and now thinking about other companies in this similar space to you this is the evolution of that that data Lake and that's what they really talk about but then that you've got to be able to take that data and it's not necessarily going to be used in an efficient way for businesses to get the most out of it right and that's where

you guys come in and that's where the query engine the SQ analytics comes over the top of it you be able to make the best of both worlds and surface that data in a really quick and efficient way that's right that's exactly right awesome so you know talk a little bit about Starburst the founding of it um I'm interested in the name where where' the name come from I'm I'm thinking you know of the fruits but I don't know if it's going to have a to the starburst fruits but yeah how'd that come to be

yeah yeah it's a it's a common question uh obviously yeah we're we're not the candy company but we're very uh popular uh you know around Halloween and things like that um but but no yeah so the name comes from really um two two things we first of all naming a company is a hard process for anybody who's going out there and trying to do it you've got to come up with something that you know uh will be memorable and also is not taken ready by somebody else so uh so that's always a challenge but for us I think at

the core of what we do we're able to get access to all of your data and uh we do it very quickly and so uh the reason star and burst were meaningful to us in that in that regard is in the SQL language in in SQL select star means give me everything and so we like the symbolism of of Star as a as a sort of signifier of the reach and the breadth of uh data that we can touch by virtue of having these connectors to all these different systems um and then burst because we do it faster basically than anybody else and so we can get all of

your data very quickly uh was sort of the motivation and we thought you know hey it sounds good some some people think about the candy which generally has positive uh you know citations to it so yeah so it was a lot of fun no that's awesome because obviously you know I've I've run a few SQL queries in my time and typically I will use the star because that's all I'm good at you know just getting everything and then you know working it out through there from a table so yeah it's it's actually almost

I wouldn't say it's it's like disappointingly logical but in a good way you know what I mean like it just makes sense that you that you've named that so no it's it's it's a good name and it fits so tell me tell me about the you know what's what's the value prop and what what would what would the mission statement be a starburst a starburst yeah so it it's really that we provide data warehousing style analytics on data that lives anywhere and what that means in practice is really two

things uh number one and and one of the reasons people think of us is uh we're pretty unique in our ability to Federate queries to multiple data sources so you might have data in a mySQL database and you might have data sitting in S3 and Amazon and you might want to join those two tables for your analytics and we can actually do that ad hoc on demand in a query and do that very quickly and efficiently so you don't even need to move data you don't even need to run it through an ETL process you can just

query it and so that's that's one thing uh that makes this pretty unique but the other thing which which actually represents probably an even bigger share of our business is that because we can query anything some of these open formats in the lake have be become wildly popular specifically Iceberg uh iceberg is a a a an open format that was originally created at Netflix it was created uh to to pair with actually our open- source query engine so you could use our query engine uh to query uh Iceberg and

these were sort of like the two parts to a data warehouse there's like the query engine part and the storage engine part and uh and and what's nice is today these are these are now decoupled so you can actually mix and match uh those components but it's a very uh natural combination that we see in the open source Community all the time and it's a great way of storing data uh at tremendous scale for very low cost I mean almost nothing because you're really if you're thinking about AWS for

a moment like you're just paying for S3 which is yeah which is pennies basically and uh you can basically develop the lowest cost uh you know high performance data warehouse system or or data Lakehouse as some people call it now today by leveraging you know Iceberg as your storage and Starburst uh as as the query engine so those tend to be the two two angles where customers bring us in is either I want to build you know this this scalable lowcost you know uh data lake house or I want to be able to uh maybe also join data with data that

lives elsewhere as well yeah so the problem of of data sprawl and Silo Silo sort of data sources effectively is what you're trying to solve here because if you've got traditional platforms wherever they may be modern you know Storage storage platforms somewhere else you don't want sort of costing well the moving of them is expensive right in in many ways time effort Network bandwidth whatever it might be so you want to basically have a layer on top of if I'm trying to simplify this as much as

possible you've got a layer on top of that which basically can go in and say okay well I know how to talk to all these multiple data sources with my own Tech with the with with the engine and the query engine but also if I need to I've got this Iceberg technology as a storage platform as well which is which is a table Open Table format right which Netflix invented so how does when when you've so when you've got data sitting in say S3 or an mssql database where does iceberg then fit into that yeah so

the way we think about the the evolution if you will of the data landscape is we think every customer is going to have a data Lake they're gonna have other databases too but they're definitely going to have a data Lake and the reason simply being that the economics are going to make it highly advantageous for you to have a data Lake for uh a large portion of your data and so that data Lake will likely store data in iceberg in the future and we would recommend that you do that and and one of the reasons being that when you store it an

iceberg you as the customer as the user you own the data we don't own it datab bricks doesn't Own It snowflake doesn't own it you know Google or AWS doesn't own it like you truly own it it's an open source format but you can now play all of us actually against each other uh to compete for the compute and you know if if we're competing for compute like you win as the customer right anytime there's competition that's that's good for you as a customer and that's the beauty really one of one of the biggest

benefits I think of using Iceberg for a customer is you now have optionality across the different uh engines that you choose uh and so we want to compete for that we're we're the youngest and hungriest of of the three uh you know companies data brick Snowflake and Starburst uh so we're happy to win on price and and and compete for that Ice Bird uh SQL analytics now to answer your question what makes us also unique is that data living in my SQL uh we can connect to that as well or or that SQL Server database we can connect to that

as well and you can now incorporate that in your analytics without having to move it and so the world we see is one that will become Lake Centric with other uh purpose-built databases uh you know outside of that Lake that you want to join in uh as needed for the types of analytics or reporting or or uh workloads that you're doing okay and how how does um you got how do your products you got Starburst Galaxy and Starburst Enterprise so where do they where do they fit how they sort of package up and what componentry do they have you know

in terms of what we've been talking about with the iceberg and and the an analysis layer and all that kind of stuff yeah so Galaxy is our Cloud hosted solution which means that we manage all the infrastructure we make it as easy to use as possible and it's really intended to be a complete TurnKey end to end um you know we call it an ice house it's an iceberg Warehouse basically where we can ingest the data into uh into Iceberg tables you can connect to like let's say a Kafka stream or or load data into an

iceberg table we'll do the data maintenance there's something called compaction that you want to do with Iceberg tables where you want to take lots of little tables and bring them together and you get performance advantages out of doing that compaction you know that's built in so we kind of take care of all the the the plumbing and the heavy lifting I guess you could say under the covers to deliver a really easy to ous endtoend solution and so that's Galaxy and you there's a free trial you can check it out if you'd like

uh and then Enterprise is self-managed which means you're basically just getting the software from us and you can run it anywhere you can actually run it on Prem you can work across hybrid environments uh so a lot of flexibility with that platform it can work with and integrate with whatever you know security you know kerbo sdap you know you you name it um so it's kind of like do you w to uh buy the components to the car and uh and and build your custom car you'd use Enterprise or do you want to

build something that just works maybe it's a Tesla you know you just hit the gas and it works yeah exactly actually funnily enough I tried to start a Tesla a couple of weeks ago I I had to Google it um and I was quite embarrassed that there was no start button or anything like that that I was looking for as a as a side story it was just press the button and go and to your point like it just kind of worked right didn't have to do anything um talk talk a little bit about uh tro because I think that's an

important thing it's an open source project which you guys contribute to in fact probably I think you're the leading contributor right so talk about that and that how that fits into this as well yeah so my my co-founders Martin Dan and David are the the creators of trino uh again it's origin story was Facebook many many years ago a little over a decade ago now um but over the years over that decade period the community has grown so large it's now you know tens of thousands of uh of users and and participants in that

Community um and it includes some of the the largest uh most sophisticated datadriven organizations in the world like apple and Linkedin and Lyft and and and and Netflix which you know we mentioned earlier so uh it's it's got a really wide uh I'll say contributor base from some of these incredibly sophisticated companies uh you know Bloomberg being another one um and the beauty of that the beauty of leveraging an open source engine is uh not only do you get the benefits of all these contributions for example the ride

sharing companies uh most of whom use trino uh created the the um geospatial functions like actually we didn't create those functions they did because they needed it for their business and so now you know grab taxi in Singapore and uh you know um olabs in in and India you know they they use uh trino as a result because it's got all these these geospatial functions so you get that benefit of the community contribution but you also know back to that idea of optionality you could always stop using Starburst and just just run with the

open source so that vendor lockin piece was really a big part of why we founded the company we wanted to give customers the choice and that just you know uh raises the bar for us to continue to deliver value above and beyond the open source to to hopefully keep you as a customer but you always have that option and I think that's very powerful as well yeah because to in a certain way you're looking to kind of democratize the data and make sure that there there's an element of of it not being locked in

right um yep data freedom is something that I know that we talk about a lot at V from from a different angle but it's important because you know to the point we made early around the original sort of database structures and the way they worked where they kept they kept all the data pretty much locked away this is almost an opposite to that and what you're adding is those layers on top to to really make the most out of the data right like that's what I see so in terms of the use cases what what are some of

your typical say if a customer if company comes to you and they've got a bunch of data sitting across marketing orgs um you know engineering whatever it might be and they come to you and say hey how can you help us how can starbur help us make our data valuable to us what what would be your sort of value prop to them yeah so very often you know an early use case might involve a large set of data in a data Lake um again maybe it's S3 or Azure data Lake storage or Google Cloud Storage or object storage on Prem we have a partnership

with Dell now where you can you can buy a a Dell lake house which basically has their object storage with Starburst in it so wherever your data may be it very often starts with a large uh Lake uh as well as perhaps joining data in some other database so I'll give you a few examples um one of the the easiest to explain uh is Comcast which is a large uh cable TV provider and internet provider in the United States uh and when they first started working with us they had uh basically all the TV show uh data all every time you change the

remote control and switch to a different show that gets captured in the cable box and sent back to to this massive data Lake and so they have all this event data basically every time you change the channel which is really interesting to understand customer Behavior but in a totally different database or data warehouse they had the billing data how long you've been a customer how much you spend every month uh what package you subscribe to and they wanted to join these to understand how customer viewing Behavior the shows that you watch

influences the customer lifetime value and the packages that you purchase and really enable cross sell and upsell based on the shows that you watch and so this was a really big deal this was like five or six years ago that we first started working with them uh and has been able to allow them to generate a a lot of incremental Revenue by positioning the Right Packages the right products the right you know additional channels you could buy based that you watch does that make sense it it does and actually to be fair I would have

thought that if I was looking at that outside to go well there's going to be a bit of AI or ml in that just to bring that we haven't talked about that it's amazing to have a conversation today and not talk about that until now but is is that part of it or is this like are we doing that without that because of the of the technology you guys have got or is there an element of that in there for that because whenever you whenever I think about that whenever I think about the way that my um grocery um company that I visit every Wednesday

tells me every Wednesday morning what to go buy based on my previous sort of purchases that it's based on some sort of AI and ml but it doesn't have to be from what I'm hearing yeah yeah exactly I mean I think of AI and ml as almost a spectrum of uh sophistication uh you know some uh are just very algorithmically driven uh you can do that with with Starburst and then you've got you know like llms and chat GPT and generative AI on on the other end of the spectrum uh we think we play a role across that Spectrum by bringing

the data uh to the surface for uh the you know the company that that wants to lever it and in some cases yes it's very algorithmic and sort of rules driven and uh and we can handle that entirely in other cases you may need to pull the right data sets together do some transformation to prepare it and then run uh you know an ml model on it uh or or leverage it in a model and you know at the at the core your your models are really only as good as the data you train them on and and and that's the role that we think we play here is sort

of the the plumbing to get you the data that you need and the form that you need to go do those those sophisticated analyses that's a really good way of putting it actually so there there's a value prop there that's separate to what we're seeing out there in the hype world you guys are surfacing the data making it bringing it to the surface and allowing the llms or whatever it might be to actually do their thing separate to what you guys are doing which is that you know bringing together the meshing

together of the data and making it valuable at some point that's right and I think you know that's seems to be exactly where we are in this AI journey is customers wanting to leverage their own data uh to create proprietary Advantage with with AI you know I I view chat GPT as more of like a a POC or a demo vehicle you know because it's trained on the internet uh everybody has access to it it's the same for everybody I think where you really win uh as a company is where you take advantage of that core technology but

start to train it on the data that you have and and enrich it with the data that you have yeah that makes absolute sense so key componentry in terms of you know if you look at it so data ingestion governance management there's capacity management as well like how how do those play into what you guys do as well especially the capacity management I'm interested in that because that's all about scalable compute and auto scaling typically but how does Starburst fit into that world as well yeah absolutely um so at the core

of of what makes uh that elasticity possible is this uh storage compute separation that we touched on you know earlier uh again traditional databases traditional data warehouses were storage and compute combined you know if you bought a terata data warehouse it had compute it had storage it was Allin one box uh but as you know these early data Lakes were created as we were speaking about by the internet Giants uh this notion of separating these two uh became very important and in the cloud uh that really made this this mainstream and

what that allows you to do is again we'll just use Iceberg as the example here uh store all of your data in iceberg and then your your Starburst cluster your trino cluster uh you could have that scale down to zero you could scale it up when you need it you could scale it up based on the the CPU processing and and how much um uh load it's it's putting on on your servers like oh okay wow I'm getting all these coming in now all of a sudden maybe I should start to make it bigger to to improve my performance so you can really

be very refined in uh scaling that to what you need and then ultimately only paying for what you use that's the other beauty of of cloud computing environments is they're pretty much utility models we pric the same way so y you know you're only paying for what what you use um now I will say the the the contrast to that is interestingly in our on-prem business and again I mentioned Dell is a great partner of ours there are some customers that want to basically uh prepay for all of that capacity and amortize that over you know

three or four years traditional way of doing things yeah exactly and so both both are viable depending on what you're trying to solve for all right good stuff and the governance piece I think that's interesting as well obviously data you know can be dangerous I mean I love the oxygen analogy I might steal that um later on I've actually I've actually used the analogy of data being like uranium like being a little bit dangerous you know in the wrong hands and all that kind of stuff so that was

my thing but yeah the oxygen thing totally works as well it's it's really cool but from a governance perspective so data has the has gravity we all we all know that so how toes star burst help in the governance side of things yeah so you know our our vision is to really play this central point of access to all of your data so that would mean the queries come into us we access the data wherever it lives you know maybe 80% is in a data Lake in iceberg and maybe 20% is spread out among 10 other databases you know the the the

goal is that the end user shouldn't actually care where the data is stored they're just joining table a and table B and uh in order to make that work and especially work within regulated Enterprises and you know financial services and Healthcare are two of our biggest verticals uh access controls become really key and specifically what we call attribute-based access controls so very fine grained you know row level column level data masking being able to tag data this is pii data this is you know Hippa data this is uh you know gdpr

data what what whatever the case may be and controlling that access in that single point so it it's sort of a single point of access and access control across potentially distributed data and and and that's really important so that's been built into our product really from the beginning uh and today we we serve some of the biggest banks in the world yeah and more from the point of analytics right and and quering as opposed to I've had other sort of global file system companies on this show um

actually had one a couple of weeks ago or a couple episodes ago it's that's not what you're talking about here right you're talking about you know the ability to have that the overview for the analytics and Analysis part exactly at the query level basically for the the end user who's trying to access that that's right cool hey I just wanted to touch on data science as a as as a general thing right like um obviously this is a a different area of it you know traditionally people who are in infrastructure who aren't

sort of data scientists you know probably find a little bit hard to make this this jump across to this world so what in your opinion being the CEO and you know obviously you know involved in hiring and and people and whatnot what makes a good employer for a company like yourself what attributes they have to have from a technical point of view I think first and foremost and and this is true for probably most roles in in Tech a um commitment to continuous learning like this space is evolving so quickly that even if you're an expert

today you know you won't be uh you know a couple years from now if you're not continuously staying sharp and up to date on the latest Technologies and that's even more so true with you know everything going on with AI right now today um the techniques are changing the models are changing so uh so I think that's just a kind of an intrinsic curiosity and desire to learn um that that I would look for um the next thing that I would say just in terms of bringing maybe the skills to Bear to be productive here uh there are you know a

few ways to to tackle this and maybe maybe you tackle it through all three um one is you know you can certainly learn uh programming languages you can learn python you can learn Scala um uh and you know that that takes some effort in investment um you could also you know focus on SQL as a language which is actually much easier language to learn and um and will not go away like that's one thing I'm sure of has not gone away yet exactly that's the one constant in 40 Years of you know database history so

your your time is certainly well spent investing in it and you know as you mentioned in the opener like you'll learn to be at least um uh you know you'll learn enough to to you know I guess be dangerous as people say uh pretty quickly you know uh in in navigating your data uh through SQL so I think that's a worthwhile uh investment and then the good news for um for people who don't even know SQL over time you're going to see more and more uh natural language to SQL interfaces in fact we've

built this now into our product um uh I would call it sort of a beta version today but it's continuously refining which means you could express your query in English and say uh show me last month's sales data and it will turn that into an SQL query so that's really I think the last mile on democratization you know as you were speaking of yeah and actually to be fair I've actually got some of those specific companies doing that um lined up for the show over the next few few episodes right which is

really interesting it just seems to be sometimes you Clump like companies together right but yeah I guess that's a really interesting way of being able to naturally prompt an llm or something to say hey I want to get this amount of this data out of this data set I don't know SQL but do it for me um you know so I think that's a really cool way for people to come into it but obviously the data scientist role in itself that seems like a very specified role right like how how does one sort of become a data

scientist yeah um yeah I mean I think there are you know classical Paths of uh you know going to college and getting degrees in statistics or um you know anything very quantitative in fact some schools now actually have data science degrees so you could actually literally major in in data science um but I think you can also learn these skills outside of that I think there are a lot of great classes uh available um to study the the basics of of you know how to be really data driven in your analysis you know learning about L your regressions and uh

you know just basic probability and statistics sort of the the language of of data if you will uh I think is accessible to you um and then uh you know try to get involved in in something uh that allows you to sort of showcase those skills would be my advice um it could even be a nonprofit or uh you know a public data set that you start to work with uh and I I think that's a great way to maybe showcase your your your abilities as well yeah it makes sense right so a lot a lot of mats basically in the short way is talking about that

which which absolutely makes sense in that in that point have only got a couple of minutes left I'm interested in you know where where is Starburst going what's the Innovation for the future what's the next B big thing for you guys yeah I mean for us right now I would say a big focus is doubling down on you know what we call the ice house this Iceberg warehouse for the simple reason that uh you know a few weeks ago uh both data brecks and snowflake mentioned they're going to support this as well and that really gives customers

the opportunity as I said to store their data in this common open format and then choose the best engine to to query that and that's where we really want to be aggressive right now we see a lot of opportunity there a lot of customers are feeling the pain of costs or feeling locked in with their traditional cloud data warehouse or data platform and and we want to provide more open alternative so we're going to invest in basically making Iceberg easy and and that's really where our our focus is probably

for the next couple years uh and uh and and then uh we'll go from there awesome so iceberg is definitely the future from is is what I'm hearing here and it's almost sounding like iceberg is is the new object storage yeah yeah exactly I mean it's it it it will be ubiquitous that's exactly that's a good analogy because even on Prem object storage is now 3 compatible right everything has the S3 interface the S3 interface seems to become the standard and I think similarly here Iceberg becomes the standard table

format that's awesome so yeah so if if you haven't looked at Iceberg you haven't heard of Iceberg now you're going to hear about it later if you if you're in it that's what I'm getting out of this so hey you know thanks for being on I think it's been a really interesting I love what Starburst are doing I love the whole journey that you've taken and you know a really sort of level set explanation about you know and how you're leveraging it for the benefit of businesses because you know

make no mistake data is just going to keep on growing and we need to be able to make it work for businesses rather than just sitting there doing nothing so really appreciate you being on the show thank you Anthony it was a pleasure awesome thanks a lot hey just as a reminder thanks for listening into episode 86 of great things with great Tech please stay tuned for more episodes where we continue to highlight companies doing great things with great Technologies shaping our world don't forget to follow us on

social media jtjt podcast and follow and visit jtwj t.com for more great content on all past episodes if you enjoyed this episode make sure you subscribe to your favorite podcast platform and on YouTube please spread the word and if you feel like it drop a review tell your friends thanks for joining us and we'll see you next time in great things with great Tech

Top Episodes

Here are some great episodes to start with. Or, check out episodes by topic

Game On! The Future of AI and Machine Learning with Xyonix | Episode #61

Game On! The Future of AI and Machine Learning with Xyonix | Episode #61

Exploring Alternative Hypervisors with Apache CloudStack | Episode #82

Exploring Alternative Hypervisors with Apache CloudStack | Episode #82

Darknet Diaries with Jack Rhysider | Episode #83

Darknet Diaries with Jack Rhysider | Episode #83

Revolutionizing DevOps with System Initiative | Episode #92

Revolutionizing DevOps with System Initiative | Episode #92