TL;DR: From data warehouses to data lakes, team sizes and task distributions, data architect Nikolay Nyagolov, a top leader in the tech hub that’s Bulgaria, is speaking to us about data science and architecture and how these fields can impact clients.
As part of business intelligence projects, we’re used to servicing product companies that provide many different products to their customers. I’m currently leading all of our Snowflake-related projects here. That includes one that involves 30+ people in a complex, yet very exciting project for a Fortune 500 company. Snowflake is a very popular database that serves a broad range of technology areas and allows data storage, its processing, and analysis. By reading this blog, I seek to provide you with insight as to what makes up our daily life as data scientists.
What data architecture is capable of doing
Whether our clients are active in a public or private market deal, data scientists tailor any type of analysis that’s capable of allowing better decision-making. We design processes to analyze raw data with the goal of smartly gearing where or how to invest company funds, for example. For that, we handle around 50+ different softwares or applications.
We can break down what we generally do as, in part, analyzing data sources and frequency. For that, we determine the number of clients our customer is relating with, for instance, what the average volume of data is, and more.
As soon as we have knowledge such as the one I describe above, we select the best-suiting technology we’ll use for data analysis. Different technologies, tools, and designs achieve different goals. Just as examples, we can think of Hadoop, Snowflake DB, MS SQL, Redshift, and many more. We can use these for faster processing, deep learning, reduced costs, and easy scalability, amongst other objectives. So we work at choosing the most suitable tech to fit our clients’ needs. In doing so, we consider different factors, such as how big data clusters are for us to deploy their required environment. Once we have chosen the most ideal mechanism to collect the required data, we work on moving the data. In such a step, we usually resort to what we call data lakes.
What are data lakes and why are they important?
Data lakes are virtual spaces where we group data for multiple organizations or a single client for easier analysis. We constantly work with them. It’s as if all the data went in one single place: a digital lake, precisely. It’s just a lake that’s full of data.
Lakes are different from a data warehouse, however, which is more of a database for databases. The main distinction between the two is that data warehouses present data that’s ready for analysis and reporting. What’s contained in them is clean, validated, and the pure source of truth for an analysis. Amongst its applications stand data science, including machine learning models, data models, and much more. All of these implementations can be useful in reporting, tracking performance, etc.
Ideally, we use these resources to sort out otherwise challenging factors. Think of weather forecasting, for example. We can easily link to Bloomberg public data sources or other resources to engage in optimum analysis that eases investment decisions in a specific asset.
How data science is part of our day-to-day work
To give an idea of everyday work for us, we’ve just built an entire ecosystem for a particular client. What we did is create a framework that easily added a new source. And we managed to do that without much additional work or coding.
Normally, clients have a clear macro understanding of their needs. They get the big picture and know perfectly well what they’re looking for.
What we do is come in with the micro architecture that’s required to make big pictures happen.
We define the storage design, which is also to say the lake and warehouses themselves. We own that leadership as the experts that we are and become product owners. Yet, the entire concept of what we do is perfectly prepared ahead of time with a specific emphasis on creating what our clients need to go after their competition.
Aside from being architects, we’re reliable advisors.
Another key consideration here are predictions. From time to time, we’ll get clients who want to do near real-time analysis to get diverse results. We need to answer something of the sort of: “If I invest 3 billion dollars, what should my expectations be for the rest of the year?” So, we work hard at answering those key business questions.
Moving big data to a cloud
On another front, whenever data for a multibillion-dollar company needs to go to the cloud, we’re there to make it happen. But big data can’t just be moved to a public cloud so easily.
As there are security policies to consider, we need to think it over several times before even trying to move the data. We also need to encrypt algorithms, which requires the opposite algorithm to be decrypted, as well.
In some cases, clients come to us without a solid architecture to keep the data separated. So, we first need to create a new design that isolates the data before we can work on moving it. And we do all that before a client can even see their information. In the process, we ensure no client can see the data of another customer, of course.
We’re speaking here of data that requires a high level of security. Not many projects have so many data policies in place.
How data science teams are distributed
Finally, we also work in specialized teams to make changes happen. To continue with an example I gave earlier, out of a team of almost 40 people, we’ll have a data architect overseeing the entire operation. I’m the one doing this for Blankfactor at the moment. And I’m directly assisted by three team leads.
We use each lead in one of three big pieces of the project. One of them focuses on data ingestion, which means transporting data from the main source to our own data lake. And that leader will have 5 engineers assisting with that.
There’s a second part that has to do with all the proper Snowflake work, which means all of the project’s data manipulation and analysis. Four people work on that, including me as their team leader, as well.
The additional team has to do with the project’s presentation. They come up with reports and all the back work that clients want to see. For that, we have the third team leader working with four additional team members.
In this scenario, other DevOps engineers come in to help with the infrastructure. They’ll also set up most of the system. We rely on project managers for that, too. And, in the kind of predictive work I mentioned above about answering key business questions for our clients, we rely on a team leader who works alongside two other engineers, as well.
In the case of this particular project, since it’s for a big asset management company, we even have a Wall Street consultant helping us understand mechanisms and terms in the investment field.
Data architecture can be a worldwide team effort
As you can see, from design through proof of concept to implementation, there are several stages and diverse group dynamics that help data scientists accomplish goals merely related to data architecture. We’re currently working in multiregional projects that foster innovation and data science conversations of the highest (and at the upper) level in the industry.
Our daily lives consist of a broad range of expertise and knowledge — not only in data science and architecture per se, but the diverse technologies, frameworks, softwares, and applications that can best make innovative and secure changes happen at a large scale.