Skip to main content

3 posts tagged with "schema evolution"

View All Tags

· 6 min read
Adrian Brudaru

automated pipeline automaton

Why is there a data engineer shortage?

  1. High Demand and Rapid Growth: The increasing reliance on data-driven decision-making and the rise of big data technologies have created a surge in demand for skilled data engineers.
  2. Skill Gap and Specialization: Data engineering requires a unique blend of technical skills, and finding individuals with the right combination of programming, database management, and cloud computing expertise can be challenging.
  3. Competition from Other Data Roles: The allure of data science and other data-related roles has attracted professionals, leading to a talent shortage in the data engineering field.

How big is the data engineer shortage?

💡 "In Europe there are 32K data engineers and 48K open positions to hire one. In the US the ratio is 41K to 79K" Source: Linkedin data analysis blog post

Well that doesn’t look too bad - if only we could all be about 2x as efficient :)

Bridging the gap: How to make your data engineers 2x more efficient?

There are 2 ways to make the data engineers more efficient:

Option 1: Give them more to do, tell them how to do their jobs better!

For some reason, this doesn’t work out great. All the great minds of our generation told us we should be more like them

  • do more architecture;
  • learn more tech;
  • use this new toy!
  • learn this paradigm.
  • take a step back and consider your career choices.
  • write more tests;
  • test the tests!
  • analyse the tests :[
  • write a paper about the tests...
  • do all that while alerts go off 24/7 and you are the bottleneck for everyone downstream, analysts and business people screaming. (┛ಠ_ಠ)┛彡┻━┻

“I can't do what ten people tell me to do. So I guess I'll remain the same”

  • Otis Redding, Sittin' On The Dock Of The Bay

Option 2: Take away unproductive work

A data engineer has a pretty limited task repertoire - so could we give some of their work to roles we can hire?

Let’s see what a data engineer does, according to GPT:

  • Data curation: Ensuring data quality, integrity, and consistency by performing data profiling, cleaning, transformation, and validation tasks.
  • Collaboration with analysts: Working closely with data analysts to understand their requirements, provide them with clean and structured data, and assist in data exploration and analysis.
  • Collaboration with DWH architects: Collaborating with data warehouse architects to design and optimize data models, schemas, and data pipelines for efficient data storage and retrieval.
  • Collaboration with governance managers: Partnering with governance managers to ensure compliance with data governance policies, standards, and regulations, including data privacy, security, and data lifecycle management.
  • Structuring and loading: Designing and developing data pipelines, ETL processes, and workflows to extract, transform, and load data from various sources into the target data structures.
  • Performance optimization: Identifying and implementing optimizations to enhance data processing and query performance, such as indexing, partitioning, and data caching.
  • Data documentation: Documenting data structures, data lineage, and metadata to facilitate understanding, collaboration, and data governance efforts.
  • Data troubleshooting: Investigating and resolving data-related issues, troubleshooting data anomalies, and providing support to resolve data-related incidents or problems.
  • Data collaboration and sharing: Facilitating data collaboration and sharing across teams, ensuring data accessibility, and promoting data-driven decision-making within the organization.
  • Continuous improvement: Staying updated with emerging technologies, industry trends, and best practices in data engineering, and actively seeking opportunities to improve data processes, quality, and efficiency.

Let’s get a back of the napkin estimation of how much time they spend on those areas

Here’s an approximation as offered by GPT. Of course, actual numbers depend on the maturity of your team and their unique challenges.

  • Collaboration with others (including data curation): Approximately 40-60% of their working hours. This includes tasks such as collaborating with team members, understanding requirements, data curation activities, participating in meetings, and coordinating data-related activities.
  • Data analysis: Around 10-30% of their working hours. This involves supporting data exploration, providing insights, and assisting analysts in understanding and extracting value from the data.
  • Technical problem-solving (structuring, maintenance, optimization): Roughly 30-50% of their working hours. This includes solving data structuring problems, maintaining existing data structures, optimizing data pipelines, troubleshooting technical issues, and continuously improving processes.

By looking at it this way, solutions become clear:

  • Let someone else do curation. Analysts could talk directly to producers. By removing the middle man, you improve speed and quality of the process too.
  • Automate data structuring: While this is not as time consuming as the collaboration, it’s the second most time consuming process.
  • Let analyst do exploration of structured data at curation, not before load. This is a minor optimisation, but 10-30% is still very significant towards our goal of reducing workload by 50%.

How much of their time could be saved?

Chat GPT thinks:

it is reasonable to expect significant time savings with the following estimates:

  1. Automation of Structuring and Maintenance: By automating the structuring and maintenance of data, data engineers can save 30-50% or more of their time previously spent on these tasks. This includes activities like schema evolution, data transformation, and pipeline optimization, which can be streamlined through automation.
  2. Analysts and Producers Handling Curation: Shifting the responsibility of data curation to analysts and producers can save an additional 10-30% of the data engineer's time. This includes tasks such as data cleaning, data validation, and data quality assurance, which can be effectively performed by individuals closer to the data and its context.

It's important to note that these estimates are approximate and can vary based on the specific circumstances and skill sets within the team.

40-80% of a data engineer’s time could be spared

💡 40-80% of a data engineer’s time could be spared

To achieve that,

  • Automate data structuring.
  • Govern the data without the data engineer.
  • Let analysts explore data as part of curation, instead of asking data engineers to do it.

This looks good enough for solving the talent shortage. Not only that, but doing things this way lets your team focus on what they do best.

A recipe to do it

  1. Use something with schema inference and evolution to load your data.
  2. Notify stakeholders and producers of data changes, so they can curate it.
  3. Don’t explore json with data engineers - let analyst explore structured data.

Ready to stop the pain? Read this explainer on how to do schema evolution with dlt. Want to discuss? Join our slack.

· 6 min read
Adrian Brudaru

Schema evolution combines a technical process with a curation process, so let's understand the process, and where the technical automation needs to be combined with human curation.

Whether you are aware or not, you are always getting structured data for usage

Data used is always structured, but usually produced unstructured.

Structuring it implicitly during reading is called "schema on read", while structuring it upfront is called "schema on write".

To fit unstructured data into a structured database, developers have to perform this transition before loading. For data lake users who read unstructured data, their pipelines apply a schema during read - if this schema is violated, the downstream software will produce bad outcomes.

We tried running away from our problems, but it didn't work.

Because structuring data is difficult to deal with, people have tried to not do it. But this created its own issues.

  • Loading json into db without typing or structuring - This anti-pattern was created to shift the structuring of data to the analyst. While this is a good move for curation, the db support for structuring data is minimal and unsafe. In practice, this translates to the analyst spending their time writing lots of untested parsing code and pushing silent bugs to production.
  • Loading unstructured data to lakes - This pattern pushes the curation of data to the analyst. The problem here is similar to the one above. Unstructured data is hard to analyse and curate, and the farther it is from the producer, the harder it is to understand.

So no, one way or another we are using schemas.

If curation is hard, how can we make it easier?

  • Make data easier to discover, analyze, explore. Structuring upfront would do that.
  • Simplify the human process by decentralizing data ownership and curation - the analyst can work directly with the producer to define the dataset produced.

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.

The problem is that curating data is hard.

  • Typing and normalising data are technical processes.
  • Curating data is a business process.

Here's what a pipeline building process looks like:

  1. Speak with the producer to understand what the data is. Chances are the producer does not document it and there will be many cases that need to be validated analytically.
  2. Speak with the analyst or stakeholder to get their requirements. Guess which fields fulfill their requirements.
  3. Combine the 2 pieces of info to filter and structure the data so it can be loaded.
  4. Type the data (for example, convert strings to datetime).
  5. Load the data to warehouse. Analyst can now validate if this was the desired data with the correct assumptions.
  6. Analyst validates with stakeholder that this is the data they wanted. Stakeholder usually wants more.
  7. Possibly adjust the data filtering, normalization.
  8. Repeat entire process for each adjustment.

And when something changes,

  1. The data engineer sees something break.
  2. They ask the producer about it.
  3. They notify the analyst about it.
  4. The analyst notifies the business that data will stop flowing until adjustments.
  5. The analyst discusses with the stakeholder to get any updated requirements.
  6. The analyst offers the requirements to the data engineer.
  7. The data engineer checks with the producer/data how the new data should be loaded.
  8. Data engineer loads the new data.
  9. The analyst can now adjust their scripts, re-run them, and offer data to stakeholder.

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.

Before we start solving, let's understand the problem:

  1. For usage, data needs to be structured.
  2. Because structuring is hard, we try to reduce the amount we do by curating first or defering to the analyst by loading unstructured data.
  3. Now we are trying to solve two problems at once: structuring and curation, with each role functioning as a bottleneck for the other.

So let's de-couple these two problems and solve them appropriately:

  • The technical issue is that unstructured data needs to be structured.
  • The curation issue relates to communication - so taking the engineer out of the loop would make this easier.

Automate the tech: Structuring, typing, normalizing

The only reason to keep data unstructured was the difficulty of applying structure.

By automating schema inference, evolution, normalization, and typing, we can just load our jsons into structured data stores, and curate it in a separate step.

Alert the communicators: When there is new data, alert the producer and the curator.

To govern how data is produced and used, we need to have a definition of the data that the producer and consumer can both refer to. This has typically been tackled with data contracts - a type of technical test that would notify the producer and consumer of violations.

So how would a data contract work?

  1. Human process:
    1. Humans define a data schema.
    2. Humans write a test to check if data conforms to the schema.
    3. Humans implement notifications for test fails.
  2. Technical process:
    1. Data is extracted.
    2. Data is staged to somewhere where it can be tested.
    3. Data is tested:
      1. If the test fails, we notify the producer and the curator.
      2. If the test succeeds, it gets transformed to the curated form.

So how would we do schema evolution with dlt?

  1. Data is extracted, dlt infers schema and can compare it to the previous schema.
  2. Data is loaded to a structured data lake (staging area).
  3. Destination schema is compared to the new incoming schema.
    1. If there are changes, we notify the producer and curator.
    2. If there are no changes, we carry on with transforming it to the curated form.

So, schema evolution is essentially a simpler way to do a contract on schemas. If you had additional business-logic tests, you would still need to implement them in a custom way.

The implementation recipe

  1. Use dlt. It will automatically infer and version schemas, so you can simply check if there are changes. You can just use the normaliser + loader or build extraction with dlt. If you want to define additional constraints, you can do so in the schema.
  2. Define your slack hook or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders.
  3. Capture the load job info and send it to the hook.

· 7 min read
Adrian Brudaru
info

Google Colaboratory demo

This colab demo was built and shown by our working student Rahul Joshi, for the Berlin Data meetup, where he talked about the state of schema evolution in the open source.

What is schema evolution?

In the fast-paced world of data, the only constant is change, and it usually comes unannounced.

Schema on read

Schema on read means your data does not have a schema, but your consumer expects one. So when they read, they define the schema, and if the unstructured data does not have the same schema, issues happen.

Schema on write

So, to avoid things breaking on running, you would want to define a schema upfront - hence you would structure the data. The problem with structuring data is that it’s a labor intensive process that makes people take pragmatic shortcuts of structuring only some data, which later leads to lots of maintenance.

Schema evolution means that a schema is automatically generated on write for the data, and automatically adjusted for any changes in the data, enabling a robust and clean environment downstream. It’s an automatic data structuring process that is aimed at saving time during creation, maintenance, and recovery.

Why do schema evolution?

One way or another, produced raw unstructured data becomes structured during usage. So, which paradigm should we use around structuring?

Let’s look at the 3 existing paradigms, their complexities, and what a better solution could look like.

The old ways

The data warehouse paradigm: Curating unstructured data upfront

Traditionally, many organizations have adopted a 'curate first' approach to data management, particularly when dealing with unstructured data.

The desired outcome is that by curating the data upfront, we can directly extract value from it later. However, this approach has several pitfalls.

Why curating unstructured data first is a bad idea

  1. It's labor-intensive: Unstructured data is inherently messy and complex. Curating it requires significant manual effort, which is time-consuming and error-prone.
  2. It's difficult to scale: As the volume of unstructured data grows, the task of curating it becomes increasingly overwhelming. It's simply not feasible to keep up with the onslaught of new data. For example, Data Mesh paradigm tries to address this.
  3. It delays value extraction: By focusing on upfront curation, organizations often delay the point at which they can start extracting value from their data. Valuable insights are often time-sensitive, and any delay could mean missed opportunities.
  4. It assumes we know what the stakeholders will need: Curating data requires us to make assumptions about what data will be useful and how it should be structured. These assumptions might be wrong, leading to wasted effort or even loss of valuable information.

The data lake paradigm: Schema-on-read with unstructured data

In an attempt to bypass upfront data structuring and curation, some organizations adopt a schema-on-read approach, especially when dealing with data lakes. While this offers flexibility, it comes with its share of issues:

  1. Inconsistency and quality issues: As there is no enforced structure or standard when data is ingested into the data lake, the data can be inconsistent and of varying quality. This could lead to inaccurate analysis and unreliable insights.
  2. Complexity and performance costs: Schema-on-read pushes the cost of data processing to the read stage. Every time someone queries the data, they must parse through the unstructured data and apply the schema. This adds complexity and may impact performance, especially with large datasets.
  3. Data literacy and skill gap: With schema-on-read, each user is responsible for understanding the data structure and using it correctly, which is unreasonable to expect with undocumented unstructured data.
  4. Lack of governance: Without a defined structure, data governance can be a challenge. It's difficult to apply data quality, data privacy, or data lifecycle policies consistently.

The hybrid approach: The lakehouse

  • The data lakehouse uses the data lake as a staging area for creating a warehouse-like structured data store.
  • This does not solve any of the previous issues with the two paradigms, but rather allows users to choose which one they apply on a case-by-case basis.

The new way

The current solution : Structured data lakes

Instead of trying to curate unstructured data upfront, a more effective approach is to structure the data first with some kind of automation. By applying a structured schema to the data, we can more easily manage, query, and analyze the data.

Here's why structuring data before curation is a good idea:

  1. It reduces maintenance: By automating the schema creation and maintenance, you remove 80% of maintenance events of pipelines.
  2. It simplifies the data: By imposing a structure on the data, we can reduce its complexity, making it easier to understand, manage, and use.
  3. It enables automation: Structured data is more amenable to automated testing and processing, including cleaning, transformation, and analysis. This can significantly reduce the manual effort required to manage the data.
  4. It facilitates value extraction: With structured data, we can more quickly and easily extract valuable insights. We don't need to wait for the entire dataset to be curated before we start using it.
  5. It's more scalable: Reading structured data enables us to only read the parts we care about, making it faster, cheaper, and more scalable.

Therefore, adopting a 'structure first' approach to data management can help organizations more effectively leverage their unstructured data, minimizing the effort, time, and complexity involved in data curation and maximizing the value they can extract from their data.

An example of such a structured lake would be parquet file data lakes, which are both, structured and inclusive of all data. However, the challenge here is creating the structured parquet files and maintaining the schemas, for which the delta lake framework provides some decent solutions, but is still far from complete.

The better way

So, what if writing and merging parquet files is not for you? After all, file-based data lakes capture a minority of the data market.

dlt is the first python library in the open source to offer schema evolution

dlt enables organizations to impose structure on data as it's loaded into the data lake. This approach, often termed as schema-on-load or schema-on-write, provides the best of both worlds:

  1. Easier maintenance: By notifying the data producer and consumer of loaded data schema changes, they can quickly decide together how to adjust downstream usage, enabling immediate recovery.
  2. Consistency and quality: By applying structure and data typing rules during ingestion, dlt ensures data consistency and quality. This leads to more reliable analysis and insights.
  3. Improved performance: With schema-on-write, the computational cost is handled during ingestion, not when querying the data. This simplifies queries and improves performance.
  4. Ease of use: Structured data is easier to understand and use, lowering the skill barrier for users. They no longer need to understand the intricate details of the data structure.
  5. Data governance: Having a defined schema allows for more effective data governance. Policies for data quality, data privacy, and data lifecycle can be applied consistently and automatically.

By adopting a 'structure first' approach with dlt, organizations can effectively manage unstructured data in common destinations, optimizing for both, flexibility and control. It helps them overcome the challenges of schema-on-read, while reaping the benefits of a structured, scalable, and governance-friendly data environment.

To try out schema evolution with dlt, check out our colab demo.

colab demo

Want more?

  • Join our Slack
  • Read our schema evolution blog post
  • Stay tuned for the next article in the series: How to do schema evolution with dlt in the most effective way

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.