What skills does a Data Engineer need? This is a question anyone looking to become a data engineer will ask. A data engineers’ role is to ingest data, process data and make the data available for downstream processing. This downstream processing can be one or a combination of reporting and analytics, data science exploration or a target application. The term data engineer came about between 2010 and 2015 when there was an explosion of interest in processing “big data”.
Data engineering came from the IT / software development world. It was initially seen as different from the then well-understood data warehouse ETL developers. On the other hand, ETL developers used SQL along with mature ETL tools to perform similar tasks, albeit with smaller data sets. Early data engineering was different as many of the tools and software libraries used to work with data were low level and language-specific.
Since then, these two worlds have combined, and modern data engineering tools use the best of both worlds. The term “data pipeline”, which is the connection of data processing steps required to ingest, process, and make data available for downstream processing, came about as data engineering as a discipline evolved. Data engineers create data pipelines.
Data engineers will typically report to lead data engineer or a data manager.
A successful data engineer needs the following skills:
Table of contents
Requirements Analysis
Understanding what data your client needs is one of the most critical first steps a data engineer will carry out. Taking the time to get this first step right is essential. Listening to the client carefully and understanding their requirement will ensure the data being sourced is correct. Knowing the intended usage of the data will determine how the data will be loaded, processed, and made available. Asking questions regarding the timeliness required of the data will allow you to work out how best to load the data and the frequency required.
Database
Understanding databases and how to work with databases is a basic skill that every data engineer will know at a minimum. Databases are used to store data in a structured manner so that it can be used by an application, reported on by a BI tool or interpreted by data scientists. Standard Query Language (SQL) is the language that is used to work with data in a database. Data is stored in tables that represent the entities that you want to store in a database. The SQL language allows you to query the tables and objects within a database and return the results of that query. Even knowing the basics of SQL is a big advantage when working with data. A data engineer will need to be highly proficient with SQL, understanding the nuances of the language and how to write efficient and performant SQL.
Data warehousing
Data warehousing skills are very important for a data engineer. The various patterns that are available for loading data to a database or cloud storage are well established and understood. Structuring data in dimensions and FACT tables is a universal approach to modelling data in a data warehouse that can be used across all data platforms. Structuring data in this way is a common methodology that all data engineers need to understand. Countless books and articles have been written covering data warehousing. The knowledge contained within is extremely helpful in dealing with different scenarios for efficiently storing data so that reporting or analytics can be easily run-on top of the data.
Programming Skills
Programming skills are essential skills for data engineers. Python is by far the leading data manipulation language. Phyton had long been a great programming language for working with data and files, long before the term “data engineering” was invented. It has since become the de facto language for data engineering. Phyton comes with extensive support for working with data. Many of the leading data science libraries are implemented in Python. Taking the time to learn Python is a very worthwhile exercise. It is an essential tool in any data engineer’s toolbox.
APIs
Almost every application expose functionality and data via an Application Programming Interface (API). The use of APIs to interact with data services or applications has now become the de-facto standard of data integration. A data engineer will use APIs to extract data from various applications and data services and land this data to either a database/data warehouse or a data lake.
Various software libraries and tools are available to a data engineer to work with APIs. The first thing a data engineer will need to understand is how the API exposes data. Secondly, how best to interact with the API is important. The objective is to extract as much data as possible in the least number of calls. Thirdly a data engineer will need to write the code or configure the tool they are using to work with the API. Ensuring the API runs on an agreed schedule and handling any errors is the last thing for the data engineer to do.
Project Management
Project management skills are an asset to anyone working with data regardless of their role. Data engineers will work closely with project managers on complex projects. Understanding basic project management is a great skill to have and one that will help every data engineer’s career. From, a data engineers perspective knowing where you are in a project, what time has been allocated to the current task you are working on, if you are on track or not is of vital importance. Too often data engineers who work directly for companies get too caught up on either technical challenges or requirements creep and forget basic project management.
Cloud
As of 2022, AWS, Azure and Google Cloud are the top three cloud providers in the world. They all offer very similar or comparable cloud services. These cloud services range across storage, compute (virtual machines or serverless), networking, data processing, and analytics along with all the security and access control to support these services. Data engineers need to be very knowledgeable in the cloud service their organisation uses. A recommendation is to at least have covered the fundamentals course for whatever cloud service your organisation uses. The more cloud certification you have or deep specialisation on either AWS, Azure or Google Cloud the more advantageous this will be for your career.
Conclusion
As you can see, there a quite a few skills a data engineer needs to master. Database, Programming and APIs skills are the ones you’ll need to get up to speed with quickly. At the same time, understanding the basics of data warehousing and the various patterns for structuring data will significantly benefit you. Requirements analysis and project management skills are skills you will pick up over time and learn from colleagues you’ll work with. Working with various data engineering cloud services will make you proficient in whatever cloud your organisation uses quickly. It is also advantageous for data engineers to understand data science skills as data engineers often make datasets available for data science exploration.
Please check out the Data Knowledge Club for upcoming articles and info related to data engineering
Mentorship Program
The Data Knowledge Club offers mentorship services for anyone looking to advance their data careers.