Publish AI, ML & data-science insights to a global community of data professionals.

How Data Scientists and Data Engineers Can Collaborate Effectively

Advice from a former data engineer turned data scientist

Photo by fauxels from Pexels
Photo by fauxels from Pexels

If you’re a data scientist or data engineer, you may have found it challenging to work well together. As a data scientist, you’re given data that doesn’t match your expectations and as data engineer, you’re asked to work on tasks that’s easier said than done. As a former data engineer turned data scientist, I’d like to discuss ways I’ve found that help data scientists and data engineers work together more effectively.


Provide Context

As a data scientist, you work with stakeholders to understand the context of the request, determine the priority, and agree on a deliverable. Use this same approach when you make a request to a data engineer. Data engineers are constantly bombarded with requests to pull new data or investigate data pipeline or quality issues. They need to understand the context of your request in order to prioritize accordingly against their backlog of tasks.

Data scientist advice:

Answer the 4 W’s when making data engineering requests.

  1. Who benefits from this data. – Marketing needs to know where the website visitors are coming from.
  2. What data is needed. – Website visits to all marketing owned pages.
  3. Why is this data needed. – Knowing the source of the website visitors helps marketing optimize their efforts to prioritize high converting channels to drive more sales.
  4. When is this data needed. – Try to make requests earlier rather than later to get on the data engineering backlog. Allocate additional time to your marketing deliverable date to account for data engineering pulling the data.

Data engineer advice:

  • Create a help page listing out the details needed for a request and review with the data science team. Alternatively, ask for this information in the data engineering request form. This saves both teams time to have the details upfront without having to go back and forth with questions.
  • Data scientists may not be aware of all the data available. Consider creating a data catalog with location and descriptions for data scientists to review before requesting data that may already exist.

Provide Data Specifications

Data engineers prefer requests with clear specifications because they can’t be expected to know what’s best for your analysis. As a data scientist try to clarify the data fields, any data handling logic such as dealing with null values, and date range needed for data.

Having vague specifications such as "pull website visitor data" will increase the time to work on your request because data engineering will need to ask clarifying questions. A clear specification such as "pull website visitor data starting from Jan. 1, 2021 for visitors that went to pages starting with ‘www’ and store in a database table named website_visitor" provides more detail and reduces questions.

I once made the mistake of not specifying that the same day’s data should’ve been deleted before updating and ended up getting duplicates in the table. In addition to the history needed, note how you want new data to be updated going forward.

Data scientist advice:

  • Provide as much detail as possible when making a request. This reduces the back and forth questions meaning your request can be done sooner.
  • Ask for sample data containing a few days of data to review values and confirm all necessary fields are present before data engineering spends time pulling the history, especially if you need years of data backfilled. Otherwise, you might get data that don’t match your expectations and have to wait days to get the full history rerun.
  • Be specific with how data should be updated to avoid duplicate records. For example, run a one-time backfill of history starting from Jan. 1, 2021 to the latest date. For daily incremental updates, delete existing data for the run date and then append it with the data for the same day. If you don’t need incremental updates, specify you want a full refresh of the data for the frequency of your choice, i.e. daily.

Data engineer advice:

  • Create a checklist of items for standard requests such as pulling in new data to review against the request details to confirm you have all the information needed. This will prevent delays when you actually get to the request and realize you need more information before you can start.
  • If you’re not certain how the data scientist wants the data, just ask. It’s better to clarify than having to modify the ETL and rerun the job because it wasn’t what the data scientist expected.

Provide QA Specifications

Data engineers can support up to hundreds of data pipelines and a primary part of their job is to make sure these ETL jobs run without error and troubleshoot those that don’t. Help reduce the turnaround time for your request by providing QA checks to run before data is passed to you for review.

For example, I once worked with data engineers on a data migration project to replicate an existing ETL pipeline to load into a new database. I knew the data I was expecting to see but the data engineers didn’t because they weren’t familiar with this ETL pipeline. I provided a few SQL statements the data engineer could run to confirm the data was loaded correctly. This helped them know the data wasn’t loaded properly to troubleshoot the ETL code and saved me time from reviewing data that wasn’t QA ready.

Data scientist advice:

  • Provide guidance on the expected values or SQL statements the data engineer can run to confirm the ETL is running as expected. The more details you provide, the fewer questions you’ll have to answer.

Data engineer advice:

  • Take a quick look at the data loaded to see if anything looks odd. For example, if every column is null that’s probably an issue you should investigate.
  • Ask the data scientist if there are any standard checks you can run to confirm the data is loaded as expected.

Final Thoughts

As data roles continue to evolve, data scientists and data engineers need to learn how to work together effectively to be successful in the organization. While we may not be able to collaborate in perfect harmony now, I hope this brings you one step closer to working together effectively.


You might also like…

6 Best Practices I Learned as a Data Engineer

My Experience as a Data Engineer vs. a Data Scientist

6 Bad Data Engineering Practices You Shouldn’t Apply as a Data Scientist


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles