Data engineering is the process of designing, building, and maintaining large-scale data systems that can collect, store, process, and analyze massive amounts of data from various sources. It involves a deep understanding of computer science, mathematics, and domain-specific knowledge to create robust, scalable, and efficient data processing systems.
ETL / ELT
| Name | OSS | Comment |
|---|
| Apache Airflow | 👍 | Programmatically author, schedule and monitor workflows. |
| Airbyte | 👍 | ETL Pipelines |
| Dagster | 👍 | Cloud-native orchestrator for data pipelines |
| Meltano | 👍 | |
| Stitch | | Move data from multiple sources into a data warehouse |
| Hevo | | Automated Data Pipelines to Redshift, BigQuery, Snowflake |
| Rivery | | ETL Pipelines |
| Fivetran | | ETL Pipelines |
Event Streaming & Data Streams
| Name | OSS | Comment |
|---|
| Apache Spark | 👍 | Unified engine for large-scale data analytics |
| Apache Beam | 👍 | Unified model for batch and streaming data processing |
| Dask | 👍 | Scale the Python tools you love |
| DBT-Core | | Transform data |
Notebooks & Visualizations
More resources