字幕列表 影片播放
Hi, my name is Sean, and I'm an engineer working on Dagster.
大家好,我叫肖恩,是 Dagster 的工程師。
I get to talk to a lot of different engineering teams, and unfortunately, they all say that they're struggling.
我接觸過很多不同的工程團隊,不幸的是,他們都說自己在苦苦掙扎。
They spend too much time babysitting production, and they don't have a chance to build new things and be proactive with stakeholders.
他們花了太多時間照看生產,沒有機會創造新的東西,也無法積極主動地與利益相關者合作。
So why is this?
為什麼會這樣?
Well, unfortunately, a lot of those teams are using task-based orchestrators like Airflow, and that puts them into this vicious cycle where, unfortunately, they can't test code out locally, so they have to push it straight to production.
不幸的是,很多團隊都在使用 Airflow 這樣基於任務的協調器,這讓他們陷入了惡性循環:他們無法在在地測試代碼,只能直接推送到生產環境。
But because it's hard for them to reason ahead of time about what new code will do, often pushing straight into production leads to failures and outages, and that's what ends up paging on-call and interrupting those engineers who are trying to do new work.
但是,由於他們很難提前推斷新代碼會做什麼,是以直接將其推向生產往往會導致故障和中斷,而這正是呼喚值班人員並打斷那些試圖完成新工作的工程師的最終原因。
Unfortunately, because of those interruptions, the team is slow and often criticized for being behind, and that in turn means they're unable to pay down technical debt that would actually allow them to fix some of these problems.
不幸的是,由於這些中斷,團隊進展緩慢,經常被責備落後,這反過來又意味著他們無法償還技術債務,而技術債務實際上可以讓他們解決其中的一些問題。
So how do we get out of this vicious cycle?
那麼,我們該如何擺脫這種惡性循環呢?
We believe that Dagster is the solution, and that's because Dagster is an orchestrator built for data engineers and the entire software development lifecycle.
我們相信 Dagster 就是解決方案,因為 Dagster 是專為數據工程師和整個軟件開發生命週期打造的協調器。
It allows you to think about individual assets and to take a declarative approach.
它可以讓你考慮單個資產,並採取聲明式方法。
So instead of having to build one monolithic DAG that's tied to your production resources, you can write new code incrementally, and then the orchestrator will figure out when those new data assets need to be run.
是以,您可以逐步編寫新代碼,然後協調器會計算出需要運行這些新數據資產的時間,而不必構建一個與生產資源綁定的整體 DAG。
If this approach sounds familiar, it's because many modern web engineers have taken this declarative approach.
如果這種方法聽起來很熟悉,那是因為許多現代網絡工程師都採用了這種聲明式方法。
In fact, the migration from Angular to React was all about adopting these benefits.
事實上,從 Angular 遷移到 React 就是為了採用這些優勢。
So let's see this in action.
讓我們來看看它的實際效果。
Here's the global data asset graph for the Hooli data engineering team.
這是互利數據工程團隊的全球數據資產圖。
You can see they start by grabbing some data from an API.
你可以看到,它們首先從應用程序接口抓取一些數據。
That data is fed through a series of transformations, and eventually a daily order summary table is created.
這些數據經過一系列轉換,最終形成每日訂單彙總表。
That table is then used by the data science team to run forecasting routines and create a reporting team for KPI reporting and executive dashboards.
然後,數據科學團隊使用該表運行預測例程,並創建一個用於 KPI 報告和執行儀表板的報告團隊。
So what are the benefits of using assets?
那麼,使用資產有什麼好處呢?
Well, imagine an executive has a question about the daily order summary.
想象一下,一位行政人員對每日訂單摘要有疑問。
Something doesn't look quite right.
有些東西看起來不太對勁。
Well, in a normal orchestrator, you would have to go spelunking through all the different tasks logs, trying to figure out what task might have impacted that table.
在普通的協調器中,你必須在所有不同的任務日誌中尋找,試圖找出是什麼任務影響了該表。
Whereas in Dagster, you can immediately look at the daily order summary and see metadata about it, see the run logs associated with it, and even information like the SQL that generated the table.
而在 Dagster 中,您可以立即查看每日訂單摘要,並查看相關元數據、運行日誌,甚至生成表的 SQL 等資訊。
This allows you to debug problems and answer that executive question really quickly, and in fact, give stakeholders the ability to self-serve questions like, when was this data set last updated?
這樣,您就可以調試問題並快速回答執行問題,事實上,還能讓利益相關者自助提出問題,例如,這個數據集最後一次更新是什麼時候?
In addition, taking an asset-first approach allows Dagster to do declarative scheduling.
此外,採用資產優先的方法還允許 Dagster 進行聲明式調度。
So instead of having to create a single monolithic DAG or try to reason through when different cron schedules should be applied to different jobs, you can simply define new assets and encode the SLA that stakeholders have for them.
是以,您無需創建一個單一的 DAG 或嘗試推理何時應將不同的 cron 計劃應用於不同的作業,而只需定義新的資產並對利益相關者為其設定的 SLA 進行編碼即可。
So for example, this average order asset that the marketing team relies on needs to be updated pretty frequently because it's in a KPI dashboard.
例如,營銷團隊所依賴的平均訂單資產需要經常更新,因為它在 KPI 面板中。
So a policy has been set that the asset should never be more than 90 minutes stale.
是以,我們制定了一項政策,規定資產的陳舊時間不得超過 90 分鐘。
In contrast, the daily order summary asset only needs to be updated every day by 9 a.m.
相比之下,每日訂單摘要資產只需在每天上午 9 點前更新。
Dagster figures out when these assets should run, and because it's aware of all the different data assets that your team cares about and how they depend on one another, Dagster's smart enough to avoid redundant work.
Dagster 可以計算出這些資產的運行時間,而且由於它瞭解團隊關心的所有不同數據資產以及它們之間的相互依賴關係,Dagster 可以聰明地避免重複工作。
So here we're seeing that the average order data set that has that SLA encoded to be up-to-date every 90 minutes needs to have itself and two other stale assets upstream updated, but everything else is already fresh enough.
是以,我們可以看到,SLA 編碼為每 90 分鐘更新一次的平均訂單數據集需要對自身和其他兩個陳舊資產進行上游更新,但其他一切都已足夠新鮮。
This avoids redundant computations and expensive cloud warehouse queries.
這就避免了冗餘計算和昂貴的雲倉查詢。
So how are all these things built?
那麼,這些東西是如何建造的呢?
Let's take a look at a Dagster project.
讓我們來看看 Dagster 項目。
Dagster projects are formatted as Python packages.
Dagster 項目的格式為 Python 包。
And within a project, we can create an asset by simply writing a new function.
在一個項目中,我們只需編寫一個新函數就能創建資產。
Assets in Dagster can be Pandas data frames, they can be Jupyter notebooks, they can be Spark data frames, or really any arbitrary code.
Dagster 中的資產可以是 Pandas 數據幀,可以是 Jupyter 筆記本,可以是 Spark 數據幀,也可以是任何任意代碼。
So here we'll create a new function to calculate the average order size, which is an important metric for our executive team.
是以,我們將在這裡創建一個新函數來計算平均訂單量,這對我們的執行團隊來說是一個重要指標。
We'll start by writing a function and then adding Dagster's asset decorator.
我們先編寫一個函數,然後添加 Dagster 的資產裝飾器。
Then within the function, we'll just use our regular logic to compute that KPI.
然後在函數中,我們將使用常規邏輯來計算 KPI。
And then finally, we'll encode the SLA for what stakeholders expect as a freshness policy.
最後,我們將把利益相關者期望的 SLA 編碼為新鮮度策略。
Once we have our asset created, in Dagster, we can run everything locally.
一旦我們在 Dagster 中創建了資產,就可以在在地運行一切。
So we'll fire up a local copy of our Dagster user interface.
是以,我們將啟動 Dagster 用戶界面的在地副本。
And here I can test out that my code, logical code that I just wrote runs.
在這裡,我可以測試我的代碼,也就是我剛寫的邏輯代碼是否能運行。
When I run things locally, I don't have to use production resources.
在在地運行時,我不必使用生產資源。
So here when I run all of my code, I'm going to be using just the local file system to store intermediate results.
是以,當我運行所有代碼時,我將只使用在地文件系統來存儲中間結果。
And the SQL that I'm writing will execute against a local DuckDB warehouse.
我編寫的 SQL 將針對在地 DuckDB 倉庫執行。
This allows me to execute all of my logical code really quickly and to iterate really fast without impacting or relying on production systems.
這使我能夠快速執行所有邏輯代碼,並在不影響或依賴生產系統的情況下快速迭代。
When we do make it to production, instead of using DuckDB, we'll use a Snowflake warehouse.
當我們投入生產時,我們將不再使用 DuckDB,而是使用 Snowflake 倉庫。
Instead of using our local file system, we'll use S3.
我們不使用在地文件系統,而是使用 S3。
And that's all encoded and configured through Dagster's pluggable resource system.
所有這些都通過 Dagster 的可插拔資源系統進行編碼和配置。
So now that I'm happy with my code locally, let's open up a pull request.
現在我對自己的代碼很滿意了,讓我們打開一個拉取請求。
Normally, when data teams open pull requests, you can review the code, but you have to guess what that code will actually do once it's in production.
通常情況下,當數據團隊打開拉取請求時,你可以查看代碼,但你必須猜測一旦代碼投入生產,它究竟會做些什麼。
With Dagster, we create what's called a branch deployment, which is essentially an isolated copy of our entire data platform just for this pull request.
通過 Dagster,我們創建了一個所謂的分支部署,它本質上是整個數據平臺的一個獨立副本,僅用於此拉取請求。
That allows my team to actually run the code and see what it's going to look like.
這樣,我的團隊就可以實際運行代碼,看看會是什麼樣子。
In this case, we're running against resources that are very similar to production.
在這種情況下,我們運行的資源與生產資源非常相似。
We're using Snowflake to clone a copy of our production database that this pull request can run against.
我們正在使用 Snowflake 克隆生產數據庫的副本,該拉伸請求可與之運行。
So while we're not impacting production, we can be sure that our code is going to work with production-like systems.
是以,雖然我們不會影響生產,但我們可以確保我們的代碼能在類似生產的系統中運行。
So in this way, Dagster provides a staging environment for every pull request that you open.
這樣,Dagster 就能為您打開的每個拉取請求提供一個暫存環境。
Once you're ready to put code into production, Dagster was built with all the modern bells and whistles.
一旦您準備好將代碼投入生產,Dagster 將為您提供所有現代化的功能。
So, for example, multiple teams can collaborate together in different virtual environments and different projects.
是以,舉例來說,多個團隊可以在不同的虛擬環境和不同的項目中共同協作。
You don't have to try to get everyone on the same version of pandas while still having a global asset view where those teams can depend on one another's work.
你不必讓每個人都使用同一個版本的 pandas,同時還能擁有一個全局資產視圖,讓這些團隊可以相互依賴彼此的工作。
Dagster has full support for role-based access controls and single sign-on.
Dagster 完全支持基於角色的訪問控制和單點登錄。
In fact, many of our Dagster customers allow everyone in their organization to be viewers so that they can self-service questions from the Dagster asset catalog, like when was this data set last updated?
事實上,我們的許多 Dagster 客戶都允許其組織中的每個人成為查看器,這樣他們就可以從 Dagster 資產目錄中自助提出問題,例如這個數據集上次更新是什麼時候?
Dagster has a variety of settings to help ensure that the orchestrator is robust, including things like automatic op retries and run queues with different priorities.
Dagster 有多種設置可幫助確保協調器的穩健性,包括自動操作重試和具有不同優先級的運行隊列。
And finally, Dagster supports a variety of different alerting policies.
最後,Dagster 支持各種不同的警報策略。
Like many orchestrators, you can alert on failure, but Dagster actually helps teams avoid alert fatigue by also allowing you to alert on SLA violations.
與許多協調器一樣,您可以對故障發出警報,但 Dagster 實際上還允許您對違反 SLA 的情況發出警報,從而幫助團隊避免警報疲勞。
And that means that you're only going to get notified when data sets are outside of the SLAs that actually matter to stakeholders and not get notified based on spurious failures that are automatically recoverable.
這就意味著,只有當數據集超出了對利益相關者至關重要的 SLA 時,您才會收到通知,而不會因為可自動恢復的虛假故障而收到通知。
Finally, Dagster Cloud can run in a variety of different ways.
最後,Dagster Cloud 可以以各種不同的方式運行。
And so, for example, you might use Kubernetes or you might use ECS or any other highly scalable compute layer.
例如,你可以使用 Kubernetes,也可以使用 ECS 或任何其他高度可擴展的計算層。
So we hope you're excited about Dagster and ready to give it a shot.
是以,我們希望您對 Dagster 感到興奮,並準備試一試。
If that's the case, we've made it really easy to get started with Dagster Cloud.
如果是這樣的話,我們會讓您非常容易地開始使用 Dagster Cloud。
You can clone an example project and get running in no time.
您可以克隆一個示例項目並立即運行。
Or you can start out by developing locally.
您也可以從在地開發開始。
Once you're ready to run things in production, you can either host Dagster open source yourself or Dagster Cloud comes with a fully serverless option or a hybrid computation models available as well.
一旦你準備好在生產中運行,你既可以自己託管 Dagster 開源,也可以選擇 Dagster 雲提供的完全無服務器選項或混合計算模型。
So be sure to check us out.
所以,請一定來看看我們。
Find us on GitHub and give us a star.
在 GitHub 上找到我們,並給我們一顆星。
That's the best place to keep track of recent updates like our 1.1 release.
這是跟蹤最近更新(如 1.1 版)的最佳地點。
Or join us on Slack where you can ask questions and meet other modern data engineers.
或者加入我們的 Slack,在這裡您可以提出問題並結識其他現代數據工程師。
Thanks so much.
非常感謝。
