Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609
Comments
|
I haven't seen anything like this in 2.0.0; we run batch jobs on a daily basis and have restarted our streaming pipelines a few times now. Is this in streaming, batch, or both? |
|
Only seen it in batch so far, and cannot reproduce yet. |
|
Still happening in 2.2.0 templated batch jobs on our side. We're currently managing it with cleanup scripts but it's a PITA. |
|
I was just thinking about this today because it happened yet again. Agree, auto expire on the datasets makes sense. |
|
So I did a little investigation and it does look like that's actually implemented... not sure why it's still happening though.
I think I'll try do a bit more debugging of my own... p.s. is this the correct forum to be discussing this? |
|
[email protected] is a good place and also by opening a tracking issue on https://issues.apache.org/jira/projects/BEAM so people can follow the bug. |
|
I am also facing this issue. If job failed, I observed that table got delete after 1 day. But DataSet still remain exist. Can we have option to clean temp dataset and tables immediately if job failed. ? Can any one have better idea. ? |
Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery when the pipeline is cancelled or when it fails. We've been on 1.8.0/1.9.0 previous to this, and we've never see this before. We skipped 2.0.0, so unsure which version it was actually introduced in.
I cancelled a job (2017-10-08_18_35_30-13495977675828673253), and it left behind a dataset and table in BigQuery: