Initial support for Data Maintenance#9
Conversation
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
|
When runnning LF_SR function, we see Cause: but in the DM data "s_store_returns" definition (official TPCDS_TOOLKIT/tools/tpcds_source.sql old version: https://github.com/gregrahn/tpcds-kit/blob/master/tools/tpcds_source.sql#L356): Those definitions are from TPC-DS Spec. According to Spec: I can fix this by changing |
| "--executor-cores" "12" | ||
| "--conf" "spark.task.cpus=1") | ||
| "--conf" "spark.task.cpus=1" | ||
| "--packages" "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1" |
There was a problem hiding this comment.
nit: the first character seems not aligned.
| "--executor-cores" "12" | ||
| "--conf" "spark.task.cpus=1") | ||
| "--conf" "spark.task.cpus=1" | ||
| "--packages" "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1" |
There was a problem hiding this comment.
Shall we consider the case of not using iceberg?
In that case, the parameters for iceberg should be enabled by some options.
There was a problem hiding this comment.
@jlowe What's our plan/strategy to do the WHOLE NDS test? Are we going to do them all based on Iceberg? For #8 I've made the transcode step to save the data ONLY to Iceberg, should we keep our old way to save them just to a folder? The old way may be more friendly to users who doesn't know Iceberg, but eventually, if they want to perform the whole NDS test including Data Maintenance, they will come back to do Iceberg writing again... Any suggestions for this?
There was a problem hiding this comment.
To do the entire NDS suite one would need to use a format that supports the entire set of operations required by the entire suite. That would mean using Iceberg, Delta Lake, or some other format that allows incremental table update.
I've made the transcode step to save the data ONLY to Iceberg
That's not desired. We want to support transcoding to a bunch of different formats, because we're not always going to run the entire suite. We get a lot of useful information from running the significant portion of NDS that works on raw Parquet and ORC files, and we do not want to lose the ability to setup those benchmarks. The transcode needs to be flexible, allowing outputs ideally to every major output format that we want to bench. For now that definitely includes raw Parquet and ORC along with Iceberg (and the ability to control settings for these formats such as compression codec, probably via separate configs spec'd either inline or sideband in the Spark instance to use).
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
|
For DELETE functions, e.g. results in I can break the SQL into And this can work. |
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
This seems like potentially a bug in Spark 3.2, especially since the error is such a low-level class cast error. I was able to get the same query to plan without an error on Spark 3.1.2. |
|
One issue for our use case, we want to use Spark 3.2.1 as our NDS 2.0 benchmark environment due to some performance consideration especially for query77(there's a huge performance drop in 3.1.2). |
|
Filed a Spark issue: https://issues.apache.org/jira/browse/SPARK-39454 Update: it's said that this will be fixed in Spark 3.3.0 and Spark 3.2.2 |
|
I verified on Spark 3.2.2, it can work. |
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
jlowe
left a comment
There was a problem hiding this comment.
This is close. Minor comments on documentation and waiting to hear back on copyright/license question.
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu allxu@nvidia.com
This PR add initial supports to do part of Data Maintenance work.
Data Maintenance requires ACID operations like INSERT, DELETE and Spark currently doesn't provide native supports for them. So we choose Iceberg as the data source metadata manager.
With this change, we will:
fix: #4 , #8