Cannot create calculated column on azure blob table #23

sztuka-billtech · 2024-05-13T09:54:31Z

When trying to create a calculated column on a json, gzipped, hive-partitioned table that is read from azure blob storage, dqops throws this error when collecting statistics on such calculated column.

The calculated column uses this query:

dayname(scraped_at::timestamp)

This calculated column is working perfectly for a s3 hive partitioned parquet file table. Manually setting the column data type to STRING, or VARCHAR does not seem to help.

Error stacktrace:

2024-05-13 09:41:41.120 [pool-5-thread-2] ERROR c.d.c.jobqueue.BaseDqoJobQueueImpl -- Failed to execute a job: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is nulljava.lang.NullPointerException: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.metadata.sources.fileformat.TableOptionsFormatter.lambda$formatColumns$1(TableOptionsFormatter.java:97)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.SliceOps$1$1.accept(SliceOps.java:200)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:601)
	at com.dqops.metadata.sources.fileformat.TableOptionsFormatter.formatColumns(TableOptionsFormatter.java:95)
	at com.dqops.metadata.sources.fileformat.JsonFileFormatSpec.buildSourceTableOptionsString(JsonFileFormatSpec.java:96)
	at com.dqops.metadata.sources.fileformat.FileFormatSpec.buildTableOptionsString(FileFormatSpec.java:170)
	at com.dqops.execution.sqltemplates.rendering.JinjaTemplateRenderParameters.createFromTrimmedObjects(JinjaTemplateRenderParameters.java:161)
	at com.dqops.execution.sqltemplates.rendering.JinjaSqlTemplateSensorRunner.prepareSensor(JinjaSqlTemplateSensorRunner.java:104)
	at com.dqops.execution.sensors.DataQualitySensorRunnerImpl.prepareSensor(DataQualitySensorRunnerImpl.java:93)
	at com.dqops.execution.statistics.TableStatisticsCollectorsExecutionServiceImpl.prepareSensors(TableStatisticsCollectorsExecutionServiceImpl.java:264)
	at com.dqops.execution.statistics.TableStatisticsCollectorsExecutionServiceImpl.executeCollectorsOnTable(TableStatisticsCollectorsExecutionServiceImpl.java:154)
	at com.dqops.execution.statistics.StatisticsCollectorsExecutionServiceImpl.executeStatisticsCollectorsOnTable(StatisticsCollectorsExecutionServiceImpl.java:171)
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:82)
	... 8 common frames omitted
Wrapped by: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:99)
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:39)
	at com.dqops.core.jobqueue.DqoQueueJob.execute(DqoQueueJob.java:128)
	... 6 common frames omitted
Wrapped by: com.dqops.core.jobqueue.exceptions.DqoQueueJobExecutionException: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.core.jobqueue.DqoQueueJob.execute(DqoQueueJob.java:142)
	at com.dqops.core.jobqueue.BaseDqoJobQueueImpl.jobProcessingThreadLoop(BaseDqoJobQueueImpl.java:203)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Steps to reproduce:

Create a hive-partitioned json gzipped newline-delimited azure blob data source table, where some field contains a timestamp
Create a calculated column using query mentioned above

Bug observed in:

develop commit hash cd41ed3

The text was updated successfully, but these errors were encountered:

dqops · 2024-05-15T14:56:36Z

Problem fixed. DuckDB is case-sensitive, and the schema definition for nested fields must be very strict. The current version on develop is no longer trying to align data types and change them to upper case.

dqops closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot create calculated column on azure blob table #23

Cannot create calculated column on azure blob table #23

sztuka-billtech commented May 13, 2024

dqops commented May 15, 2024

Cannot create calculated column on azure blob table #23

Cannot create calculated column on azure blob table #23

Comments

sztuka-billtech commented May 13, 2024

dqops commented May 15, 2024