Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] support COUNT_DISTINCT #914

Open
1 of 4 tasks
loomlike opened this issue Dec 10, 2022 · 0 comments
Open
1 of 4 tasks

[BUG] support COUNT_DISTINCT #914

loomlike opened this issue Dec 10, 2022 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@loomlike
Copy link
Collaborator

Willingness to contribute

No. I cannot contribute a bug fix at this time.

Feathr version

0.9.0

System information

NA

Describe the problem

Aggregation function COUNT_DISTINCT seems is not supported yet.

When I try to use that, it seems auto-types the feature to STRING, ignoring the type I explicitly defined, which is integer,
and thus spark throws an error.

My codes:

# total number of different currencies used for transaction in the past week
num_currency_type_in_week = Feature(
    name="num_currency_type_in_week",
    key=account_id,
    feature_type=INT32,
    transform=WindowAggTransformation(
        agg_expr="transactionCurrencyCode", agg_func="COUNT_DISTINCT", window="7d"
    ),
)

Spark logs:

if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 17, num_currency_type_in_week), StringType, false), true, false, true) AS num_currency_type_in_week#819

and the error:

Caused by: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of string
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke_12$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_7$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:207)

I see the test function for count_distinct is commented out:
https://github.com/feathr-ai/feathr/blob/main/feathr-impl/src/test/scala/com/linkedin/feathr/offline/SlidingWindowAggIntegTest.scala

Tracking information

No response

Code to reproduce bug

No response

What component(s) does this bug affect?

  • Python Client: This is the client users use to interact with most of our API. Mostly written in Python.
  • Computation Engine: The computation engine that execute the actual feature join and generation work. Mostly in Scala and Spark.
  • Feature Registry API: The frontend API layer supports SQL, Purview(Atlas) as storage. The API layer is in Python(FAST API)
  • Feature Registry Web UI: The Web UI for feature registry. Written in React
@loomlike loomlike added the bug Something isn't working label Dec 10, 2022
@blrchen blrchen added the good first issue Good for newcomers label Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants