Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.NoClassDefFoundError on Databricks #595

Open
nohanaga opened this issue Dec 30, 2022 · 3 comments
Open

java.lang.NoClassDefFoundError on Databricks #595

nohanaga opened this issue Dec 30, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@nohanaga
Copy link

I used pyrasterframes v0.10.1 with Databricks 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12).
I get the following error and cannot access the dataframe.

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-1914649719011683> in <module>
----> 1 df.head()

/databricks/spark/python/pyspark/sql/dataframe.py in head(self, n)
   1742         """
   1743         if n is None:
-> 1744             rs = self.head(1)
   1745             return rs[0] if rs else None
   1746         return self.take(n)

/databricks/spark/python/pyspark/sql/dataframe.py in head(self, n)
   1744             rs = self.head(1)
   1745             return rs[0] if rs else None
-> 1746         return self.take(n)
   1747 
   1748     def first(self):

/databricks/spark/python/pyspark/sql/dataframe.py in take(self, num)
    767         [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    768         """
--> 769         return self.limit(num).collect()
    770 
    771     def tail(self, num):

/databricks/spark/python/pyspark/sql/dataframe.py in collect(self)
    713         # Default path used in OSS Spark / for non-DF-ACL clusters:
    714         with SCCallSiteSync(self._sc) as css:
--> 715             sock_info = self._jdf.collectToPython()
    716         return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
    717 

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o512.collectToPython.
: java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.rasterframes.ref.RFRasterSource$
	at org.locationtech.rasterframes.expressions.transformers.URIToRasterSource$.apply(URIToRasterSource.scala:62)
	at org.locationtech.rasterframes.datasource.raster.RasterSourceRelation.$anonfun$buildScan$6(RasterSourceRelation.scala:114)
	at scala.collection.TraversableLike$WithFilter.$anonfun$map$2(TraversableLike.scala:935)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:934)
	at org.locationtech.rasterframes.datasource.raster.RasterSourceRelation.buildScan(RasterSourceRelation.scala:113)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:458)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:69)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:69)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:100)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:78)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$4(QueryPlanner.scala:85)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
	at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
	at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:100)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:78)
	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:621)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:215)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:268)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:265)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:265)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:215)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:208)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:227)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:227)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:222)
	at com.databricks.sql.transaction.tahoe.metering.DeltaMetering$.reportUsage(DeltaMetering.scala:136)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:303)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:386)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:186)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:141)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:336)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3949)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3737)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

Code

from pyrasterframes.utils import create_rf_spark_session
spark = create_rf_spark_session()
df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/MCD43A4.A2019059.h11v08.006.2019072203257_B02.TIF')
#no error
df.printSchema()

#error
df.head()
#error
df.count()

Prerequisite

Basically, I used this explanation as a guide.
https://www.databricks.com/notebooks/rasterframes-notebook.html

  1. Setup GDAL Init Script
    #! /bin/bash
    sudo add-apt-repository ppa:ubuntugis/ppa
    sudo apt-get update
    sudo apt-get install -y cmake gdal-bin libgdal-dev python3-gdal
    """,
    True)
    
    The above file was created and registered in the Cluster-scoped init script.
  2. Install RasterFrames on the Cluster
    Databricks can only use Spark 3 or higher. Therefore, the latest rasterframes release, v0.10.1, was used.
    1. download https://github.com/locationtech/rasterframes/archive/refs/tags/0.10.1.zip.
    2. Unzip and run sbt publishLocal to obtain pyrasterframes .jar and .whl.
      a. pyrasterframes-assembly-0.10.1.jar
      b. pyrasterframes-0.10.1-py3-none-any.whl
      Register these two files in the cluster library.

Execution conditions

Confirmed on Spark Notebook.

  1. gdalinfo
    !gdalinfo --version
    GDAL 3.3.2, released 2021/09/01
  2. gdal_version, build_info
    from pyrasterframes.utils import gdal_version, build_info
    print(gdal_version())
    print(build_info())
    
    GDAL 3.3.2, released 2021/09/01
    {'scalaVersion': '2.12.15', 'sbtVersion': '1.5.5', 'name': 'core', 'rfSparkVersion': '3.1.2', 'rfGeoMesaVersion': '3.2.0', 'GDAL': 'GDAL 3.3.2, released 2021/09/01', 'rfGeoTrellisVersion': '3.6.1', 'version': '0.10.1'}
  3. Java Version
    !java -version
    
    openjdk version "1.8.0_345"
    OpenJDK Runtime Environment (Zulu 8.64.0.19-CA-linux64) (build 1.8.0_345-b01)
    OpenJDK 64-Bit Server VM (Zulu 8.64.0.19-CA-linux64) (build 25.345-b01, mixed mode)

I would like to use pyrasterframes on Databricks, so please investigate.

@pomadchin
Copy link
Member

Hello!

I think That's due to the codec in the RFRasterSource companion object; which implementation lives here: https://github.com/locationtech/rasterframes/blob/develop/core/src/main/scala/org/locationtech/rasterframes/encoders/StandardEncoders.scala#L87

I would assume that's due to the frameless dependency on shapeless.

Unfortunately at this point the sbt assembly should be properly built and shaded to avoid shapeless / frameless deps issues.

That's a good question when DB plans to upgrate their environment to depend on the more up to date lib versions.

@pomadchin pomadchin added the enhancement New feature or request label Dec 30, 2022
@nohanaga
Copy link
Author

@pomadchin Thanks for the response.
Is this problem due to the version of the library on the Databricks side?
Should I look into the shapeless library?

@pomadchin
Copy link
Member

pomadchin commented Dec 31, 2022

I think more into the way rasterframes is loaded; I feel like there can be an issue with how RF jar is built / used within the cluster.

Could you trigger somehow a more precise error i.e. by calling https://github.com/locationtech/rasterframes/blob/develop/core/src/main/scala/org/locationtech/rasterframes/encoders/StandardEncoders.scala#L87 explicitly from the SBT shell?

I kinda had a feeling that everything that needs to be shaded is shaded 🤔

assembly / assemblyShadeRules:= {
val shadePrefixes = Seq(
"shapeless",
"com.github.mpilquist",
"com.amazonaws",
"org.apache.avro",
"org.apache.http",
"com.google.guava",
"com.google.common",
"com.typesafe.config",
"com.fasterxml.jackson",
"io.netty",
"spire",
"cats.kernel"
)
shadePrefixes.map(p => ShadeRule.rename(s"$p.**" -> s"shaded.rasterframes.$p.@1").inAll)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants