Skip to content

OctopuFS library helps managing cloud storage, ADLSgen2 specifically. It allows you to operate on files (moving, copying, setting ACLs) in very efficient manner. Designed to work on databricks, but should work on any other platform as well.

License

Notifications You must be signed in to change notification settings

procter-gamble-oss/octopufs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Checkout examples in databricks_notebooks_examples branch
Scaladoc link https://procter-gamble-tech.github.io/octopufs/#com.pg.bigdata.octopufs.package

OctopuFS

OctopuFS is Scala/Spark toolkit to manage cloud storage, especially ADLSgen2 directly from databricks. It provides several capabilities, which internally have retry mechanism built in, which will repeat unsuccessful operations up to 5 times :

File copy

com.pg.bigdata.octopufs.fs.DistributedExecution OctopuFS distributes copy operation to spark tasks and does data copy 3x faster than spark read/write operation while utilizing less CPU

Local multi-threaded operations

Many operations on ADLS are limited to HTTP requests only, thus they don't require significant fardware involvement and can be run on single machine. Operation on tens of thousands of files/folders take appox 1 minute. There operations inclide:

File move/rename/delete

com.pg.bigdata.octopufs.fs.LocalExecution

Setting up ACLs on files and folders (recursively)

com.pg.bigdata.octopufs.acl.AclManager

Getting size of files and folders

com.pg.bigdata.octopufs.fs.getSize

Hive metadata operations

com.pg.bigdata.octopufs.Promotor OctopuFS uses above functions on Hive metadata layer (i.e. Tableas and partitions) to enable operations currently not accessible for tables, which are not using Databricks Delta format abstraction.

Required setup of databricks cluster:

RDD API security setup For copy operation only it is recommended to turn of or tune spark speculation spark.conf.set("spark.speculation","false") Most methods require implicit parameter:

  • SparkSession – for distributed copy implicit val s = spark
  • Configuration – for local, multithreaded operation implicit val c = spark.sparkContext.hadoopConfiguration

How to get started

Clone and compile repository to get the latest version or download jar from artifact repositories. Once you have jar, upload it to spark cluster and run ot from scala notebook or from your own jar.
Please rememer to set up credentials like it was mentioned above.

How to get help

In case you find anny issue with the package, do not hesitate to open issue on github. Please be as specific as possible regarding the error and context/environment you were using when issue occured.

Maintainer

Jacek Tokar @ Procter&Gamble

About

OctopuFS library helps managing cloud storage, ADLSgen2 specifically. It allows you to operate on files (moving, copying, setting ACLs) in very efficient manner. Designed to work on databricks, but should work on any other platform as well.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages