File move/rename/delete

Checkout examples in databricks_notebooks_examples branch
Scaladoc link https://procter-gamble-tech.github.io/octopufs/#com.pg.bigdata.octopufs.package

OctopuFS

OctopuFS is Scala/Spark toolkit to manage cloud storage, especially ADLSgen2 directly from databricks. It provides several capabilities, which internally have retry mechanism built in, which will repeat unsuccessful operations up to 5 times :

File copy

com.pg.bigdata.octopufs.fs.DistributedExecution OctopuFS distributes copy operation to spark tasks and does data copy 3x faster than spark read/write operation while utilizing less CPU

Local multi-threaded operations

Many operations on ADLS are limited to HTTP requests only, thus they don't require significant fardware involvement and can be run on single machine. Operation on tens of thousands of files/folders take appox 1 minute. There operations inclide:

File move/rename/delete

com.pg.bigdata.octopufs.fs.LocalExecution

Setting up ACLs on files and folders (recursively)

com.pg.bigdata.octopufs.acl.AclManager

Getting size of files and folders

com.pg.bigdata.octopufs.fs.getSize

Hive metadata operations

com.pg.bigdata.octopufs.Promotor OctopuFS uses above functions on Hive metadata layer (i.e. Tableas and partitions) to enable operations currently not accessible for tables, which are not using Databricks Delta format abstraction.

Required setup of databricks cluster:

RDD API security setup For copy operation only it is recommended to turn of or tune spark speculation spark.conf.set("spark.speculation","false") Most methods require implicit parameter:

SparkSession – for distributed copy implicit val s = spark
Configuration – for local, multithreaded operation implicit val c = spark.sparkContext.hadoopConfiguration

How to get started

Clone and compile repository to get the latest version or download jar from artifact repositories. Once you have jar, upload it to spark cluster and run ot from scala notebook or from your own jar.
Please rememer to set up credentials like it was mentioned above.

How to get help

In case you find anny issue with the package, do not hesitate to open issue on github. Please be as specific as possible regarding the error and context/environment you were using when issue occured.

Maintainer

Jacek Tokar @ Procter&Gamble

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
docs		docs
examples		examples
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

examples

examples

project

project

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

azure-pipelines.yml

azure-pipelines.yml

build.sbt

build.sbt

Repository files navigation

OctopuFS

File copy

Local multi-threaded operations

File move/rename/delete

Setting up ACLs on files and folders (recursively)

Getting size of files and folders

Hive metadata operations

Required setup of databricks cluster:

How to get started

How to get help

Maintainer

About

Releases 6

Packages

Languages

License

procter-gamble-oss/octopufs

Folders and files

Latest commit

History

Repository files navigation

OctopuFS

File copy

Local multi-threaded operations

File move/rename/delete

Setting up ACLs on files and folders (recursively)

Getting size of files and folders

Hive metadata operations

Required setup of databricks cluster:

How to get started

How to get help

Maintainer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages