Skip to content

Spark Commons, some hacks to simplify programming with Spark.

Notifications You must be signed in to change notification settings

AnkushKhanna/spark-common

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chaining multiple transformers:

Multiple times while trying to use more than one transformation, I was required to chain up Transformers or build a pipeline.

Although pipeline was a go to way. Sometime it became overloaded while experimenting with different transformations.

Ex: Maintaining intermediate transformation columns and passing column names between transformers.

Thus I built a Transform class which was extended by the most common transformers I used, Tokenizer, Hashing, TFIDF.

Thus now to chain Tokenizer and Hashing, we can use:

val transform = new Transform with TTokenize with THashing

To add TFIDF the transformer would look like:

val transform = new Transform with TTokenize with THashing with TIDF

This works from left to right. So first Tokenizer would be applied then Hashing and at last TFIDF.

This make life easier while trying out different Transformer combinations, without the headache of maintaining intermediate columns.

See source code

See usage code

To extend this class with further transformations, you can check out Source code extension

About

Spark Commons, some hacks to simplify programming with Spark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages