views
data pipelines filter out data to discard invalid or malformed records, which could
cause problems in any resulting model. For this step Spark’s MLlib package expects
you to do your filtering using Spark’s RDD transformations.
Spark has two machine learning libraries—Spark MLlib and Spark ML—with verydifferent APIs, but similar algorithms. These machine learning libraries inherit manyof the performance considerations of the RDD and Dataset APIs they are based on,but also have their own considerations.
MLlib is the first of the two libraries and isentering a maintenance/bug-fix only mode. Normally we would skip discussingSpark MLlib and focus on the new API; however, for existing algorithms not all of thefunctionality has been ported over to the new Spark ML API.
Spark ML is the newer,scikit-learn inspired, machine learning library and is where new active developmentis taking place.
At first glance, the most obvious difference between MLlib and ML is the data typesthey work on, with MLlib supporting RDDs and ML supporting DataFrames andDatasets. The data format difference isn’t all that important since they both deal withRDDs and Datasets of vectors, which are easily represented and converted betweenthe RDD and Dataset formats. For more additional info: Apache Spark Certification
From a design philosophy point of view, Spark’s MLlib is focused on providing a coreset of algorithms for people to use, while largely leaving the data pipeline, cleaning,preparation, and feature selection problems up to the user. Spark ML instead focuseson exposing a scikit-learn inspired pipeline API for everything from data preparationto model training.Currently, if you need to do streaming or online training your only option is workingwith the MLlib APIs. Select algorithms in Spark MLlib support training on streamingdata, using the Spark Streaming DStream API.
Working with MLlib:
Many of the same performance considerations in working with Spark Core alsodirectly apply to working with MLlib. One of the most direct ones is with RDD reuse;many machine learning algorithms depend on iterative computation or optimization.
For more in-depth knowledge on Spark, enroll for live demo on Apache Spark Training
so ensuring your inputs are persisted at the right level can make a huge difference.Supervised algorithms in the Spark MLlib API are trained on RDDs of labeled points,with unsupervised algorithms using RDDs of vectors. These labeled points and vec‐tors are unique to the MLlib library, and separate from both Scala’s vector class andSpark ML’s equivalent classes.
MLlib Feature:
Encoding and Data PreparationFeature selection and scaling require that our data is already in Spark’s internal for‐mat, so before we cover those, let’s look at how to encode data into the requiredformat.Once the data is encoded, or sometimes during the process, many machine learningdata pipelines filter out data to discard invalid or malformed records, which couldcause problems in any resulting model. For this step Spark’s MLlib package expectsyou to do your filtering using Spark’s RDD transformations. for more on Big data tool Spark Certification