Apache Mahout Interview Questions and Answers

Apache Mahout is a new open source project by the Apache Software Foundation (ASF) with the primary goal of creating highly scalable machine-learning algorithms that are fast and free to use under the Apache license. Mahout’s core algorithms for clustering, classification, and batch-based collaborative filtering are implemented on top of Apache Hadoop using the Mapreduce paradigm. Currently, Mahout supports mainly three common machine-learning use cases: (1) user-based recommendations, where data is mined using known user preferences and behaviours and used to predict new preferences for the user (there is also limited support for the related approach, item-based recommendations), (2) clustering looks for similarities between data points, using a user-specified metric, to identify clusters in the data, that is groups of points that appear more similar to each other than to members of other groups, and (3) classification applies discrete labels to data or predicts a continuous value (e.g., a price) based on previous examples of similar data.

What is Apache Mahout?

Apache Mahout is Free and open source project. It is a library of scalable machine learning Algorithms, implemented on top of Apache Hadoop and using the Map reduce paradigm. It implements the most popular machine learning techniques like recommendation, classification, clustering, and collaborative filtering. In this Mahout contains manly java libraries for common math algorithms and different operations like or, and, not focused on statics and linear algebra as well as primitive java collection. It also provides the data science tools to automatically find interesting patterns in those big data sets. Most companies used in mahout internally: Facebook, LinkedIn, Foursquare, twitter, yahoo, adobe etc.

What are the features of Apache Mahout?

Can you briefly explain the Apache Mahout?

The Mahout project was started by several people involved in the Apache Lucene (open source search) community with an active interest in machine learning and a desire for robust, well-documented, scalable implementations of common machine-learning algorithms for clustering and categorization. The community was initially driven by Ng et al.’s paper “Map-Reduce for Machine Learning on Multicore” (see Resources) but has since evolved to cover much broader machine-learning approaches. Mahout also aims to:

Build and support a community of users and contributors such that the code outlives any particular contributor’s involvement or any particular company or university’s funding.
Focus on real-world, practical use cases as opposed to bleeding-edge research or unproven techniques.
Provide quality documentation and examples.

What does Apache Mahout do?

Mahout supports four main data science use cases:

Collaborative filtering: It mines user behaviour and makes product recommendations (Amazon recommendations)

Clustering: It takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other

Classification: learns from existing categorizations and then assigns unclassified items to the best category

Frequent item-set mining: analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

Can you explain Clustering in Mahout?

Clustering is the procedure to organize elements of a given data collection into groups based on the similarity between the items. Or Clustering is grouping any forms of data into characteristically similar groups of data sets. Mahout supports many different clustering mechanisms. The important clustering mechanisms are

Canopy: It is used to create initial seeds for other clustering algorithms
K-Mean or Fuzzy-Mean: It creates K clusters based on the distance of items from the centre of the previous iteration.

Dirichlet: It creates clusters by combining one or more cluster models
Mean shift: This algorithm doesn’t require any prior information about the number of clusters.

Can you explain how it is different from doing machine learning in R or SAS?

Unless you are highly proficient in java, the coding itself is a big overhead. There’s no way around it. If you don’t know it’s already you are going to need to learn java and its not language that flows! For R users who are used to seeing their thoughts realized immediately the endless declaration and initialization of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.

Can you explain Recommendation engine?

Recommendation engine is a subset of information filtering systems which can predict the rating or preferences user can give to an item. Using taste library we can build a fast algorithm and flexible collaborative filtering engine. Below are primary components of taste library

Data model: Users, Items and preferences

User similarity: similarity between two users
Item similarity: similarity between two items
Recommender::provide recommendations

User neighbourhood: Compute and calculate a neighbourhood of users of same category which cab be used by recommenders.

What are the machine learning algorithms supports in Apache Mahout?

User and Items based Collaborative Filtering
Clustering: K-means, Fuzzy k-means, streaming k-means, spectral clustering
Native bayes or complementary native bayes

Matrix Factorization with ALs and ALS on Implicit Feedback
Weighted matrix factorization, SVD++, Singular Value Decomposition
Stochastic SVD

PCA
QR-Decomposition
Random Forest

Lanczos Algorithm
Hidden Markov Models
Sparse TF-IDF Vectors from Text

Logistic Regression – trained via SGD
Multilayer Perceptron
Latent Dirichlet Allocation

Frequent Pattern Matching
RowSimilarityJob
ConcatMatrices

Colocations

Can you explain difference between Apache Mahout and Apache Spark’s MLlib?

Mahout is Hadoop Map reduce and MLib is spark .To be more specific from the difference in per job overhead .If Your ML algorithm mapped to the single MR job – main difference will be only start-up overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important. Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.For example: we need 100 iterations; each needed 5 seconds of cluster CPU.

On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount – I would consider Mahout as serious alternative.

Mention Some Use Cases Of Apache Mahout?

Commercial Use:

Adobe AMP uses Mahout’s clustering algorithms to increase video consumption by better user targeting.
Accenture uses Mahout as typical example for their Hadoop Deployment Comparison Study

AOL use Mahout for shopping recommendations. See slide deck
Booz Allen Hamilton uses Mahout’s clustering algorithms. See slide deck
Buzzlogic uses Mahout’s clustering algorithms to improve ad targeting

tv uses modified Mahout algorithms for content recommendations
DataMine Lab uses Mahout’s recommendation and clustering algorithms to improve our clients’ ad targeting.
Drupal users Mahout to provide open source content recommendation solutions.

Evolv uses Mahout for its Workforce Predictive Analytics platform.
Foursquare uses Mahout for its recommendation engine
Idealo uses Mahout’s recommendation engine.

InfoGlutton uses Mahout’s clustering and classification for various consulting projects.
Intel ships Mahout as part of their Distribution for Apache Hadoop Software.
Intela has implementations of Mahout’s recommendation algorithms to select new offers to send to customers, as well as to recommend potential customers to current offers. We are also working on enhancing our offer categories by using the clustering algorithms.

IOffer uses Mahout’s Frequent Pattern Mining and Collaborative Filtering to recommend items to users.
Kauli, one of Japanese Ad network, uses Mahout’s clustering to handle click stream data for predicting audience’s interests and intents.
LinkedIn historically, we have used R for model training. We have recently started experimenting with Mahout for model training and are excited about it – also see Hadoop World slides.

Lucid Works Big Data uses Mahout for clustering, duplicate document detection, phrase extraction and classification.
Mendeley uses Mahout to power Mendeley Suggest, a research article recommendation service.
Mippin uses Mahout’s collaborative filtering engine to recommend news feeds

Mobage uses Mahout in their analysis pipeline
Myrrix is a recommender system product built on Mahout.
NewsCred uses Mahout to generate clusters of news articles and to surface the important stories of the day

Next Glass uses Mahout
Predixion Software uses Mahout’s algorithms to build predictive models on big data
Radoop provides a drag-n-drop interface for big data analytics, including Mahout clustering and classification algorithms

ResearchGate, the professional network for scientists and researchers, uses Mahout’s recommendation algorithms.
Sematext uses Mahout for its recommendation engine
com uses Mahout’s collaborative filtering engine to recommend member profiles

Twitter uses Mahout’s LDA implementation for user interest modeling
Yahoo! Mail uses Mahout’s Frequent Pattern Set Mining.
365Media uses Mahout’s Classification and Collaborative Filtering algorithms in its Real-time system named UPTIME and 365Media/Social.

Academic Use

Dicode project uses Mahout’s clustering and classification algorithms on top of HBase.
The course Large Scale Data Analysis and Data Mining at TU Berlin uses Mahout to teach students about the parallelization of data mining problems with Hadoop and Mapreduce

Mahout is used at Carnegie Mellon University, as a comparable platform to GraphLab
The ROBUST project , co-funded by the European Commission, employs Mahout in the large scale analysis of online community data.
Mahout is used for research and data processing at Nagoya Institute of Technology, in the context of a large-scale citizen participation platform project, funded by the Ministry of Interior of Japan.

Several researches within Digital Enterprise Research Institute NUI Galway use Mahout for e.g. topic mining and modeling of large corpora.
Mahout is used in the NoTube EU project.