Data Processing and Analysis: Big Data

Big data refers to information resources whose characteristics in terms of volume, velocity and variety require the use of particular technologies and analytical methods to generate value.
In our case, we use this data with artificial intelligence and more precisely to do data clustering.

Descriptive part

This subject allows me to deepen my knowledge of clustering algorithms using python and artificial intelligence.
The algorithms are mainly based on the sklearn library.

Here we can see the difference between the classification method and the clustering method.
In general, in classification, you have a set of predefined classes and you want to know to which class a new object belongs.
Clustering tries to group a set of objects and look if there is any relationship between the objects.
In the context of machine learning, classification is supervised learning and clustering is unsupervised learning.

Everything is organized in the form of 3 practical works: the use of k-means, agglomerative clustering and DBSCAN.For more information, I invite you to download and read my report here.

TP N°1:
K-Means

K-Means relies on the variance to define its clusters, so a cluster groups points with equal variance. It assumes that the clusters are of Convex form. The method requires to know the number of clusters K in advance and corresponds to several fields of use.

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.

DBSCAN is a well-known unsupervised clustering algorithm. DBSCAN iterates over the points of the dataset. For each of the points it analyzes, it constructs the set of points reachable by density from this point: it computes the epsilon-neighborhood of this point, then, if this neighborhood contains more than n_min points, the epsilon-neighborhoods of each of them, and so on, until the cluster cannot be enlarged anymore. If the point considered is not an interior point, i.e. it does not have enough neighbors, it will be labeled as noise. This allows DBSCAN to be robust to outliers since this mechanism isolates them.

TP N°2:
Agglomerative clustering & DBScan 

Tp N°3:
Real world dataset

This part gathers the three methods seen previously and applies them on a real dataset with 33 parameters. Here, all hyper-parameters are determined automatically. 

Technical part

This section describes the context of the subject, my accomplishments and a summary of the skills I have acquired.

Presentation

This subject took place during December and lasted 3 sessions of practical work plus personal work. I worked in pair with KHALED Walid.

It focuses mainly on the use of the scikit-learn library for pre-processing and then processing the data. It is a simple and efficient approach to discover the basics of clustering with Artificial Intelligence. Indeed, it is one of the most used libraries in the AI world. This subject has been evaluated with a report available in the GitHub below.

Solution

The objective is to study 3 different types of data clustering being K-Means, Agglomerative Clustering and DBSCAN. Each type has its own characteristics and its own application domain. I have studied these characteristics and applied them to the data sets provided to observe the differences in the algorithms.

All these algorithms depend on parameters that I have automated to determine using different metrics. However, these metrics do not always allow to find the best parameters and I have to intervene manually at times, if only to check if the algorithm leads to the right result.

Skills  used

This subject uses my knowledge acquired during my training in Artificial Intelligence. Knowledge that has been extended thanks to my last internship at SII. Finally, I had the opportunity to self-train during these last years in regression, classification and data clustering.

Review

At the end of this subject, I am able to mobilize my knowledge and skills to develop a clustering application on real data. From preprocessing to the application of the 3 types of algorithms and the normalization of data according to their type.

Analytical Part

This section presents a comprehensive analysis of all the knowledge and skills acquired during this experiences

Skills matrix

Know how to explore and represent data sets

Master Python

Master complexity associated to statistical data processing and know the techniques to be used to minimise them

Further Explaination

Since I am in IT, I should have a level of control, not expertise, for the first skill. However, through my internship and self-study, I have been able to greatly improve this.

See related work

Click on the button below to go to my GitHub repo.

GitHub