CP based Sequence Mining on the Cloud using Spark

Schaus, Pierrede Vogelaere, CyrilCyrilde Vogelaere2025-05-142025-05-142025-05-142017https://hdl.handle.net/2078.2/7370Sequential pattern mining (SPM) is a challenging problem concerned with discovering frequent pattern in a given sequence dataset. Due to it's broad range of applications and it's relative complexity, it has been widely studied in the last two decades and a wide variety of efficient approaches have been developed, notably CP-based approaches which combine both great efficiency and flexibility through the addition of constraints. With the advent of Big-data allowing large-scale data processing through parallel computation, most specialised approaches evolved further and displayed a qualitative step forward in efficiency. However, CP-based approaches have yet to be adapted to support parallel computations as parallel CP frameworks have yet to be developed. In this paper we thus propose a novel parallel sequential pattern mining approach based on constraint programming modules, and designed to efficiently tackle large-scale data mining problems through parallel computations in a scalable environment. In this endeavour, we use the recently designed CP-based algorithm PPIC, which outperform both other CP-based and specialised approaches by using ideas from both data mining and CP on a generic constraint solver, to locally solve sub-problems generated through a generic, map-reduce based, scalable PrefixSpan algorithm implemented using Spark. We then show through detailed experimentation on popular datasets that, through this novel implementation, great performances can be obtained in a scalable architecture without losing much flexibility on the supported constraints.CPSequential Pattern MiningSparkBig DataPPICPrefixSpanMapReduceCP based Sequence Mining on the Cloud using Sparktext::thesis::master thesisthesis:10532