By Dr Jack Yang
In the past decade, with the exponential data accumulation, the challenges associated with the big four “V” problems – data volume (number of records), variety (different data formats), velocity (streaming data), and veracity (data uncertainty) – continue to multiply. This presents a typical scenario for Big Data processing, which is difficult to address using traditional analysis methods. Thus there is an increasing need to develop alternative frameworks for conducting Big Data analysis.
In this presentation, we reported our research work on developing the Big Data analytic capability in SMART. The implemented system, known as Smart Learning framework, utilises different open-sources frameworks, such as Apache Hadoop and Spark, to name a few. This system has overcome many challenges, such as quick processing, massive data storage, and more importantly, how huge data can be used for pattern mining, analysing user behaviours, and visualizing and tracking data, among others.
The proposed system has been successfully applied to different domains, namely, social media, transportation modelling, and educational data (student behaviours). In social media area, we investigate a niche subset of user-generated popular culture content on Douban, a well-known Chinese-language online social network. Millions of samples are harvested via an asynchronous scraping crawler. Then we manipulate heterogeneous features from raw samples to facilitate analysis of various film details, review comments, user profiles and users’ network on Douban. In addition, parallel machine learning algorithms are proposed for content-mining functions.
For transportation modelling, we apply the proposed framework to explore the temporal-spatial effects on the car parking. In addition, a fuzzification algorithm is also introduced to quantify the key attributes of the data, which helps in removing the data redundancy and inconsistency. Using more than half of million records, the proposed framework is able to make an on-the-fly prediction within 1 minutes by achieving 85% accuracy.
Last but not the least, the proposed system is also used to analyse student behaviours. We are working with our industry partner to deploy 50,000 laptops for disadvantaged primary schools across Australia (the One Education program). Then we collect massive behaviour data from all these laptops, before modelling behaviours and identifying areas of complex, unique and common digital practices and patterns of use.
More recently, we are working with our colleagues on other domains, such as online streaming, human protein to protein interaction medication, etc.
This blog post is related to a SMART Seminar Series presentation of the same name with Dr Yang presented on 28 April 2016.