by Jose Sealtiel Cruz
How can we defeat dengue, (apparently) decide on what is next for society in a pandemic, and read a Facebook feed that is tailor-made for us? With data, obviously.
Data is, to put it simply, raw information. It can be anything, especially in this age: demographics, the words you use, location, number of hospital beds, clicks, feelings, and the time you spend on something can be treated as data. It can be beyond numbers and letters: characters, images, and observations can be treated as data, among others.
When treated properly, data gives information, which is what makes data indispensable and powerful in any field and industry for virtually all applications. The evolution of what we consider as data has opened so much possibilities that it is considered a cornerstone of the Fourth Industrial Revolution, away from the dominantly physical (think steampunk) themes of the first three Industrial Revolutions and instead focusing on digital systems, “smart machines,” and copious amounts of data – Big Data.
With the abundance of data sprouts a new field: data science.
Beyond numbers
Statistics is a relatively old and ubiquitous field relative to data science. Though they both deal with treating data to obtain information and solution, statistics employ (strictly) mathematical methodologies to arrive with conclusions from random sets of quantifiable data – data science, on the other hand, is a more interdisciplinary approach in handling data to gain valuable insight from a filtered set of data.
This is not to say that the two fields do not overlap. Data science can borrow from statistical methods among other fields: from mathematics to computer science, economics, and management. Data science is simply more equipped to deal with Big Data as it is created alongside the evolution of big data, which can be anything from structured database information (transactions for instance) to messy collections of tweets or Facebook posts.
Big Data is big, and literally so: Facebook in 2014, for instance, has reported that they generate a new 4 petabytes of data a day. One petabyte is equivalent to around a million gigabytes. Big Data is usually characterized by the three Vs: volume, velocity, and variety.
Berkeley lists these five main processes in the data cycle:
- capture, or the procurement of data;
- maintain, or the cleaning, sorting, and storing of data;
- process, or the mining, modelling, or the running of data in softwares to obtain information;
- analyze, or the verification and interpretation of the obtained information; and
- communicate, or the presentation of both data and results.
Berkeley also listed the various softwares used in dealing with datasets, from R and Python which is heavily used in statistics, to Apache Hadoop, MapReduce, and NoSQL; among other skills from programming to algorithms, artificial intelligence, and machine learning.
The immense need for data, however, has a glaring caveat: commodifying data, when left unchecked, makes data prone to abuse and can compromise a person’s right to privacy and data protection. This ethical challenge is discussed in-depth in a dissertation by Ma. Angela Teresa G. Sebastian from the UP College of Law.
The wide variety of fields data science encompasses in the study makes a data scientist fit for any industry that involves dealing with data – that is to say, all industries, which makes them one of the most sought-after jobs in the Philippines and beyond.
The Philippines, however, is just starting its pacing towards creating more data scientists.
Keeping up with the neighbors?
The country is the first Southeast Asian country to offer a BS Statistics degree program (starting in 1964) under then-University of the Philippines Statistical Center, which is renamed as the University of the Philippines School of Statistics (UPSS). However, the country is yet to implement an undergraduate data science program – though a petition is already filed for one by Rudy H. Tan, Ph.D. (retired UPSS Professor) and Lourdes A. Tan, M Stat.
Data science is, for some time already, offered as an online course in many MOOCs (Massive Open Online Courses), such as edX and Coursera, which offer data science education (and for some sites certification) for free.
Formal data science studies are currently offered as graduate studies in a few schools, according to an article of Edukasyon.ph, which are the
- Asian Institute of Management (AIM)
- University of the Philippines (UP)
- Ateneo de Manila University (ADMU), partnered with the London-based Queen Mary University
- De La Salle University (DLSU), partnered with the Liverpool Hope University in the United Kingdom
Local schools and universities also offer data science subjects either as electives or integrated in existing statistics and computer science undergraduate programs, while some argue that these existing undergraduate courses (along with industrial engineering) can serve as a precursor to the data science fields.
However, as the petitioners argued, to call someone a “scientist” is a high calling and requires expertise in their field – something a certification or a “one-semester business analytics course,” as in those offered in the universities dubbed as the Big Four, cannot provide; thus the call to create a new data science degree program.
The country has to catch up with the growing demand of data scientists across all sectors of society – and with the amount of data waiting to be tapped, rolling out data science programs will not only produce globally competitive Filipinos but a progressive Philippines with its increasingly data-driven solutions.
The future is data, and that future is now.