Home AI What is big data?

What is big data?

by Yasir Aslam
0 comment

img big data 20200707 02 02

Table of contents

  • What is big data
  • Technology supporting big data
  • Big data and personal information
  • Summary

What is big data

Big data, as the name suggests, means a huge collection of data that cannot be handled by general data processing software. In the field of information science, data ranges from tens of terabytes to petabytes. It was originally used in the context of data mining (data collection). With the spread of the cloud and smartphones around 2010, it began to be used as an important trend word in business. In recent years, it has become mainstream to express using the “three Vs”, which are the initials of volume, velocity, and variety.

Why Big Data Matters

The spread of smartphones and the rapid development of communication technology are pushing up the amount of data distributed around the world. According to the Ministry of Internal Affairs and Communications, compared to 2015, the amount of data distributed in the world in 2020 will be about 2.7 times, and it will grow by an average of 22% per year. In recent years, in addition to user behavior data on the web, all kinds of information such as health information, environmental information, and location information using IoT is sent to the cloud, and utilization is progressing in various situations.

It is difficult to create any significant value just by storing such large amounts of data in the cloud, but in recent years, AI (machine learning) technology has reached a level where it can be put into practical use. Recent AI extracts features from data and performs data classification, prediction, regression, and estimation without human intervention.

There are many examples, regardless of the size of the company, of using big data and AI to dramatically improve the accuracy of search engines such as Google and the recommendation systems of Amazon and Netflix. Therefore, collecting big data in various situations regardless of the field, and analyzing it with AI to find new value is attracting attention from all over the world as an unprecedented business opportunity.

Technology supporting big data

NoSQL (Not only SQL)

NoSQL is defined as a term that represents databases other than RDBMS (relational database management system), which is a matrix-type storage format, such as MySQL and PostgreSQL. NoSQL usually stores data only in key-value pairs and specializes in data read/write speed. Since data can be held in a simple format, it is easy to store various unstructured data. It can also be used on cloud services such as Alibaba Cloud’s Table Store.

MapReduce and Hadoop

In October 2003, Google released a distributed file system called Google File System, and in December 2004, a big data processing technology called MapReduce. The technology presented in this paper will bring about a major change in data processing technology ranging from several terabytes to several petabytes, which was considered difficult until then. Although the implemented code was not released by Google, the Apache Nutch project, a pioneer of OSS, adopted this technology, and around July 2005, the project became independent as Apache Hadoop. In 2006, Yahoo adopted Hadoop for the backend of its search engine, and the project rapidly accelerated, after which it was used by many IT companies such as Facebook, Twitter, and LinkedIn.

The mechanics of Hadoop were revolutionary at the time, greatly reducing the cost of structuring and storing large amounts of data. However, Hadoop does not support high-speed query processing, leaving issues for scenes that require real-time performance.

Deep learning

Deep learning is famous for being the spark that sparked the third AI boom, but the origin of the technology itself dates back to 1957. A method called perceptron, which mimics the human brain, was devised, and in 2006, a method of stacking networks in multiple layers was devised. However, the amount of computation (CPU, GPU) of computers at that time was not enough to demonstrate sufficient performance. From around 2010, the hardware will start to have enough performance to withstand the technology. After that, deep learning surprises the world with various news such as the identification of cat images and victory over the world champion of Go. Behind it was the big data that the algorithm learned.

Big data and personal information

Revision of the Personal Information Protection Law

As mentioned above, with the rapid development of technology, the amount, quality, and type of information handled in the world have changed greatly. In 2017, the Japanese government will enforce the revised Act on the Protection of Personal Information to ensure that information is used appropriately in such circumstances. This law stipulates the addition of personal identification codes, the establishment of special care-required personal information, the securing of traceability, and the establishment of anonymously processed information as a means of deregulation.

Among these items, the items related to anonymously processed information have a particularly significant meaning in the use of big data. By processing information that can be used as personal identifiers from the data held by companies, ensuring a safety management system, and announcing the use of anonymously processed information, companies will no longer be subject to rules regarding personal information.

GDPR and Cookie Restrictions

The trend of personal information protection is a big trend not only in Japan but also in the world. In Europe, GDPR (EU General Data Protection Regulation) was enacted in 2018. As a result, IP addresses and cookies will also be treated as personal information. In January 2020, Google announced that within two years, Google will start advertising on the Google Chrome web browser. We are announcing that we are phasing out support for third-party cookies on Purpose.

Secure computing and Secure AI

Even if a company wants to request analysis of its data, such as personal DNA and health information, there are cases where it is not possible to provide information to the outside due to the risk of information leakage. On the other hand, research on a method called secure computation, which analyzes data while it is encrypted, has been progressing for the past few years, and demonstration cases are starting mainly in the financial field. If this method is established, a scheme will be established that can provide a large amount of data that was not provided for analysis, so it is expected that the learning accuracy of AI will improve dramatically.

Summary

With IoT and edge computing, more and more information will be digitized and accumulated as big data in the future. In addition, with the spread of the cloud, an environment is being prepared in which companies can easily analyze the acquired data. However, as laws and regulations on personal information are being strengthened, there is a need for even more careful handling of information. In providing value and ensuring safety using big data, companies need to face both sides.

You may also like

Leave a Comment