Right Before Korean Thanksgiving Day
Last night, before meeting my friend at midnight I did many things for what to do on holiday, Chuseok. In detail, I installed 'Hadoop cluster' system through Putty, free ssh program, on server which is in my lab. Still, I was really stupid. Actually, to set up the environment to connect server through putty, I learned many thing to do it via a lot of blogs. However, I changed a port without no definite idea. That's because I felt like I had to change the port 80 to 22 when I saw this message from Hadoop 'localhost of which a port is 22 cannot connect to server'
In fact, I should not have done it because my school, KAIST, already blocked a port 22 to block a would-be malicious user connecting to school's server. I was just out of brain. Now, I cannot connect my server and cannot do anything to learn and analyze codes of Hadoop. It was my plan on holiday.
Damn........
Thus, I set my heart to study a textbook of Hadoop just formally.
Chapter 1. Meet Hadoop
The trend is for every individual's data footprint to grow, but perhaps more important, the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions - all of theses contribute to the growing mountain of data.
Although the storage capacities of hard drives have increased massively over the years, access speeds - the rate at which data can be read from drives - have not kept up. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. Plus, we can imagine that the user of such a system would be happy to share access in return for shorter analysis times, and statistically, that their analysis jobs would be likely to be spread over time.
There are more to being able to read and write data in parallel to or from multiple disks, though
Problems
1) hardware failure in the case that you start using many pieces of hardware, the chance that one will fails is fairly high. A common way of avoiding data loss is through replication.
This is how RAID works, for instance, although Hadoop's file system takes a slightly different approach.
2) most analysis tasks need to be able to combine the data in some way, and data read from one disks may need to be combined with the data from any of the other 99 disks. MapReduce provides a programming model that abstract the problem from disk read and writes, transforming it into a computation over sets of keys and values.
In a nutshell, this is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce.
MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole data set and get the results in a reasonable time is transformative.
Why can't we use databases with lots of disks to do large-scale batch analysis?
Why is MapReduce needed?
- Seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk's head to a particular place on the disk read or write data. It characterizes the latency of a disk operation, whereas the transfer rates corresponds to a disk's bandwidth.
If the data access patterns is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree, RDMS standing for Rational Database Management System (limited by the rate it can perform seeks.) works well, whereas for updating the majority of a database, a B-Tree is less efficient than MapReduce, which users Sort/Merge to rebuild the database.
Another difference between RDBMS and MapReduce is the amount of structure in the data sets on which they operate. The realm of the RDBMS is Structured data organized into entities that have a defined format, such as XML documents or datbase tables performing a certain scheme. Oppositely, Unstructured data of MapReduce dose not have any defined set and works well such as plain text and image data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
Compatibility: three aspects to consider when you move from one release to another.
1) API compatibility - user programs may need to be modified and recompiled.
2) Data compatibility - concerning persistent data and metadata formats
3) Wire compatibility - there are two types of clients who connects with server via wire protocols such as RPC and HTTP. 1) External clients 2) Internal clients
The trend is for every individual's data footprint to grow, but perhaps more important, the amount of data generated by machines will be even greater than that generated by people. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions - all of theses contribute to the growing mountain of data.
Although the storage capacities of hard drives have increased massively over the years, access speeds - the rate at which data can be read from drives - have not kept up. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. Plus, we can imagine that the user of such a system would be happy to share access in return for shorter analysis times, and statistically, that their analysis jobs would be likely to be spread over time.
There are more to being able to read and write data in parallel to or from multiple disks, though
Problems
1) hardware failure in the case that you start using many pieces of hardware, the chance that one will fails is fairly high. A common way of avoiding data loss is through replication.
This is how RAID works, for instance, although Hadoop's file system takes a slightly different approach.
2) most analysis tasks need to be able to combine the data in some way, and data read from one disks may need to be combined with the data from any of the other 99 disks. MapReduce provides a programming model that abstract the problem from disk read and writes, transforming it into a computation over sets of keys and values.
In a nutshell, this is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce.
MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole data set and get the results in a reasonable time is transformative.
Why can't we use databases with lots of disks to do large-scale batch analysis?
Why is MapReduce needed?
- Seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk's head to a particular place on the disk read or write data. It characterizes the latency of a disk operation, whereas the transfer rates corresponds to a disk's bandwidth.
If the data access patterns is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree, RDMS standing for Rational Database Management System (limited by the rate it can perform seeks.) works well, whereas for updating the majority of a database, a B-Tree is less efficient than MapReduce, which users Sort/Merge to rebuild the database.
Another difference between RDBMS and MapReduce is the amount of structure in the data sets on which they operate. The realm of the RDBMS is Structured data organized into entities that have a defined format, such as XML documents or datbase tables performing a certain scheme. Oppositely, Unstructured data of MapReduce dose not have any defined set and works well such as plain text and image data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
Compatibility: three aspects to consider when you move from one release to another.
1) API compatibility - user programs may need to be modified and recompiled.
2) Data compatibility - concerning persistent data and metadata formats
3) Wire compatibility - there are two types of clients who connects with server via wire protocols such as RPC and HTTP. 1) External clients 2) Internal clients
댓글 없음:
댓글 쓰기