Got it

What is Ceph storage?

Latest reply: Nov 26, 2021 13:29:25 492 9 5 0 0

Hello all,

We will talk about Ceph storage. I will introduce this concept.

1. General

Ceph is a high-performance, scalable, and non-single-point distributed file storage system based on Sage A. Weil's paper. Sage Weil developed an open source project called Ceph in 2004 and opened Ceph in 2006 based on an open source protocol. Weil was the founder of Inktank Storage, a company that focused on Ceph until it was acquired by Red Hat, and in 2012 Ceph released its first stable version.


Ceph provides the following three storage services:


Object Storage (OBS) is compatible with Amazon S3 and OpenStack Swift by using Ceph's libraries, using C, C++, Java, Python, and PHP codes, or using RESTful gateways to access or store data in the form of objects.


Block storage is directly mounted as a block device like a hard disk and supports thin provisioning, snapshot, and clone.


File system is mounted like a network file system and is compatible with POSIX interfaces.


2. Ceph Features

High performance: The traditional centralized storage metadata addressing solution is abandoned. The CRUSH algorithm is used to achieve balanced data distribution and high parallelism. DR domain isolation is considered to implement copy placement rules for various workloads, such as cross-equipment room and rack awareness.


High availability: The number of copies can be flexibly controlled. Fault domain isolation; Strong data consistency; Automatic recovery and self-healing in various fault scenarios; No single point of failure (SPOF) and automatic management.


High scalability: decentralized and flexible expansion; The number of nodes increases linearly. Supports thousands of storage nodes and TB to PB-level data.


Various features: supports three storage interfaces: block storage, file storage, and object storage. Supports user-defined interfaces and multiple language drivers.


3. System Architecture

The object storage of Ceph is provided by LIBRADOS and RADOSGW, the block storage is provided by RBD, and the file system is provided by Ceph FS. The RADOSGW, RBD, and Ceph FS all need to call LIBRADOS interfaces. And ultimately they are stored as objects in the RADOS.

RADOS



1, RAODS

The bottom layer of Ceph is RADOS, which means "A reliable,autonomous, distributed object storage". Ceph object storage and Ceph block devices read and write data from the storage cluster of RADOS. The LIBRADOS programming interface is the basis of other client interfaces, and other interfaces are extended and implemented based on LIBRADOS.

RADOS



Nodes in the Ceph cluster have three roles:


Monitor: maintains the global status of the entire Ceph cluster, monitors the cluster health status, and sends the latest CRUSH map (including the current network topology) to the client.


OSD: provides storage resources, maintains objects on nodes, responds to client requests, and synchronizes data with other OSD nodes.


MDS: Ceph Metadata Server, which is the metadata service on which Ceph FS depends.


2, Librados

Librados is a library provided by the RADOS. The RBD, RGW, and CephFS at the upper layer are accessed through the Librados. Currently, the supports PHP, Ruby, Java, Python, C, and C++.



3. Ceph client interface (Clients)

LIBRADOS, RADOSGW, RBD, and Ceph FS are collectively called Ceph client interfaces. RADOSGW, RBD, and Ceph FS are developed based on the multi-programming language interfaces provided by LIBRADOS.


RADIUS Gateway: RADIUS Gateway, which is an object storage service provided by Ceph. The underlying object storage interface is based on Librados and provides RESTful interfaces for clients. Supports Amazon's S3 and OpenStack's Swift APIs.


Ceph FS: Ceph File System, a file system service provided by Ceph for external systems. Ceph storage clusters are used to store data.


RBD: RADOS block device, which is a block device service provided by Ceph. In a Ceph cluster, Ceph block devices support thin provisioning, size adjustment, and data storage. The RBD interacts with the kernel module or the library of Librados through the RADOS protocol.


4. Data stored procedure

The Ceph storage cluster receives files from clients. The client divides each file into one or more objects, groups these objects, and stores them to the OSD nodes of the cluster based on certain policies.

OSD


Several concepts are explained as follows:


File: files to be stored or accessed.


Ojbect: object seen by the RADOS. The difference between an object and a file is that the maximum size of an object is limited by the RADIUS server (usually 2 MB or 4 MB) to implement the organization and management of underlying storage. Therefore, when an upper-layer application saves a file with a large size to the RADIUS server, the file needs to be divided into a series of objects of the same size for storage.


PG: Placement Group, which is used to organize and map the storage locations of objects. A PG organizes multiple objects, but an object can be mapped to only one PG. In addition, one PG is mapped to n OSDs, and each OSD carries a large number of PGs. In practice, n is at least 2 and at least 3 if used in a production environment.


OSD: This has been described previously.


1. File -> object mapping

Splits the file based on the maximum size of the object and maps the file to the PG. Each object generated after segmentation obtains a unique OID, that is, object ID.


2. Object -> PG Mapping

Objects in the same PG will be distributed to the same OSD node. (One primary OSD node has multiple backup OSD nodes). The PG of an object is obtained by using the hash algorithm and other modified parameters.


The formula is hash(oid) & mask -> pgid. The calculation consists of two steps. First, a static hash function specified by Ceph system is used to calculate the hash value of oid, and the oid is mapped to a pseudo-random value with approximately uniform distribution. Then, the pseudo-random value and mask are bitwise ANDed to obtain the final PG sequence number (pgid). According to the RADOS design, given that the total number of PGs is m (an integer power of 2), the value of mask is m-1. Thus, the overall result of the hash calculation and bitwise AND operation is virtually randomly selected from all m PGs. Based on this mechanism, when there are a large number of objects and PGs, RADOS can ensure approximately uniform mapping between objects and PGs. In addition, because objects are split from files, most objects have the same size. Therefore, this mapping ensures that a total amount of data of objects stored in each PG is approximately uniform.


3. PG -> OSD mapping

Mapping between PGs and corresponding OSDs. PGs are mapped to a group of OSDs. (The number of OSDs depends on the number of copies in the pool.) , the first OSD is Primary, and the rest are Replicas. The RADOS system uses the hash algorithm to distribute PGs to the OSD cluster based on the current system status (cluster map) and PG IDs. OSD clusters are divided based on fault tolerance areas (such as racks and equipment rooms) of physical nodes. This mapping is determined by four factors:


  • CRUSH algorithm: a pseudo-random algorithm.


  • OSD MAP: contains the status of all pools and OSDs.


  • CRUSH MAP: contains the current hierarchical structure of disks, servers, and racks.


  • CRUSH Rules: data mapping policy. These policies can flexibly set the area where objects are stored.


Thank you.

something new information
View more
  • x
  • convention:

little_fish
little_fish Created May 27, 2021 07:48:51 (1) (0)
Yes, about Ceph storage information.  
Very important to know.
View more
  • x
  • convention:

little_fish
little_fish Created May 27, 2021 09:36:35 (0) (0)
 
nice job.
View more
  • x
  • convention:

little_fish
little_fish Created May 28, 2021 06:11:18 (0) (0)
Thanks.  
Hi thanks for sharing
View more
  • x
  • convention:

little_fish
little_fish Created May 28, 2021 06:11:37 (0) (0)
Thank you, dear.  
Anno7
Moderator Created Nov 26, 2021 13:29:25

thankyou for sharing new and exciting information
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.