Dr R Anurekha: Data Intensive Grid Model

Data–Intensive Grid Service Models

§ Applications in the grid are normally grouped into two categories:

o Computation-intensive and

o Data-intensive.

§ For data-intensive applications, we may have to deal with massive amounts of data.

§ The grid system must be specially designed to discover, transfer, and manipulate these massive data sets.

§ Transferring massive data sets is a time-consuming task.

§ Efficient data management demands low-cost storage and high-speed data movement.

§ Common methods for solving data movement problems are.

o Data Replication and Unified Namespace

o Grid Data Access Models

o Parallel versus Striped Data Transfers

Data Replication and Unified Namespace

§ This data access method is also known as caching, which is often applied to enhance data efficiency in a grid environment.

§ By replicating the same data blocks and scattering them in multiple regions of a grid, users can access the same data with locality of references.

§ Replicas of the same data set can be a backup for one another. – Key data will not be lost in case of failures.

§ Replication strategies aims to

o preserve locality,

o minimize update costs, and

o maximize profits

§ Issues:

o Data replication may demand periodic consistency checks.

o Increase in storage requirements and network bandwidth.

o Replication strategies determine when and where to create a replica of the data.

§ The factors to consider include data demand, network conditions, and transfer cost.

§ The strategies of replication can be classified into method types:

o Static strategies

§ the locations and number of replicas are determined in advance and will not be modified

§ optimization is required to determine the location and number of data replicas

§ issue – static strategies cannot adapt to changes in demand, bandwidth, and storage availability

o Dynamic strategies

§ Dynamic strategies can adjust locations and number of data replicas according to changes in conditions (e.g., user behavior).

§ Optimization may be determined based on whether the data replica is being created, deleted, or moved

§ Issues – Frequent data-moving operations can result in much more overhead than in static strategies.

Grid Data Access Models

§ Multiple participants may want to share the same data collection.

§ To retrieve any piece of data, we need a grid with a unique global namespace.

§ Similarly, we desire to have unique file names.

§ To achieve these, we have to resolve inconsistencies among multiple data objects bearing the same name.

§ Access restrictions may be imposed to avoid confusion.

§ Also, data needs to be protected to avoid leakage and damage.

§ Users who want to access data have to be authenticated first and then authorized for access.

§ In general, there are four access models for organizing a data grid,

o Monadic model

o Hierarchical model

o Federation model

o Hybrid model

§ Monadic model:

o This is a centralized data repository model, shown in Figure (a).

o All the data is saved in a central data repository.

o When users want to access some data they have to submit requests directly to the central repository.

o No data is replicated for preserving data locality.

o This model is the simplest to implement for a small grid.

o For a large grid, this model is not efficient in terms of performance and reliability.

o Data replication is permitted in this model only when fault tolerance is demanded.

§ Hierarchical model:

o The hierarchical model, shown in Figure 7.5(b),

o This model is suitable for building a large data grid which has only one large data access directory.

o The data may be transferred from the source to a second-level center.

o Then some data in the regional center is transferred to the third-level center.

o After being forwarded several times, specific data objects are accessed directly by users.

o Generally speaking, a higher-level data center has a wider coverage area.

o It provides higher bandwidth for access than a lower-level data center.

o PKI (Public Key Infrastructure) security services are easier to implement in this hierarchical data access model.

o The European Data Grid (EDG) adopts this data access model.

§ Federation model (Also known as Mesh Model):

o This data access model also known as a mesh model shown in Figure (c)

o This model is better suited for designing a data grid with multiple sources of data supplies.

o The data sources are distributed to many different locations.

o Although the data is shared, the data items are still owned and controlled by their original owners.

o According to predefined access policies, only authenticated users are authorized to request data from any data source.

o This mesh model may cost the most when the number of grid institutions becomes very large.

§ Hybrid model:

o This data access model is shown in Figure (d).

o The model combines the best features of the hierarchical and mesh models.

o Traditional data transfer technology, such as FTP, applies for networks with lower bandwidth.

o Network links in a data grid often have fairly high bandwidth, and other data transfer models are exploited by high-speed data transfer tools such as GridFTP developed with the Globus library.

o The cost of the hybrid model can be traded off between the two extreme models for hierarchical and mesh-connected grids.

Parallel versus Striped Data Transfers

§ Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams for passing subdivided segments of a file simultaneously.

§ Although the speed of each stream is the same as in sequential streaming, the total time to move data in all streams can be significantly reduced compared to FTP transfer.

§ In striped data transfer, a data object is partitioned into a number of sections, and each section is placed in an individual site in a data grid.

§ When a user requests this piece of data, a data stream is created for each site, and all the sections of data objects are transferred simultaneously.

§ Striped data transfer can utilize the bandwidths of multiple sites more efficiently to speed up data transfer.

Dr R Anurekha

Data Intensive Grid Model

No comments:

Post a Comment