Data–Intensive Grid Service Models
§ Applications in the grid are normally grouped into two categories:
o Computation-intensive and
o Data-intensive.
§ For data-intensive applications, we may have to deal with massive amounts of data.
§ The grid system must be specially designed to discover, transfer, and manipulate these massive data sets.
§ Transferring massive data sets is a time-consuming task.
§ Efficient data management demands low-cost storage and high-speed data movement.
§ Common methods for solving data movement problems are.
o Data Replication and Unified Namespace
o
Grid Data
Access Models
o
Parallel
versus Striped Data Transfers
Data Replication and
Unified Namespace
§ This data access method is also known as caching, which is often applied to enhance data efficiency in a grid environment.
§ By replicating the same data blocks and scattering them in multiple regions of a grid, users can access the same data with locality of references.
§ Replicas of the same data set can be a backup for one another. – Key data will not be lost in case of failures.
§ Replication strategies aims to
o preserve locality,
o minimize update costs, and
o maximize profits
§ Issues:
o Data replication may demand periodic consistency checks.
o Increase in storage requirements and network bandwidth.
o Replication strategies determine when and where to create a replica of the data.
§ The factors to consider include data demand, network conditions, and transfer cost.
§ The strategies of replication can be classified into method types:
o
Static
strategies
§ the locations and number of replicas are determined in advance and will not be modified
§ optimization is required to determine the location and number of data replicas
§ issue – static strategies cannot adapt to changes in demand, bandwidth, and storage availability
o
Dynamic
strategies
§ Dynamic strategies can adjust locations and number of data replicas according to changes in conditions (e.g., user behavior).
§ Optimization may be determined based on whether the data replica is being created, deleted, or moved
§ Issues – Frequent data-moving operations can result in much more overhead than in static strategies.
Grid Data Access
Models
§ Multiple participants may want to share the same data collection.
§ To retrieve any piece of data, we need a grid with a unique global namespace.
§ Similarly, we desire to have unique file names.
§ To achieve these, we have to resolve inconsistencies among multiple data objects bearing the same name.
§ Access restrictions may be imposed to avoid confusion.
§ Also, data needs to be protected to avoid leakage and damage.
§ Users who want to access data have to be authenticated first and then authorized for access.
§ In general, there are four access models for organizing a data grid,
o Monadic model
o Hierarchical model
o Federation model
o Hybrid model
§ Monadic model:
o This is a centralized data repository model, shown in Figure (a).
o All the data is saved in a central data repository.
o When users want to access some data they have to submit requests directly to the central repository.
o No data is replicated for preserving data locality.
o This model is the simplest to implement for a small grid.
o For a large grid, this model is not efficient in terms of performance and reliability.
o Data replication is permitted in this model only when fault tolerance is demanded.
§
Hierarchical
model:
o The hierarchical model, shown in Figure 7.5(b),
o This model is suitable for building a large data grid which has only one large data access directory.
o The data may be transferred from the source to a second-level center.
o Then some data in the regional center is transferred to the third-level center.
o After being forwarded several times, specific data objects are accessed directly by users.
o Generally speaking, a higher-level data center has a wider coverage area.
o It provides higher bandwidth for access than a lower-level data center.
o PKI (Public Key Infrastructure) security services are easier to implement in this hierarchical data access model.
o The European Data Grid (EDG) adopts this data access model.
§
Federation
model (Also known as Mesh Model):
o This data access model also known as a mesh model shown in Figure (c)
o This model is better suited for designing a data grid with multiple sources of data supplies.
o The data sources are distributed to many different locations.
o Although the data is shared, the data items are still owned and controlled by their original owners.
o According to predefined access policies, only authenticated users are authorized to request data from any data source.
o This mesh model may cost the most when the number of grid institutions becomes very large.
§ Hybrid model:
o This data access model is shown in Figure (d).
o The model combines the best features of the hierarchical and mesh models.
o Traditional data transfer technology, such as FTP, applies for networks with lower bandwidth.
o Network links in a data grid often have fairly high bandwidth, and other data transfer models are exploited by high-speed data transfer tools such as GridFTP developed with the Globus library.
o The cost of the hybrid model can be traded off between the two extreme models for hierarchical and mesh-connected grids.
Parallel versus
Striped Data Transfers
§ Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams for passing subdivided segments of a file simultaneously.
§ Although the speed of each stream is the same as in sequential streaming, the total time to move data in all streams can be significantly reduced compared to FTP transfer.
§ In striped data transfer, a data object is partitioned into a number of sections, and each section is placed in an individual site in a data grid.
§ When a user requests this piece of data, a data stream is created for each site, and all the sections of data objects are transferred simultaneously.
§ Striped data transfer can utilize the bandwidths of multiple sites more efficiently to speed up data transfer.