Data Intensive Grid Model

Data–Intensive Grid Service Models

 

§  Applications in the grid are normally grouped into two categories:

o   Computation-intensive and

o   Data-intensive.

§  For data-intensive applications, we may have to deal with massive amounts of data.

§  The grid system must be specially designed to discover, transfer, and manipulate these massive data sets.

§  Transferring massive data sets is a time-consuming task.

§  Efficient data management demands low-cost storage and high-speed data movement.

§  Common methods for solving data movement problems are.

o   Data Replication and Unified Namespace

o   Grid Data Access Models

o   Parallel versus Striped Data Transfers


 

Data Replication and Unified Namespace

§  This data access method is also known as caching, which is often applied to enhance data efficiency in a grid environment.

§  By replicating the same data blocks and scattering them in multiple regions of a grid, users can access the same data with locality of references.

§  Replicas of the same data set can be a backup for one another. – Key data will not be lost in case of failures.

§  Replication strategies aims to

o   preserve locality,

o   minimize update costs, and

o   maximize profits

§  Issues:

o   Data replication may demand periodic consistency checks.

o   Increase in storage requirements and network bandwidth.

o   Replication strategies determine when and where to create a replica of the data.

§  The factors to consider include data demand, network conditions, and transfer cost.

 


 

§  The strategies of replication can be classified into method types:

o   Static strategies

§  the locations and number of replicas are determined in advance and will not be modified

§  optimization is required to determine the location and number of data replicas

§  issue – static strategies cannot adapt to changes in demand, bandwidth, and storage availability

o   Dynamic strategies

§  Dynamic strategies can adjust locations and number of data replicas according to changes in conditions (e.g., user behavior).

§  Optimization may be determined based on whether the data replica is being created, deleted, or moved

§  Issues – Frequent data-moving operations can result in much more overhead than in static strategies.


 

Grid Data Access Models

§  Multiple participants may want to share the same data collection.

§  To retrieve any piece of data, we need a grid with a unique global namespace.

§  Similarly, we desire to have unique file names.

§  To achieve these, we have to resolve inconsistencies among multiple data objects bearing the same name.

§  Access restrictions may be imposed to avoid confusion.

§  Also, data needs to be protected to avoid leakage and damage.

§  Users who want to access data have to be authenticated first and then authorized for access.

§  In general, there are four access models for organizing a data grid,

o   Monadic model

o   Hierarchical model

o   Federation model

o   Hybrid model


 

 

§  Monadic model:

o   This is a centralized data repository model, shown in Figure (a).

o   All the data is saved in a central data repository.

o   When users want to access some data they have to submit requests directly to the central repository.

o   No data is replicated for preserving data locality.

o   This model is the simplest to implement for a small grid.

o   For a large grid, this model is not efficient in terms of performance and reliability.

o   Data replication is permitted in this model only when fault tolerance is demanded.


 

 

 

§  Hierarchical model:

o   The hierarchical model, shown in Figure 7.5(b),

o   This model is suitable for building a large data grid which has only one large data access directory.

o   The data may be transferred from the source to a second-level center.

o   Then some data in the regional center is transferred to the third-level center.

o   After being forwarded several times, specific data objects are accessed directly by users.

o   Generally speaking, a higher-level data center has a wider coverage area.

o   It provides higher bandwidth for access than a lower-level data center.

o   PKI (Public Key Infrastructure) security services are easier to implement in this hierarchical data access model.

o   The European Data Grid (EDG) adopts this data access model.

 

 

§  Federation model (Also known as Mesh Model):

o   This data access model also known as a mesh model shown in Figure (c)

o   This model is better suited for designing a data grid with multiple sources of data supplies.

o   The data sources are distributed to many different locations.

o   Although the data is shared, the data items are still owned and controlled by their original owners.

o   According to predefined access policies, only authenticated users are authorized to request data from any data source.

o   This mesh model may cost the most when the number of grid institutions becomes very large.


 

 

 

§  Hybrid model:

o   This data access model is shown in Figure (d).

o   The model combines the best features of the hierarchical and mesh models.

o   Traditional data transfer technology, such as FTP, applies for networks with lower bandwidth.

o   Network links in a data grid often have fairly high bandwidth, and other data transfer models are exploited by high-speed data transfer tools such as GridFTP developed with the Globus library.

o   The cost of the hybrid model can be traded off between the two extreme models for hierarchical and mesh-connected grids.

 


 

Parallel versus Striped Data Transfers

§  Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams for passing subdivided segments of a file simultaneously.

§  Although the speed of each stream is the same as in sequential streaming, the total time to move data in all streams can be significantly reduced compared to FTP transfer.

§  In striped data transfer, a data object is partitioned into a number of sections, and each section is placed in an individual site in a data grid.

§  When a user requests this piece of data, a data stream is created for each site, and all the sections of data objects are transferred simultaneously.

§  Striped data transfer can utilize the bandwidths of multiple sites more efficiently to speed up data transfer.

 

 

No comments:

Post a Comment

Don't be a silent reader...
Leave your comments...

Anu