Data Intensive Grid Model

Data–Intensive Grid Service Models

 

§  Applications in the grid are normally grouped into two categories:

o   Computation-intensive and

o   Data-intensive.

§  For data-intensive applications, we may have to deal with massive amounts of data.

§  The grid system must be specially designed to discover, transfer, and manipulate these massive data sets.

§  Transferring massive data sets is a time-consuming task.

§  Efficient data management demands low-cost storage and high-speed data movement.

§  Common methods for solving data movement problems are.

o   Data Replication and Unified Namespace

o   Grid Data Access Models

o   Parallel versus Striped Data Transfers


 

Data Replication and Unified Namespace

§  This data access method is also known as caching, which is often applied to enhance data efficiency in a grid environment.

§  By replicating the same data blocks and scattering them in multiple regions of a grid, users can access the same data with locality of references.

§  Replicas of the same data set can be a backup for one another. – Key data will not be lost in case of failures.

§  Replication strategies aims to

o   preserve locality,

o   minimize update costs, and

o   maximize profits

§  Issues:

o   Data replication may demand periodic consistency checks.

o   Increase in storage requirements and network bandwidth.

o   Replication strategies determine when and where to create a replica of the data.

§  The factors to consider include data demand, network conditions, and transfer cost.

 


 

§  The strategies of replication can be classified into method types:

o   Static strategies

§  the locations and number of replicas are determined in advance and will not be modified

§  optimization is required to determine the location and number of data replicas

§  issue – static strategies cannot adapt to changes in demand, bandwidth, and storage availability

o   Dynamic strategies

§  Dynamic strategies can adjust locations and number of data replicas according to changes in conditions (e.g., user behavior).

§  Optimization may be determined based on whether the data replica is being created, deleted, or moved

§  Issues – Frequent data-moving operations can result in much more overhead than in static strategies.


 

Grid Data Access Models

§  Multiple participants may want to share the same data collection.

§  To retrieve any piece of data, we need a grid with a unique global namespace.

§  Similarly, we desire to have unique file names.

§  To achieve these, we have to resolve inconsistencies among multiple data objects bearing the same name.

§  Access restrictions may be imposed to avoid confusion.

§  Also, data needs to be protected to avoid leakage and damage.

§  Users who want to access data have to be authenticated first and then authorized for access.

§  In general, there are four access models for organizing a data grid,

o   Monadic model

o   Hierarchical model

o   Federation model

o   Hybrid model


 

 

§  Monadic model:

o   This is a centralized data repository model, shown in Figure (a).

o   All the data is saved in a central data repository.

o   When users want to access some data they have to submit requests directly to the central repository.

o   No data is replicated for preserving data locality.

o   This model is the simplest to implement for a small grid.

o   For a large grid, this model is not efficient in terms of performance and reliability.

o   Data replication is permitted in this model only when fault tolerance is demanded.


 

 

 

§  Hierarchical model:

o   The hierarchical model, shown in Figure 7.5(b),

o   This model is suitable for building a large data grid which has only one large data access directory.

o   The data may be transferred from the source to a second-level center.

o   Then some data in the regional center is transferred to the third-level center.

o   After being forwarded several times, specific data objects are accessed directly by users.

o   Generally speaking, a higher-level data center has a wider coverage area.

o   It provides higher bandwidth for access than a lower-level data center.

o   PKI (Public Key Infrastructure) security services are easier to implement in this hierarchical data access model.

o   The European Data Grid (EDG) adopts this data access model.

 

 

§  Federation model (Also known as Mesh Model):

o   This data access model also known as a mesh model shown in Figure (c)

o   This model is better suited for designing a data grid with multiple sources of data supplies.

o   The data sources are distributed to many different locations.

o   Although the data is shared, the data items are still owned and controlled by their original owners.

o   According to predefined access policies, only authenticated users are authorized to request data from any data source.

o   This mesh model may cost the most when the number of grid institutions becomes very large.


 

 

 

§  Hybrid model:

o   This data access model is shown in Figure (d).

o   The model combines the best features of the hierarchical and mesh models.

o   Traditional data transfer technology, such as FTP, applies for networks with lower bandwidth.

o   Network links in a data grid often have fairly high bandwidth, and other data transfer models are exploited by high-speed data transfer tools such as GridFTP developed with the Globus library.

o   The cost of the hybrid model can be traded off between the two extreme models for hierarchical and mesh-connected grids.

 


 

Parallel versus Striped Data Transfers

§  Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams for passing subdivided segments of a file simultaneously.

§  Although the speed of each stream is the same as in sequential streaming, the total time to move data in all streams can be significantly reduced compared to FTP transfer.

§  In striped data transfer, a data object is partitioned into a number of sections, and each section is placed in an individual site in a data grid.

§  When a user requests this piece of data, a data stream is created for each site, and all the sections of data objects are transferred simultaneously.

§  Striped data transfer can utilize the bandwidths of multiple sites more efficiently to speed up data transfer.

 

 

GRID STANDARDS

Grid Standards

 

 

Grid standards have been developed over the years. Well–formed organizations behind those standards include

1.    Open Grid Forum (formally Global Grid Forum)

2.    Object Management Group

3.    OGSA (Open Grid Services Architecture)

4.    OGSI (Open Grid Service Infrastructure)

5.    OGSA-DAI

6.    Web services

7.    SAGA (Simple API for Grid Applications)

8.    GSI (Grid Security Infrastructure)

9.    WSRF (Web Service Resource Framework)

 

1.    Open Grid Forum (OGF)

Formally called the Global Grid Forum, OGF is a community of users, developers, and vendors for standardization of grid computing. OGF has two principal functions – being the standards organization for grid computing, and building communities within the overall grid community

 

2.    Object Management Group

The Object Management Group (OMG) is a computer industry standards consortium. OMG Task Forces develop enterprise integration standards for a range of technologies.

 

3.    OGSA (Open Grid Services Architecture)

Open Grid Services Architecture (OGSA) was developed within the Open Grid Forum. It describes a service-oriented architecture for a grid computing environment for business and scientific use. The standard was specifically developed for the emerging grid and cloud service communities.

The OGSA is extended from web service concepts and technologies. It is intended to support the creation, termination, management, and invocation of stateful, transient grid services via standard interfaces and conventions. The OGSA framework specifies the physical environment, security, infrastructure profile, resource provisioning, virtual domains, and execution environment for various grid services and API access tools.

 

4.    OGSI (Open Grid Service Infrastructure)

OGSA describes the features that are needed for the implementation of services provided by the grid, as web services. It, however, does not provide the details of the implementation. Open Grid Services Infrastructure (OGSI) provides a formal and technical specification needed for the implementation of grid services. It provides a description of Web Service Description Language (WSDL), which defines a grid service. OGSI also provides the mechanisms for creation, management and interaction among grid services

 

5.    OGSA-DAI

Open Grid Services Architecture–Data Access and Integration (OGSA-DAI) is a project conceived by the UK Database Task Force. It aims to develop middleware to provide access and integration to distributed data sources using a grid. This middleware provides support for various data sources such as relational and XML databases. These data sources can be queried, updated and transformed via OGSA-DAI web service. These web services can be deployed within a grid, thus making the data sources grid enabled

 


 

6.    Web Services

Grid services, defined by OGSA, is an extension of web services. Important web service specifications include:

1.    eXtensible Markup Language (XML) – It forms the basis of web services. XML is a markup language for sharing of data across different interfaces using a common format.

2.    Simple Object Access Protocol (SOAP) – Platform independent message–based communication protocol

3.    Web Service Definition Language (WSDL) – is an XML document used to describe the web service interface. Includes information about portType, message, types, binding, port and service

4.    Universal Description, Discovery and Integration (UDDI) – is an XML–based registry used for finding a web service on the Internet. It is a specification that allows a business to publish information about it and its web services allowing other web services to locate this information. A UDDI registry is an XML-based service listing.

 

7.    SAGA (Simple API for Grid Applications)

The Simple API for Grid Applications (SAGA) is a family of related standards specified by the Open Grid Forum to define an application programming interface (API) for common distributed computing functionality. The SAGA Core API specification covers the following areas:

§  security and session management

§  permission management

§  asynchronous operations

§  monitoring

§  asynchronous notifications

§  attribute management

§  I/O buffer management

 


 

8.    GSI (Grid Security Infrastructure)

GSI is a well-known security solution in the grid environment. GSI is a portion of the Globus Toolkit and provides fundamental security services needed to support grids, including supporting for message protection, authentication and delegation, and authorization. GSI enables secure authentication and communication over an open network, and permits mutual authentication across and among distributed sites with single sign-on capability. GSI supports both message-level security and transport-level security.

 

9.    WSRF (Web Service Resource Framework)

WSRF defines a “generic and open framework for modeling and accessing stateful resources using web services”. It defines conventions for state management enabling applications to discover and interact with stateful web services in a standard way. Standard web services do not have a notion of state. Grid-based applications need the notion of state because they often perform a series of requests where output from one operation may depend on the result of previous operations. WS-Resource Framework can be used to develop such stateful grid services. The format of message exchange in WSRF is defined by the WSDL