Parallel databases – Specialised databases series – Part 2

Continuing in the specialised database series we move on to Parallel databases (Part 1 was Deductive Databases)

Moving away from a centralised database management system, parallel databases are designed to provide a huge increase in performance through parallel command execution.  In line with this Ramakrishnan & Gehrke recount that although data may be stored in a distributed fashion in such a system, the distribution is governed solely by performance considerations (Ramakrishnan & Gehrke, 2003). It should be noted that parallel databases are not equivalent to distributed databases, when discussing the distinction between parallel and distributed databases Özsu and Valduriez explain

“Distributed DBMSs assume loose interconnection between processors that have their own operating systems and operate independently. Parallel DBMSs exploit recent multiprocessor computer architectures in order to build high-performance and high availability database servers at a much lower price than equivalent mainframe computer” (Özsu & Valduriez, 1996)

There are three main architectures that have been proposed for parallel database management systems (DeWitt & Gray, 1992) (Ramakrishnan & Gehrke, 2003) (Mohamed et al., 1998) :

  • Shared Memory – Any CPU has access to both memory and disk through a fast interconnect (e.g high-speed bus). This provides excellent load balance however scalability and availability is limited.
  • Shared Disk – This provides the CPU with its own memory but a shared disk. Meaning there is no longer competition for shared memory but still competition for access to the shared disk. This provides better scale up and the load balancing is still acceptable. Availability is better than shared memory but still limited as disk failure would mean entire system failure.
  • Share Nothing – Each processor has exclusive access to its main memory and disk unit, this means all communication between CPUs is through a network connection. Shared nothing has high availability and reliability; if one node fails the others are still able to run independently. However load balance and skew become major issues with this architecture.

 

Parallel Architectures
Figure 1 – Physical Architectures for Parallel Database Systems (Ramakrishnan & Gehrke, 2003)

The architecture believed to scale best is shared nothing (Abouzeid et al., 2009) and many of the large DBMS vendors such as Teradata and Microsoft use it.

Parallel and non parallel execution
Figure 2 – Example of Parallel Query Processing (Oracle, n.d.)

When processing queries there are two different dataflow approaches a parallel database system could take:

  • Pipelined parallelism – Pipelining is when a process is split into stages and those stages are run on multiple processors. When executing a command it first passes through to stage one and when that has finished it passes off to the next stage and can pull a new command off of the queue and start executing at stage one.
  • Partition parallelism – Partitioning means splitting the input data among multiple processors and memories, executing and then merging the outputs together.

 

Pipelining and paritioning
Figure 3 -  The dataflow approach to relational operators gives both pipelined and partitioned (DeWitt & Gray, 1992)

Typical usage of a parallel database would be anywhere where performance needs to be high.

Further Analysis

Parallel database technology has been around for over 25 years. Although initially the forecast for parallel databases looked meek (DeWitt & Gray, 1992) in recent years we have seen them enter main stream. This could be attributed to the rise and domination of the relational model and the suitability of relational queries for parallel execution. The aim has always been to improve performance and the problem faced by conventional database management was known as “I/O bottle neck” (Valduriez, 1993) which was caused by high disk and memory access time.  Initial solutions saw mainframe designers use special-purpose hardware, however this failed due to high costs and poor performance. Meanwhile fast and inexpensive multi-processors began to become widely available that would make machines more powerful and cheaper than their mainframe counterparts. Parallel databases have exploited the move to fast and cheap disks, processors and memory and combined it with the popularity of the relational model to establish itself commercially.

With the introduction of parallel databases, users have had to adjust the design of their applications and databases to allow for a seamless experience. Oracle state that there are four levels of scalability needed for the successful implementation of parallel processing and parallel databases (Oracle, n.d.):

  • Hardware – Every system must be able to access the CPUs , since they are at the core of parallel execution. Bandwidth and latency of the access link determine the scalability of the hardware.Example of interconnection network for Shared-nothing Architecture
Figure 4 – Example of interconnection network for Shared-nothing Architecture
  • Operating System – This is an important issue if there is a shared memory node. Synchronization done by the operating systems can impact the scalability of the whole system. For example if the OS was developed for single CPU systems you would have asymmetric multiprocessing this is where only a single CPU can handle I/O interrupts. If that system has multiple user processes request a resource from the OS then you would create a bottle neck and hardware scalability is lost as a result. This means you need to make sure that the OS is Symmetric multiprocessing, most of today’s OS are of that kind, this problem would only be highlighted in legacy systems.
Asymmetric  multiprocessing
Figure 5 – Asymmetric  multiprocessing
 Symmetric  multiprocessing with shared memory
Figure 6- Symmetric  multiprocessing with shared memory
  • Database Management System – It is important to decide on whether the parallelism takes part internal or externally. That is, does the DBMS parallelise the query or does an external process parallelise the query. Synchronisation is highly important and an efficient DBMS enables better speedup and scale up.
  • Application – Applications need to be designed to be scalable. Despite hardware, software and the database being scalable if the application, for example, a table with only one row which every node is updating this will synchronize on one data block. Oracle gives the following example:

UPDATE ORDER_NUM
SET NEXT_ORDER_NUM = NEXT_ORDER_NUM + 1;
COMMIT;

Code snippet 1 – Causing data block

Code snippet 1 shows that every node that needs to update the order number has to wait for the row of the table to become free. A better solution would be:

INSERT INTO ORDERS VALUES
(order_sequence.nextval, … )

Code snippet 2 – more scalable solution to code snippet 1

A consideration must also be made into how the clients are connected to the server machines; this should be done in a scalable manner which implies that the network must also be scalable.

In summary the points that application and DB designers have had to take into consideration are:

  • Synchronisation – How do we ensure data integrity and prevent concurrency issues? Can the application provide “dirty data”? What type of locking does my application require?
  • Data separation/placement – How will the data be separated?
  • Query optimisation – How will the queries be optimised to allow for parallel execution?

After conducting research into emerging parallel database technologies it was discovered that companies are now often reporting to load more than a terabyte of structured data a day into their analytical database systems (Monash, 2008), this will most likely continue to rise and performance expectations remain if not increase.  Companies are already finding that traditional parallel database solutions can be too expensive on massive datasets resulting in several big names developing distributed data storage and processing systems on large clusters of shared-nothing servers (Chaiken et al., 2008).  These companies include Google’s File System, BigTable, MapReduce, Hadoop and Microsofts Dryad. These clusters can consist of hundreds or even thousands of machines and so writing a program that maximises the parallelism can prove to be difficult. There are now immerging abstraction frameworks such as MapReduce  that are providing a more user friendly way of programming, however users have to map their applications to the MapReduce model to provide parallelism which renders the code coupled and less reusable. We are already beginning to see new languages, such as SCOPE, form in an attempt to resolve the problems. Over the next 5-10 years we might expect to see a move away from parallel databases (PDB) and the evolution of the already existing hybrid combination of PDB’s and distributed computing frameworks, such as MapReduce. Abouzeid et al. state in their paper:

“Ideally, the scalability advantages of MapReduce could be combined with the performance and efficiency advantages of parallel databases to achieve a hybrid system that is well suited for the analytical DBMS market and can handle the future demands of data intensive applications”

In 2011 there was an article written on the prototyping of such hybrids LinearDB (Wang et al., 2011). It describes the explosion of the scale of the data warehouse and the move away to “private clouds” based on clusters of low-cost machines. The paper reiterates that parallel databases architectures are not designed for several hundred inexpensive machines which are likely to be unreliable and also that a MapReduce based system, although more scalable, has inferior performance. This would further suggest a possibility of a lean towards hybrid systems.

AT&T are reported to use Teradata as a data warehouse provider (Wikipedia, n.d.). Since Teradata implement Parallel databases in their solutions it would indicate that AT&T are using them. Furthermore AT&T is an American multinational telecommunications corporation (Wikipedia, n.d.) this would imply they have huge amounts of data to process. Parallel databases are a fitting solution to provide AT&T with fast processing of large amounts of data to provide their customers with a quick service.

That’s it for part 2, until next week….

Thanks,

Sara :)

<shameful plug>If you enjoy the post please retweet ;) </shameful plug>

If you can’t wait for the series here is the full text.

References

Filed under: Research, Technical, , , ,

Leave a Reply