JOURNALING ON RAID    
 

ReiserFS, JFS, and Ext3FS show their merits on a fast RAID appliance. Click for an assessment of ReiserFS performance on an ATA RAID appliance

   
  by  Jack Fegreus    
     
 

As Linux server usage continues to expand beyond specialized web and e-mail serving and into the heart of database-driven IT applications, the need for serious data storage options grows in tandem. Spurring this trend on is the nearly universal adoption of 24X7 IT operations for business continuity.  But that's only half the story. Linux itself is a very young and dynamically evolving OS. As a result, taking full advantage of the latest in data-center devices can be a challenge in its own right.

A perfect example of coming to terms with the right hardware/software synergy can be found in our evaluation of Adaptec's new RAID appliance, the DuraStor 6200SR. The premise of this RAID appliance is to maximize RAID options for rack-mounted server farms by moving RAID management out of server and into a rack-mounted 1U RAID appliance that can provide access to upwards of 36 drives for multiple servers.

 
       
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Adaptec DuraStor 6220SS
*DuraStor 6200SR with 128MB cache
*DuraStor 312R with 4 Seagate Ultra160 SCSI drives
http://www.adaptec.com

Linux Journaling File systems
Ext3
JFS
ReiserFS

HOW WE TESTED
HP Netserver LP 1000r with 512MB RAM

http://hp.com
QLogic QLA12160 SCSI HBA
http://www.qlogic.com

SuSE Linux 7.3
http://www.suse.com

OBLdisk
OBLload

KEY FINDINGS
 Sequential data throughput on the DuraStor 6200RS consistent with PCI-based RAID controllers.
 Best overall performance measured with ReiserFS and JFS file systems.
 Transaction processing benchmarks using OBLload benchmark demonstrated significantly lower results when compared to the performance of a similar configuration using an HP NetRAID-2M controller.

 

The base complete RAID appliance system is the DuraStor 6220SS, which is comprised of a DuraStor 6200SR controller module and one DuraStor 312R storage module. The storage module has a 2U form factor and can hold 12 drives. From this starting point, the configuration options grow fast and furious. The 6200SR module can be configured with either one or two internal controller modules for computer host connections. These host configurations are dubbed by Adaptec as single- and dual-port mode.

The DuraStor 312R module by default is configured with two disk I/O channels. The net result of all of these options is the ability to configure the RAID appliance in four basic configurations with various levels of device redundancy: stand-alone single-port, stand-alone dual-port, active-active single port, and active-passive dual-port. Setup and management of these options can be accomplished either through the front control panel of the 6200SR or via Adaptec's Java-based Storage Manager Pro software.

Certainly running a comprehensive software control module is a lot easier than pushing buttons and squinting at the LCD readout of a 1U console. This method works and we accomplished a complete configuration in this manner. Nonetheless, the console approach is far from elegant. On the other hand, a Java-based application written in the context of a browser probably runs everywhere—not exactly. As is usually the case the Storage Manager Pro software only runs on Windows.

Nonetheless, most sites have a rich mix of Windows NT/2000 servers along with Linux. And given the wide range of configuration options for the 6200SR, in theory it should be possible to configure a dual port scheme with a Linux server with a Windows 2000 server. What's more, in practice it works marvelously well—just don't tell Adaptec you're doing that.

 
     
 

 When Adaptec talks about dual-porting scenarios, their assumption is that a single host system is connected via two internal SCSI HBAs. For Adaptec, the heart of the dual porting problem is that their first version of Storage Manager Pro strictly deals only with configuring and managing the RAID appliance. That means configuring logical RAID arrays (including RAID 10 and 50) and monitoring the health of the physical drives. There is nothing in the software that virtualizes the array volumes at the host level.

Currently, Adaptec must rely on the systems administrator to have in place clustering or virtualization software to prevent any potential catastrophic data corruption incidents in a dual-porting scenario. This is not a difficult stumbling block, especially in a mixed OS configuration. Windows 2000 is blind to Linux file partitions, so its greedy device-gobbling ways are never an issue.

On a single logical volume, we were easily able to create four primary logical drive partitions. We installed QLogic Ultra160 SCSI HBAs in two HP Netserver LP 1000r servers, which were running SuSE Linux 7.3 and Windows 2000 Server respectively. We chose SuSE 7.3 because it gave us direct out-of-the-box access to all of the major Linux file systems: ext2, ext3, JFS, and ReiserFS.

 
       
 

At the BIOS level, the QLogic HBAs reported both the DuraStor 6200SR unit and the logical "DuraStor disk" that we had created and configured as having 4 primary partitions. Both SuSE Linux and Windows 2000 saw the disk and we were able to format the partitions to suit our testing. At the OS level, Windows 2000 formatted one partition as NTFS and nicely ignored the Linux-formatted partitions. When we loaded the Storage Management Pro software on this system, it recognized the DuraStor 6200SR attached to the QLogic HBA and we were able to launch the management browser.

The management browser supplies all of the necessary basics. The systems administrator can drill down on the devices. This view allows checking on the physical aspects of the RAID appliance such as power, temperature, and hardware faults. From there, any array associated with the drives in a DuraStor 312R can be accessed and configured. What is lacking, however, is real-time performance monitoring and tuning. For example, there is currently no provision for collecting caching statistics. For Linux hosts, this would be a powerful add-on module.

 
Adaptec's Storage Management Pro software currently runs only on a Windows platform. This Java-based utility replicates all of the configuration and management chores that must otherwise be done at the 6200SR LCD console. Mouse over the system view of the storage device to see the array management view.
 
     
 

For version 2.4 of the Linux kernel, cache is king when it comes to I/O. This is clearly reflected in the baseline performance results of our OBLdisk and OBLload benchmark suites. The fundamental assumption for disk I/O under Linux is that the data should be in cache. As a result, everything necessary to make that so is done. When that strategy fails, however, the performance hit taken by the Linux OS can be quite severe. 

On the other hand, data being in cache is a nice thing for Windows NT/2000, but it is in no way a necessary thing for good performance. Unlike Linux, the Windows NT/2000 I/O subsystem anticipates that needed data won't be in cache. As a result, Windows NT/2000 follows a strategy of launching volumes of asynchronous I/O requests in order to go about its processing tasks until the responses come back from the storage devices. 

Our OBLdisk benchmark, which reads data sequentially from a disk file in increasingly larger block-size requests, reveals two distinctly different I/O throughput profiles. To optimize cache utilization,  the Linux ext2 file system bundles I/Os requests in order to issue large-block read requests. Such requests  have the added advantage of triggering large-block look-ahead requests which serve to populate the cache for likely future hits. With 512MBs of RAM in the server and a dedicated 128MB RAM device cache, sequentially reading files from the DuraStor volume—even files up to 256MB in size—with tiny I/O requests still streamed data at full bus speed. In contrast, streaming throughput performance under Windows follows a fast ramp-up curve as I/O request sizes get larger. The Windows 2000 operating system does not intervene for an application making small I/Os, To reap the advantages of large-block reads under Windows 2000, the application must explicitly issue large-block reads.

The Windows I/O processing strategy does have a decided advantage in a high transaction-processing environment. Here the asynchronous approach pays major dividends. In a database-driven application with hundreds of independent simultaneous users, the I/O pattern is made up of a complex mix of localized high activity areas, such as index tables, and essentially random access over the remaining areas of the disk. In such a scenario, robust asynchronous I/O is essential so as not to be held hostage by localized caching performance. This is currently the one really bright spot for Windows 2000 in any benchmark comparison with Linux—with a very strong emphasis on currently. One of the hot areas of Linux kernel development is to dramatically improve asynchronous I/O, and fortunately the offending legacy constructs have all been easily identified.

 
           
    Until these changes are implemented, Linux transaction processing will remain bound by the speed of cache hits. This puts a double whammy on the Adaptec DuraStor configuration. In previous OpenBench Labs tests using PCI-based RAID controllers, the size of cache and system memory proved to be dominant variables in the performance equation.

In the case of the DuraStor RAID appliance and the recently tested HP NetRAID-2M controller, system memory and cache size were identical. The big difference between the two is that the HP NetRAID-2M sits internally on a 64-bit PCI bus while the DuraStor RAID controller is at the end of an Ultra160 SCSI bus.

 
 
     
 

This is a relatively insignificant hurdle for Windows 2000, which blithely went on fulfilling 3500 8-KB I/O requests per second. For Linux,  however, this configuration proved a major stumbling block.  Throughput fell from 1,500 I/Os per second using the HP NetRAID-2M card to 500 I/Os per second on the DuraStor. This of course begs the question of just how many applications need to process more than 500 I/Os per second. The importance of the DuraStor RAID appliance lies in its capabilities for supporting large numbers of disks, configuration flexibility for supporting numerous variations in hardware redundancy, and dual porting for use in high-availability clusters.

The clear importance of the role that caching plays in a Linux environment for overall system performance in general and file system performance in particular makes file system architecture an important aspect of any Linux distribution. This is reflected in the interest being paid to the performance of the journaling file systems that are now included in the latest distributions of Linux for production environments. SuSE 7.3 includes three of these file systems: ReiserFS, JFS, and ext3FS. 

The new file systems are designed to provide more robust file structures through the introduction of journaling techniques pioneered in high-end relational database systems. To understand the significance of these new file systems, it is first necessary to understand the problem they are designed to solve. Traditional old-line Unix file systems date back to a time when disk storage was very costly. As a result, they were designed to first and foremost minimize wasted disk blocks. To minimize the number of empty disk blocks associated with a file, these file system allocated a minimal number of disk blocks to a file at creation. When these blocks were fully utilized, the file system continued to add new blocks in minimal amounts as necessary.

Naturally, such an allocation scheme only serves to fragment files in scattered clusters of disk blocks. This makes it essential to have additional metadata that describes the attributes of files and maps the scattered physical disks blocks onto the sequential logical blocks of a file. A single logical write will therefore typically involve multiple physical writes of metadata. A system crash in the midst of one of these multiple writes can easily leave the system in a very corrupted state. The solution to this problem has been the dreaded fsck utility which scavenges all of a volume's metadata to restore its structure to a consistent state in what can be a considerable time-consuming process.

The new journaling file systems borrow constructs from high-end transaction-processing databases to simplify the task of maintaining structural consistency. These file systems log operations performed on the file system's metadata as atomic transactions. In the event of a system failure, simply replaying a finite set of log records representing the period since the last file system checkpoint restores the volume to a consistent state.  In many instances, these metadata-only transaction logs actually simplify the total amount of data involved in a write operation.

In addition, these file systems also extend a number of I/O enhancement techniques introduced with the current de facto file system for Linux: ext2FS. By the time Linux was introduced, the explosion in disk capacity was well underway. The ext2 file system therefore attempts to maximize performance rather than minimize disk space. To do this, ext2 was modified to delay issuing writes in order to bundle them whenever possible into more efficient large-block operations.

This construct has evolved further in the new file systems which automatically allocate new disk blocks to files in large-block extents. Extents also serve to speed reads. When a read request is issued, that request can be expanded to put the entire extent in cache on the reasonable assumption that all of the data will be eventually accessed. The net result is that these new file systems should often result in faster writes and equivalent reads to ext2FS.

In our first look at the new file systems, we utilized our OBLdisk benchmark to test these new file systems. To minimize the effects of caching, we wrote to and read from a 512MB file with a single thread and with multiple threads accessing the same file a different locations. We then followed up by copying a file hierarchy with 14,648 files in 1,645 folders representing 3GB of data. We used the system defaults for each of the file systems and did not attempt any manual tuning in this first analysis.

For the most part, the test results went according to Hoyle. In the OBLdisk benchmark, write performance was consistently higher on the journaling file systems than ext2FS. Read performance was slightly lower. And as expected, with multiple threads, total read throughput increased while total write throughput declined. Of particular interest, performance differences were most evident with ext3FS, which can be considered a superset of ext2FS.

 
           
 

One of the unique advantages of this file system is the ability to easily upgrade an existing volume in place. In our tests we used both the default configuration which logs only metadata and also tested the fstab option data=journal, which logs both the metadata and the physical data changes to the file system. The latter option slowed writes to a fraction of ext2FS performance.

 
 
           
   The most intriguing results, however, were reserved for the file file structure copy test. Consistent with the OBLdisk benchmark results, the copy operation was faster on ReiserFS and JFS (135 seconds in both cases) as opposed to 147 seconds on ext2FS.  
 
     
  Directory copy performance on the ext3FS volume, however, was consistently slower (170 seconds) on ext3FS. With both metadata and real data journaling, the time to copy zoomed to 203 seconds. We'll be following up on these tests with more detailed examinations of tuning options in upcoming Storage Area Network reviews.