Cloud Storage File System (CSFS™)

Accessing Cloud Storage outside your web browser is a big challenge. Up to this point there has not been a cost-effective yet comprehensive solution with native NFS or Windows (CIFS or SMB), FTP, iSCSI or other storage protocols that allow native connectivity to Cloud Storage Providers. BridgeSTOR remedies this with CSFS.

CSFS is a Linux File system available on Centos, Red Hat and SUSE which translates POSIX file system calls to REST object-based calls for Cloud Storage. REST was originally developed for Amazon S3 Cloud Storage and is rapidly becoming a de-facto standard in both the Cloud Storage and Object Storage Vendors. CSFS back-end technology communicates to Cloud Storage Providers over a REST interface. For example, standard file system calls to create, read, write, delete files are translated to GET and PUT REST calls.

CSFS also includes native deduplication and compression which not only optimizes cloud disk capacity savings, but also helps reduce the bandwidth required to transfer objects across the internet; greatly improving the archive of deduplicated data in offsite locations and enabling new storage solutions to evolving challenges.

Global File View

Cloud Storage has traditionally not been a good location for file structures and metadata. CSFS overcomes this by allowing customers to create a Global File View for all of their Cloud Storage. A key element of CSFS is the capability of separating metadata from the physical data location, while keeping the two combined as a single object. This significant architectural achievement allows metadata to be stored in one location while physical data is maintained in Cloud Storage. The CSFS global view maintains the metadata in a separate clustered VM environment. In this way, CSFS exposes all files and directories without accessing the Cloud Storage platform. However, for disaster recovery purposes, CSFS will also duplicate all metadata inside the Cloud Storage. This allows CSFS to rebuild all metadata in case of a local disaster.

Advanced File System Functionality

Besides residing 100% in the Linux kernel (CSFS is not FUSE based), CSFS also includes an advanced I/O path. Most operating systems today are still utilizing single threaded I/O, which limits access to and from data. CSFS solves this problem by adding Asynchronous Processing to the Linux environment. As your data is being processed, CSFS will consolidate your storage into large blocks. As these large blocks are completed, they will be sent off to the Cloud Storage provider in the background. This caching extends the local Linux buffer cache and greatly enhances access speeds. When link speeds permit, single file writes can easily be done at 350 MB/sec or more over Windows and NFS.

Data Deduplication

Sub-File (Block-level) Deduplication

CSFS block-level data deduplication operates at the sub-file level. As its name implies, files to be deduplicated are decomposed into segments – chunks, clusters or blocks - that are examined for redundancy versus previously stored information.

In CSFS “block level” deduplication, blocks of data are “fingerprinted” using a hashing algorithm (SHA-1) that produces a unique, “shorthand” identifier for each data block. These unique fingerprints along with the blocks of data that produced them are indexed, optionally compressed and encrypted and then retained. Duplicate copies of data that have previously been fingerprinted are deduplicated, leaving only a single instance of each unique data block along with its corresponding fingerprint.

The fingerprints along with their corresponding full data representations are stored in a compressed (and optionally encrypted) form.

Once the block fingerprint value have been calculated, the deduplication engine must compare the fingerprint against all the other fingerprints that have previously been generated to see whether this block is unique (new) or has been processed previously (a duplicate). It is the speed at which these index search and update operations are performed that is at the heart of a deduplication system’s throughput.

In an in-line deduplication engine, the amount of time it takes to decide whether a block is new and unique or a duplicate that has to be deduplicated, translates into latency – which is the enemy of in-line deduplication performance.

CSFS allows the Hash Table to be memory-resident. The amount of memory required to hold the hash table is based on the amount of physical capacity being used and the deduplication block size.

Finally, based on the data being processed, CSFS can significantly reduce storage utilization for unstructured data (NAS), using file-level deduplication, typically referred to as single instance storage.

File Optimization

CSFS incorporates optimizations for Virtual Machine Image data reduction. CSFS reduces off-site storage requirements of all data types (compressed backup data in tape format, i.e., .bkf files, will be supported in a future release) with BridgeSTOR-exclusive optimization technology for deduplicating Virtual Machine Images (VHD, VMDK, VDI). When CSFS detects a Virtual Machine Image it has the intelligence to mathematically move blocks around internally to force proper block alignment. This allows all VHDs, VHDx and VMDKs to share blocks permitting massive deduplication across VM images.

Global Deduplication

A single instance of CSFS will perform deduplication against all the input data to which it is exposed. That collection of data can originate in multiple virtual and physical servers; for example, if your organization has Virtual Machine Images in multiple physical locations and Virtual Machine Images from all the locations are selected for the operation.

Data Compression

Data compression re-encodes data so that it occupies less physical storage space. Data compression algorithms search for repeatable patterns of binary 0s and 1s within data structures and replace them with patterns that are shorter in length. The more repeatable patterns found by the compression algorithm, the more the data is compressed.

Compression algorithms adapt dynamically to different types of data in order to optimize their effectiveness. Because data compression is based on the contents of the data stream, the algorithm is designed to adapt dynamically to different types of data.

The operations performed by the algorithm that produced the compression are reversed to “decompress” compressed data.

The effectiveness of any data compression method varies depending on the characteristics of the data being compressed. CSFS optionally utilizes data compression to save space within Cloud Storage. Data deduplication and compression are critical when sending data over low speed lines. Data may be sent in half the time if compression and deduplication result in an overall data reduction ratio of 50%.

Thin Provisioning

“Thin provisioning” simply means that the file system interprets a storage volume as having a (virtual) capacity that is in excess of its physical capacity. A thinly provisioned virtual logical unit (LUN) target is only partially backed by physical storage at the time of its creation with additional physical storage being added to the LUN as needed. The storage controller reports the capacity of a thinly provisioned LUN as its virtual capacity.

Data deduplication and data compression are thin provisioning techniques that improve storage utilization efficiency. The virtual capacity of the data-reduced disk target needs to be set based on a fixed assumption as to the data reduction ratio that will be achieved. However, because the reducibility of the data stored on a deduplicated and compressed volume cannot be determined in advance, there can be an “over commitment level” caused by less than expected deduplication or compressibility that is unknown at the time of volume creation. This presents an additional challenge to the storage administrator, the most obvious being an out-of-capacity condition.