In Windows Azure Blob Storage, not all blobs are created equal. Windows Azure has the notion of Page Blobs and Block Blobs. Each of these distinct blob types aim to solve a slightly different problem, and its important to understand the difference.
To Quote the documentation:
- Block blobs, which are optimized for streaming.
- Page blobs, which are optimized for random read/write operations and provide the ability to write to a range of bytes in a blob.
About Block Blobs
Block blobs are comprised of blocks, each of which is identified by a block ID. You create or modify a block blob by uploading a set of blocks and committing them by their block IDs. If you are uploading a block blob that is no more than 64 MB in size, you can also upload it in its entirety with a single Put Blob operation.
When you upload a block to Windows Azure using the Put Block operation, it is associated with the specified block blob, but it does not become part of the blob until you call the Put Block Listoperation and include the block’s ID. The block remains in an uncommitted state until it is specifically committed. Writing to a block blob is thus always a two-step process.
Each block can be a maximum of 4 MB in size. The maximum size for a block blob in version 2009-09-19 is 200 GB, or up to 50,000 blocks.
About Page Blobs
Page blobs are a collection of pages. A page is a range of data that is identified by its offset from the start of the blob.
To create a page blob, you initialize the page blob by calling Put Blob and specifying its maximum size. To add content to or update a page blob, you call the Put Page operation to modify a page or range of pages by specifying an offset and range. All pages must align 512-byte page boundaries.
Unlike writes to block blobs, writes to page blobs happen in-place and are immediately committed to the blob.
The maximum size for a page blob is 1 TB. A page written to a page blob may be up to 1 TB in size.
So, before we determine what blob type we’re going to use, we need to determine what we’re using this particular blob for in the first place.
You’ll notice the above extract is quite clear what to use block blobs for: streaming video. In other words, anything that we don’t need random I/O access to. On the other hand page blobs have a 512-byte page boundary that makes it perfect for random I/O access.
And yes, its conceivably possible for you to need to host stuff such as streaming video as a page blob. When you think about this stuff to much, you end up imagining situations where that might be possible. So, these would be situations where you are directly editing or reading very select potions of a file. If you’re editing video, who wants to read in an entire 4MB for one frame of video? You might laugh at the idea of actually needing to do this, but that the Rough Cut Editor is web based and works primarily with web-based files. If you had to run that using Blob storage as a backend you’d need to use page blobs to fully realise the RCE’s functionality.
So, enough day-dreaming. Time to move on.
Some groundwork
Now, in our block blob, each individual block can be a maximum of 4MB in size. Assuming we’re doing streaming video, 4MB is not going to cut it.
The Azure API provides the CloudBlockBlob class with several helper methods for managing our blocks. The methods we are interested in are:
- PutBlock()
- PutBlockList()
The PutBlock method takes a base-64 encoded string for the Block ID, a stream object with the binary data for the block and a (optional) MD5 hash of the contents. Its important to note that the ID string MUST be base-64 encoded or else Windows Azure will not accept the block. For the MD5 hash, you can simply pass in null. This method should be called for each and every block that makes up your data stream.
The PutBlockList is the final method that needs to be called. It takes a List<string> containing every ID of every block that you want to be part of this blob. By calling this methods it commits all the blocks contained in the list. This means, then, that you could land up in a situation where you’ve called PutBlock but not included the ID when you called PutBlockList. You then end up with an incomplete and corrupted file. You have a week to commit uploaded blocks. So all is not lost if you know which blocks are missing. You simply call PutBlockList with the IDs of the missing blocks.
There are a number of reasons why this is a smart approach. Normally, I fall on the side of developer independence, the dev being free to do things as he likes without being hemmed in. In this case, by being forced to upload data in small chuncks, we realise a number of practical benefits. The big one being recovery from bad uploads – customers hate having to re-start gigabyte sized uploads from scratch.
Here be Dragons
The following example probably isn’t the best. I’m pretty sure someone will refactor and post a better algorithm.
Now there are a couple of things to note here. One bring that I want to illustrate what happens at a lower level of abstraction that we usually work at, so that means no StreamReaders – We’ll read the underlying bytes directly.
Secondly, not all Streams have the same capability. Its perfectly possible to come across a Stream object where you can’t seek. Or determine the length of the stream. So this is written to handle any data stream you can throw at it.
With that out of the way, lets start with some Windows Azure setup code.
StorageCredentialsAccountAndKey key = new StorageCredentialsAccountAndKey(AccountName, Account Key);
CloudStorageAccount acc = new CloudStorageAccount(key, true);
CloudBlobClient blobclient = acc.CreateCloudBlobClient();
CloudBlobContainer Videocontainer = blobclient.GetContainerReference("videos");
Videocontainer.CreateIfNotExist();
CloudBlockBlob blob = Videocontainer.GetBlockBlobReference("myblockblob");
Note how we’re using the CloudBlockBlob rather than the CloudBlob class.
In this example we’ll need our data to be read into a byte array right from the start. While I’m using data from a file here, the actual source doesn’t matter.
byte[] data = File.ReadAllBytes("videopath.mp4");
Now, to move data from our byte array into individual blocks, we need a few variables to help us.
int id = 0;
int byteslength = data.Length;
int bytesread = 0;
int index = 0;
List blocklist = new List();
- Id will store a sequential number indicating the ID of the block
- byteslength is the length, in bytes of our byte array
- bytesread keeps a running total of how many bytes we’ve already read and uploaded
- index is a copy for bytes read and used to do some interim calculations in the body of the loop (probably will end up refactoring it out anyway)
- blocklist holds all our base-64 encoded block id’s
Now, on to the body of the algorthim. We’re using a do loop here since this loop will always run at least once (assuming, for the sake of example, that all files are larger than our 1MB block boundary)
do
{
byte[] buffer = new byte[1048576];
int limit = index + 1048576;
for (int loops = 0; index < limit; index++)
{
buffer[loops] = data[index];
loops++;
}
The idea (that of using a do loop) here being to loop over our data array until less than 1MB remains.
Note how we’re using a separate byte array to copy data into. This the block data that we’ll pass to PutBlock. Since we’re not using StreamReaders, we have to do the copy byte for byte as we go along.
It is this bit of code would be abstracted away were we using StreamReaders (or, more properly for this application, BinaryReaders)
Now, this is the important bit:
bytesread = index;
string blockIdBase64 = Convert.ToBase64String(System.BitConverter.GetBytes(id)); //1
blob.PutBlock(blockIdBase64, new MemoryStream(buffer, true), null); //2
blocklist.Add(blockIdBase64);
id++;
} while (byteslength - bytesread > 1048576);
There are three things to note in the above code. Firstly, we’re taking the block ID and base-64 encoding it properly.
And secondly, note the call to PutBlock. We’re wrapped the second byte array containing just our block data as a MemoryStream object (since that’s what the PutBlock methods expects) and we’ve passed in null rather than an MD5 hash of our block data.
Finally, note how we add the block id to our blocklist variable. This will ensure that the call to PutBlockList will include the ID’s of all of our uploaded blocks.
So, by the time this do loops finally exits, we should be in a position to upload our final block. This final block will almost certainly be less than 1MB in size (barring the usual edge case caveats). Since this final block is less than 1MB, our code will need a final change to cope with it.
int final = byteslength - bytesread;
byte[] finalbuffer = new byte[final];
for (int loops = 0; index < byteslength; index++)
{
finalbuffer[loops] = data[index];
loops++;
}
string blockId = Convert.ToBase64String(System.BitConverter.GetBytes(id));
blob.PutBlock(blockId, new MemoryStream(finalbuffer, true), null);
blocklist.Add(blockId);
Finally, we make our call to PutBlockList, passing in our List array (in this example, the “blocklist” variable).
blob.PutBlockList(blocklist);
All our blocks are now committed. If you have the latest Windows Azure SDK (and I assume you do), the Server Explorer should allow you to see all your blobs and get their direct URL’s. You can downloaded the blob directly in the Server Explorer, or copy and paste the URL into your browser of choice.
Wrap up
Basically, what we’ve covered in this example is a quick way of breaking down any binary data stream into individual blocks conforming to Windows Azure Blob storage requirements, and uploading those blocks to Windows Azure. The neat thing here is that by using this method not only does the MD5 hash let Windows Azure check data integrity for you, but block ID’s let Windows Azure take care of putting the data back together in the correct sequence.
Now when I refactor this code for actual production, a couple of things are going to be different. I’ll do the MD5 hash. I’ll upload blocks in parallel to take maximum advantage of upload bandwidth (this being the UK, there not much upload bandwidth, but I’ll take all I can get). And obviously, I’ll use the full capability of Stream readers to do the dirty work for me.
Heres the full code:
StorageCredentialsAccountAndKey key = new StorageCredentialsAccountAndKey(AccountName, Account Key);
CloudStorageAccount acc = new CloudStorageAccount(key, true);
CloudBlobClient blobclient = acc.CreateCloudBlobClient();
CloudBlobContainer Videocontainer = blobclient.GetContainerReference("videos");
Videocontainer.CreateIfNotExist();
CloudBlockBlob blob = Videocontainer.GetBlockBlobReference("myblockblob");
byte[] data = File.ReadAllBytes("videopath.mp4");
int id = 0;
int byteslength = data.Length;
int bytesread = 0;
int index = 0;
List blocklist = new List();
do
{
byte[] buffer = new byte[1048576];
int limit = index + 1048576;
for (int loops = 0; index < limit; index++)
{
buffer[loops] = data[index];
loops++;
}
bytesread = index;
string blockIdBase64 = Convert.ToBase64String(System.BitConverter.GetBytes(id));
blob.PutBlock(blockIdBase64, new MemoryStream(buffer, true), null);
blocklist.Add(blockIdBase64);
id++;
} while (byteslength - bytesread > 1048576);
int final = byteslength - bytesread;
byte[] finalbuffer = new byte[final];
for (int loops = 0; index < byteslength; index++)
{
finalbuffer[loops] = data[index];
loops++;
}
string blockId = Convert.ToBase64String(System.BitConverter.GetBytes(id));
blob.PutBlock(blockId, new MemoryStream(finalbuffer, true), null);
blocklist.Add(blockId);
blob.PutBlockList(blocklist);
You must be logged in to post a comment.