Fun with XQuery, Images encoded as base64 Strings, and Word 2007

or: There and Back Again, A JPEGs Tale.

This is a fun one that comes up every once in awhile.  When you save a Word 2007 document as .xml, Word serializes images as base 64 strings.  It turns out that organizations regularly save Word documents as .xml and they want the ability to view these images in a browser or some other application so they can decide how they’d like to re-use them.  So the first question that comes up is: How can I transform the base 64 string back into an image?

If you want to play along at home, copy the image of Bilbo here and save it in a Word 2007 document: In the Ribbon, select the ‘Insert’ tab, from the ‘Illustrations’ group, choose ‘Picture’, then select  your pic and insert it.  Next: Go to the Button, click ‘Save As’, select ‘Other Formats’, and for the ‘Save as format’ choose ‘Word XML Document (*.xml)’.

Don’t choose the 2003 XML, cause that’s something else.  It similar, but different (cause it’s not the same).

So now, open that bad boy in vi, Visual Studio, or some other editor and take a peek.  I want to take this opportunity to introduce you to the  Flat OPC format. When you save a Word doc as a .docx, you end up with a .zip file that contains all these interrelated .xml files and their associated assets (such as images).  When you save as .xml, you end up with all those same XML parts serialized in a single .xml file, with images serialized as base 64 strings.  This .xml format is known affectionately in Redmond as Flat OPC.  Once you understand just a little about this format, the amount of Word document @$$ you can kick in MarkLogic Server and/or using an Add-in in Word is awesome.

SPOILER ALERT: An upcoming post is going to dive into how we can exploit the Flat OPC format for document re-use.

The main body of content for a Word 2007 document can be found in the document.xml part.  With images, you’ll find a reference to the image in document.xml, but the image will be stored separately as its own part in the document package; as a binary in the .docx, but serialized as base 64 when saved as .xml.  Knowing this, let’s convert the string to an image.

Throw BilboBaggins.xml into MarkLogic Server. (I use WebDAV).  To view the base 64 string, evaluate the following in CQ:

xquery version "1.0-ml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  $image-string

Yep, it’s that ugly.  Luckily viewing the image is as simple as:

xquery version "1.0-ml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

let $doc := fn:doc("/BilboBaggins.xml")/node()[2]
let $image-string :=  $doc/pkg:part[@pkg:name="/word/media/image1.jpeg"]/pkg:binaryData/node()
return  binary{xs:hexBinary(xs:base64Binary($image-string))}

Now, what about the reverse?  What if we have images in the Server that we want to serialize as base 64 strings?

Take the image you copied at the beginning and save it to MarkLogic.  We can convert it to a base 64 string by evaluating the following in CQ:

xquery version "1.0-ml";
declare namespace ooxml= "http://marklogic.com/ooxml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()
{
    binary{xs:hexBinary(xs:base64Binary($string))}
};

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
{
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string
};

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return ooxml:binary-to-base64-string($doc)

Now, the above will work if we just want the base 64 string,  but if we want a string we can use with Word and the Flat OPC format, certain rules apply: 1) the string must be broken into lines of 76 characters, and 2) there must not be a line break at the beginning or end of the content.  No big deal, we just do the following:

xquery version "1.0-ml";
declare namespace ooxml= "http://marklogic.com/ooxml";
declare namespace pkg="http://schemas.microsoft.com/office/2006/xmlPackage";

declare function ooxml:base64-string-to-binary(
  $string as xs:string
) as binary()
{
    binary{xs:hexBinary(xs:base64Binary($string))}
};

declare function ooxml:binary-to-base64-string(
 $node as binary()
) as xs:string
{
      xs:base64Binary(xs:hexBinary($node)) cast as xs:string
};

declare function ooxml:base64-opc-format(
$binstring as xs:string)
{
      fn:string-join(ooxml:format-binary($binstring),"
")
};

declare function ooxml:format-binary(
$binstring as xs:string
)as xs:string*
{
    for $i in 0 to (fn:string-length($binstring) idiv 76)
    let $start := ($i * 76)
    return fn:substring($binstring,$start,76)
};

let $doc := fn:doc("/bilbo-200x200.jpg")/node()
return   ooxml:base64-opc-format(ooxml:binary-to-base64-string($doc))

And that’s all there is to it!  You are now a Master of the image-encoded-as-base64-string Universe!  Cheers!

2 Comments

Leave a comment