Alexandre Bourget

geek joy

New and hot, part 5: MongoDB integration

March 23, 2011 at 01:32 PM

This is part 5 of my March 9th 2011 Confoo presentation. Refer to the Table of Contents for the other parts.

MongoDB integration

In this part, we will make our system scale, by using a distributed file system to store our encoded files. We will be using GridFS that is backed by MongoDB, which can be sharded and replicated to scale.

First off, let's install MongoDB. You can get Ubuntu installation instructions here.

mongoinstall Display the install instructions on screen.
$ sudo -s
# echo "deb http://downloads.mongodb.org/distros/ubuntu 10.10 10gen" > /etc/apt/sources.list.d/mongodb.list
# apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
# apt-get update
# apt-get install mongodb-stable

We will create a local MongoDB data dir, and we'll run the server as a normal user:

$ mkdir mongodata

To launch MongoDB on the default port, as a normal user:

$ mongod --dbpath ./mongodata

Back to Pyramid, most of the database-related elments will go in resources.py in a Pyramid project. This is where we'll add our code to handle the connections to MongoDB:

## resources.py
...

from gridfs import GridFS
import pymongo

mongo_conn = pymongo.Connection()

def add_mongo(event):
    req = event.request
    req.db = mongo_conn['testdb']
    req.fs = GridFS(req.db)

We import GridFS, the filesystem handler, and pymongo, which is the low-level mongodb library. We keep a reference to a connection in mongo_conn. This connection object will be pooled, so when we get a reference to a particular database, it takes a connection from the pool and returns it upon deletion.

The add_mongo() function that is defined here will be hooked to be executed for each new request. This way, we will have a db and an fs attribute on each request going through our app. To do so, we'll modify __init__.py this way:

def main(...):
    ...
    config.add_subscriber('foo.resources.add_mongo',
                          'pyramid.events.NewRequest')
    ...

And then we're set for database.

GridFS: Using MongoDB as a filestore

Here we will store our compressed video files into the distributed filesystem MongoDB provides. GridFS is backed by MongoDB's features (document-oriented database for metadata) and has elegant python bindings.

If we store videos in MongoDB, we'll want to be able to retrieve them from there as well, so we will need to tweak the route to get them. In __init__.py:

def main(...):
    ...
    config.add_route('video', 'video/{fileid}.webm')
    ...

See that fileid ? We'll get it in the view filled with what matched in the requested URL.

Then, we'll modify the encoding process to dump the output directly to MongoDB, using Python's pipes:

def encode_video(filename, request):
    p = subprocess.Popen('$FFMPEG -y -i %s ... -ac 2 -' % filename,
                         shell=True, stdout=subprocess.PIPE)
    stdout, stderr = p.communicate()
    fileid = request.fs.put(stdout, content_type="video/webm",
                            original_user="123")
    print "Video URL: ", request.route_url('video', fileid=fileid)

What's to note here is the addition of the request parameter in the encode_video() call. This will allow us to create a URL with route_url(). Also we changed the output for FFmpeg, not pointing to /tmp/output.webm anymore, but to stdout using a dash. We've added stdout=subprocess.PIPE also, to get a hand of the output.

The request.fs.put() method will create a new file in the GridFS attached to request.fs and takes FFmpeg's stdout as a file-like object to fetch content from (as it's stdin). The end result is a new file in the distribute file system. put() takes any keyword arguments, and will add that meta-data to the fs.files collection's document objects. That's why original_user="123" can be added, and searched for later on.

Then, we'll make sure we fetch the data from MongoDB when a web request comes in for a video:

objectid YASnippet to add the import statements.
dataapp YASnippet to add the WSGIAPP line.
from pymongo.objectid import ObjectId
from paste.fileapp import DataApp

@view_config(route_name="get_video")
def get_video(request):
    oid = ObjectId(request.matchdict['fileid'])
    filein = request.fs.get(oid)
    wsgiapp = DataApp(filein, headers=[("Content-Type", "video/webm")],
                      filelike=True)
    return request.get_response(wsgiapp)

(Note that we need a patched paste.fileapp.DataApp for this to work.)

This chunk will get the fileid from the URL, make it an ObjectId and query the GridFS instance with request.fs.get(oid). This will return a file-like object that we'll pass to DataApp, which is a simple file-serving mechanisms that deals with Byte-Ranges and ETags and those kind of things.

To try it out, upload a video, and find the link in the server's stdout and load it in your browser (or mplayer). If it plays correctly, then it works!

Next will be Redis integration with its PubSub support, so that when the video is ready, it's pushed directly to all web viewers.

If you would like us to implement anything you've seen here in a real-life project, or to kick start some of your projects, don't hesitate to send out an e-mail at contact@savoirfairelinux.com mentioning this blog.

blog comments powered by Disqus