Monday 31 October 2011

musings on shard keys from a MongoDB newbie

I'm a MongoDB newbie who turned to Mongo for a simpler Map/Reduce solution after trying Hadoop, simpler in terms of the programming model (JavaScript vs. Java) and in terms of the infrastructure required. The application is an environmental one, generating visualizations of data such as ocean carbon dioxide readings over the last half century and global groundwater readings collected by satellite over the last decade.

The data model is simple, consisting of collections of measurements with this schema:
latitude (double)
longitude (double)
date (YYYYMMDD string)
main measurement (double)
ancillary measurements (doubles)

Currently for ocean CO2 readings there are about 5.2 million documents, and for groundwater readings there are about 2.5 million, with updates occurring monthly.

The Map/Reduce job which I ported from Java (Hadoop) to JavaScript to run on Mongo has (1) a mapper which maps the document's lat/lon to a region (lat1, lat2, lon1, lon2) and a time period (date1, date2), using the region and time period as the key, (2) a reducer which for each region and period combo outputs the sample count and the min, max, and total measurement values, and (3) a finalizer which outputs the count, min, max, and average for each region and period. The granularity of the region and period used for grouping is configurable (for example divide the earth's surface into 10 degree x 10 degree lat x lon regions, and group by 1 month, 6 months, etc.), with these parameters passed in the scope argument of the mapReduce command.

The reduced data is queried by a web app which lets the user select a region of the earth and a time unit to drilldown to a graph of the CO2 or groundwater trends over time in that region. The app also queries the data to determine top regions of recent "concern" (high CO2 or low groundwater) which become highlighted on a map.

I haven't had the need to shard the data yet, but if I did I would select a shard key consisting of a lat/lon + a date:
{latitude: 1, longitude: 1, date: 1}

The presence of date in the key would help distribute the data where measurements are taken in the same regions over a period of time, which can happen with ocean CO2 readings (but not so much with groundwater readings which are by satellite). The presence of lat and lon helps distribute new readings as they come in, avoid hotspotting on the same date range. Having lat/lon at the beginning of the key would make queries more efficient, since most queries would include location (the exception are queries to find top X environmental hotspots which query first by date to get the most recent data available).

This is hypothetical of course, given the volume of my data, plus I'm new to Mongo, so I welcome comments on alternative shard keys.