Cloud Watching #2 – How to Manage 30Bn Trees Worth of Data
Data is fundamental to operating schooling systems. Without data schooling systems would grind to a halt – teachers wouldn’t get paid; students wouldn’t get transported; taught and fed; and essential services would cease to operate.
As the value of good data for decision making is becoming more widely understood, the quantity of data in the world’s schooling systems is ballooning. But how much data are we talking about, how fast is it growing, and how can it be better managed.
To get a sense of how big the issue is, let’s start by looking at Charlotte Mecklenburg in the US – a School District that has paid a lot of attention to its data and information systems recently. According to David Fitzgerald, Vice President of the Education Group at Mariner, Charlotte Mecklenburg School District in the US plans to use 70 Terabytes for a system with 140,000 students – 524.3MB per student.
The US and Western Europe account for ~10% of the world’s school students population – 0.12Bn. So, assuming similar levels of consumption across these regions, we can estimate that in these areas alone there is 60,000TB of data in schooling systems. 1TB = 50k trees worth of paper and print, so we’re looking at 3bn trees worth of data. Imagine that every student on the planet used the same amount of data as Charlotte Mecklenburg – that would add up to 30bn trees.
Whilst it’s currently unlikely that the amount of data in schooling systems adds up to this amount yet, there are several factors pushing it hard in this direction.
For example, major countries such as Russia, Mexico and Brazil are developing and running massive student data operations, increasing both the quantities and sophistication of data used.
UNESCO (2003) state that most countries develop education databases, and they also specify the optimal datasets that should be maintained. Let’s suppose that this adds up to a minimum of 1/2 a typewritten page on each of the student population living outside the USA and Western Europe, roughly 1 Kilobytes each. Rounding-off, we can estimate that 1bn students x 1Kb = 954GB. It’s interesting to think that this could be kept on a single external hard drive no bigger than a paperback book. However, add other data, say a single low-resolution image per student, and that rises by a factor of 8. Add digital work produced by students and this number grows exponentially.
Also, there is a sharp increase in the rate at which data is used in developed countries. Take New South Wales for example. Last year, New South Wales Department of Education and Training – which has 1.3m students – used 280TB of storage space - but this has been doubling every year for last five years!
The amount of data used in schooling can only increase as governments around the world recognise that it is core to improving effectiveness.
WHY IS MANAGING DATA CORE TO IMPROVING SCHOOLING EFFECTIVENESS?
Driven by the need for better accountability for how public funds are spent, and the widespread use of international benchmarks such as PISA, there is a sharp increase in the number of governments and private companies that are investing in solutions for data driven decision making. These investments aim to use data to:
- Improve student performance: Give students, parents, teachers and administrators a clear picture of student performance at an individual or group level so they can adjust and personalise learning accordingly
- Make better management decisions: Inform routine decisions and strategic planning across all enablers and disciplines with accurate, readily-available data
- Increase accountability: Quickly and easily understand performance across organisations
- Manage resources more effectively: Gain a better understanding of projected revenues and expenditures; keep track of financial health; compare costs against those of other organisations
- Drive administrative efficiencies: Improve time and effort taken to report information. Improve quality and presentation of information.
SO WE HAVE TO TALK ABOUT DATABASES THEN?
Why is it that peoples’ eyes glaze over when you start talking about databases? Most web pages that you will experience – including this one – are driven by databases. For most people databases are “black boxes”, and few care about how they work or what they do. However, a basic understanding of databases and how they work is essential to understanding how ICT can make schooling more effective – so let’s take a quick database 101:
WHAT IS A DATABASE?
Databases arrange data as sets of records, and these records are arranged as rows. Each record consists of several fields which are arranged in columns. The rows and columns combine to form a table.
Most large scale databases are Relational, which means that they can connect data from two or more tables.
- Forms are a main way to enter data into a database
- Queries are used to get data out of a database.
- Reports format and display data from the database.
Indexes improve the speed of data retrieval operations by querying a unique key which in turn uniquely identifies each row in a table. Metadata – data about data – can include tables of all tables, their names, sizes and number of rows in each table; or tables of columns, what tables they are used in, and the type of data stored in each column.
At the heart of a database is the Database Engine – software for storing, processing and securing data; providing controlled access and processing capabilities. The structure of the database is described in a Schema, and this is usually written in a language called “Structured Query Language” SQL. This language determines how data is inserted, queried, updated and deleted. Different database vendors have different extension to SQL – T-SQL is Microsoft’s extension to SQL.
A Data Warehouse is a database that extracts data from operational systems for reporting. It can aggregate data from different sources, and ensure that the integrity of operational data isn’t compromised by the processes associated with analysing it.
Integration Services are the means by which data from various sources can be integrated, extracted, transformed, and loaded into data warehouses.
OLAP – or Online Analytical Processing – enables data to be manipulated and analysed from multiple perspectives. Eg a Longitudinal analysis could involve the study of student progress over time, and take advantage of an OLAP Cube to interrogate a number of different dimensions over a given period.
Analysis Services supports OLAP by allowing the design, creation, and management of multidimensional structures that contain data aggregated from a range of data sources, such as relational databases.
Data Mining – is about extracting patterns from large sets of data, to yield Business Intelligence (BI) for example, high achievement correlated with the number of books in the family home, or low reading ability impacting examination results. Data Mining Services enables the design, creation, and visualisation of data mining models.
Reporting Services – enabling reports to be published in various formats drawing on content from a variety of data sources. They also centrally manage security and subscriptions. Portal Integration – it’s crucial to for end-users to work with operational data – in ‘dashboard’ format ideally – through a portal site.
To be able to manage databases is crucial and several key tools are used for this. Master Data Services is the means by which all applications across the organization can rely on a central, accurate source of information. Replication – copying and distributing data and database objects from one database to another, and synchronizing between databases to maintain consistency. Automated compression and backup are also key tools.
WHAT HAS THIS GOT TO DO WITH THE CLOUD?
With massive growth in the amount of data used in schooling comes questions about sustainability, cost and management. The Cloud offers some major advantages here:
Having data in the cloud makes it easier for authorized users with internet access to access that data from almost anywhere.
In an enterprise architecture where resources are distributed, organisations usually have a single SQL Server back-end with WAN links and/or multiple distributed SQL Server installations that replicate data with each other. Maintaining this kind of environment is time consuming and expensive. With the cloud, replication, backup, compression etc are all taken care of.
As with other Cloud services, you only pay for what you use. During the peaks and troughs of schooling system operations, one can expect to see varying amounts of data storage requirements.
SQL Azure is Microsoft’s Cloud Database solution, and it offers the following benefits:
- No physical administration required – software installation and patching is included, as SQL Azure is a platform as a service (PAAS)
- High availability and fault tolerance are built in
- Simple provisioning and deployment of multiple databases
- Scale databases up or down based on business needs
- Multitenant – i.e. a single database can provide services to multiple organisations
- Integration with SQL Server and tooling including Visual Studio®
- Support for T-SQL-based familiar relational database model
- Option for pay-as-you-go pricing
The SQL Azure suit currently comprises of the following offerings, some currently on limited availability:
SQL Azure Database – a Platform as a Service (PaaS) relational database. Highly available and scalable .
SQL Azure Data Sync – allows organisations to extend their current sets of data into the Cloud. It provides synchronisation between an organisation’s current SQL on-premises databases and SQL Azure Databases in the Cloud. Currently available in Community Technology Preview.
SQL Azure Reporting – a complete reporting infrastructure that enables users to see reports with visualizations such as maps, charts, gauges, sparklines etc. Currently available in Community Technology Preview.
The Windows Azure Platform Appliance under limited trials, this will eventually enable organisations to deploy their own Cloud Services from within their own datacentres. The Windows Azure Platform Appliance consists of Windows Azure, SQL Azure and a Microsoft-specified configuration of network, storage and server hardware.
TAKING ADVANTAGE OF CLOUD DATABASE SERVICES
Taking full advantage of the Cloud is not something that is going to happen overnight. Besides careful analysis and planning for migrating existing services, Cloud computing opens up a whole set of questions around what new services could be offered. For example, the rise of virtual schooling across the world – as brilliantly analyzed in the US by Clayton Christensen in his book “Disrupting Class” – will be a major beneficiary of cheap, ubiquitous database services at massive scale.
As pointed out in the Cloud Watching #1, moving to the Cloud is not without effort and risk. David Chappell, in his excellent paper “The Benefits and Risks of Cloud Platforms: A Guide for Business Leaders“ points out that storing data outside their organization makes people nervous. Many countries have regulations about where certain kinds of data can and can’t be stored, so before putting data into the Cloud platform, it’s important to ensure compliance.
A key question is to ask whether any given data centre is more secure than those of the major Cloud service providers. A significant data breach for a Cloud services provider is likely to mean a huge financial loss, so there’s a very strong incentive for them to keep the data they hold secure.
David Chappell also advises – “as with any new technology, starting small can be a good approach. Perhaps your first cloud application should be important, for instance, but not truly mission critical”. The same can be said for data.
Whilst its early days for Cloud based database services in Education, we’re beginning to see interest turning to into plans and action. For example, Curtin University in Perth, Australia, has started to move some of its services to the Cloud and intend to take advantage of SQL Azure.
Educause Horizon Report 2010, includes an analysis of Cloud amongst other key and emerging technologies – http://wp.nmc.org/horizon2010/chapters/trends/ It states:
“The abundance of resources and relationships made easily accessible via the Internet is increasingly challenging us to revisit our roles as educators in sense-making, coaching, and credentialing”.
Cloud will no doubt change how data is gathered, manipulated and interrogated, and by making vast amounts of storage available at extremely low prices we can look forward to seeing innovative organisations build completely new services to reach growing numbers of learners in completely new ways.
A great introduction to databases: http://www.microsoft.com/student/en/us/techstudent/handson/database.aspx
Getting started with SQL Azure: http://msdn.microsoft.com/en-us/magazine/gg309175.aspx
Migrating to SQL Azure: http://msdn.microsoft.com/en-us/library/ee730904.aspx
“How much data is that?” – http://www.jamesshuggins.com/h/tek1/how_big.htm
Thanks to Sven Reinhardt, database guru, for input into this article.