Changes

Jump to: navigation, search

EngineeringProductivity/Projects/ActiveData

3,605 bytes added, 08:07, 18 December 2014
save this draft
This project is inspired by the data warehouse and data mart technology that is common inside large corporations, but largely non-existent in the public space. Since Mozilla's mandate is an open web, and we have a lot of data to share, it is only logical we make our data active.
 
= Architecture =
Applications that leverage an active data warehouse can forgo the significant server side development , if not all, and put the logic on the client side.
== Features ==
An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:
* === A service, open to third party clients===: By providing the service, clients save the need to stand up their own datastore* === Fast filtering=== : Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.* === Fast aggregates=== : Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates. * === API is a query language (SQL, MDX)=== : Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.* === Uniform, Cartesian space of values=== * : Mozilla has a mandate of data driven decision making. Data analysis tools, like R, Scipy, Numpy, and Pandas are what's use to perform data analysis, and they all require uniform data in multi-dimensiton arrays. ActiveData's objective is to provide query results in these formats=== Metadata on dimensions and measures=== * : ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and maybe even provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options. === Has a security model=== Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.  == Problem == A significant portion of any application is the backend database/datastore, which include:* Managing resources and machines to support the datastore* Data migrations on schemas during application lifetime* Manually defining database indexes for responsive data retrieval* Coding caching logic to reduce application latency The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next. ==Solution== Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations. Columnar datastores, have solved many (but not all) problems with changing schema. Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch. We now have the technology to build an ActiveData solution.   By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently ==Non Solutions== ActiveData makes specific tradeoffs to achieve it's goals, and there are situations * memory hog* transactional speed* strict data model (snowflake schema, hierarchical relations* non-relational - * etl work required to denormalize data and provide dimension metadata
Confirm
513
edits

Navigation menu