"SELECT FROM WHERE" Data Blog: July 2013

Tuesday 23 July 2013

'The Hydrogen Sonata' and the ethics of Big Data

Earlier this summer we lost Iain M Banks, one of the most imaginative writers in modern literature. Living in Edinburgh and sharing a number of good friends, I was lucky enough to have met him on several occassions over the past 25 years or so, in bars (mainly), SF conventions (sometimes) and even bobbing about in a swimming pool. He was every bit as wonderful and generous a man as the many appreciations of his life have deservedly highlighted and I'll miss him very much.

But Iain's significant body of work remains to inspire and I was reminded of a section from his latest, last Culture novel, The Hydrogen Sonata when I clicked through to a link about Predictive Analytics in the Cloud from the LinkedIn Big Data group this morning.

The article - which reads exactly as most over excited vendor press releases read and is concerned with a market I just know Iain would have the greatest disdain for - talks about tools which, and I'll quote directly from the CEO "can literally just say, 'Here's every event that happens in the world.' Earnings, economic indicators, seasonality, cyclicality, political events, and so on'...And they can literally model every single stock in the AMEX and the NYSE in relation to every single event, and can gain that kind of precision around it, which obviously helps them in terms of making their investments."

In The Hydrogen Sonata, Banks writes about an even grander ambition ambition. The story muses about the challenges encountered when trying to predict how an entire civilisation will react when confronted with evidence which may reveal their founding myths to have been a lie. Other civilisations - including Banks' great utopia, The Culture - have attempted a particular simulation modelling technique to try and predict the reaction but these have all been found wanting:

"The Simming Problem boiled down to, How True to life was it morally justified to be?"

A longer history of the challenge is also presented:

"Long before more species made it to the stars, they would be entirely used to the idea that you never made any significant societal decision with large-scale or long-term consequences without running simulations of the future course of events, just to make sure you were doing the right thing. Simming problems at that stage were usually constrained by not having the calculational power to run a sufficiently detailed analysis, or disagreements regarding what the initial conditions ought to do.

Later, usually round about the time when your society had develped the sort of processal tech you could call Artifical Intelligence without blushing, the true nature of the Simming Problem started to appear.

Once you could reliably model whole populations within your simulated environment, at the level of detail and complexity that meant individual within the simulation had some sort of independent existence, the question became : how god-like, and how cruel, did you want to be?"

These considerations develop over several pages. To simulate life, life must first be created to the extent that it recognises itself as life, leading to the thought that we might all be in a simulation ourselves. All fantastically, entertaining stuff but it's perhaps not surprising that Banks' Culture Minds end up concluding that "Just Guessing" is ultimately as effective.

What I believe will endure about this passage is not just the wit and vision but the necessity to assess the ethics of data usage. OK, 'simming' life is not where we are right now but larger and larger data sets are being use in increasingly 'sophisticated' models such as the one hyped by the Business Insider article. We've all been reminded of 1984 and The Minority Report in recent weeks and months thanks largely to the no longer secret efforts of the NSA. Mainstream commentators are falling over themselves to express concern and outrage about the uses to which data could or, with greater alarm, 'is' being put to. Data Professionals should start developing some answers to the more common concerns.

Questions such as what individuals need to know about the data being collected about them, how it is impacting the choices available to them, who their data can be sold on to and whether attempts should be made to identify individuals from analysing a variety of aggregate data sources are all relevant and live today. And there is surely a role for Data Governance to play here? Data Governance should help organisations understand what they are connecting and why. Data Governance should help organisations understand what they can do legitimately - ethically - do with their data. This will require particular attention as organisations look to exploit secondary usage of data. If nothing else, establishing a Big Data Use Assessment every time you want to use your data sets for a purpose other than that they were originally intended to will help reduce the risk of costly law suits later down the line.

Later in The Hydrogen Sonata one of the characters warns that “One should never mistake pattern for meaning" which is text book Big Data best practice and another indication that data practitioners should take heed of this, the last, of the great Iain M Banks' Culture novels. Recommended now and forever. Whichever simulation you find yourself in.

Wednesday 17 July 2013

Agile Analytics and the SAP Information Design Tool – 3. Evolving Excellent Design

In this, the final post in my series looking at how the new SAP Information Design Tool (IDT) can support the delivery of Agile Analytics, I’m going to be reviewing the capacity of the IDT to cope with adaptive design processes. As in the previous two posts – covering Collaborative Working and Testing – I’m taking my lead on Agile Analytics from Ken Collier’s excellent, recent book on the topic. In “Agile Analytics”, Collier devotes the first of his chapters on Technical Delivery Methods to the concepts behind evolving, excellent design. It’s a superb guide through the benefits, challenges, approaches and practical design examples to delivery through an adaptive design process and I’m not going to attempt to capture it all here. In a nutshell, though, Collier argues that “Agility benefits from adaptive design” and stresses that this is not a replacement methodology for your existing DW/BI design techniques but rather a state of mind approach which uses those techniques to deliver more value, faster.

SAP’s IDT primarily supports this adaptive design approach by facilitating a more flexible approach to adapt the data access layer (the Universe) to the necessary database refactoring involved in developing your DW solutions. Essentially it does this by it’s much famed ability to deliver multi-source Universes, something simply not possible in the traditional Universe Designer.

Let’s look at this using an example development requirement. Our fearless, collaborative developers from the last two blogs – Tom and Huck – are into the next iteration of their agile development cycle. One of the user stories they have to satisfy is to bring in a new ‘Cost of Sales’ measure. This is required to be made available in the existing eFashion Universe, with daily updates of the figure in the underlying warehouse. At present users calculate the figure weekly using manual techniques. Development tasks accepted for this iteration include developing the ETL to implement the measure in Warehouse and also amendments to the Universe to expose it as well as adding it to key reports. There’s only four weeks to the iteration so they want to make effective use of all developers time throughout. Tom is looking after ETL whilst Huck is responsible for the Universe and Report changes.

Tom gets on with the ETL work but has some calculation specs to work out first so can’t deliver a new ‘Cost of Sales’ field in the Warehouse for Huck to use in the Universe for at least two weeks. This is awkward as Huck wants to get on with working with the users on report specifications but knows it will be difficult without real data. Huck decides to take advantage of the IDT’s capability to federate data sources together. He’ll create a temporary database on a separate development schema in which he will load the weekly figures currently prepared by the business. He will then be able to integrate this into the existing IDT Project to allow the existing Universe to have the ‘Cost of Sales’ measure object added. Against this he can work with the users to define and approve reports using real, weekly, data. Once Tom has finished defining the new ETL and ‘Cost of Sales’ exists in the Warehouse, Huck can then rebuild the Data Foundation and Business Layer elements of the IDT Project to point to this new, daily source. The work done to the Universe and Reports will remain but now fed, again, by the single source. All that remains is for the users to confirm they are happy with the daily figures in the reports.

The diagram below illustrates this development plan in three stages, showing what elements of the existing environment are changed and added at each.

You can see that, with the IDT, we have a greater degree of flexibility – agility – in how Universes can be developed in the old Universe Designer. Yes, we could change sources there but it was not terribly straightforward and there was no possibility of federating sources. That would have been a job for SAP Data Federator pre-4.0 but now, with the IDT, some of that Data Federator technology has been integrated into the IDT to allow it’s use in the Universe development process.

Let’s look in detail at Huck’s development activities. First up, after syncing and locking the latest version of the project from the central repository, he creates a new Connection Layer using the IDT interface. This is the layer that will connect to his new, temporary, ‘Cost of Sales’ data source. There are two types of connection possible in the IDT – Relational and OLAP.

The screenshot below shows the first details to be entered for a new Relational Connection.

After that Huck selects the type of database he’s connecting to. He’s using SQL Server so has to make sure that he has the ODBC connection to this environment in his sandpit environment as well as in the server which the connection will eventually used when shared. Actually I think I’d always try to define the connection first on that shared server before then sharing the connection layer with the central repository for local developers to sync with.

Next up Huck supplies the connection details for the database he’s set up for the prototype Cost of Sales data.

There are then a couple of parameter and initiation string style screens to fine tune (in these Big Data times, some of the pre-sets look quite quaint – presumably they’re upped when you set up a HANA connection). Once those are done the new Connection Layer is available in the local IDT Project.

Within the Connection Layer it’s then possible to do some basic schema browsing and data profiling tasks. Useful just to confirm you’re in the right place.

It’s important then to share the connection with the central repository. To do this, right click on your local connection layer in the IDT project pane and select ‘Publish Connection to Repository’.

This turns your local .cnx to a shared, secured .cns which you can then create a shortcut to in your local project folder.

You can only use secured .cns connections in a multi-source Data Foundation. It is not possible to add a .cnx to the definition of a multi source Data Foundation.

On that note, it seems a likely best practice when using the IDT to always create a new Data Foundation as a multi source layer rather than single source unless you are very certain that you will never have need to add extra sources. Fortunately for Huck he had done just that in the previous iteration when first creating the Data Foundation.

Within a multi source enabled Data Foundation it is always possible to add a new connection through the IDT interface. The available sources are displayed for the developer to select.

For each source added to the Data Foundation, it is necessary to define properties by which they can be distinguished. I particularly like the use of colour here which should keep things clear (although I wonder how quickly complexity could creep in when folk start using the IDT Families?).

With all the required connections added to the Data Foundation, Huck can then start creating the joins between tables using the GUI in a very similar way to the development techniques he was used to in the classic Business Objects Universe Designer.

With the Data Foundation layer amended (and I’d normally expect it would be somewhat more complex than the two table structure shown above!), Huck can now move on to refresh his existing Business Layer against it. This is simply a matter of opening the existing Business Layer and bringing in the required columns from the new data source as required. Again, it’s very similar to the classic Universe Designer work where you define the SQL to be generated, build dimensions and measures, etc… for the end user access layer.

As discussed in the previous post it’s possible to build your own queries in this Business Layer interface. In that previous post I’d suggested using queries as a way of managing consistent test scenarios. Here I’d suggest that the queries can be used as the first point on validating that the joins made between sources are correct and return the expected results.

For a closer look at the queries being generated by the federated data foundation here I’d recommend looking at the Data Federation Administration Tool. It’s a handy little tool deserving of a blog post in itself really but, for now, I’ll just comment that it’s a core diagnostic tool installed alongside the other SAP BusinessObjects 4.0 Server and Client tools and allows developers to analyse the specific queries run, monitor performance, script test queries and adjust connection parameters. I’ve found it especially useful when looking at the queries from each source used and then the federated query across all of them.

With the Business Layer now completed, it’s time to synchronise the IDT project with the central repository and publish the Universe. Publishing the Universe is achieved through right clicking on the Business Layer in the IDT pane and selecting ‘Publish’.

Huck now starts working with the users to build up reports against the published Universe. This allows them to confirm that their requirements can be met by the BI solution even whilst the ETL is being developed by Tom.

When Tom has completed his ETL work, he can then let Huck know that the new Data Warehouse table is ready to be included in the IDT. Huck then has to bring this new table into the Data Foundation and expose it through the Business Layer. When the new ETL created table is added to the Data Foundation, Huck has two options. He can simply remap the existing Cost of Sales column in the Business Layer to the new table or he could create a new Cost of Sales column from the new source. The former will allow for a seamless update of existing reports – after the new Universe is published, they’ll just be refreshed against the new table – whilst the latter will allow the Users to compare the old and new figures but would require additional work to rebuild the reports to fully incorporate the new, frequently updated Cost of Sales figures from the Warehouse.

Whichever option is selected though, the IDT clearly allows for a more rapid, agile delivery iteration than traditional Universe development offers. It’s only part of the BI/DW environment - arguably a small part – but it does bring flexibility to development, hastening delivery, where before there was little. Developers and users can work together, in parallel, throughout the iteration, each delivering value to the project. The old waterfall approach of building Data Warehouse, Universe and, finally, Reports in strict order is no longer necessary. IDT has supported a new value-driven delivery approach.

Hopefully, you’ve found this series of posts about Agile BI and the IDT of interest.

Monday 15 July 2013

Agile Analytics and the SAP Information Design Tool – 2. Testing

This post continues the series looking at how the new SAP Information Design Tool (IDT) can support the agile delivery of SAP BI projects. In the previous entry I looked at how collaborative working and a degree of version control could be delivered using the IDT in line with the concepts laid out by Ken Collier in his recent, excellent, book “Agile Analytics”. In this entry I turn to how IDT can support the testing methodology that Collier explains in the seventh chapter of his book. As is the situation with version control, the IDT is far from being a complete solution but it is an advance on what is available on what is available in the traditional Universe Designer. Fundamentally, I’m going to be making using of the IDT Business Layer Queries. These Queries are mentioned, briefly, in the IDT documentation but it’s not really clear what they’re intended to be used for so hope no SAP developers mind me co-opting them in the name of testing!
Collier has an extremely useful chapter devoted to the subject of Testing. Early on he makes two important assertions:

i) “Testing must be integrated into the development process”

ii) “Essential to integrated testing is test automation”

The IDT certainly helps with the first of those assertions but, in it’s current state, does nothing to help with the second. My contention is that there may be an option within the wider SAP BusinessObjects deployment which might help with this but I’ll get to that later. For now, let’s focus on integrating testing with development. To be fair, I think Collier accepts that the goal of automated testing becomes difficult when you are using ‘black box’ proprietary solutions like the data access layers which SAP Universes effectively provide. His chapter focuses principally on the testing principles for data warehouse schemas, user interfaces and the various third party solutions available to help with automating those tests but acknowledges that not all the components of a BI environment can currently be automated. If we buy that then I believe there is value in looking at where the SAP IDT can at least progress things a bit.

Collier also makes reference in his chapter on Testing to Test Driven Development (TDD). It’s a compelling argument that – and I’m simplifying it here - puts agreeing tests at the start of all agile developments and applying them continuously throughout. This is the integration that the first of the assertions I’ve quoted above refers to and it’s where I could see the IDT adding particular value through the use of the Business Layer Queries.

Let’s use the productive and collaborative developers I referenced in my last post – Tom and Huck. When we left them last time, Tom was developing the Business Layer of the IDT ‘exploded’ Universe model whilst Huck worked on adding a new data source to the Data Foundation. As he developed the Business Layer, Tom was able to create queries that tested his amendments and made sure the results were in line with the user expectations originally defined in their User Stories (Collier is really good and challenging on this topic e.g. he advocates potentially creating these test queries even before development). How does he do that? By using the query panel built into the Business Layer interface.

Business Layer Queries are stored with the Business Layer as it is shared in the repository and with other developers. It’s sadly not possible to automate them in this interface but it does allow a single point of storage for the small unit tests that are necessary for all artefacts in your BI environment. They’re built using a a query interface very familiar to anybody that’s used BusinessObjects before.

Once the Queries are created, they can be renamed and have a description added to them that, I’d suggest, be used to record expected results to. Once the Business Layer is saved and exported to the Repository the queries are then available to other developers to run, validating that any changes they have made do not affect the expected results elsewhere. It’s quite a handy query interface actually as, not only does it return the results, it also returns basic data profiling information about distinct values and their spread in the result set.

With the new queries Tom has created now shared, the next time Huck downloads the latest version of the Business Layer to his sandpit environment he’s able to see if the expected results have been affected by any of the changes he has made to the Data Foundation. Hopefully not, but if there’s an issue, he’ll have to go an amend the Data Foundation to resolve it.

Now eventually, without automation, these tests are clearly going to become obstructive to the rapid delivery of working applications Agile Analytics is intended to deliver. I’d hope though that by at least centralising them they will help with the unit testing of individual iterations of Universes. When it comes to testing that the latest ‘for production’ iteration does not reverse the achievements of previous iterations an automated process will clearly be required. Use of the Metadata Management capacity of the SAP Information Steward should help guide development as to where possible impacts might be but does not provide the automated solution.

A potential option for that will be using the capability of the SAP BusinessObjects reporting environment to create a chained schedule of reports, each of which will only run upon successful completion of the previous. The first report to be run would be Scheduled to, upon the successful issuing of an Event (e.g. report returning data), trigger the next report which in turn would have to issue an Event and so on. If, at any point in this automated run of test reports there was a failure it would be simple enough to identify which Event had failed and look at what went wrong for the issuing report.

In summary then, what I’m suggesting is that the IDT be used to define and run (un-automated) the Queries required within an Agile Analytics iteration but, that the SAP BusinessObjects Schedule and Event features then be used to automate these tests as reports within the wider end-to-end project test plan. We integrate tests to the Universe development process but only automate them when an interation is complete.
Finally, whilst the IDT does not at present allow automation out of the box, I wonder if there is an option for somebody to develop an Eclipse (the platform the IDT appears to have been developed from) script that would run a whole suite of the queries automatically? Could be worth a look at a later date.

In my next and final post on Agile Analytics and the SAP Information Design Tool, I’ll be looking at how the IDT can support the concept of excellent evolutionary design.

Friday 12 July 2013

Agile Analytics and the SAP Information Design Tool – 1. Collaborative Development

In my previous blog post introducing this series of posts, I discussed both the concept of Agile Analytics (with particular reference to the recent, excellent book by Ken Collier, Agile Analytics) and how the new SAP Information Design Tool (IDT) could go some way (but, by no means, all) to support it. This blog explores a specific area of detail here, namely how the IDT facilitates a degree of collaborative development and version control in SAP Universe development. It’s far from being a complete version control solution – lots of room for improvement – but it’s certainly an advance on what is possible in the old Universe Designer traditionally used by SAP developers and does allow developers to work together on aspects of the single Universe artefact.

In his Agile Analytics book, Collier devotes his eighth chapter to the topic of “Version Control for Data Warehousing”. Clearly the IDT is not a Data Warehouse development tool but it certainly qualifies as one of the artefacts used in the delivery of Analytics solutions that Collier lists. The IDT’s purpose is to prepare and manage the semantic layer on top of Data Warehouses, Marts, etc… rather than actually design one but I think the core principles of that chapter are still applicable.

Collier writes that ‘Version control is mandatory on all Agile Analytics projects’ and provides five advantages to the delivery process.
1) It allows development rewind.
2) It allows controlled sharing of code.
3) It maintains an audit trail of development.
4) It provides release control.
5) It encourages ‘Fearlessness’ in development.

I particularly like that word ‘Fearlessness’ and the understanding that it allows developers to experiment with solutions without adversely affecting the project. This, as part of a collaborative development environment, is perhaps where the IDT can offer the most support over the traditional Universe Designer – always a nightmare-ish tool to promote code through – although, again, it’s far from complete in supporting all five of Collier’s version control advantages i.e. it doesn’t really support the rewind principle beyond one iteration or maintain an audit trail of development.

Let’s focus instead on what the IDT does deliver. I’ll cover how different developers can work on the same SAP Universe, applying a degree of version control to and sharing the code of, the different layers of the ‘exploded’ Universe. Throughout these developers will have to work within the principles of good communication and collaborative working (because the IDT won’t do it all – no tool would!).

I’ve set up two users who I’m sure, being the best of friends, will adhere to the best practices of collaborative developments.

Tom Sawyer is first to log into the IDT in the morning. He’s got a new measure to add to the Business Layer of our eFashion Universe so immediately also starts up a shared session which opens up the centralised Repository versions of all currently deployed models. Now, it’s not immediately obvious how to do this and, depending on your screen size and resolution, you might have to do a little resizing of the various IDT panes.
From the Window drop down menu, Tom selects the Project Synchronisation window which then opens up on right hand side of the workbench.

The session has to be opened by clicking on the session icon. If your project sync pane doesn’t have enough screen space, this icon will be hidden and you’ll have to expand the pane size.

Opening the session connects the user to the central repository of Shared IDT Projects. An IDT project consists of layers that can be used to create a Universe (.unx) i.e. connection layers, data foundations and business layers. Note that each project can consist of multiple versions of each i.e. it’s possible to build more than one Universe from a single project. They’re probably best organised through agreement within the development team. A prime place for the collaborative working practices and standards to start being developed.

Opening up a shared project then populates both sections of the Project Synchronisation pane – the Sync Status and Shared Project lists. From the Sync Status window we can see that the iteration of the Project uploaded on 28th February. This reflects the iterations developed locally from converting a traditional .unv Universe into the three layered .unx Project, and then shared with the central repository. We can see that, at the moment, all are fully synchronised with the local version.

In the Shared Project section of the Project Sync pane, Tom can then ‘lock’ individual layers of the project to prevent other developers changing them whilst he works – fearlessly – on them. An example of this is Tom wanting to lock the eFashion Business Layer so that he could add a new Measure object to it.

This will communicate to other users that this layer is currently being worked on. When Tom’s colleague, Huck Finn, logs in that morning and opens up the Project Sync pane in his IDT workspace he sees that the Business Layer is locked. By scrolling along the Shared Project section, Huck can see details as to which user has locked it, version number, last update date, locked date, etc… He can also simply unlock the layer if he wants to which does rather undermine the process. A charitable interpretation would suggest SAP are merely trying to encourage effective collaborative working through increased developer communication but, I’d suggest, this a clear candidate for project enhancement.

Anyway, Huck goes over to Tom’s desk and agrees that whilst Tom is creating this new measure, Huck will work instead on adding that new data source to the existing Data Foundation layer. Huck checks out the Data Foundation layer and they both have a productive morning working on the same project.
After lunch, Tom is happy that his changes have been tested successfully in his local sandpit environment. He can now see, as expected, from the Project Sync pane that the version of the Business Layer on the central repository is out of sync with the version in his local sandpit environment.

By clicking on the “Save Changes on Server” icon on the top left of the Sync Status section, Tom can now check in his changed layer to the central repository.

Tom calls over to Huck and lets him know that the new Business Layer is ready for him to work with so Huck refreshes his Sync Status section and sees that, yes indeed, there is a new version there ready for him to sync to his local area with the status “Changed on Server".

Huck clicks on the “Get Changes from Server” icon and a copy of the most recent version of the Business Layer is downloaded to his local sandpit. Huck can now test the changes he’s made to the Data Foundation that morning don’t impact on that Business Layer and proceed, fearlessly as ever, with his own development tasks.

You get the gist. There’s a clear and straightforward, check in/check out environment available in the IDT that supports an iterative, incremental and evolutionary – an agile – approach to Universe development. Similar functionality could be achieved in the pre-4.0 Universe Designer and a variety of CMS folders but always only for one developer at a time. With the ‘exploded’ Universe approach, developer productivity gains are clear. Releases can become more rapid.

There is, of course, room for improvement. In future releases of the IDT I’d like to see a further ‘unpacking’ of the Universe so that developers can work on individual components of each layer in the same check in/check out style e.g. a Dimension Class in a Business Layer could be developed by one, whilst another worked on Measures. It would also be nice to add some commentary to each iteration as, at the moment, Huck would have to manually check for the changes Tom had been made (assuming they weren’t working in a good collaborative fashion that is!). And, I guess, ultimately I’d like to see full integration with a third party version control system so that, as per Collier’s recommendation in Agile Analytics, all the artefacts associated with the BI environment can be stored in one place to ensure the potential requirement to completely rebuild the environment. Subversion would seem to be the obvious integration version control tool here given that SAP already ships it for the their BusinessObjects CMS Repository.

Not perfect then but a start at delivering the collaborative environment and version control required to support Agile Analytics. Next time I’ll be looking at a specific feature of the IDT Business Layer development –Queries – and how they could be used to help deliver continuous testing.

Wednesday 10 July 2013

Agile BI Delivery with the SAP Information Design Tool – Introduction

Over the past few years there hasn’t been a Business Intelligence event I’ve attended that didn’t cover the topic of Agile BI in one sense or another. It was always a fascinating source of discussion that quite often raised considerable passion. To me, it seemed a common sense approach and one that, to varying degrees, the BI projects that I have been delivering over the years have pretty much been in line with.

At the start of last year, I was browsing various BI discussion boards and came across a recommendation for a new book title “Agile Analytics : A Value Driven Approach to Business Intelligence and Data Warehousing” written by Ken Collier. There are many wonderful things about 21st Century living but one is surely the fact that, minutes after reading these recommendations, I’d downloaded the book onto my Kindle and was fully engrossed in what Collier had to say. I’m happy to report that the recommendations were well made and I’ve been adding my own to clients and colleagues ever since finishing it the first time round. What I thought I’d do here is tie in some of the concepts that Collier raises in his excellent book to some of the features in the now only relatively new SAP Information Design Tool.

I’m in no way trying to tie Collier to SAP. One of the genius points about “Agile Analytics” is that – like me - it is technology independent. Collier makes suggestions and clearly has some tools he’s familiar with more than others but stress is made, throughout the book, that Agile BI does not rely on a single tool set. All I’m trying to do in the series of blogs for which this is the introduction, is highlight some of the areas where SAP’s Information Design Tool (IDT) can help support Agile BI. There is, I intend to point out, plenty of room for improvement on SAP’s part but I do believe that the IDT is a good starting point, as far as technology ever can be, for any development team considering an Agile BI approach.

Let’s start though with some words about Agile BI. I can do no better than recommend (again) Collier’s Agile Analytics book. The introductory chapter makes clear that Agile BI “is not a rigid or prescriptive methodology; rather it is a style of building a data warehouse, data marts, business intelligence applications and analytics applications that focuses on the early and continuous delivery of business value throughout the development lifecycle”. There’s a brilliant analogy in the introduction setting the scene by comparing BI to mountaineering with traditional approaches seen as Siege Style Mountaineering (large teams of climbers expending much cost, effort, sherpas, etc…) whilst Agile BI delivers a more productive Alpine Style ascents (small teams reaching the summit faster with just the bare essentials). Throughout, Collier is at pains to observe that it is “simple but not easy” and does not shy away from the challenges of Agile BI (I especially appreciated the need to be ‘fearless’ when doing the most difficult thing first to fully identify ‘project peril’.
Another key point Collier continually makes is that delivery resource needs to perfect ‘being’ Agile rather than just ‘doing’ Agile. It’s a state of mind rather than a strictly replicable approach. This is where his adaptation of the original Agile Manifesto comes in. It’s worth reproducing in full:

There’s a lot more detail about this manifesto in the book. You might well question some points of the manifesto but Collier successfully rebuts most of the obvious concerns. It’s all good stuff but it’s not my purpose to detail it all here – go on, buy it yourself!

My focus instead is how SAP’s IDT fits into this manifesto. Clearly it’s a ‘tool’ which is not as valued as ‘Individuals and Interactions’ but, the key point is, it is still valued. Collier’s book is split into two sections covering both the Management and Technical aspects of Agile BI delivery. What I’m attempting to do in these blogs is demonstrate how the IDT can support some of those technical aspects.

Before I can do that though it might be worth reviewing exactly what the SAP IDT is. The Information Design Tool was introduced as a development environment with SAP BusinessObjects 4.0. In short, it’s a new tool with which developers can create the longstanding SAP BI semantic layers – the Universes. The old Universe Designer is still there, available for developers happy with the familiar but the IDT represents a significant leap forward in functionality and, I’d argue, is much more suited to Agile BI delivery than the legacy product. This should allow for reduced development costs and a corresponding increase in all the highlighted elements to items on the left of the Agile Analytics Manifesto e.g. where solutions are less costly to change there is more scope for responding to change in user requirements and where systems can be self documenting there is less need for comprehensive documentation.

An easy way I’ve found to understand what the IDT does is to think of it as exploding the previous development environment for SAP Universes. Rather than working on a single entity (the .unv file), the IDT works with three distinct layers – Connection, Data Foundation and Business Layer - to produce a new Universe (a .unx file). The diagram below illustrates this explosion using the good old e-fashion Universe imported into the new format (a straightforward enough process).

So that’s a brief introduction to both Agile BI and the SAP Information Design Tool. Over the next three blogs, I intend to review how the latter can support the former.

i) Collaborative Development – whilst it doesn’t offer the full version controlled environment commended by Collier’s eighth chapter, the SAP IDT does provide a good advance on what’s possible for a multi-developer team on the capacity available in the traditional Universe Designer.

ii) Continuous Testing – the IDT offers an interesting, not quite complete, solution to the need to perform continuous testing on the Agile-developed Universes and I’ll also cover how using triggered SAP WebIntelligence reports can help deliver on this approach.

iii) Evolving Excellent Design – the exploded Universe layers present a model against which this can be achieved. Correct sources can be prototyped in multiple systems, reports developed and then repointed at an optimised single source solution if required i.e. no need to redevelop reports when a new source system is delivered.

Sunday 7 July 2013

Three Idiots Sat Babbling

At the start of the week there was an interesting piece in The Guardian about Big Data. There’s been a lot of this sort of thing over the past few years of course but I’ve really been noticing a ramp up in press attention over the past couple of months. Perhaps that has something to do with the recent release of Big Data: A Revolution that will transform how we Live, Work and Think by Kenneth Cukier and Viktor Mayer-Shonberger. It's certainly a provocative read and one I’ll return to in a future post but for now I wanted to focus on another text mentioned in that Guardian article – The Minority Report by the great Philip K Dick.

This is a (very good) short story written in 1956 by Dick and undoubtedly gets referenced in quite so many Big Data articles because of the 2002 (not bad) film adaptation which definitely made analytics sexier than it probably deserves to be. It was of relevance to the Guardian article primarily because the PreCrime unit it revolves around so closely resembles the “Crush” (Criminal Reduction Utilising Statistical History) policing approach being adopted in various parts of the planet.

You can get a precis of the plot here (though I’d recommend reading it yourself because it is good) but I was interested in reading it to see if there was anything on top of the basic concept of using predictive analysis to reduce crime that can relate to the business of data management as I know it fifty eight years after the story was written.

Given the era it was written in we can surely forgive the eccentric systems architecture it describes. Chapter 1 gives a useful summary of what the PreCrime unit looks like...

"In the gloomy half-darkness the three idiots sat babbling. Every incoherent utterance, every random syllable, was analysed, compared, reassembled in the form of visual symbols, transcribed on conventional punchcards, and ejected into various coded slots. All day long the idiots babbled, imprisoned in their special high-backed chairs, held in one rigid position by metal bands, and bundles of wiring, clamps. Their physical needs were taken care of automatically. They had no spiritual needs. Vegetable-like, they muttered and dozed and existed. Their minds were dull, confused, lost in shadow.

But not the shadows of today. The three gibbering, fumbling creatures, with their enlarged heads and wasted bodies, were contemplating the future. The analytical machinery was recording prophecies, and as the three precog idiots talked, the machinery carefully listened."

No doubt a familiar experience to anyone that’s worked in Business Intelligence or Data Warehousing environments in the past decade but let’s look past the punchcard technology and call our “precog idiots” the equivalent of a Big Data statistical correlation engine. As the story progresses, Dick offers an insight into how the three work together…

"...the system of the three precogs finds its genesis in the computers of the middle decades of this century. How are the results of an electronic computer checked? By feeding the data to a second computer of identical design. But two computers are not sufficient. If each computer arrived at a different answer it is impossible to tell a priori which is correct. The solution, based on a careful study of statistical method is to utilise a third computer to check the results of the first two. In this manner, a so-called majority report is obtained"

At a stretch, we could call this in memory, parallel processing?

OK, I’m stretching the point here. Perhaps Dick was not that much of a systems visionary? Maybe his real strength was in predicting some of the data management issues that commonly arise today?

1) A Data Quality issue causes serious problems – Specifically it's the data quality dimension of timeliness that is revealed as having dropped the hero into difficulty. It is the fact that each of the precog reports are run at different times – therefore using different ‘real time’ parameters – that leads to the different result of the minority report. And that’s the simple version of the plot. Nevertheless, the lesson is clear - don't compare apples with oranges.

2) Engage current data providers in any delivery enhancement project - Driving the plot behind the pre-cog mistake is a power struggle between the Army who used to impose order and the PreCrime Unit that has usurped that role. This seems to me to perfectly reflect the challenge any new Business Analytics solution faces when having to earn the credibility required to replace existing solutions. Maybe even today there are people who claim to see little difference in Big Data solutions other than scale? Successful implementation projects will aim to bring in the existing providers of information into their stakeholder engagement. These resource typically have much wisdom to impart and should be encouraged to find benefit from the new solutions,

3) Knowing what data is held and how it is reported is key – It is only because of his position in the PreCrime Unit that the hero get placed in the tricky situation in the first place but it is also only by understanding the nature of the data held about him does he change his behaviour to escape the trap (sort of). Not only do I think it’s sensible data governance practice for organisations to know what data they actually have and how it is reported, I also think it is important for all of us as individuals to educate ourselves about what data is held about us, where, by whom and, critically, how we all contribute to it’s creation.

From an initial review of Cukier and Mayer-Shonberger I note that the authors are advising we all stop worrying about the causal why’s? and focus instead on the what’s? that the data shows us. Clearly the lesson from Dick’s The Minority Report is to continue asking ‘why’ and ‘what’ but also start asking 'what if’.

I’m aiming to post some thoughts on Big Data: A Revolution once I’ve finished it. Hopefully it will be interesting to compare that vision of the future we’re living today with Dick’s vision of the future he envisaged back in 1956.

Wednesday 3 July 2013

The first SELECT FROM WHERE blog post

So, this is it, the first post on my latest foray into blogging. I've broken the habit since changing jobs over a year ago. Guess I must have missed it but what to talk about first? How to break the ice?

Let's keep it simple - why SELECT FROM WHERE?

Three reasons spring to mind:

1- SELECT FROM WHERE was the probably the first and most important concept I 'got' in information technology that opened up my mind to the possibility I might be able to build a career in that field and these three SQL building blocks have remained at the core of every project I've been involved with since.

2 - Perhaps the above had something to do with SFW being my initials?

3 - It was originally just going to be acronymised to SFW because it absolutely is safe for work but then I began to worry that people would start thinking there was an NSFW version somewhere and all the connotations that would bring. Classic. Metadata. Issue.

OK, that seemed straightforward enough. What about what's the aim of the blog? In another handy three point list?

1 - SELECT interesting themes, articles and software

2 - FROM a variety of sources be they my personal experiences (no names here though!), other bloggers (see the list to the left), journals and books

3 - WHERE the topics are related to the management and usage of data. Initially this is going to most likely focus on issues of data quality and profiling as that's where my current career focus is but I'm not going to narrow myself to that long term. I'm certain to go back to Business Intelligence, Reporting and Analytics at some point - that stuff is all too interesting to stay away from for long.

And that's that...the first post. More to follow. Thanks for reading.