I’m currently working as a BA / ScrumMaster for the Data Lake team at one of the world’s biggest online fashion retailers, which has been an interesting journey so far. It’s not my first big data project, but as always I’m amazed at how the technology has moved on since my last comparable project. One of the benefits of coming in as a fresh set of eyes in a team that has been running for a while is that I’m very much still at the stage where the big picture is absolutely clear to me – there’s always a bit of a danger, particularly in time-pressured projects like this one, that once we get stuck right into the detail of delivering iterations, we can become more focused on those deliverables, and more likely to discuss them at User Story level rather than at Epic or Initiative level.
I think it’s for this reason that I was recently asked to co-present an overview of the Data Lake team’s current status and direction to a wide range of stakeholders from the client’s Data Community (clearly, as a global fashion leader, this company has multiple teams comprising data scientists, engineers, analytics professionals and report writers who are all consumers of the vast amounts of data now being generated by global-scale online retailers). The very fact that I am a new face to many of the members of the Data Community probably meant that they were more open to hearing a bit of a refresher on what the main drivers behind the creation of a Data Lake is, and why, even if there is a bit of a learning curve to get over when adopting the approach, the value which can be extracted is worth the effort – and with the unstoppable march towards ML and AI driven solutions, the value they can provide is only going to increase with each day that passes.
As such, I structured my section of the talk to try to take the attendees on a bit of a journey. I felt it was important to start at the minutae of a single dataset from a minimal data feed, then work to the next level of being able to correlate that data with the other diverse datasets coming into the Lake, then finally onto the big ticket value-adds – the creation of datasets that can power ML or AI initiatives.One of the key messages I presented was this:
Every piece of data, no matter how modest its initial purpose or how awkward its source format, can contribute to more powerful and far-reaching use cases when it can be easily discovered, queried and correlated efficiently with other datasets.
To me, this is the true essence of a Data Lake, although I’m careful to caveat that with the unsurprising revelation that the last part of that statement is actually a very challenging proposition to deliver.
It’ll take a bit of effort, but I feel pretty confident that achieving this goal isn’t too far off though – one of the main hurdles to getting there will be ensuring that the solutions that we put in place must be self-service to the highest degree possible. Removing friction from the process of adding data feeds / sources to the Lake as well as ensuring that access to the resulting output is equally as frictionless for consumers has to be one of the driving principles of creating an active and engaging data lake (a point that our Platform Lead was keen to get across to the audience).
I’m already looking forward to posting an update on this – hoping to see some interesting progress over the next few weeks.