User session management and change tracking can overcome the pitfalls of 'last commit wins' data access.
When implementing any retail planning solution around a centralised database, managing parallel data access is a potential minefield. One of the most common support queries we receive about older solutions is “why have my numbers changed?” The answer tends to be that two users unknowingly worked on overlapping datasets at the same time.
Let’s use a fictional example of a WSSI tool in an apparel retail business. Jenny fetches all of menswear to adjust her divisional sales forecasts. Brian is working on cross-divisional forecasts for sourcing-led categories, and fetches all of denim for the same period. Where their datasets intersect (men’s denim) conflicts could arise.
A spectrum of solutions
Both users continue to work on their plans, and Jenny commits her changes to the database. Shortly after, Brian commits his changes. Neither user is aware that the other was working on the same data, so when Jenny re-fetches the same data later to pick up where she left off, she is surprised to see that her menswear forecasts have changed.
Good governance of role-based user access at product/location level is a must for any business, and goes some way towards avoiding accidental data clashes, but there are still scenarios where two users with a legitimate need to access the same data could come into conflict. Solutions to this common problem lie somewhere between two extremes, depending on the nature of the application and the data:
- Lock the data - when a user fetches, they ‘check out’ the dataset, denying other users read-write access to the same rows until they have committed.
- Live collaboration – users can see each other’s changes happening live in the application, essentially swapping ‘last commit wins’ for ‘last change wins’ at the most granular level.
The ‘check-out’ approach certainly limits conflicts, but the problem comes with checking rows back in – or more specifically, what happens when data is not checked back in, for example if a user’s PC crashes, or they leave their desk for a long period without checking data back in? Arbitrary timeouts and application downtime can mitigate this problem, but waiting for a colleague’s access to expire can be very frustrating for another user with a deadline to meet.
The ’live updates’ approach works really well for flat files – think Google Docs – but is poorly suited to database-driven applications where underlying data is presented in an aggregated view, rendering changes to individual data points much harder to spot from within the application interface. The potential for constantly shifting numbers makes this an unworkable approach for modelling and forecasting.
Somewhere in the middle is a more pragmatic, stable approach: session management and change tracking. In other words, working within your own bubble.
Introducing bubbles
What if each user works within their own bubble? An application will give any user access to a subset copy of the root data, with a session ID attached. Users still need to commit their changes back to the root database, but at least conflicts can be resolved by comparing the data in the user’s session with the previous commit, surfacing disparities to the user and allowing them to choose whether to proceed.
This idea is not new, but we don’t see it applied often within big data projects. Software developers use a similar approach through platforms like GitHub and Bitbucket. A developer pulls down a project’s source code from a central repository and works on a local copy to fix bugs and implement features without impacting other developers. Multiple branches can be created off the source branch and when merging back to the source branch, all changes can be represented as additions or deletions, with the merging user able to decide which ones to accept.
Obviously, this approach is more complicated for a data than code, as the overlap between two users’ sessions might not be complete. In the previous example, we know the overlap between Jenny and Brian’s datasets occurs on men’s denim, but that could still equate to thousands of individual data points depending on the granularity of the data in all dimensions (product, location, time). Should we force the user to review each conflicting row, or show a summary of the discrepancies at the highest level with a binary choice between accepting or rejecting all changes?
Then there’s the issue of unchanged data. In the source code example, changes to distinct sections of code in a well-structured project won’t usually conflict; if one developer adds a section of code to one file to enhance a feature, while another developer using a different branch amends a couple of lines of code in a different file to fix a bug, both changes can be evaluated independently at the point of merging.
The same logic doesn’t apply when making changes to a dataset of rows and columns. Returning again to our fictional example, what happens if Jenny is happy with the existing ecommerce forecasts, but makes changes to her store numbers, while Brian makes some changes to ecommerce numbers without changing store forecasts? Comparing changes row-by-row there are technically no conflicts, as they haven’t changed the same data points, yet accepting both users’ changes would result in aggregated numbers that don’t match what either user expects to see. In forecasting terms, committing an unchanged number is as much of a decision as changing it.
Our approach
So, there’s a limit to the value and practicality of comparing row-by-row changes, particularly for datasets that can be sliced in multiple ways. This pushes us towards a blunter all-or-nothing approach towards resolving data conflicts. It doesn’t have to be the last commit that wins under this approach, but one commit has to win. In this case, we need to equip users to avoid conflicts where possible, and resolve them when they occur.
We’ve already applied the bubble concept of session management to our new WSSI solution. Here’s how it works in practice:
- When fetching, a user can create a new session or ‘bubble’, or re-open an existing one.
- Creating a session pulls down a copy of the fetched data from the root database to an intermediate database with a unique session ID and user ID.
- The intermediate database sends the data pre-aggregated to the front-end application.
- Individual changes in the front-end are sent back to the intermediate database in a queue, meanwhile the front-end UI is optimistically updated to reflect the change. (We’ll cover the benefits of optimistic updates in a future blog post!)
- Updating a slicer or filter in the application sends a new query to the intermediate database.
- The server resolves all changes in the queue before sending the requested data to the application.
- Committing merges changes from the intermediate session back to the root database.
- Bubbles are only ‘popped’ (removed) as part of the import process that updates actualised data (daily or weekly based on the customer’s needs).
The bubble approach offers some significant benefits beyond mitigating conflicts between users:
- Bubbles working with a subset of data means filtering, slicing and disaggregation is faster for optimal application performance.
- Auto-saving changes to the session means users don’t lose work when they lose network connection or close the application.
- Users can have multiple active sessions, allowing them to forecast different scenarios and choose which one to commit.
- Colleagues can resolve commit conflicts between themselves by reopening and committing from the correct bubble.
- Where required, commits can be rolled back in the root database easily by an administrator.
We’re also developing additional features to enhance the usefulness of bubbles:
- Notifying users when fetching if another user has an uncommitted session open with overlapping data, allowing colleagues a chance to communicate with each other to pre-emptively avoid conflicts.
- Notifying users when committing if another user has committed changes against any overlapping data since they fetched, and allowing them to fetch a copy of the most recent commit to check before deciding whether to commit their own changes.