Transform data on the fly with Java Nashorn

How we use Java Nashorn to transform data on the fly

We were asked to migrate data back and forth between two schema-incompatible systems. At the same time, their schemas were evolving. We decided not to waste our developer energy on writing disposable mapping tables in Excel. Instead, we provided our product colleagues with the tools to transform data themselves in a more efficient way. And we used Java Nashorn to do it.

An old problem

Paddington bear "migration is not a crime"

We’ve all faced this challenge many times: migrate an old system to a new one. Keep all the data and adapt it to the new schemas, and on top of that provide interoperability. At Schibsted, we’ve had many cases where we have upgraded the technology for our classified ad websites, and then we have had to ensure we keep our existing user base and content data.

We poor developers lose sleep over these operations! Often we create ad hoc scripts:

mix_fields_together, extract_some_info_from_them, call_third_parties_to_fill_in_the_gaps, optimize_so_migration_doesn_t_take_ages, and so on ad nauseam.

And worst of all, when we get the result of the script,  often someone will realize that something has to be changed, e.g. we were missing fields or had a mistaken preconception.

Start again. Edit script. Run migration. Show result. More changes.

  • Sound familiar?

Room for improvement

This time we had to do it better. We wanted to make it so anyone in the team could do the debugging, and also make it a nicer task, so everyone would be able to continue working on adding value.

Hopefully, we would also find a better solution to the problem at hand than our many previous attempts. So what could we do?

Experience has shown that most data transformations are simple mappings that can be managed on Excel worksheets. We will support this.

The same experience shows that a small but nasty number of those transformations require more complex formulas for adding, subtracting, combining or other manipulation of the original data. We will also manage this.

So in summary, these came to be the requirements:

  • Whoever deals with the migration can manage the mapping tables and the rules to mix, remove, extract and everything else on data fields.
  • It shouldn’t be a painful process.
  • Oh, and some other minor stuff like reliability and performance …

Choices we made

Javascript document icon

So, we wanted to let users upload their Excel mapping tables, and also let them define how to compose their destination field definitions based on those tables.

That called for giving users a language capable of expressions, and also a place to drop those definitions.

[Let me divert for a second to state that we had decided to use Scala for the project implementation, for reasons that would require a separate post to explain!]

We settled on providing a service to upload the tables and the field definitions. The language we would provide to write the definitions was the next big ugly monster–I mean interesting problem!–we faced.

 

  • We don’t want to implement a language. What if we use something already existing? Let us find out if and what interpreters are available. How cool is that? Java includes a Javascript interpreter and anything Java is Scala too!

Enter Nashorn

Nashorn as tool to transform data

It turns out that Java 8 comes with a first class ECMA-compliant Javascript engine out of the box. Its name is Nashorn (German for rhinoceros) and we decided to try and see what it could do for us. Javascript is an easy language to use at this level, and we knew many non-technical people were already familiar with it, if only to do some basic magic on web page prototypes and such.

Besides, this programming language had the added advantage of being “untyped”: for a job like this where the user might end up doing crazy things with data of unknown types, it looked like a perfect fit!

Javascript then.

Back to the task

Or, the most code-oriented part of the post

The Javascript engine

We needed a Javascript interpreter in our Scala code. For that we used the simple instantiation code:

The transformation rules

In order to migrate data schemas, you need to define a set of output fields derived from a set of input fields and mapping tables. Some simple examples:

  • OutputField1 ← input field x
  • OutputField2 ← input field x concatenated to input field y
  • OutputField3 ← first three characters of input field z
  • OutputField4 ← mapping in table “categories” corresponding to input fields v and w

And so on.

We ask the operator to write a YAML file with one field per output field, whose value is the javascript code to calculate it. Our system will read it and later run it for each of the output fields:

As you can guess from the above, we provide bound variables with the full JSON input object plus lookup tables (Javascript arrays) copied from the good old Excel sheets, so you can compose and look up in order to define your output fields.

It turns out you can bind Javascript variables to the runtime environment where you execute your Javascript code, like this:

Of course, the magic is in bindings. We have previously obtained the bindings object from Nashorn in initEngine with (not shown above for clarity):

The schema transformation

The last step in the process is to actually apply the transformation definitions to the input. We did that in the transform method, actually asking Nashorn to run Javascript code: a transform global function that we also bound in a separate step.

For reference, the Javascript transform function is given next. ruleset contains the definition rules parsed from the YAML above. Both transform and ruleset are bound to the engine in a way similar to the input data:

And thus we obtain a new record with as many fields as defined in the rules.

Colophon

Using an embedded engine for a well-known language like Javascript has allowed us to provide the tools for frictionless migrations. Nashorn has delivered excellently in terms of performance as well as in terms of API. It gives you a lot of power in a very easy way.

References

Nashorn

Scala

 

Read more from the Software engineering category
SUBSCRIBE TO OUR UPDATES
Menu