Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration pipeline exploiting Mustache template. #24

Open
skarampatakis opened this issue May 20, 2017 · 19 comments
Open

Configuration pipeline exploiting Mustache template. #24

skarampatakis opened this issue May 20, 2017 · 19 comments
Assignees

Comments

@skarampatakis
Copy link
Collaborator

So far, the CSV2RDF pipeline uses a "version" of the Mustache to be reconfigured. There are almost 240 edits to be done in order to reconfigure the pipeline.

Manual editing would probably introduce typos or other kind of errors. The number of edits required is reduced to only(!) 20 unique, considering that in most times there are duplicates.

The idea was to have the variables using a "Mustache like" template, as is {{@@variable_name@@}}, trying not to interfere with the JSON-LD notation, that is used by the pipeline. So the current workflow is to use a text editor or a shell script or any other possible way, to automatically replace all these entries with the required variable value. The list of the variables needed to be changed can be found here. Since LP has already a Mustache component, how could we use it to re-configure the pipeline?

At the moment the only possible way is to create a new pipeline, upload it in LP and execute. I think this solution may have scalability issues. Could it be possible to just reconfigure the pipeline and execute as it happens with FDP2RDF pipeline?

I believe this kind of approach could save as a lot of time, as this kind of parameters is already gathered by the OS Packager. So, we could use the OS Packager to configure both LP and OS pipelines.

@jakubklimek could you please have a look and provide a kickstart guide?

@skodapetr
Copy link

@skarampatakis Do I understand it correctly, that the scalability issue is related to a potentially huge number of pipelines and executions? So the desire is to limit the number of executions and instances to minimum?

@skarampatakis
Copy link
Collaborator Author

@skodapetr
I believe yes. Every time a user configures a pipeline ATM has to be uploaded and run. This would create huge number of pipelines (followed by it's garbage), in real world applications. Even if the user creates a custom pipeline due to complex data structure, but the same for his use case every let say year, would have to reconfigure the pipeline. We have seen this with our use case of Greek Municipalities where we had to create a new pipeline for each year, for each municipality, whilst all of them have the same structure.

Another possible solution I was thinking of and used recently, would be to use a YAML (or similar) descriptor with a script to reconfigure the pipeline internally. But I don't know if this is even possible for LP pipelines.

@jindrichmynarz
Copy link
Contributor

jindrichmynarz commented May 30, 2017

Why not leverage the fact that LP-ETL pipelines are serialized in RDF? One solution might be to put placeholder resources (e.g., blank nodes instantiating sp:Variable identified with sp:varName) in places that require configuration. Consequently, there can be a SPARQL Update that replaces these variables with concrete data given configuration in RDF (e.g., provided via the Text holder component). This would enable higher-level operations than basic text manipulation with Mustache. What do you think?

@skarampatakis
Copy link
Collaborator Author

Most of the configurations needed to change are inside SPARQL queries of SPARQL Update or Construct components. Do we have a sloution on how we can change these with your method? Can we have a minimal working example so I can take a look?

Maybe I am missing something here.

@jindrichmynarz
Copy link
Contributor

If the variable parts are inside SPARQL, then this method does not provide much help. You can think of using SPIN to represent SPARQL in RDF and then render it to literals using the SPIN API, but that would likely be an overkill. I was thinking more about variables such as simple values of configuration of LP-ETL components. I you need to generate SPARQL instead, there are many approaches, some of which I've cover in my post last year. Using Mustache might be a fine solution for that.

@skarampatakis
Copy link
Collaborator Author

Hi @jindrichmynarz ,
I have read your post (which by the way is great as all of them), and that was the "eureka" moment I had with using the mustache component to reconfigure the pipeline. I have also tried to play with the demo pipeline on your public LP instance but to be honest I was a bit confused. That's why I asked your help, if you can provide a kickstart so I can do the rest. We have the pipeline, we also have the variables need to changed. Now we only have to somehow "feed" these variables with values and replace them on the pipeline.

I have looked also at SPIN at some time but to be honest, I think that as you said it would be an overkill.
The overall problem is that we have a template of pipeline that is designed to be used by data uploaders of the platform. The average people responsible for this job does not and shouldn't either have a minimum knowledge of RDF or SW technologies.

So I believe that we need to have a solution that would be easy to produce and easy to be consumed.

The simplest thing that I would like to have would be a file with variable name - values pairs. Then "something" reads that file, configures the pipeline by replacing the variables with it's values and finally executes it! This file could be either be written by hand, or by the Packager. So it has to be simple and at any case not using RDF syntax. So YAML looks a good candidate. Mustache component of LP uses RDF flavor of mustache template so it may be a bit confusing. Please correct me if I got something wrong here.

Just thoughts, there may be better or more elegant solutions than this.

@jindrichmynarz
Copy link
Contributor

Since the data model Mustache operates on is basically YAML, one option for solving this may be to implement a component that does standard Mustache rendering: using a template and YAML data to render its output.

However, YAML still requires some tech-savviness, so it may require a UI-based solution.

@skarampatakis
Copy link
Collaborator Author

Yes, that was my thought also when I thought about YAML. The UI-based solution could be the Packager itself. If you see a descriptor produced by the packager, most if not all, required variables are already there. So we could re-purpose the descriptor as a LP pipeline descriptor. It already has that role for OS platform pipelines, at least at my understanding.

@skarampatakis
Copy link
Collaborator Author

Please find attached a sample OS descriptor

{
  "model": {
    "dimensions": {
      "functional-classification": {
        "dimensionType": "classification",
        "primaryKey": [
          "functional_classification_generic_code"
        ],
        "attributes": {
          "functional_classification_generic_code": {
            "source": "Κ.Α.",
            "title": "Κ.Α."
          },
          "functional_classification_generic_label": {
            "source": "Περιγραφή",
            "title": "Περιγραφή",
            "labelfor": "functional_classification_generic_code"
          }
        },
        "classificationType": "functional"
      },
      "date": {
        "dimensionType": "datetime",
        "primaryKey": [
          "date_fiscal_year"
        ],
        "attributes": {
          "date_fiscal_year": {
            "source": "Έτος",
            "title": "Έτος"
          }
        }
      }
    },
    "measures": {
      "value": {
        "source": "Προϋπολογισθέντα",
        "title": "Προϋπολογισθέντα",
        "currency": "EUR",
        "direction": "revenue",
        "phase": "proposed"
      },
      "value_2": {
        "source": "Διαμορφωθέντα",
        "title": "Διαμορφωθέντα",
        "currency": "EUR",
        "direction": "revenue",
        "phase": "adjusted"
      },
      "value_3": {
        "source": "Βεβαιωθέντα",
        "title": "Βεβαιωθέντα",
        "currency": "EUR",
        "direction": "revenue",
        "phase": "approved"
      },
      "value_4": {
        "source": "Εισπραχθέντα",
        "title": "Εισπραχθέντα",
        "currency": "EUR",
        "direction": "revenue",
        "phase": "executed"
      }
    }
  },
  "regionCode": "eu",
  "countryCode": "GR",
  "cityCode": "Thessaloniki",
  "fiscalPeriod": {
    "start": "2016-01-01",
    "end": "2016-12-31"
  },
  "title": "Municipality of Thessaloniki, Greece Revenue Budget fot the fiscal year 2016",
  "name": "europe-greece-municipality-thessaloniki-2016-revenue",
  "description": "Municipality of Thessaloniki, Greece Revenue Budget fot the fiscal year 2016.",
  "resources": [
    {
      "name": "thessaloniki-2016-revenue",
      "format": "csv",
      "path": "thessaloniki-2016-revenue.csv",
      "mediatype": "text/csv",
      "bytes": 44711,
      "dialect": {
        "csvddfVersion": 1,
        "delimiter": ",",
        "lineTerminator": "\n"
      },
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "title": "Κ.Α.",
            "name": "Κ.Α.",
            "slug": "functional_classification_generic_code",
            "type": "string",
            "format": "default",
            "osType": "functional-classification:generic:code",
            "conceptType": "functional-classification"
          },
          {
            "title": "Περιγραφή",
            "name": "Περιγραφή",
            "slug": "functional_classification_generic_label",
            "type": "string",
            "format": "default",
            "osType": "functional-classification:generic:label",
            "conceptType": "functional-classification"
          },
          {
            "title": "Προϋπολογισθέντα",
            "name": "Προϋπολογισθέντα",
            "slug": "value",
            "type": "number",
            "format": "default",
            "osType": "value",
            "conceptType": "value",
            "decimalChar": ",",
            "groupChar": "."
          },
          {
            "title": "Διαμορφωθέντα",
            "name": "Διαμορφωθέντα",
            "slug": "value_2",
            "type": "number",
            "format": "default",
            "osType": "value",
            "conceptType": "value",
            "decimalChar": ",",
            "groupChar": "."
          },
          {
            "title": "Βεβαιωθέντα",
            "name": "Βεβαιωθέντα",
            "slug": "value_3",
            "type": "number",
            "format": "default",
            "osType": "value",
            "conceptType": "value",
            "decimalChar": ",",
            "groupChar": "."
          },
          {
            "title": "Εισπραχθέντα",
            "name": "Εισπραχθέντα",
            "slug": "value_4",
            "type": "number",
            "format": "default",
            "osType": "value",
            "conceptType": "value",
            "decimalChar": ",",
            "groupChar": "."
          },
          {
            "title": "Έτος",
            "name": "Έτος",
            "slug": "date_fiscal_year",
            "type": "integer",
            "format": "default",
            "osType": "date:fiscal-year",
            "conceptType": "date"
          }
        ],
        "primaryKey": [
          "Κ.Α.",
          "Έτος"
        ]
      }
    }
  ],
  "@context": "http://schemas.frictionlessdata.io/fiscal-data-package.jsonld",
  "owner": "mple",
  "author": "Sotiris Karampatakis <[email protected]>",
  "count_of_rows": 269
}

datapackage.json.txt

@jindrichmynarz
Copy link
Contributor

If you can map the variables required by your pipeline to something like JSON Path in the FDP descriptor, then it may be used in a Mustache template.

@skarampatakis
Copy link
Collaborator Author

You mean something like that?

raw_dataset_uri  = $.resources[*].path 

@jindrichmynarz
Copy link
Contributor

Yes. Just to get an idea of how many of the variables your pipeline requires can be served from the FDP descriptor.

@skarampatakis
Copy link
Collaborator Author

Let suppose that we can map all of them. What would be the next step?

@jindrichmynarz
Copy link
Contributor

The next step could be testing if the pipeline can be generated using standard Mustache.

@skodapetr
Copy link

skodapetr commented May 31, 2017

@skarampatakis We got some new functionality (runtime configuration for more component, x-httpRequest) in develop brach. With that functionality I created a prototype:

It consists of two pipelines: Instance and Metapipeline. The Metapipeline retrieves the Instance, uses t-mustache for placeholder substitution and executes the pipeline.

What is missing:

  1. Input from e-pipelineInput. So you can start the execution with custom configuration (variables) with HTTP POST request. I think it is not necessary to have this for the demonstration purpose.
  2. Delete previous executions. Here I need to know: should we delete all previous executions of the pipeline or should we keep executions that failed?

It would be great if you can take a look and give some feedback. Especially if this solution would be suitable for your use-case.

@jindrichmynarz
Copy link
Contributor

By the way, @skodapetr, I get load timeout for the demo instance of LP-ETL:

screen shot 2017-05-31 at 16 18 24

@jakubklimek
Copy link

@jindrichmynarz We were updating the instance, just try again.

@jindrichmynarz
Copy link
Contributor

The problem persists after refreshes too or in other browsers. While the 1.2 MB of angular-material.js loads, the screen remains blank.

screen shot 2017-06-01 at 12 40 31

@jakubklimek
Copy link

jakubklimek commented Jun 1, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants