README.md 8.18 KB
Newer Older
dani's avatar
dani committed
1
2
[[_TOC_]]

dani's avatar
dani committed
3
# SihlMill and SihlQL
4

dani's avatar
dani committed
5
SihlMill is a query engine for SihlQL. SihlQL extends the SPARQL query syntax with various primitives for reading from data streams as well as processing them in a differentially private manner. Based on a SihlQL query, SihlMill builds a Flink topology that can be executed with Flink to process the specified data streams.
6

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
7
In the following, we explain how to install and run the execution engine. You can try it by: (1) importing the project in your IDE (we use IntelliJ as reference IDE, but we also tested and run it in Eclipse) or (2) executing it from command line (using Maven and bash). We assume that you have Java (8+) and Maven.
Roland Schlaefli's avatar
Roland Schlaefli committed
8

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
9
10
Please note that this is a research prototype (and by no means a commercially hardened system), which should not be used for publishing sensitive information anywhere. It does enable researchers to explore the nature privacy-preserving RDF stream publication.

dani's avatar
dani committed
11
## How the project is structured
dani's avatar
dani committed
12

dani's avatar
dani committed
13
We briefly summarise the design of SihlMill:  
Roland Schlaefli's avatar
Roland Schlaefli committed
14

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
15
1. Query compilation: the compiler gets a query as input, parses and optimises it using Apache Jena. Next, the compiler rewrites the query creating a Flink job. The result is the source code of a class named Topology.java in the package *sihlmill.generated.*
dani's avatar
dani committed
16
2. Job compilation: Topology.java should now be compiled. It is worth mentioning that this step is needed because it is not possible to generate **and** execute the topology in the same process, since Java raises problems about the dynamic types (e.g. Tuple and its derived classes in Flink).
dani's avatar
dani committed
17
3. The compiled job can be executed.
Pengchen Duan's avatar
Pengchen Duan committed
18

dani's avatar
dani committed
19
Going a bit more into technical details:
Pengchen Duan's avatar
Pengchen Duan committed
20

dani's avatar
dani committed
21
- Query compilation and job generation: pom.xml includes a plugin named _exec-maven-plugin_, which runs a Java main entry in the Maven compile phase.
dani's avatar
dani committed
22
23
  The compiler is in the class *Compiler*: the main method receives a SihlQL query file as one of the arguments. As explained above, *Compiler* generates the job *sihlmill.generated.Topology* which respresent the query in a Flink topology.
- Packaging: Maven packages the newly generated class *sihlmill.generated.Topology*  into the jar files: one that can be run as a standalone process, and one that can be submitted to a Flink cluster.
dani's avatar
dani committed
24

dani's avatar
dani committed
25
## The query language SihlQL
dani's avatar
dani committed
26

27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
This is a sample SihlQL query:
```
PREFIX : <http://example.org/>
PREFIX s: <https://schema.org/>
ENABLE PRIVACY EPSILON 0.1 W 10
SELECT ?product (COUNT(?user) AS ?h)
FROM FILE STREAM <file:///tmp/stream.jsonld>
TO FILE <file:///tmp/output.out>
WHERE{
    ?review s:itemReviewed ?product ;
      s:author ?user .
}
GROUP BY ?product
```

42
One of the most important extensions of SihlQL is the `FROM` clause in its `FROM STREAM` and `FROM STATIC` variations. Using `FROM STREAM`, the source for the input stream can be defined to be a JSON-LD file (using `FROM FILE STREAM <file://[PATH_TO_FILE]>`), a Kafka topic (`FROM KAFKA STREAM <kafka://[BROKER_IP]> TOPIC <[TOPIC_NAME]>`) or a or a MQTT topic (`FROM MQTT STREAM <mqtt://[BROKER_IP]> TOPIC <[TOPIC_NAME]>`). 
43

44
`FROM STATIC` allows for an easy enrichment of the graphs with static RDF triples (e.g., sourcing like `FROM STATIC <file:///[PATH_TO_RDF_FILE]>`).
dani's avatar
dani committed
45

46
In addition to the possibility of reading from Kafka, SihlQL also adds a primitive for producing its results to a Kafka or MQTT topic. Using `TO KAFKA <kafka://[BROKER_IP]> TOPIC <[TOPIC_NAME]>` or `TO MQTT <mqtt://[BROKER_IP]> TOPIC <[TOPIC_NAME]>`, the resulting Flink job will produce all of its results to the corresponding Kafka or MQTT topic.
47

dani's avatar
dani committed
48
# How to run SihlMill
dani's avatar
dani committed
49
50
## Setting up

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
51
- Clone the project.
dani's avatar
dani committed
52

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
53
- Open the root directory *sihlmill*, where you will find the code and the data
dani's avatar
dani committed
54

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
55
- To initialise some example queries and unzip the data, run:
dani's avatar
dani committed
56

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
57
  ```sh ./setup.sh```
dani's avatar
dani committed
58

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
59
  You will find a new folder *queries*, and data will contain decompressed streams ready to be used to test SihlMill.
dani's avatar
dani committed
60
61


dani's avatar
dani committed
62
## Packaging the project from command line with Maven
dani's avatar
dani committed
63

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
64
* Go into the root dir ${project_root_path}, i.e., *sihlmill*, run the Maven command:
dani's avatar
dani committed
65

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
66
  ```sh
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
67
  mvn clean compile package -Dquery.path=${YOUR_QUERY_FILE_PATH}
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
68
  ```
dani's avatar
dani committed
69
70
71

  where YOUR_QUERY_FILE_PATH is the path to the query. We provide some example queries in ${project_root_path}/queries. When no query is provided, i.e. you run:

dani's avatar
dani committed
72
  ```sh
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
73
  mvn clean compile package
dani's avatar
dani committed
74
  ```
dani's avatar
dani committed
75
76
77
78

  the compiler will take a default query located at *${project_root_path}/queries/query-template*

* After the compilation, in the target directory, i.e., *${project_root_path}/target*, there will be three jars:
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
79
  * **sihlmill-0.6-jar-with-dependencies.jar**
dani's avatar
dani committed
80
    This jar contains the job with the dependencies. It can be executed via the Flink cluster.
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
81
  * **sihlmill-0.6-standalone.jar**
dani's avatar
dani committed
82
    This jar contains the job, also with dependencies. It is targeted to be executed via command line.
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
83
  * **original-sihlmill-0.6.jar**
dani's avatar
dani committed
84
85
    This jar contains the job without dependencies. It is targeted to be executed via the Flink cluster which has loaded the dependencies.

dani's avatar
dani committed
86
87
## Executing the project
### From Command Line
Pengchen Duan's avatar
Pengchen Duan committed
88

Pengchen Duan's avatar
Pengchen Duan committed
89
- Execute via command-line: 
dani's avatar
dani committed
90

Pengchen Duan's avatar
Pengchen Duan committed
91
92
```sh
cd ${project_root_path}/
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
93
java -jar target/sihlmill-0.6-standalone.jar
Pengchen Duan's avatar
Pengchen Duan committed
94
```
Pengchen Duan's avatar
Pengchen Duan committed
95

dani's avatar
dani committed
96
### From a Flink Cluster
dani's avatar
dani committed
97

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
98
99
- Upload the **sihlmill-0.6-jar-with-dependencies.jar** to the homepage __*http://JOBMANAGER_HOST:8081/#/submit*__ of the Flink cluster. As entry class, set: *sihlmill.generated.Topology*, then input the parallelism, and then click 'Submit'.
- Look at the terminal and wait for the result.
dani's avatar
dani committed
100

dani's avatar
dani committed
101
## From an IDE (IntelliJ)
dani's avatar
dani committed
102
103
104
105

In this section we describe how to run the project in IntelliJ.
#### Import the project using IntelliJ IDEA

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
106
- Open the project: at the up-left corner of the window of IntelliJ IDEA, click **File** -> click **Open** -> choose the root directory *sihlmill*  
dani's avatar
dani committed
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

At this point, the project will be opened. You should see the project in IntelliJ IDEA, like

![](src/main/resources/viewaftersetup.png)

* If you forgot to install the query parser package (see _Setting up_ section), you should do it, and then refresh the project. You can do that as in the figure:

![](src/main/resources/reimport.png)

* Your project is now ready to run!

#### Run the project in IntelliJ

- Write your own query in a file or chose one among the available ones (in the query folder)
- Configure the main entry of FlinkStream. Go to the **Run** menu -> Select **Edit Configurations**, then **Template**, and click **Application** :
  - Fill the **Main class** (as in Figure)
  - If the query is provided, fill it (otherwise will use the default one: *queries/query-template*)

At the end, the configuration page should be like this:

![](src/main/resources/run-ide.png).
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
128
129
- Open the class *sihlmill.compiler.FlinkStream*, open the right click menu and press **Run 'FlinkStream.main()'**
- The generated job is generated and stored in *sihlmill.generated.Topology*
dani's avatar
dani committed
130
131
132

It's time to run the Flink topology!

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
133
- Open the class *sihlmill.generated.Topology*,  open the right click menu, and press **Run 'Topology.main()'**
dani's avatar
dani committed
134
- Look at the terminal and wait for the result. For this double blind submission, we provide two sample streams and a data item (if you move them, you should adjust the path in the query files). At the beginning, the job loads the stream file in memory, and it could take some seconds (usually less than one minute). 
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
135

dani's avatar
dani committed
136
137
138
139
# More on the SihlMill and SihlQL

## Licence
The code of SihlMill is available under Apache 2.0 License.
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
140
141

## Publications
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
142
* D. Dell'Aglio and A. Bernstein: Differentially private stream processing for the semantic web. The Web Conference 2020 (TheWebConf 2020), ACM, pp. 1977-1987. Taipei, Taiwan, April 2020. 
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
143
144
* R. Pernischová, F. Ruosch, D. Dell'Aglio and A. Bernstein: Stream Processing: The Matrix Revolutions. 12th International Workshop on Scalable Semantic Web Knowledge Base Systems, co-located with the 17th International Semantic Web Conference, CEUR-WS.org, pp. 15-27. Monterey, USA, October 2018. 

Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
145
146
147
148
149
## Contributors
* Daniele Dell'Aglio
* Romana Pernischova
* Florian Ruosch
* Roland Schaefli
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
150
* Pengcheng Duan
dani's avatar
dani committed
151
* Alessandro Margara
Daniele Dell'Aglio's avatar
Daniele Dell'Aglio committed
152
153

## Acknowledgements
dani's avatar
dani committed
154
SihlQL is partially supported by the ![Swiss National Science Foundation](http://www.snf.ch/) under contract number #407550_167177 (![NFP 75's](http://www.nfp75.ch/) project ![Privacy-preserving, stream analytics for non-computer scientists](http://www.nfp75.ch/en/projects/module-1-information-technology/project-boehlen))