Below are the links to the BPMN diagram that shows the entire process in which we create a mock report, we apply fabricated rules, we fit the model, we calibrate it and we verify the inferred rules.
![]()
→ full BPMN diagram edited with ARIS Express
The implementation of the process mock data generator of the data-flow-diagram is based on the python Faker library and available here:
https://github.com/a-moscatelli/home/blob/main/am-wiki-assets/bayesdatamining/faker-datamining.py
This is an excerpt of the CSV file generated by the process above.
The independent variables are the columns from 'trade_date' to 'delim'.
The dependent variables are the columns from 'delim' to '* PnL' (realized PnL, unrealized PnL, total PnL = R + U).
id,trade_date, exp_date, trader,tr_ccy,product,disc_curve,fx_cut,market,PnL_ccy,delim,uPnL,rPnL,PnL
0,2023-05-04,2023-06-09, Shannon Byrd,USDALL,fx swap,USDALL-GE,TKY1500,GE,USD,:,-7408923.84,-1466436.87,-8875360.71
1,2023-05-14,2023-06-25, Ashley Wilson,USDBWP,fx spot,USDBWP-GM,TKY1500,GM,USD,:,-5226084.9,-8085982.9,-13312067.8
2,2023-04-29,2023-06-02, Jasmine Baker,USDFJD,fx spot,USDFJD-AG,NYC1000,AG,USD,:,6394191.51,819916.38,7214107.89
3,2023-05-14,2023-06-16, Ashley Gilmore,USDAZN,fx futures,USDAZN-BW,TKY1500,BW,USD,:,7253124.02,-1717815.64,5535308.38
4,2023-04-26,2023-05-13,Roberta Wade DVM,USDTMT,fx spot,USDTMT-WS,TKY1500,WS,USD,:,5089417.84,-8436149.32,-3346731.48
5,2023-05-13,2023-05-14, Jennifer Jones,USDZMW,fx fwd,USDZMW-GN,NYC1000,GN,USD,:,-9869409.75,-3996865.59,-13866275.34
6,2023-05-11,2023-06-29, Rebecca Flores,USDMNT,fx spot,USDMNT-GD,NYC1000,GD,USD,:,-996809.37,-8099321.43,-9096130.8
7,2023-04-25,2023-06-02, Stephen Murray,USDPAB,fx swap,USDPAB-OM,NYC1000,OM,USD,:,-3281707.89,797070.23,-2484637.66
8,2023-05-02,2023-05-08, John Franklin,USDLSL,fx futures,USDLSL-TH,TKY1500,TH,USD,:,6089738.13,2395105.79,8484843.92
9,2023-04-28,2023-05-05,Jennifer Roberts,USDSRD,fx futures,USDSRD-SK,ECB1415,SK,USD,:,-5658173.72,-7390356.96,-13048530.68
With the processes rule engine and signal derivation of the data-flow-diagram, we are going to generate the CANDIDATE report through the application of a few rules in order to get a set of fabricated differences. Then, we are going to calculate a few derived variables.
This step is done to transform our discrete variables (like the time_to_expire), into categorical ones (like expired_today true/false, expired_yesterday true/false), that are accepted by our data mining algorithm.
Hi-res picture
After running the workflow above (see KNIME 4.6) we have a CANDIDATE report and the derived signals.
An example of signal expressions is as follows:
signal_1m = (expiry_date == report date - 1)
signal_1e = (expiry_date == report date)
signal_1p = (expiry_date == report date + 1)
signal_2p = (trade_date == report date + 1)
...
Providing these extra variables to the rule miner
is a way of verifying further hypotheses we have in mind:
is there any systematic issue with
the price computed for a class of trades expiring tomorrow?
In the process data miner we are running the FPGrowth provided with pySpark.
# docker
services:
pyspark:
image: "jupyter/pyspark-notebook:notebook-6.5.4"
FPGrowth is an enhancement of Apriori which is an enhancement of the Naive Bayes. It is an unsupervised machine learning implementation.
An example of Apriori implementation is available at:
https://github.com/a-moscatelli/DEV/tree/main/association_mining
The FPGrowth documentation supported by pySpark is available here
The FPGrowth requires the specification of a mininum support and confidence.
The minimum support we provide should be lower than the occurrence of the fabricated differences.
We now execute the process p3a of the data-flow-diagram.
Inferred rules for realized PnL: Jupyter notebook
| SN | antecedent | consequent | confidence | lift | support | lenA | lenC |
|---|---|---|---|---|---|---|---|
| 1 | [PnL_ccy=AUD, product=fx fwd] | [rPnL_diff=1] | 1 | 661.38 | 0 | 2 | 1 |
| 2 | [trade_ccy=AUDUSD, product=fx fwd] | [rPnL_diff=1] | 1 | 661.38 | 0 | 2 | 1 |
| 3 | [PnL_ccy=AUD, product=fx fwd, exp_date_m1=0] | [rPnL_diff=1] | 1 | 661.38 | 0 | 3 | 1 |
| 4 | [trade_ccy=AUDUSD, product=fx fwd, exp_date_m1=0] | [rPnL_diff=1] | 1 | 661.38 | 0 | 3 | 1 |
| 5 | [PnL_ccy=AUD, trade_ccy=AUDUSD, product=fx fwd] | [rPnL_diff=1] | 1 | 661.38 | 0 | 3 | 1 |
| 6 | [PnL_ccy=AUD, trade_ccy=AUDUSD, product=fx fwd, exp_date_m1=0] | [rPnL_diff=1] | 1 | 661.38 | 0 | 4 | 1 |
The implied rule 2 is exactly the rule that was applied to introduce fabricated differences.
The implied rule 1 is a byproduct of rule 2: in our input,
trade_ccy=AUDUSD ⇒ PnL_ccy=AUD
Inferred rules for unrealized PnL: Jupyter notebook
| SN | antecedent | consequent | confidence | lift | support | lenA | lenC |
|---|---|---|---|---|---|---|---|
| 1 | [exp_date_m1=1, fx_cut=TKY1500] | [uPnL_diff=1] | 1 | 239.23 | 0 | 2 | 1 |
| 2 | [exp_date_m1=1, fx_cut=TKY1500, PnL_ccy=USD] | [uPnL_diff=1] | 1 | 239.23 | 0 | 3 | 1 |
The implied rule 1 is exactly the rule that was applied to introduce fabricated differences.
The implied rule 2 is less general than rule 1 - but can still give clues about the root causes of the diffs.
The rule mining script above,
executed on AWS Glue 3.0 with the default settings,
could process our half-million-record file pair
in 1 min 55 s
at a cost of USD 0.15
(Apr-2023)
aws glue start-job-run --job-name JOBNAME --region REGION --number-of-workers 10 --worker-type G.1X
:: default settings: 10 DPUs, G.1X
:: execution: 0.32 DPU hours
Full program: https://ampub.s3.eu-central-1.amazonaws.com/am-wiki-js/rulemining/FPG-FPmining-adapted-for-glue.py
Back to Portfolio