The CounQER system provides a pipeline for identifying set predicates in a KB. We use linguistic and co-occurrence alignment metrics to analse the relationship between the predicates. The results of these alignments can be explored in the project demo page at https://counqer.mpi-inf.mpg.de. The project uses pgadmin to access its backend PostgreSQL database.
Requirements
The project runs in a Python3 virtual environment. requirements.txt
provides the list of the necessary packages.
python3 -m venv \path\to\myenv
activate \path\to\myenv\bin\activate
Once inside the environment change to counqer/
and install the required packages.
pip install -r requirements.txt
Data setup
Location: ./datasetup
Create a local n-tuple DB from RDF dumps of KBs.
-
create*<KB-name>*DB.py
a. This file calls
createDB
if the table is to be hosted in a posstgres serverb.
createcsv
is called to create a csv file which can be imported to any database management system (like Postgresql, Hive) as a table. -
query the SPO tables for a list of distinct predicates and their frequencies. Save results as csv (
predfreq_p_all.csv
) corresponding DB subfolder. -
generate_property_details_.py Uses the
property_details_from_postgres
to create a table and a csv files with the table values. These values can then be copied to the table using psql commands.psql -h postgres2.d5.mpi-inf.mpg.de -d <database_name> -U <username> <database_nmae>=> \copy fb_pred_property FROM '<KB-name>_pred_property.csv' DELIMITER E'\t' CSV HEADER; <database_nmae>=> \q
Crowd task for type identification
Location: ./classifier_crowd_annotations
Sample predicates from candidate KBs to present to the crowd annotators
-
sql_query_for_set_predicates
has the sql query used to sample data items for counting predicates in the first querya. We filter out less (<50) frequent, non-integer (<5% integer values and >5% float values) predicates
b. The samples are saved in
./counting
folder as csv files under the names of the corresponding KBs.c. Create a entity lookup list for freebase using
sql_fb_entity_label
.d.
get_labelled_triples.py
reads all sampled predicates from./counting
and creates a data file with labelled triples./counting/counting_labelled_triples.csv
.e.
clean_labelled_triples.R
unifies triples from multiple sources to create a csv file ready for upload to the crowd-sourcing platform.NOTE: Since Freebase returns empty subject labels we create a larger sample size (of 200 predicates) and select 100 samples with 5 complete example triples.
-
sql_query_for_set_predicates
has the sql query used to sample data items for enumerating predicates in the second query.a. Sampled data from each KB is saved in
./enumerating
folder.
Note We create a test set containing honey-pot questions for figure-eight task (in ./test
folder). First we run the get_labelled_triples.py
on the selected test predicates and then manually edit the test_rows_figure_eight.csv
file to add the annotations columns (_golden, *<question>*_gold, *<question>*_gold_reason
).
Predicate usage feature collection
Location: ./predicate_usage_features
- Download the POS tagger data for nltk.
$ python
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.pos_tag(nltk.word_tokenize('This is a sentence'))
-
Run
get_estimated_matches.py
to get the predicate usage features from the Bing API for all frequent (>= 50) predicates. Data stored in -
Run
get_sub_obj_types.py
Classifier dataset creation
./pred_property_p_50
has the predicate property files of all KBs with predicate frequency >= 50. Next, we collect data from different sources to create a unified feature file of all predicates (predicates_p_50.csv
) and the labelled predicates (labelled_data_counting.csv
, labelled_data_enumerating.csv
) in the folder ./feature_file
using the script ./create_feature_file.R
.
KB | All | Frequent |
---|---|---|
DBP-raw | 59,149 | 13,394 |
inv | 14,085 | 3,241 |
DBP-map | 1,355 | 1,127 |
inv | 653 | 543 |
WD-truthy | 5,032 | 3,346 |
inv | 1,079 | 721 |
Freebase | 784,936 | 8,289 |
inv | 14,871 | 5,583 |
YAGO | (79) | (79) |
Classifier training
Location: ./classifier
We have two classifiers - one for counting and one for enumerating in .../*<type>*/*<type>*_classifier.R
.
Classifier models used -
- Logistic regression
- Bayesian glm
- Lasso regression
- Neural network with single hidden layer
The predictions are saved in .../*<type>*/predictions.csv
.
Random Classifier performance:
- Counting: 345 data points, 39 positive, 306 negative
Predicted | |||
---|---|---|---|
Actual | 0 | 1 | |
0 | 272 | 34 | 306 |
1 | 34 | 5 | 39 |
306 | 39 | 345 |
Precision = Recall = F1 = 12.8%
- Enumerating: 328 data points, 133 positive, 195 negative
Predicted | |||
---|---|---|---|
Actual | 0 | 1 | |
0 | 116 | 79 | 195 |
1 | 79 | 54 | 133 |
195 | 133 | 328 |
Precision = Recall = F1 = 40.6%
Precision Recall scores of all models a. Counting
Model | Recall | Precision | F1 |
---|---|---|---|
Random | 12.8 | 12.8 | 12.8 |
Logistic | 51.2 | 19.0 | 27.7 |
Bayesian | 48.7 | 20.2 | 28.5 |
Lasso | 71.7 | 23.3 | 35.1 |
Neural | 35.8 | 20.8 | 26.3 |
b. Enumerating
Model | Recall | Precision | F1 |
---|---|---|---|
Random | 40.6 | 40.6 | 40.6 |
Logistic | 55.6 | 51.7 | 53.5 |
Bayesian | 55.6 | 51.0 | 53.5 |
Lasso | 51.1 | 59.6 | 55.0 |
Neural | 53.0 | 49.6 | 51.2 |
Predicted counting predicates
KB | Input | Output | Filtered |
---|---|---|---|
DBP-raw | 13,394 | 5,853 | 5853 |
DBP-map | 1,127 | 898 | 898 |
WD-truthy | 3,346 | 1,922 | 1,067 |
Freebase | 8,289 | 1,723 | 1,687 |
Predicted enumerating predicates
KB | Input | Output | Filtered |
---|---|---|---|
DBP-raw | 16,635 | 2,894+1196 = 4090 | 2894+1196 = 4090 |
DBP-map | 1,670 | 173+135 = 308 | 173+135 = 308 |
WD-truthy | 4,067 | 99+117 = 216 | 86+ 117 = 203 |
Freebase | 13,872 | 6311+1441 = 7752 | 6177+1437 = 7614 |
Alignment metrics computation
Location: ./alignment
-
Create a csv file with entity names across different platforms.
a. DBpedia entity: http://dbpedia.org/resource/ b. Wikidata entity: http://www.wikidata.org/entity/
shorten_entity_names.py
- remove url prefic which identifies the KB.get_sameAs_dbpedia.py
- for all unique entities collected from KB and shortened, get the corresponting entity identities in other KBs (namely, Wikidata and Freebase). -
Get the number of entities per subject per predicate information from KB query using psql.
a. Enumerating
\copy (Select sub, pred, count(*) from *<kb-name>* where obj_type='named_entity' group by pred, sub order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV;
Since Freebase has 700k predicates, modify above query by filtering only top frequently occurring predicates.
\copy (Select sub, pred, count(*) from freebase_spot where pred in (*<list from file fb_pred_names_p_50>*) obj_type='named_entity' group by pred, sub order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV;
Stored in DB server as a table with name
*<kb-name>*_sub_pred_necount
.b. Counting
\copy (Select sub, pred, obj from freebase_spot where pred in (*<list from file fb_pred_names_p_50>*) and obj_type='int' order by pred, sub) to '/GW/D5data-11/existential-extraction/count_information/integer_per_pred_per_sub_fb.csv' with CSV;
Stored in DB server as a table with name
*<kb-name>*_sub_pred_intval
.Note Create indexes on the predicate column.
-
Create a view of triples in each kb having p_50 predicates.
create view *<kb_name>*_p_50 as select * from *<kb-name>*_spot where pred in (*<list from file kb_pred_names_p_50>*)
-
Get co-occurrence statistics on the generated view. Store co-occuring pairs (predE, predC, #co-occurring subjects) in
./cooccurrence/*<kb-name>*_predicate_pairs.csv
. ~~``` psql select t1.pred as predE, t2.pred as predC, count(distinct sub) from (select * from <kb_name>_p_50 where obj_type='named_entity') as t1 inner join (select * from <kb_name>_p_50 where obj_type='int') as t2 on t1.sub = t2.sub group by t1.pred, t2.pred*Note* This is not time-efficient. Use instead ```select t1.pred as predE, t2.pred as predC, count(*) from *<kb-name>*_sub_pred_necount as t1 inner join *<kb-name>*_sub_pred_intval as t2 on t1.sub = t2.sub group by t1.pred, t2.pred
-
Get predicate marginals (#subjects per predicate) in files labelled
./marginals/*<kb-name>*_int.csv
for counting predicate marginals and./marginals/*<kb-name>*_ne.csv
for enumerating predicate marginals.select pred, count(*) from *<tablename>* group by pred
where*<tablename>* in *kb-name*_sub_pred_intval, *kb-name*_sub_pred_neocunt, *kb-name*_obj_pred_necount
-
Run
get_cooccurrence_scores.py
to get the alignment metrics. -
RunNote: Get linguistic similarity scores online since reading existing files is time consuming.get_linguistic_sim.py
to generate linguistic alignment.
Inverse Predicates
-
Get inverse predicates from postgres server
select pred_inv from *<kb-name>*_inv_pred_property where frequency >= 50
into a list inp_50_prednames/
-
Get the number of entities per subject per inverse predicate information from KB query using psql.
\copy (Select obj, pred, count(*) from *<kb-name>*_spot where pred in (*<list from file kb-name_pred_names_p_50>*) and obj_type='named_entity' group by pred, obj order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV;
-
Get co-occurrence stats for inv predicates
-
Label inverse predicates as enumerating using the enumerating classifier.
Post-processing
Location: ./alignment
1. Predicate Filtering
filter_prednames.py
- to remove codes and id's from predicted predicates. The number of predicates (id and code names) filtered before and after classification -
Type | Pre-class | Post-class | # removed by classifier |
---|---|---|---|
Enumerating | 2158 (26156) | 147 (9477) | 2011 (93.1%) |
Enum_inv | 9 (10091) | 4 (2890) | 5 (55.5%) |
Counting | 2158 (26156) | 881 (10396) | 1277 (59.1%) |
Note: number in bracket denotes the predicates input to the filter.
2. Metrics aggregation
-
Get the (filtered) predicate lists from
get_predicate_list.R
. -
Keep only required metrics (predicate pairs which are in the predicted lists) in
./metrics_req
folder by runningmetrics_assembly.R
.
Number of aligments obtained = 4265
KB name | Direct | Inverse |
---|---|---|
DBP map | 138 | 126 |
DBP_raw | 1947 | 1756 |
WD | 22 | 2 |
FB | 120 | 154 |
Total | 2227 | 2038 |
Crowd Evaluation of Alignment
Location: ./alignment_crowd_annotations
-
clean_fig8_test_ques.R
- to re-use figure8 evaluation questions. -
test_questions/edit_fig8_for_mturk.py
- create test csv for mturk -
clean_mturk_resp.R
- check responses of test questions. -
select_random_prop_for_eval.R
- create a list of 300 counting and 300 enumerating (ratio of inverse vs. direct) predicates for crowd evaluation. -
eval_questions/create_eval_top3_pairs.py
- to get list of top predicates from different metrics.#datapoints for enumerating = 460
#datapoints for counting = 371
-
eval_questions/create_datafile.py
- create csv with labelled triples for mturk.#datapoints for enumerating = 169
which implies that 291 pairs do not cooccur.#datapoints for counting = 72
which implies that 299 pairs do not cooccur. -
Launch Mturk task with the csv files in
eval_questions/data/
and runeval_questions/notify_successful_workers.py
to notify selected workers to take the task. -
Download MTurk results to
eval_annotations/
and runeval_annotations/clean_mturk_repsonse.R
to get absolute scores for all pairs0.5*(1/3)*(#complete*1 + #incomplete*0.5 + #unrelated*0)
Note: 0.5 * (1/j) = is weight for topicality and enumeration scores times the number of judges (m); #x * w = number of votes x received from 3 judges times the weight of x.
Evaluation
Location: ./evaluation
evaluate.py
- To generate dcg scores for all metrics.aggregated_ndcg.R
- To get mean ndcg of all metrics.
Demo
The demo is developed in Python using Flask webframework and run on an Apache webserver. The site is under contruction and may not exhibit full functionalites of the system.
Flask Application
Location: ./flask_app
Predicate List
Location: ./predicate_list
Scipts for create json files of KB set predicates to be displayed in the demo.
############### Notes
counting <- read.csv('alignment/counting_filtered.csv')
wd_labels <- read.csv('datasetup/WD/wd_property_label.csv')
wd_labels$id <- substr(as.character(wd_labels$Property), 32, nchar(as.character(wd_labels$Property)))
counting$id <- sapply(counting$pred, function(x) substr(x, start=tail(gregexpr('/', x)[[1]], 1)+1, stop=nchar(as.character(x))))
counting <- inner_join(counting, wd_labels, by='id')