diff --git a/README.md b/README.md index 812423d..740652a 100644 --- a/README.md +++ b/README.md @@ -1 +1,51 @@ -# ENTYFI \ No newline at end of file +# ENTYFI: Entity Typing in Fictional Texts + +Cuong Xuan Chu, Simon Razniewski, Gerhard Weikum. (WSDM 2020) +Project website: https://www.mpi-inf.mpg.de/yago-naga/entyfi + +# Dependencies +- Python2 for mention detection. + - cPickle + - theano +- python3 with tensorflow for fictional typing + - tensorflow + - sklearn + - pandas + - keras +- python3 for ultra-fine typing and ilp + - Pytorch (ver 0.3.0) + - Python3 + - Numpy + - Tensorboard + - Pickle + - ast, pulp + - Pretrained word embeddings: "wget http://nlp.stanford.edu/data/glove.840B.300d.zip". + +# Required Data +You need to download required data which include background knowledge bases of all reference universes, pretrained models for fictional typing module and data for reference universe ranking. + +All data can be found at: http://people.mpi-inf.mpg.de/~cxchu/entyfi/ + +# Configuration +To run typing, you need to set some paths in several files: +- ultrafile/resources/constant.py + - GLOVE_VEC=path to pretrained word embedding (glove) +- utils/Constants.java + - PYTHON_TAGGER=path to python2 for mention detection + - PYTHON_ULTRA=path to python3 for ultra-fine typing and ilp + - PYTHON_GENERALTYPING=path python3 with tensorflow for fictional typing +- resources/wikia.properties + - BASE_DIR=path to data-store (background KB of all universes) --- data-store (downloaded data) + - ATTENTION_MODEL=path to pretrained model of fictional typing module --- attentionModel (downloaded data) + - TERMATRIX=path to universe-term matrix for reference universe ranking --- universe-termmatrix (downloaded data) + +# How to Run + +- Build: ./build.sh +- Run typing: ./run.sh heap-size typing.ENTYFI input-file output-file + For example: ./run.sh 10G typing.ENTYFI input-file output-file +Other parameters like topK reference universes or topK types returned by ILP can be defined in class typing/ENTYFI.java + +# Notes + +- For mention detection, to improve efficiency, we use technique from the paper: https://arxiv.org/abs/1603.01360