EEQuest: An Event Extraction and Query System
Prerit Jain Haripriya Bendapudi Shrisha Rao
IIIT Bangalore IIIT Bangalore IIIT Bangalore
[email protected] [email protected] [email protected]
ABSTRACT (OSINT), which relies on the analysis of open published tex-
We present EEQuest, an application that extracts events tual sources. Such analyses have historically “not only con-
from text using natural language processing (nlp) and su- stituted a major part of all intelligence,” but also are “the
pervised machine-learning techniques, and provides a system leading source of information” [19]. Even in business in-
to query events extracted from a text corpus. telligence (BI), the ability to answer queries based on event
We provide a use case for the application wherein we ex- extraction from unstructured data is critical [18]. Therefore,
tract business-related events from news articles. The ex- the problems of event extraction and querying from textual
tracted events are then categorized based on the business data are of much real-world significance.
organization/company that they are related to. Finally, the In this paper, we present an approach to build an appli-
events are added to a knowledge base using which a query cation that extracts such events and allows users to query
system is built. The system can be used to display events those that are relevant to them. The approach adopted is as
related to a particular organization or a group of organiza- follows: we first take in English text, use nlp and supervised
tions. Although we are using the system to extract business- learning techniques to detect and extract the various ele-
related events, the event extraction mechanism can be used ments of an event (who, what, where and when). Once vari-
in a more general sense with any available textual data, to ous events are extracted, we store them in a graph database
extract any kind of events that have a structure that can (Neo4J) and create a query system on top of it. The graph
answer the question: Who did what, when and where? database allows the events to be divided into categories and
subcategories while keeping the queries simple and fast. A
user now, can request for events belonging to the category
CCS Concepts or subcategory of his/her choice.
•Computing methodologies → Information extrac- EEQuest has been built∗ as an exemplar for the use and
tion; Supervised learning by classification; working of an application such as the one described above. It
extracts business-related events from news articles and adds
Keywords them to a graph database. The graph database consists of
various companies related to each other as competitors or
event extraction, information extraction, natural language associates, and events related to each company. A user of
processing, supervised learning, artificial intelligence. the application can query events related to a particular com-
pany, those related to a company’s competitors/associates,
1. INTRODUCTION or those related to a particular group of companies or prod-
As the availability of data increases day by day, it be- uct type.
comes increasingly urgent to devise methods to automate By keeping the core technology of the above application
the extraction of relevant information from this data. “Event intact and by changing some of its peripherals such as the
Extraction” is a common application of text mining, where input data, the structure of the network, etc., the application
information about important real-world incidents or occur- can be adapted to various kinds of use cases.
rences is detected and extracted from texts that often do The rest of the paper is organized as follows. In Section 3,
not have a fixed and predictable structure. Such real-world we talk about the architecture of the system used to build
events that are to be detected can most of the time be struc- the application. In the sections that follow, we give a de-
tured in the form of, ”Who did what, when and where?” [12]. tailed explanation of how each component of the system was
Event extraction is a major part of open-source intelligence implemented. In Section 4, we give a detailed explanation of
the procedure we used for extracting events from the news
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
sources. In Section 5 we explain how the query system was
for profit or commercial advantage and that copies bear this notice and the full citation built after extracting the events. Section 6 summarizes the
on the first page. Copyrights for components of this work owned by others than the results of various experiments carried out to test the com-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or ponents of the model. Finally, in Section 7 we present our
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from
[email protected]. conclusions and ideas for future work.
ACM COMPUTE ’16, October 21 - 23, 2016, Gandhinagar, India
c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4808-9/16/10. . . $15.00
∗
The source code and data for this work can be found at
DOI: http://dx.doi.org/10.1145/2998476.2998482 https://github.com/haripriya-b/EventExtraction.
2. RELATED WORK at only a few specific relations (or maybe just a single one)
There is a lot of on-going as well as past research related that can possibly give a gist of the entire article. Therefore,
to event extraction and it also finds applications in many other parameters such as the frequency of certain important
diverse fields. A few areas where event extraction has been words/phrases, their relative position within the text and ti-
applied are discussed below. tle of the article, etc. need to be considered. Also, since we
Borsje et al. [1] developed a financial event extractor that are only interested in specific relations, it may be easier to
detects and extracts financial events, thereby, making it eas- first find the subject of discussion and then extract the rela-
ier for investors to continuously monitor financial markets. tion as opposed to finding the relation and then extracting
They use lexico-semantic patterns to extract relevant events the arguments.
from RSS news feeds.
Tanev et al. [2] and Alexandra et al. [4] provide different 3. SYSTEM ARCHITECTURE
approaches for extracting real time news events. The former In this section we present the basic architecture and com-
extracts violent and natural disaster events from online news ponents of EEQuest.
by using clustering techniques to automatically tag words The system consists of the following basic modules:
and further learns patterns from these extracted events. The
latter evaluates news articles in real-time and extracts mean- • Event Extraction Module
ingful complex events based on the user’s needs and specifi-
cations. • Network Creation Module
Coming to the actual process for extracting events, there
are several methodologies that can be adopted for this. Hogen- • Event Addition Module
boom et al. [3] provide an overview of some of the various • Query Module
text mining techniques that have been employed for various
event extraction purposes. In their paper, they also provide The application has two main jobs: (a) extracting events
some general guidelines on how to choose a particular tech- and (b) building the query system. The first module takes
nique depending on the user, the scenario of use and the care of event extraction and the rest of the modules work to
available content. build the query system. We give a brief explanation of what
Frasincar et al. [5] used an extraction mechanism that is happens in each of these modules.
centered around a domain ontology which is used for index-
ing news items. The major bottleneck of this approach is Event Extraction Module
that the ontology needs to be regularly maintained by do- This module takes in a news article provided in json for-
main experts since news information can contain knowledge mat, does some nlp on the given data, extracts the required
that is not known apriori. information and creates events. Here, an event can be the
Zhang et al. [6] devised an unsupervised approach that launch of a new product, sudden change in its stock value,
uses a probabilistic graphical model to cluster sentences de- mergers and acquisitions, etc. For each article, the mod-
scribing similar events to discover event relations and learn ule first finds the organization being talked about and then
to extract them. extracts information that answers the question: Who did
Fader et al. [7] built an system named ReVerb that helps what, where and when and stores it in the form of an event.
identify relationships for open information extraction. The
main focus of their work is to reduce incoherent and uninfor-
mative extractions. To achieve this, they establish certain
syntactic and lexical constraints and only those phrases that
satisfy these constraints are considered as relations. Their
algorithm first extracts the relations that satisfy the lexical
and syntactic constraints and then extracts the arguments
to the left and right of each relation.
As an improvement to ReVerb, Schmitz et al. [8] came up
with ollie (Open Language Learning for Information Ex-
traction) which overcomes the limitations of ReVerb. They
expand the syntactic scope of relational phrases and also Figure 1: System Architecture for EEQuest
allow additional context information. To extract relations,
they first use a set of high precision seed tuples from Re-
Verb, bootstrap a large training set and use this training Network Creation Module
set to learn open pattern templates. These templates are The query system is built on top of a network of compa-
applied on the data at the time of extraction. Finally, ollie nies and events related to them. Therefore, the first step in
analyzes the context around the tuple to add information. building the query system is to get the network ready.
Although both systems show promising results in extract- This module creates a network of companies and busi-
ing relations, it may not be ideal to use them for the pro- ness organizations. The nodes in the network represent
cess of event extraction. To start with, ReVerb uses strict the companies while the edges represent relations (competi-
lexical and syntactic constraints for the process of extrac- tors and associates) between various companies. All compa-
tion, which makes their system quite rigid. Although, this nies that are related to the same domain or produce sim-
problem is addressed in ollie, both systems are still more ilar/comparable products are grouped under a single do-
applicable for general extraction of relations. In the system main/product type. Also, companies (especially some of
we are trying to build, we are more interested in looking the large companies such as Amazon, Google, etc.) might
have multiple divisions developing multiple kinds of prod- text of the article we used certain nlp and supervised ma-
ucts. Therefore, a company can be connected to more than chine learning techniques. The entire process was coded in
one product type. The created network is stored in Neo4J Python. For nlp, we used packages such as nltk [13], Stan-
[20] which is a graph database. ford core nlp [14] and Cort (co-reference resolution toolkit)
To create the network, we start by manually adding a set [15]. The entire pipeline for the process of event extraction
of companies, their domains of work and the relationships is explained in detail below. Also, we took an example arti-
between them. As events get extracted and added to the cle and used it to demonstrate various steps. An excerpt of
network, if the company referred to in the article does not the extracted article is given in Figure 2.
exist, the system automatically adds it to the network. How-
ever, any sort of relations between this company and others
need to be added manually.
Event Addition Module
Once the basic network is created by the network creation
module and events are extracted by the event extraction
module, the next step is to add the extracted events to the
graph database. To do this, the network module creates new Figure 2: Example article for event extraction [16]
event nodes and connects each event node to it’s respective
organization. Once all the events are added to the graph
database, queries can be made for events related to a par- 4.2.1 Pre-processing
ticular organization or a particular group of organizations. Before jumping into the event extraction process, certain
amount of pre-processing of the text had to be done. Once
Query Module the article’s text was extracted from the json, the following
Once the events have been added to the network, the query operations were done.
module allows a user to make queries on the network to
get relevant information. Since the network graph is made Sentence Tokenization.
using Neo4j, the queries are made using its query language The entire text had to be split into sentences. Just split-
i.e, Cypher. ting on the period (.) does not work. There can be abbre-
viations (w.h.o, b.h.e.l ), person titles (Mr., Dr.), etc that
have periods. Therefore, care had to be taken to ensure that
4. EVENT EXTRACTION sentences were split in the right manner. The nltk package
In this section we explain in detail the procedure adopted in Python provides the sent tokenize() method which splits
to extract events from news articles. We start by explaining the text into valid sentences.
how we built a data set for testing out approach followed by
a detailed explanation of how we extracted each element of Parts of Speech (POS) Tagging.
an event. Here, the parts of speech of the various word tokens in
the sentence are determined. For each word token, an ap-
4.1 Data Collection propriate syntactic label is given to it from a list of pos tags.
In order to be able to extract events, we needed business The Stanford core nlp package provides a pos tagger which
related news articles. We used the rss feed of some standard can be used for tagging parts of speech of a sentence. An
news sources and magazines such as Reuters [9], TechCrunch example of pos tagging is shown in Figure 3.
[10], etc., to get articles. Within these sources, we looked
at the business subsection alone, since we were interested in
extracting business related events. From the rss feed, we
extracted details such as title, date, description and url of
each article using the f eed parser package in python. Then,
we used the url to extract the text from the actual web
page of the article using the N ewspaper [11] package, also Figure 3: A POS-tagged sentence for Figure 2
available in python. We then stored the data related to
each article in a file, in json format along with a unique
identifier for each article. These articles were then used for Parse Tree.
extracting events. We periodically checked for rss feed and Once the individual word tokens are given pos tags, each
only considered the feed if it got updated. sentence is then converted into a hierarchical structure that
corresponds to individual units of meaning in the sentence.
4.2 Extraction Pipeline This hierarchical structure is called a parse tree. Once the
Once sufficient number of articles were collected, the next parse tree is formed, we can easily extract required parts of
step was to extract events from each of these articles. For the sentence such as the noun phrases, verb phrases, etc. An
most of the process of event extraction, we followed the pro- example of a parse tree is given in Figure 4.
cedure suggested by Wunderwald [12].
Any event can be reported as: Who did what, when, and Named Entity Recognition (NER).
where. Therefore, if we can extract who, what, when and There are certain words in the sentences that belong to
where from an article, we have the event that the article categories such as person, location, organization, etc.
is reporting. In order to extract these from the title and The named entity recognizer goes through each word in the
Figure 4: Parse Tree for title in Figure 2
text and labels each word with it’s entity name. If a word
does not belong to any category of entities, it is labeled as
other (O). The Stanford core nlp package provides named
entity recognition for the following entities: person, lo-
cation, organization, money, percent, date and time.
Once the text is split into sentences, each sentence is fur-
ther split into word tokens and the Stanford named entity
recognizer is run to identify and label named entities. An
example of named entity recognition is shown in Figure 5.
Figure 5: Tagged entities for the title in Figure 2
Co-reference Resolution.
In an article, a particular entity can be referred to in mul-
tiple ways. For example, in Figure 6, Microsoft, it and its,
all refer to the same entity: Microsoft. The process of iden-
tifying all phrases that refer to the same entity is called
co-reference resolution. Martschat and Strube [15] built a
co-reference resolution toolkit (cort) which internally uses
nltk, Stanford core nlp packages. Also, during the process
of co-reference resolution, cort does ner, pos tagging and Figure 7: Extraction Pipeline
creates the parse tree.
1. The number of occurrences in text: It is the
count of how often a certain entity is mentioned in the
article text. It is assumed that number of occurrences
of an entity is strongly correlated to its importance.
Figure 6: Co-reference Resolution for Figure 2 While counting the number of occurrences, all possible
representations of the entity are considered (Microsoft,
it, its, etc in Figure 6)
4.2.2 Who
It is worthwhile to note that not only people but also 2. The number of occurrences in the title: More
groups of people, organizations and even locations can be often than not, the subject of the article or the who
valid whos. By trying to identify who, we are trying to find is present in the title. The title usually contains the
out who the article is about or in other words, the subject most important information of the article such as who
of the article. We assumed that the who or the subject and what. Usually this count is 0 or 1.
of the article occurs frequently in the whole article and is
also a named entity, and therefore we applied Named Entity
Recognition on the entire text of the article as well as the 3. Mean Position: It refers to the mean position of
title. We calculated several features for each of the entities the entity relative to the length of the article text. It
detected, and used them to classify entities into those that is assumed that more important entities occur at the
are who in the article and those that are not. The features beginning of the article while the less important ones
that we considered for selecting the appropriate who are as may feature somewhere later. The mean position is
follows. calculated as:
for the verb phrase having what, we created a parse tree for
P each sentence and then matched the who candidates in the
indexn (e)
n parse tree. Since the who candidates are always in the noun
µ(entity) =
n∗l phrase part, it was easy to find the subsequent verb phrase.
In the article in Figure 2, since the highest rated who is
where, present in the title, the what is the subsequent verb phrase
e ←entity in the title itself i.e., ‘launches Azure preview in Germany
n ← number of occurrences of the entity in the article and Canada ’.
indexn (e) ← index of the nth occurrence of the entity
‘e’ in the article 4.2.4 When
l← total number of word tokens in the article Events are usually associated with a time or date. We try
to extract temporal information that is present within the
news articles. Since we also have a publication date of the
4. Entity Type: This holds the ner tag given to the article, we use that in case the when cannot be extracted
entity by the Stanford nlp package. Here we focus on from article text. We used a named entity recognizer ca-
the types person, location and organization for pable of extracting time and date entities. The ner from
the classification of who. It is assumed that the type Stanford nlp package tags entities of types date and du-
of the entity as a feature has a high impact on the ration which are used for extracting when. Since the ner
classification. tagging done for the extraction of who contains all entities
and the corresponding features, we filter the entities of type
date and duration. Among these filtered entities, the en-
Table 1: Features for article in Figure 2 tity with the lowest mean position which is also present in
the same sentence as the extracted who, is selected as the
Entity Count Count Mean Entity
when. Many times the relevant entities are today, yester-
in Text in Title Position Type
day, next week etc. But these do not tell us the absolute
Microsoft 4 1 0.1212 organization when. To solve this issue, the date of publication of the ar-
U.S. 1 0 0.928 location ticle is also maintained and therefore actual when is usually
Canada 1 1 0.9393 location relative to the publication date of the article. If none of the
Germany 1 1 0.8787 location entities satisfy this criteria, the when tag is left blank for
Late last 1 0 0.002 date the article under consideration. For the article in Figure 2,
year there is only one entity that can possibly be the when and
is present in Table 1 that is ‘Late last year’.
The entity type feature is broken into three individual 4.2.5 Where
features isLocation, isPerson and isOrganization and
To extract the where, we again use ner to find out enti-
boolean values are give for each entity. This is done to over-
ties labeled as location because the place where something
come the problem of handling multiple types of data in the
happens is probably an entity of type location. Whether
same ml algorithm.
it is possible to extract where or not strictly depends on
Each of the entities of the article is then classified to be
the article. While some events specifically mention that the
a who or not using classifiers from the Sklearn [17] library.
event is taking place in a particular location like “Apple
From the list of detected entities for each article, we calculate
launches iPhone 6S in India” wherein the information rel-
the features mentioned above for each entity and then find
evant for location may be mentioned several times in the
out the probability of whether it is a who or not using the
text or even in the headlines, there are abstract events like
GNB(Gaussian Naive Bayes) classifier. The who candidates
“Mark Pincus steps down as CEO of Zynga” in which there
are then ranked in decreasing order of probabilities. We
is no mention of where. We made use of the same features
experimented with other classifiers as well, a comparison of
as used for the extraction of who but we only considered
which is provided in Table 5. The set of features for the
entities where the entity type was location. We use the
article in Figure 2 are given in Table 1.
same classification technique as in who without the entity
For example, for the article in Figure 2, based on the
type feature and finally take the highest ranked entity as
feature values, the trained model gives the possible who
the where. For the article in Figure 2, the possible entities
candidates and their probabilities as [(‘Microsoft ’, 0.9980),
for where are present in Table 1. Among these, the entities
(‘Canada ’, 0.887), (‘Germany ’, 0.878)] based on which Mi-
‘Germany’ and ‘Canada’ are classified as where.
crosoft is chosen as the who.
4.2.3 What 5. QUERY SYSTEM
Here we attempted to answer the question who did what In this section we explain how we built our query system.
in the article. Since what refers to an action that is being We first created a network of companies, then added the
performed, we take it to be a verb phrase. To find what, we extracted events to the company that they are related to.
start by looking at the who candidates present in the title From this network, the user can make any of the queries
and their subsequent verb phrases. If such a pair exist, we specified in the Section 5.
fix them as the who and what for the event. If none of the
who candidates are present in the headline, we search for Network Creation
the first occurrence of the highest ranked who in the article The network consists of companies and company types/domains
text and take its subsequent verb phrase as what. To look represented as nodes. Various companies can be related to
Figure 8: Network Of Companies
Table 2: Extracted Events from News
who what where when publication date
a
Google loses android antitrust appeal in Russia Russia Feb 2015 Mon, 14 Mar 2016
Zynga b beats expectations in q2 with $ 200m in revenue - Today Thu, 06 Aug 2015
AOL c got acquired by giant U.S. carrier Verizon for $ 4.4 billion London - Fri, 13 Nov 2015
LinkedIn d revamps its jobs listings with big data analytics - Today Tue, 15 Dec 2015
Square e launches payments in Australia , its first country Australia recently Tue, 08 Mar 2016
expansion in nearly three years
a
http://feedproxy.google.com/˜r/techcrunch/android/˜3/Pwj0hJmsS\ A/
b
http://feedproxy.google.com/˜r/TechCrunch/Zynga/˜3/lY\ jErmRQtg/
c
http://feedproxy.google.com/˜r/TechCrunch/Aol/˜3/MvY-Z07lg4Y/
d
http://feedproxy.google.com/˜r/TechCrunch/Linkedin/˜3/e1c2SHVTk0g/
e
http://feedproxy.google.com/˜r/TechCrunch/Square/˜3/tUGW4fEhccg/
one another as being competitors or associates. A company to the graph database. To do this, we created event nodes
can also be related to a company type/ domain/ product for each event. An event node consists of the eventID, url
type, if it is manufacturing goods or providing services re- and date of publication of the article from which the event
lated to that domain. A company can have relations with was extracted and details of the event: who, what, when
multiple domain nodes. and where. Some of the details such as when and where
To store the network, we used a nosql graph database, may not be available for all events. In cases where they are
Neo4J [20]. In graph databases, relationships are first-class not available the value is set to null. Also, while adding an
citizens. Therefore, querying relationships becomes easier event to the network, if the company that the event is re-
and faster as you do not need to go through the trouble of lated to does not exist in the network, then a new node for
using foreign keys, table joins, etc as in the case of relational that company is created, and the event node is added to that
databases to infer the relation between entities. Neo4J pro- node. The newly created node however, does not have any
vides a neo4j-rest-client api that allows us to use the Neo4J competitors/associates set and does not belong to any par-
rest server locally through python-embedded. We man- ticular category. These relations have to be manually added.
ually created text files (in csv format) containing names It can so happen that more than one copy of the same ar-
of some companies, possible domains and relations between ticle from different sources gets scraped. Therefore, if the
pairs of companies and those between a company and a do- exact same event already exists in the database, the system
main. The network creation module takes as input these detects duplication and does not add the event. However,
csv files. Once the input is given, the module creates the if two different articles reporting the same event, but in dif-
network and stores it in a Neo4J graph database. During the ferent manner are encountered, then both get added to the
process of extraction, if the system encounters a company database.
that does not exist in the network, it automatically adds it. Once the network of companies, company types and events
An example network is given in Figure 8. is created, a query system is built on top of the database.
The user is given an option to choose one of the following
queries:
Adding and Querying Events
• get events related to a particular company
Once the network is created and the events are extracted
from the news articles, the next step is to add the events • get events related to a groups of companies that either
index who what when where publication date
1 Google wouldn’t start talking about Android N . . . May Wed, 09 Mar 2016
2 Google launches the android experiments i/o challenge today Fri, 25 Mar 2016
for open-source app developers
3 Google loses android antitrust appeal in Russia Feb 2015 Russia Mon, 14 Mar 2016
Table 3: Events related to “Google”
company who what when where publication date
Amazon Amazon may be able to take business away . . . last November UK Mon, 29 Feb 2016
Amazon Amazon says it will bring device encryption Less than Sat, 05 Mar 2016
back to fire os a day
Microsoft Microsoft launches Azure preview in Late last year Germany Tue, 15 Mar 2016
Germany and Canada
Table 4: Events related to Associates and Competitors of “Google”
belong to the same category or where who was recognized correctly, what was also recog-
nized correctly. Out of the 100 articles that were used, only
• get events related to the competitors or associates of 11 contained a where in them. Out of those 11, 10 were
a particular company. correctly extracted. Since most articles in the test set were
Some of the query results obtained are shown in Section 6. announcements related to release of a product, or a hike/dip
in the stock value of a company, etc., there was no explicit
date for the event. Most of the whens extracted were of the
6. RESULTS kind: ’today’, ’last week’, ’yesterday’, etc.
In Table 2, some of the extracted events from their respec-
Evaluation of the Extractor tive articles are mentioned.
Who Classifier
Query System
To test the working of the who classifier, we needed a data-
set where the entity that represents the who in a given ar- As mentioned previously, the user can make 3 different queries,
ticle was already marked. Since no such data-set was read- and can request for events related to:
ily available, we manually tagged about 100 articles which • a particular organization
roughly had about 2500 entities. For testing the model, we
used business articles that were available in Tech-Crunch • the competitors and associates of a particular organi-
zation
[10] and Reuters [9]. Each entity was marked 1 if it was the
who of the article otherwise it was marked 0. A total of 1500 • organizations working in a particular sector
entities were used to train the model. The remaining were
The results obtained for the first two queries made on the
used to test the model. We trained our feature set with sev-
network shown in Figure 8, are shown in Tables 3 and 4.
eral classifiers and the results of the test set are summarized
The last query will print the events in Tables 3 and 4 as
in table 5. For the data-set described above, we proceeded
Amazon, Google and Microsoft work in the cloud sector.
with the Gaussian Naive Bayes classifier which gave 89%
accuracy.
7. CONCLUSIONS AND FUTURE WORK
EEQuest allows the user to query events related to a
Table 5: Results for WHO particular organization or a group of organizations. For
Classifier Accuracy True Positive extraction of events we have used some standard natural
(in percent) Rate language processing and supervised machine learning tech-
niques. We have also created a network of companies and
Gaussian Naive Bayes 89 0.7
to this network, we have added the extracted events. Fi-
Multinomial Naive Bayes 92 0.5
nally, we have developed a query system using which the
Logistic Regression 91 0.4
user can get events related to a particular organization or
a group of organizations. We ran experiments to evaluate
the performance of the system. For the extraction of who
What and what, the model gave an accuracy of 89% and 78.57%
We tried extracting what for all the 100 articles for which respectively. We have proposed a mechanism and demon-
what was also manually tagged. The process gave an ac- strated with publicly available news data. There might be
curacy of 78.57% . This result for what is when we used cases where private data of an organization is present. In
Gaussian Naive Bayes for extracting who. We observed such cases, the organization can extract events from their
that in many articles the what was not being recognized own data and create their own queryable network using EE-
correctly because there was no who recognized or the rec- Quest.
ognized who was not the right one. For most of the articles
In the current system, the areas where an organization [9] Reuters India News RSS, Subscribe RSS News Feeds,
works have to be added manually. But the system can be Latest news India (2016). Reuters India. Retrieved 7
extended to learn to derive those areas automatically. Also June 2016, from http://in.reuters.com/tools/rss
we can use additional features as well as a more sophisticated [10] Riggs, D. (2016). TechCrunch. Retrieved 2 June 2016,
machine learning technique for event extraction. One more from http://techcrunch.com/topic/company/
challenge that needs to be addressed is the one of scale. [11] Lucas Ou-Yang. Newspaper: Article scraping &
As the network grows, it would not fit in a single device. curation — newspaper 0.0.2 documentation (2016).
Work needs to be done to make the system handle big data. Newspaper.readthedocs.org. Retrieved 11 April 2016,
Also as the number of data sources increase, there might from http://newspaper.readthedocs.org/en/lates
be redundancy in the events detected since different sources [12] Wunderwald, Martin (2011). ”NewsX Event
may report the same event. In this case, event co-reference Extraction from News Articles”, diploma thesis,
resolution needs to be done to ensure that only one instance Dresden University of Technology, Dresden, Germany.
of a event is added to the network. Retrieved 2 June 2016 from
http://www.rn.inf.tu-dres-den.de/uploads/Studentis
8. REFERENCES che Arbeiten/Diplomarbeit Wunderwald Martin.pdf
[13] Bird, Steven, Edward Loper and Ewan Klein (2009).
[1] Borsje, J., Hogenboom, F., Frasincar, F. (2010). Natural Language Processing with Python. O’Reilly
Semi-Automatic Financial Events Discovery Based on Media Inc.
Lexico-Semantic Patterns. International Journal of [14] Manning, Christopher D., Mihai Surdeanu, John
Web Engineering and Technology vol. 6(2), pp. Bauer, Jenny Finkel, Steven J. Bethard, and David
115–140. McClosky (2014). The Stanford CoreNLP Natural
[2] Hristo Tanev , Jakub Piskorski , Martin Atkinson Language Processing Toolkit In Proceedings of the
2008). Real-Time News Event Extraction for Global 52nd Annual Meeting of the Association for
Crisis Monitoring, Proceedings of the 13th Computational Linguistics: System Demonstrations,
international conference on Natural Language and pp. 55-60.
Information Systems: Applications of Natural [15] Sebastian Martschat and Michael Strube (2015).
Language to Information Systems, June 24-27, 2008, Latent Structures for Coreference Resolution.
London, UK [doi>10.1007/978-3-540-69858-6 21] Transactions of the Association for Computational
[3] Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, Linguistics, 3, pages 405-418.
F. (2011). An overview of event extraction from text. [16] Lardinois, F. (2016). Microsoft launches Azure
Workshop on Detection, Representation, and preview in Germany and Canada, announces
Exploitation of Events in the Semantic Web (DeRiVE DoD-specific regions in U.S.. TechCrunch.
2011), Bonn, Germany, October 2011. [17] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A.
[4] Alexandra La Fleur, Kia Teymourian, and Adrian Mueller, O. Grisel, V. Niculae and Peter Prettenhofer,
Paschke (2015). Complex event extraction from A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A.
real-time news streams. In Proceedings of the 11th Joly, B. Holt and G. Varoquaux, ”API design for
International Conference on Semantic Systems machine learning software: experiences from the
(SEMANTICS ’15), Sebastian Hellmann, Josiane scikit-learn project” in ECML PKDD Workshop:
Xavier Parreira, and Axel Polleres (Eds.), pp. 9–16. Languages for Data Mining and Machine Learning,
[5] Frasincar, F., Borsje, J., Levering, L. (2009). A 2013, pp. 108-122. Retrieved 2 June 2016, from
Semantic Web-Based Approach for Building http://techcrunch.com/2016/03/15/microsoft-
Personalized News Services. International Journal of launches-azure-preview-in-germany-and-canada-
E-Business Research 5(3), pp. 35–53. announces-dod-specific-regions-in-u-s/?ncid=rss&utm
[6] Congle Zhang, Stephen Soderland, and Daniel S. Weld source=feedburner&utm medium=feed&utm campaign
(2015). Exploiting parallel news streams for =Feed%3A+techcrunchIt+%28TechCrunch+IT%29
unsupervised event extraction. Transactions of the [18] E. Arendarenko and T. Kakkonen (2012).
Association for Computational Linguistics, vol. 3, pp. “Ontology-based information and event extraction for
117–129. business intelligence”, in The 15th International
[7] Anthony Fader, Stephen Soderland, and Oren Etzioni. Conference Artificial Intelligence: Methodology,
2011. Identifying relations for open information Systems, and Applications (AIMSA 2012), Varna,
extraction. In Proceedings of the Conference on Bulgaria, September 12–15, 2012, Lecture Notes in
Empirical Methods in Natural Language Processing Computer Science vol. 7557, pp. 89–102, Springer.
(EMNLP ’11). Association for Computational [19] Schaurer, F., & St¨ orger, J. (2013). The Evolution of
Linguistics, Stroudsburg, PA, USA, 1535-1545. Open Source Intelligence (OSINT). Journal of U.S.
[8] M Schmitz, R Bart, S Soderland, O Etzioni. 2012. Intelligence Studies, vol. 19 (3), Winter/Spring 2013,
Open Language Learning for Information Extraction. pp. 53–56.
In Proceedings of Conference on Empirical Methods in [20] Neo4j: The World’s Leading Graph Database (2016).
Natural Language Processing and Computational Neo4j Graph Database. Retrieved 11 April 2016, from
Natural Language Learning (EMNLP-CONLL). http://neo4j.com/