EEQuest

Haripriya Bendapudi; P. K. Jain

doi:10.1145/2998476.2998482

Outline

EEQuest: An Event Extraction and Query System

Shrisha Rao

https://doi.org/10.1145/2998476.2998482

visibility

…

description

8 pages

Abstract

We present EEQuest, an application that extracts events from text using natural language processing (nlp) and supervised machine-learning techniques, and provides a system to query events extracted from a text corpus. We provide a use case for the application wherein we extract business-related events from news articles. The extracted events are then categorized based on the business organization/company that they are related to. Finally, the events are added to a knowledge base using which a query system is built. The system can be used to display events related to a particular organization or a group of organizations. Although we are using the system to extract business-related events, the event extraction mechanism can be used in a more general sense with any available textual data, to extract any kind of events that have a structure that can answer the question: Who did what, when and where?

EEQuest: An Event Extraction and Query System Prerit Jain Haripriya Bendapudi Shrisha Rao IIIT Bangalore IIIT Bangalore IIIT Bangalore [email protected] [email protected] [email protected] ABSTRACT (OSINT), which relies on the analysis of open published tex- We present EEQuest, an application that extracts events tual sources. Such analyses have historically “not only con- from text using natural language processing (nlp) and su- stituted a major part of all intelligence,” but also are “the pervised machine-learning techniques, and provides a system leading source of information” [19]. Even in business in- to query events extracted from a text corpus. telligence (BI), the ability to answer queries based on event We provide a use case for the application wherein we ex- extraction from unstructured data is critical [18]. Therefore, tract business-related events from news articles. The ex- the problems of event extraction and querying from textual tracted events are then categorized based on the business data are of much real-world significance. organization/company that they are related to. Finally, the In this paper, we present an approach to build an appli- events are added to a knowledge base using which a query cation that extracts such events and allows users to query system is built. The system can be used to display events those that are relevant to them. The approach adopted is as related to a particular organization or a group of organiza- follows: we first take in English text, use nlp and supervised tions. Although we are using the system to extract business- learning techniques to detect and extract the various ele- related events, the event extraction mechanism can be used ments of an event (who, what, where and when). Once vari- in a more general sense with any available textual data, to ous events are extracted, we store them in a graph database extract any kind of events that have a structure that can (Neo4J) and create a query system on top of it. The graph answer the question: Who did what, when and where? database allows the events to be divided into categories and subcategories while keeping the queries simple and fast. A user now, can request for events belonging to the category CCS Concepts or subcategory of his/her choice. •Computing methodologies → Information extrac- EEQuest has been built∗ as an exemplar for the use and tion; Supervised learning by classification; working of an application such as the one described above. It extracts business-related events from news articles and adds Keywords them to a graph database. The graph database consists of various companies related to each other as competitors or event extraction, information extraction, natural language associates, and events related to each company. A user of processing, supervised learning, artificial intelligence. the application can query events related to a particular com- pany, those related to a company’s competitors/associates, 1. INTRODUCTION or those related to a particular group of companies or prod- As the availability of data increases day by day, it be- uct type. comes increasingly urgent to devise methods to automate By keeping the core technology of the above application the extraction of relevant information from this data. “Event intact and by changing some of its peripherals such as the Extraction” is a common application of text mining, where input data, the structure of the network, etc., the application information about important real-world incidents or occur- can be adapted to various kinds of use cases. rences is detected and extracted from texts that often do The rest of the paper is organized as follows. In Section 3, not have a fixed and predictable structure. Such real-world we talk about the architecture of the system used to build events that are to be detected can most of the time be struc- the application. In the sections that follow, we give a de- tured in the form of, ”Who did what, when and where?” [12]. tailed explanation of how each component of the system was Event extraction is a major part of open-source intelligence implemented. In Section 4, we give a detailed explanation of the procedure we used for extracting events from the news Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed sources. In Section 5 we explain how the query system was for profit or commercial advantage and that copies bear this notice and the full citation built after extracting the events. Section 6 summarizes the on the first page. Copyrights for components of this work owned by others than the results of various experiments carried out to test the com- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or ponents of the model. Finally, in Section 7 we present our republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. conclusions and ideas for future work. ACM COMPUTE ’16, October 21 - 23, 2016, Gandhinagar, India c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4808-9/16/10. . . $15.00 ∗ The source code and data for this work can be found at DOI: http://dx.doi.org/10.1145/2998476.2998482 https://github.com/haripriya-b/EventExtraction. 2. RELATED WORK at only a few specific relations (or maybe just a single one) There is a lot of on-going as well as past research related that can possibly give a gist of the entire article. Therefore, to event extraction and it also finds applications in many other parameters such as the frequency of certain important diverse fields. A few areas where event extraction has been words/phrases, their relative position within the text and ti- applied are discussed below. tle of the article, etc. need to be considered. Also, since we Borsje et al. [1] developed a financial event extractor that are only interested in specific relations, it may be easier to detects and extracts financial events, thereby, making it eas- first find the subject of discussion and then extract the rela- ier for investors to continuously monitor financial markets. tion as opposed to finding the relation and then extracting They use lexico-semantic patterns to extract relevant events the arguments. from RSS news feeds. Tanev et al. [2] and Alexandra et al. [4] provide different 3. SYSTEM ARCHITECTURE approaches for extracting real time news events. The former In this section we present the basic architecture and com- extracts violent and natural disaster events from online news ponents of EEQuest. by using clustering techniques to automatically tag words The system consists of the following basic modules: and further learns patterns from these extracted events. The latter evaluates news articles in real-time and extracts mean- • Event Extraction Module ingful complex events based on the user’s needs and specifi- cations. • Network Creation Module Coming to the actual process for extracting events, there are several methodologies that can be adopted for this. Hogen- • Event Addition Module boom et al. [3] provide an overview of some of the various • Query Module text mining techniques that have been employed for various event extraction purposes. In their paper, they also provide The application has two main jobs: (a) extracting events some general guidelines on how to choose a particular tech- and (b) building the query system. The first module takes nique depending on the user, the scenario of use and the care of event extraction and the rest of the modules work to available content. build the query system. We give a brief explanation of what Frasincar et al. [5] used an extraction mechanism that is happens in each of these modules. centered around a domain ontology which is used for index- ing news items. The major bottleneck of this approach is Event Extraction Module that the ontology needs to be regularly maintained by do- This module takes in a news article provided in json for- main experts since news information can contain knowledge mat, does some nlp on the given data, extracts the required that is not known apriori. information and creates events. Here, an event can be the Zhang et al. [6] devised an unsupervised approach that launch of a new product, sudden change in its stock value, uses a probabilistic graphical model to cluster sentences de- mergers and acquisitions, etc. For each article, the mod- scribing similar events to discover event relations and learn ule first finds the organization being talked about and then to extract them. extracts information that answers the question: Who did Fader et al. [7] built an system named ReVerb that helps what, where and when and stores it in the form of an event. identify relationships for open information extraction. The main focus of their work is to reduce incoherent and uninfor- mative extractions. To achieve this, they establish certain syntactic and lexical constraints and only those phrases that satisfy these constraints are considered as relations. Their algorithm first extracts the relations that satisfy the lexical and syntactic constraints and then extracts the arguments to the left and right of each relation. As an improvement to ReVerb, Schmitz et al. [8] came up with ollie (Open Language Learning for Information Ex- traction) which overcomes the limitations of ReVerb. They expand the syntactic scope of relational phrases and also Figure 1: System Architecture for EEQuest allow additional context information. To extract relations, they first use a set of high precision seed tuples from Re- Verb, bootstrap a large training set and use this training Network Creation Module set to learn open pattern templates. These templates are The query system is built on top of a network of compa- applied on the data at the time of extraction. Finally, ollie nies and events related to them. Therefore, the first step in analyzes the context around the tuple to add information. building the query system is to get the network ready. Although both systems show promising results in extract- This module creates a network of companies and busi- ing relations, it may not be ideal to use them for the pro- ness organizations. The nodes in the network represent cess of event extraction. To start with, ReVerb uses strict the companies while the edges represent relations (competi- lexical and syntactic constraints for the process of extrac- tors and associates) between various companies. All compa- tion, which makes their system quite rigid. Although, this nies that are related to the same domain or produce sim- problem is addressed in ollie, both systems are still more ilar/comparable products are grouped under a single do- applicable for general extraction of relations. In the system main/product type. Also, companies (especially some of we are trying to build, we are more interested in looking the large companies such as Amazon, Google, etc.) might have multiple divisions developing multiple kinds of prod- text of the article we used certain nlp and supervised ma- ucts. Therefore, a company can be connected to more than chine learning techniques. The entire process was coded in one product type. The created network is stored in Neo4J Python. For nlp, we used packages such as nltk [13], Stan- [20] which is a graph database. ford core nlp [14] and Cort (co-reference resolution toolkit) To create the network, we start by manually adding a set [15]. The entire pipeline for the process of event extraction of companies, their domains of work and the relationships is explained in detail below. Also, we took an example arti- between them. As events get extracted and added to the cle and used it to demonstrate various steps. An excerpt of network, if the company referred to in the article does not the extracted article is given in Figure 2. exist, the system automatically adds it to the network. How- ever, any sort of relations between this company and others need to be added manually. Event Addition Module Once the basic network is created by the network creation module and events are extracted by the event extraction module, the next step is to add the extracted events to the graph database. To do this, the network module creates new Figure 2: Example article for event extraction [16] event nodes and connects each event node to it’s respective organization. Once all the events are added to the graph database, queries can be made for events related to a par- 4.2.1 Pre-processing ticular organization or a particular group of organizations. Before jumping into the event extraction process, certain amount of pre-processing of the text had to be done. Once Query Module the article’s text was extracted from the json, the following Once the events have been added to the network, the query operations were done. module allows a user to make queries on the network to get relevant information. Since the network graph is made Sentence Tokenization. using Neo4j, the queries are made using its query language The entire text had to be split into sentences. Just split- i.e, Cypher. ting on the period (.) does not work. There can be abbre- viations (w.h.o, b.h.e.l ), person titles (Mr., Dr.), etc that have periods. Therefore, care had to be taken to ensure that 4. EVENT EXTRACTION sentences were split in the right manner. The nltk package In this section we explain in detail the procedure adopted in Python provides the sent tokenize() method which splits to extract events from news articles. We start by explaining the text into valid sentences. how we built a data set for testing out approach followed by a detailed explanation of how we extracted each element of Parts of Speech (POS) Tagging. an event. Here, the parts of speech of the various word tokens in the sentence are determined. For each word token, an ap- 4.1 Data Collection propriate syntactic label is given to it from a list of pos tags. In order to be able to extract events, we needed business The Stanford core nlp package provides a pos tagger which related news articles. We used the rss feed of some standard can be used for tagging parts of speech of a sentence. An news sources and magazines such as Reuters [9], TechCrunch example of pos tagging is shown in Figure 3. [10], etc., to get articles. Within these sources, we looked at the business subsection alone, since we were interested in extracting business related events. From the rss feed, we extracted details such as title, date, description and url of each article using the f eed parser package in python. Then, we used the url to extract the text from the actual web page of the article using the N ewspaper [11] package, also Figure 3: A POS-tagged sentence for Figure 2 available in python. We then stored the data related to each article in a file, in json format along with a unique identifier for each article. These articles were then used for Parse Tree. extracting events. We periodically checked for rss feed and Once the individual word tokens are given pos tags, each only considered the feed if it got updated. sentence is then converted into a hierarchical structure that corresponds to individual units of meaning in the sentence. 4.2 Extraction Pipeline This hierarchical structure is called a parse tree. Once the Once sufficient number of articles were collected, the next parse tree is formed, we can easily extract required parts of step was to extract events from each of these articles. For the sentence such as the noun phrases, verb phrases, etc. An most of the process of event extraction, we followed the pro- example of a parse tree is given in Figure 4. cedure suggested by Wunderwald [12]. Any event can be reported as: Who did what, when, and Named Entity Recognition (NER). where. Therefore, if we can extract who, what, when and There are certain words in the sentences that belong to where from an article, we have the event that the article categories such as person, location, organization, etc. is reporting. In order to extract these from the title and The named entity recognizer goes through each word in the Figure 4: Parse Tree for title in Figure 2 text and labels each word with it’s entity name. If a word does not belong to any category of entities, it is labeled as other (O). The Stanford core nlp package provides named entity recognition for the following entities: person, lo- cation, organization, money, percent, date and time. Once the text is split into sentences, each sentence is fur- ther split into word tokens and the Stanford named entity recognizer is run to identify and label named entities. An example of named entity recognition is shown in Figure 5. Figure 5: Tagged entities for the title in Figure 2 Co-reference Resolution. In an article, a particular entity can be referred to in mul- tiple ways. For example, in Figure 6, Microsoft, it and its, all refer to the same entity: Microsoft. The process of iden- tifying all phrases that refer to the same entity is called co-reference resolution. Martschat and Strube [15] built a co-reference resolution toolkit (cort) which internally uses nltk, Stanford core nlp packages. Also, during the process of co-reference resolution, cort does ner, pos tagging and Figure 7: Extraction Pipeline creates the parse tree. 1. The number of occurrences in text: It is the count of how often a certain entity is mentioned in the article text. It is assumed that number of occurrences of an entity is strongly correlated to its importance. Figure 6: Co-reference Resolution for Figure 2 While counting the number of occurrences, all possible representations of the entity are considered (Microsoft, it, its, etc in Figure 6) 4.2.2 Who It is worthwhile to note that not only people but also 2. The number of occurrences in the title: More groups of people, organizations and even locations can be often than not, the subject of the article or the who valid whos. By trying to identify who, we are trying to find is present in the title. The title usually contains the out who the article is about or in other words, the subject most important information of the article such as who of the article. We assumed that the who or the subject and what. Usually this count is 0 or 1. of the article occurs frequently in the whole article and is also a named entity, and therefore we applied Named Entity Recognition on the entire text of the article as well as the 3. Mean Position: It refers to the mean position of title. We calculated several features for each of the entities the entity relative to the length of the article text. It detected, and used them to classify entities into those that is assumed that more important entities occur at the are who in the article and those that are not. The features beginning of the article while the less important ones that we considered for selecting the appropriate who are as may feature somewhere later. The mean position is follows. calculated as: for the verb phrase having what, we created a parse tree for P each sentence and then matched the who candidates in the indexn (e) n parse tree. Since the who candidates are always in the noun µ(entity) = n∗l phrase part, it was easy to find the subsequent verb phrase. In the article in Figure 2, since the highest rated who is where, present in the title, the what is the subsequent verb phrase e ←entity in the title itself i.e., ‘launches Azure preview in Germany n ← number of occurrences of the entity in the article and Canada ’. indexn (e) ← index of the nth occurrence of the entity ‘e’ in the article 4.2.4 When l← total number of word tokens in the article Events are usually associated with a time or date. We try to extract temporal information that is present within the news articles. Since we also have a publication date of the 4. Entity Type: This holds the ner tag given to the article, we use that in case the when cannot be extracted entity by the Stanford nlp package. Here we focus on from article text. We used a named entity recognizer ca- the types person, location and organization for pable of extracting time and date entities. The ner from the classification of who. It is assumed that the type Stanford nlp package tags entities of types date and du- of the entity as a feature has a high impact on the ration which are used for extracting when. Since the ner classification. tagging done for the extraction of who contains all entities and the corresponding features, we filter the entities of type date and duration. Among these filtered entities, the en- Table 1: Features for article in Figure 2 tity with the lowest mean position which is also present in the same sentence as the extracted who, is selected as the Entity Count Count Mean Entity when. Many times the relevant entities are today, yester- in Text in Title Position Type day, next week etc. But these do not tell us the absolute Microsoft 4 1 0.1212 organization when. To solve this issue, the date of publication of the ar- U.S. 1 0 0.928 location ticle is also maintained and therefore actual when is usually Canada 1 1 0.9393 location relative to the publication date of the article. If none of the Germany 1 1 0.8787 location entities satisfy this criteria, the when tag is left blank for Late last 1 0 0.002 date the article under consideration. For the article in Figure 2, year there is only one entity that can possibly be the when and is present in Table 1 that is ‘Late last year’. The entity type feature is broken into three individual 4.2.5 Where features isLocation, isPerson and isOrganization and To extract the where, we again use ner to find out enti- boolean values are give for each entity. This is done to over- ties labeled as location because the place where something come the problem of handling multiple types of data in the happens is probably an entity of type location. Whether same ml algorithm. it is possible to extract where or not strictly depends on Each of the entities of the article is then classified to be the article. While some events specifically mention that the a who or not using classifiers from the Sklearn [17] library. event is taking place in a particular location like “Apple From the list of detected entities for each article, we calculate launches iPhone 6S in India” wherein the information rel- the features mentioned above for each entity and then find evant for location may be mentioned several times in the out the probability of whether it is a who or not using the text or even in the headlines, there are abstract events like GNB(Gaussian Naive Bayes) classifier. The who candidates “Mark Pincus steps down as CEO of Zynga” in which there are then ranked in decreasing order of probabilities. We is no mention of where. We made use of the same features experimented with other classifiers as well, a comparison of as used for the extraction of who but we only considered which is provided in Table 5. The set of features for the entities where the entity type was location. We use the article in Figure 2 are given in Table 1. same classification technique as in who without the entity For example, for the article in Figure 2, based on the type feature and finally take the highest ranked entity as feature values, the trained model gives the possible who the where. For the article in Figure 2, the possible entities candidates and their probabilities as [(‘Microsoft ’, 0.9980), for where are present in Table 1. Among these, the entities (‘Canada ’, 0.887), (‘Germany ’, 0.878)] based on which Mi- ‘Germany’ and ‘Canada’ are classified as where. crosoft is chosen as the who. 4.2.3 What 5. QUERY SYSTEM Here we attempted to answer the question who did what In this section we explain how we built our query system. in the article. Since what refers to an action that is being We first created a network of companies, then added the performed, we take it to be a verb phrase. To find what, we extracted events to the company that they are related to. start by looking at the who candidates present in the title From this network, the user can make any of the queries and their subsequent verb phrases. If such a pair exist, we specified in the Section 5. fix them as the who and what for the event. If none of the who candidates are present in the headline, we search for Network Creation the first occurrence of the highest ranked who in the article The network consists of companies and company types/domains text and take its subsequent verb phrase as what. To look represented as nodes. Various companies can be related to Figure 8: Network Of Companies Table 2: Extracted Events from News who what where when publication date a Google loses android antitrust appeal in Russia Russia Feb 2015 Mon, 14 Mar 2016 Zynga b beats expectations in q2 with $ 200m in revenue - Today Thu, 06 Aug 2015 AOL c got acquired by giant U.S. carrier Verizon for $ 4.4 billion London - Fri, 13 Nov 2015 LinkedIn d revamps its jobs listings with big data analytics - Today Tue, 15 Dec 2015 Square e launches payments in Australia , its first country Australia recently Tue, 08 Mar 2016 expansion in nearly three years a http://feedproxy.google.com/˜r/techcrunch/android/˜3/Pwj0hJmsS\ A/ b http://feedproxy.google.com/˜r/TechCrunch/Zynga/˜3/lY\ jErmRQtg/ c http://feedproxy.google.com/˜r/TechCrunch/Aol/˜3/MvY-Z07lg4Y/ d http://feedproxy.google.com/˜r/TechCrunch/Linkedin/˜3/e1c2SHVTk0g/ e http://feedproxy.google.com/˜r/TechCrunch/Square/˜3/tUGW4fEhccg/ one another as being competitors or associates. A company to the graph database. To do this, we created event nodes can also be related to a company type/ domain/ product for each event. An event node consists of the eventID, url type, if it is manufacturing goods or providing services re- and date of publication of the article from which the event lated to that domain. A company can have relations with was extracted and details of the event: who, what, when multiple domain nodes. and where. Some of the details such as when and where To store the network, we used a nosql graph database, may not be available for all events. In cases where they are Neo4J [20]. In graph databases, relationships are first-class not available the value is set to null. Also, while adding an citizens. Therefore, querying relationships becomes easier event to the network, if the company that the event is re- and faster as you do not need to go through the trouble of lated to does not exist in the network, then a new node for using foreign keys, table joins, etc as in the case of relational that company is created, and the event node is added to that databases to infer the relation between entities. Neo4J pro- node. The newly created node however, does not have any vides a neo4j-rest-client api that allows us to use the Neo4J competitors/associates set and does not belong to any par- rest server locally through python-embedded. We man- ticular category. These relations have to be manually added. ually created text files (in csv format) containing names It can so happen that more than one copy of the same ar- of some companies, possible domains and relations between ticle from different sources gets scraped. Therefore, if the pairs of companies and those between a company and a do- exact same event already exists in the database, the system main. The network creation module takes as input these detects duplication and does not add the event. However, csv files. Once the input is given, the module creates the if two different articles reporting the same event, but in dif- network and stores it in a Neo4J graph database. During the ferent manner are encountered, then both get added to the process of extraction, if the system encounters a company database. that does not exist in the network, it automatically adds it. Once the network of companies, company types and events An example network is given in Figure 8. is created, a query system is built on top of the database. The user is given an option to choose one of the following queries: Adding and Querying Events • get events related to a particular company Once the network is created and the events are extracted from the news articles, the next step is to add the events • get events related to a groups of companies that either index who what when where publication date 1 Google wouldn’t start talking about Android N . . . May Wed, 09 Mar 2016 2 Google launches the android experiments i/o challenge today Fri, 25 Mar 2016 for open-source app developers 3 Google loses android antitrust appeal in Russia Feb 2015 Russia Mon, 14 Mar 2016 Table 3: Events related to “Google” company who what when where publication date Amazon Amazon may be able to take business away . . . last November UK Mon, 29 Feb 2016 Amazon Amazon says it will bring device encryption Less than Sat, 05 Mar 2016 back to fire os a day Microsoft Microsoft launches Azure preview in Late last year Germany Tue, 15 Mar 2016 Germany and Canada Table 4: Events related to Associates and Competitors of “Google” belong to the same category or where who was recognized correctly, what was also recog- nized correctly. Out of the 100 articles that were used, only • get events related to the competitors or associates of 11 contained a where in them. Out of those 11, 10 were a particular company. correctly extracted. Since most articles in the test set were Some of the query results obtained are shown in Section 6. announcements related to release of a product, or a hike/dip in the stock value of a company, etc., there was no explicit date for the event. Most of the whens extracted were of the 6. RESULTS kind: ’today’, ’last week’, ’yesterday’, etc. In Table 2, some of the extracted events from their respec- Evaluation of the Extractor tive articles are mentioned. Who Classifier Query System To test the working of the who classifier, we needed a data- set where the entity that represents the who in a given ar- As mentioned previously, the user can make 3 different queries, ticle was already marked. Since no such data-set was read- and can request for events related to: ily available, we manually tagged about 100 articles which • a particular organization roughly had about 2500 entities. For testing the model, we used business articles that were available in Tech-Crunch • the competitors and associates of a particular organi- zation [10] and Reuters [9]. Each entity was marked 1 if it was the who of the article otherwise it was marked 0. A total of 1500 • organizations working in a particular sector entities were used to train the model. The remaining were The results obtained for the first two queries made on the used to test the model. We trained our feature set with sev- network shown in Figure 8, are shown in Tables 3 and 4. eral classifiers and the results of the test set are summarized The last query will print the events in Tables 3 and 4 as in table 5. For the data-set described above, we proceeded Amazon, Google and Microsoft work in the cloud sector. with the Gaussian Naive Bayes classifier which gave 89% accuracy. 7. CONCLUSIONS AND FUTURE WORK EEQuest allows the user to query events related to a Table 5: Results for WHO particular organization or a group of organizations. For Classifier Accuracy True Positive extraction of events we have used some standard natural (in percent) Rate language processing and supervised machine learning tech- niques. We have also created a network of companies and Gaussian Naive Bayes 89 0.7 to this network, we have added the extracted events. Fi- Multinomial Naive Bayes 92 0.5 nally, we have developed a query system using which the Logistic Regression 91 0.4 user can get events related to a particular organization or a group of organizations. We ran experiments to evaluate the performance of the system. For the extraction of who What and what, the model gave an accuracy of 89% and 78.57% We tried extracting what for all the 100 articles for which respectively. We have proposed a mechanism and demon- what was also manually tagged. The process gave an ac- strated with publicly available news data. There might be curacy of 78.57% . This result for what is when we used cases where private data of an organization is present. In Gaussian Naive Bayes for extracting who. We observed such cases, the organization can extract events from their that in many articles the what was not being recognized own data and create their own queryable network using EE- correctly because there was no who recognized or the rec- Quest. ognized who was not the right one. For most of the articles In the current system, the areas where an organization [9] Reuters India News RSS, Subscribe RSS News Feeds, works have to be added manually. But the system can be Latest news India (2016). Reuters India. Retrieved 7 extended to learn to derive those areas automatically. Also June 2016, from http://in.reuters.com/tools/rss we can use additional features as well as a more sophisticated [10] Riggs, D. (2016). TechCrunch. Retrieved 2 June 2016, machine learning technique for event extraction. One more from http://techcrunch.com/topic/company/ challenge that needs to be addressed is the one of scale. [11] Lucas Ou-Yang. Newspaper: Article scraping & As the network grows, it would not fit in a single device. curation — newspaper 0.0.2 documentation (2016). Work needs to be done to make the system handle big data. Newspaper.readthedocs.org. Retrieved 11 April 2016, Also as the number of data sources increase, there might from http://newspaper.readthedocs.org/en/lates be redundancy in the events detected since different sources [12] Wunderwald, Martin (2011). ”NewsX Event may report the same event. In this case, event co-reference Extraction from News Articles”, diploma thesis, resolution needs to be done to ensure that only one instance Dresden University of Technology, Dresden, Germany. of a event is added to the network. Retrieved 2 June 2016 from http://www.rn.inf.tu-dres-den.de/uploads/Studentis 8. REFERENCES che Arbeiten/Diplomarbeit Wunderwald Martin.pdf [13] Bird, Steven, Edward Loper and Ewan Klein (2009). [1] Borsje, J., Hogenboom, F., Frasincar, F. (2010). Natural Language Processing with Python. O’Reilly Semi-Automatic Financial Events Discovery Based on Media Inc. Lexico-Semantic Patterns. International Journal of [14] Manning, Christopher D., Mihai Surdeanu, John Web Engineering and Technology vol. 6(2), pp. Bauer, Jenny Finkel, Steven J. Bethard, and David 115–140. McClosky (2014). The Stanford CoreNLP Natural [2] Hristo Tanev , Jakub Piskorski , Martin Atkinson Language Processing Toolkit In Proceedings of the 2008). Real-Time News Event Extraction for Global 52nd Annual Meeting of the Association for Crisis Monitoring, Proceedings of the 13th Computational Linguistics: System Demonstrations, international conference on Natural Language and pp. 55-60. Information Systems: Applications of Natural [15] Sebastian Martschat and Michael Strube (2015). Language to Information Systems, June 24-27, 2008, Latent Structures for Coreference Resolution. London, UK [doi>10.1007/978-3-540-69858-6 21] Transactions of the Association for Computational [3] Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, Linguistics, 3, pages 405-418. F. (2011). An overview of event extraction from text. [16] Lardinois, F. (2016). Microsoft launches Azure Workshop on Detection, Representation, and preview in Germany and Canada, announces Exploitation of Events in the Semantic Web (DeRiVE DoD-specific regions in U.S.. TechCrunch. 2011), Bonn, Germany, October 2011. [17] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. [4] Alexandra La Fleur, Kia Teymourian, and Adrian Mueller, O. Grisel, V. Niculae and Peter Prettenhofer, Paschke (2015). Complex event extraction from A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. real-time news streams. In Proceedings of the 11th Joly, B. Holt and G. Varoquaux, ”API design for International Conference on Semantic Systems machine learning software: experiences from the (SEMANTICS ’15), Sebastian Hellmann, Josiane scikit-learn project” in ECML PKDD Workshop: Xavier Parreira, and Axel Polleres (Eds.), pp. 9–16. Languages for Data Mining and Machine Learning, [5] Frasincar, F., Borsje, J., Levering, L. (2009). A 2013, pp. 108-122. Retrieved 2 June 2016, from Semantic Web-Based Approach for Building http://techcrunch.com/2016/03/15/microsoft- Personalized News Services. International Journal of launches-azure-preview-in-germany-and-canada- E-Business Research 5(3), pp. 35–53. announces-dod-specific-regions-in-u-s/?ncid=rss&utm [6] Congle Zhang, Stephen Soderland, and Daniel S. Weld source=feedburner&utm medium=feed&utm campaign (2015). Exploiting parallel news streams for =Feed%3A+techcrunchIt+%28TechCrunch+IT%29 unsupervised event extraction. Transactions of the [18] E. Arendarenko and T. Kakkonen (2012). Association for Computational Linguistics, vol. 3, pp. “Ontology-based information and event extraction for 117–129. business intelligence”, in The 15th International [7] Anthony Fader, Stephen Soderland, and Oren Etzioni. Conference Artificial Intelligence: Methodology, 2011. Identifying relations for open information Systems, and Applications (AIMSA 2012), Varna, extraction. In Proceedings of the Conference on Bulgaria, September 12–15, 2012, Lecture Notes in Empirical Methods in Natural Language Processing Computer Science vol. 7557, pp. 89–102, Springer. (EMNLP ’11). Association for Computational [19] Schaurer, F., & St¨ orger, J. (2013). The Evolution of Linguistics, Stroudsburg, PA, USA, 1535-1545. Open Source Intelligence (OSINT). Journal of U.S. [8] M Schmitz, R Bart, S Soderland, O Etzioni. 2012. Intelligence Studies, vol. 19 (3), Winter/Spring 2013, Open Language Learning for Information Extraction. pp. 53–56. In Proceedings of Conference on Empirical Methods in [20] Neo4j: The World’s Leading Graph Database (2016). Natural Language Processing and Computational Neo4j Graph Database. Retrieved 11 April 2016, from Natural Language Learning (EMNLP-CONLL). http://neo4j.com/

References (21)

REFERENCES
Borsje, J., Hogenboom, F., Frasincar, F. (2010). Semi-Automatic Financial Events Discovery Based on Lexico-Semantic Patterns. International Journal of Web Engineering and Technology vol. 6(2), pp. 115-140.
Hristo Tanev , Jakub Piskorski , Martin Atkinson 2008). Real-Time News Event Extraction for Global Crisis Monitoring, Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems, June 24-27, 2008, London, UK [doi>10.1007/978-3-540-69858-6 21]
Hogenboom, F., Frasincar, F., Kaymak, U., de Jong, F. (2011). An overview of event extraction from text. Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011), Bonn, Germany, October 2011.
Alexandra La Fleur, Kia Teymourian, and Adrian Paschke (2015). Complex event extraction from real-time news streams. In Proceedings of the 11th International Conference on Semantic Systems (SEMANTICS '15), Sebastian Hellmann, Josiane Xavier Parreira, and Axel Polleres (Eds.), pp. 9-16.
Frasincar, F., Borsje, J., Levering, L. (2009). A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research 5(3), pp. 35-53.
Congle Zhang, Stephen Soderland, and Daniel S. Weld (2015). Exploiting parallel news streams for unsupervised event extraction. Transactions of the Association for Computational Linguistics, vol. 3, pp. 117-129.
Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 1535-1545.
M Schmitz, R Bart, S Soderland, O Etzioni. 2012. Open Language Learning for Information Extraction. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL).
Reuters India News RSS, Subscribe RSS News Feeds, Latest news India (2016). Reuters India. Retrieved 7 June 2016, from http://in.reuters.com/tools/rss
Riggs, D. (2016). TechCrunch. Retrieved 2 June 2016, from http://techcrunch.com/topic/company/
Lucas Ou-Yang. Newspaper: Article scraping & curation -newspaper 0.0.2 documentation (2016). Newspaper.readthedocs.org. Retrieved 11 April 2016, from http://newspaper.readthedocs.org/en/lates
Wunderwald, Martin (2011). "NewsX Event Extraction from News Articles", diploma thesis, Dresden University of Technology, Dresden, Germany. Retrieved 2 June 2016 from http://www.rn.inf.tu-dres-den.de/uploads/Studentis che Arbeiten/Diplomarbeit Wunderwald Martin.pdf
Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky (2014). The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
Sebastian Martschat and Michael Strube (2015). Latent Structures for Coreference Resolution. Transactions of the Association for Computational Linguistics, 3, pages 405-418.
Lardinois, F. (2016). Microsoft launches Azure preview in Germany and Canada, announces DoD-specific regions in U.S.. TechCrunch.
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae and Peter Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt and G. Varoquaux, "API design for machine learning software: experiences from the scikit-learn project" in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, pp. 108-122. Retrieved 2 June 2016, from http://techcrunch.com/2016/03/15/microsoft- launches-azure-preview-in-germany-and-canada- announces-dod-specific-regions-in-u-s/?ncid=rss&utm source=feedburner&utm medium=feed&utm campaign =Feed%3A+techcrunchIt+%28TechCrunch+IT%29
E. Arendarenko and T. Kakkonen (2012). "Ontology-based information and event extraction for business intelligence", in The 15th International Conference Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2012), Varna, Bulgaria, September 12-15, 2012, Lecture Notes in Computer Science vol. 7557, pp. 89-102, Springer.
Schaurer, F., & Störger, J. (2013). The Evolution of Open Source Intelligence (OSINT). Journal of U.S. Intelligence Studies, vol. 19 (3), Winter/Spring 2013, pp. 53-56.
Neo4j: The World's Leading Graph Database (2016). Neo4j Graph Database. Retrieved 11 April 2016, from http://neo4j.com/

About the author

Shrisha Rao

IIIT Bangalore, Faculty Member

Papers

136

Followers

899

View all papers from Shrisha Raoarrow_forward

EEQuest: An Event Extraction and Query System

Sign up for access to the world's latest research

Abstract

Related papers

References (21)

Related papers

Related topics