Computers and devices are largely unaware of events taking place in the world. This could be changed if news were made available in a computer-understandable form. In this paper we present XML documents called NewsForms that represent the key points of 17 types of news events. We discuss the benefits of computer-understandable news and present the NewsExtract program for converting text news stories into NewsForms.
XML, news, information extraction
Computers and devices today have little or no awareness of events taking place in the real world:
The rise of XML [3] presents a unique opportunity to change this. An XML document type can be created that represents the content of news events. Then news stories can be made available as XML documents to computers at the same time that text news stories are made available to people.
In fact, two XML document types for news have been defined.
The News Industry Text Format (NITF) [9] was created by the International Press Telecommunications Council and the Newspaper Association of America. NITF is a text markup language that enables reuse of news stories across print publications, broadcast news, databases, and the web. It incorporates a number of HTML tags for structure, appearance, and linking such as a, hr, p, and pre. To those, it adds tags for metadata or information about a news story such as byline, distributor, and keyword. Finally, it includes tags for marking up entities in the story such as event, location, money, org, and postaddr.
XMLNews [17, 18] was created by WavePhore, a news amalgamator, in order to provide web sites with a feed of news stories gathered from various sources in a standard format. XMLNews is split into two parts: XMLNews-Story is an easier-to-implement subset of NITF for marking up text news stories, while XMLNews-Meta provides information about news stories, text or otherwise.
NITF and XMLNews help make news more understandable to computers, but they do not go far enough. First of all, knowing what entities are present in a story is not sufficient. It is also important to know how those entities relate to one another. Consider the following XMLNews-Story markup of the first paragraph of a story about an earthquake:
<p>An <event>earthquake</event> struck <location>westernThis markup does not specify that western Colombia was the location of the earthquake rather than some other event mentioned later in the story. It does not specify that 19990125 was the date of the earthquake and not some other event. Nor does it specify the relationship of the numbers 143 and 900 to the earthquake, namely, that 143 is the minimum number of people killed in the earthquake and 900 is the minimum number of people injured.
<country>Colombia</country></location> on
<chron norm="19990125">Monday</chron>, killing at least 143 people and
injuring more than 900 as it toppled buildings across the country's
coffee-growing heartland, <function>civil defense officials</function>
said.</p>
[19]
Furthermore, what if the story had said a quake shook western Colombia? Then the markup would have been:
How is a computer to know that a quake is the same as an earthquake? What if the event is a trial? How is a computer to know whether this refers to a legal proceeding or the act of testing something? Natural language is highly ambiguous, so marking up text news stories with tags such as event is not sufficient. Rather, it is important to represent concepts in a standard, unambiguous format.<p>A <event>quake</event> shook ...
At this point, one might be tempted to use an AI knowledge representation language such as CycL [5] or the web-based SHOE [8] to represent the meaning of every sentence in a news story. Our solution is simpler and more practical, along the lines of the templates defined in the United States government sponsored Message Understanding Conferences (MUCs) [7].
We propose an XML document type, called NewsForm, for representing the key points of common news events. Though not all events fall into predefined classes, many do. NewsForm documents, which we call NewsForms, describe 17 types of news events: competitions, deals, earnings reports, economic releases, Fed watching, IPOs, injuries and fatalities, joint ventures, legal events, medical findings, negotiations, new products, management successions, trips and visits, votes, war, and weather reports. The NewsForm document type defines:
Here is a NewsForm for the above earthquake story:
This makes it clear that Colombia is the location of an InjuryFatality caused by an Earthquake, resulting in 900 injuries and 143 deaths.<NewsForm> <Head> <DatelineTime>19990125T181917Z</DatelineTime> </Head> <InjuryFatality> <Cause>Earthquake</Cause> <InjuredCount>900</InjuredCount> <KilledCount>143</KilledCount> <Source><Function>Civil Defense Official</Function></Source> <AtLocation> <Country>COL</Country> <Latitude>4.29</Latitude> <Longitude>-75.68</Longitude> </AtLocation> </InjuryFatality> </NewsForm>
We have developed a program called NewsExtract for converting text news stories into NewsForms, and a news search engine called Padoof that uses NewsForms.
In the remainder of this paper, we discuss the benefits of computer-understandable news, how news events are represented by NewsForms, the creation of NewsForms, the NewsExtract program, and related and future work.
In this section we describe some of the benefits of computer-understandable news in the areas of tracking and notification, research studies and informational graphics, world-aware software, ubiquitous computing, and accessibility.
A number of web-based services allow tracking and notification of news stories, Usenet postings, and web page updates on topics of interest to the user. In order to be notified when Bell Atlantic is involved in a merger or acquisition, the user must enter appropriate keywords and phrases such as Bell Atlantic, BEL, acquisition, acquire, and buy. Some news services provide indexing terms that simplify this task. The Dow Jones search statement:
finds articles indexed as being about acquisitions, mergers, or takeovers and containing Bell Atlantic's stock symbol. But using these services the user cannot perform highly focused queries such as finding only those stories in which Bell Atlantic is the target of a deal.ns=tnm and co=bel
In contrast, the Deal NewsForm element specifies the acquirer and target, so the user can request a Deal whose Target's Ticker is BEL.
NewsForms allow other queries such as:
The Padoof NewsForm-powered search engine [14] accepts such queries and allows sorting of results by the content of NewsForm elements (such as DatelineTime, AtLocation, or InjuredCount).find stories about the Federal Reserve raising interest rates find stories about someone pleading guilty find negative earnings reports
Online trading sites allow the user to request e-mail notification when the price of a financial instrument hits a certain level. NewsForms take this a step further, enabling computation and tracking of novel statistics on news events such as the number of product announcements per day or the number of announced deals in an industry over the last week.
Computer-understandable news facilitates historical studies on news events such as:
the impact of news on financial markets the impact of financial markets on news the effect of management successions on company performance the effect of layoffs on company performance the effect of takeovers on stock prices
Computer-understandable news can help in the production of informational graphics. A number of statistics could be computed and visualized such as:
Padoof displays a United States or world map showing the geographical distribution of retrieved events. On the map, a red B indicates a negative event, a green G indicates a positive event, and a blue X indicates other events.changes of government versus time CEO changes versus time IPOs versus time IPO locations
Computer-understandable news allows calendar and other application programs to be more aware of the world. We are building a smart calendar program [12] that helps the user avoid obvious blunders such as entering a dinner appointment for 6 am instead of 6 pm, or taking a vegetarian to a steakhouse. With awareness of news, the program could inform the user if an appointment is in a building that has just been evacuated due to an accident, or suggest the user arrange an alternative means of transportation in case of a subway outage.
In an e-commerce application, when the user is about to order a product, the application could inform the user if a new product of a related type was just announced. The user might wish to investigate the new product before making the purchase. The application could also inform the user if bad news was just announced about the manufacturer.
Computer-understandable news opens up the possibility of news-aware rooms, furniture, toys, and clothing. Based on the outcome of a news event such as a vote or legal proceeding, a room could alter the lighting and music, furniture could change color, or stuffed animals could alter their facial expression.
Sounds could be associated with each type of news event in a geographical region to give a feel for what is happening in that region. A competition event would produce a cheering crowd, a car crash event would produce a crashing sound, and a thunderstorm would produce thunder. The user might wish to monitor the current region, the region of family or friends, or a travel destination.
NewsForms make information about news stories accessible in a compact, mostly text-free format. For people who have difficulty reading text, NewsForms can be hooked up to audible, visual, tactile, and other transducers. NewsForms do not contain graphics and are suitable for transmission over a slow net connection.
In this section we first present the NewsForm elements for representing people, locations, and organizations. We then present the 17 NewsForm elements for representing news events. Here we present only the immediate child elements of each element. The complete NewsForm specification and DTD are presented elsewhere [10, 11].
The Person element contains the following elements:
Additional middle initial, middle name, or additional name Age age in years Country origin or nationality country (ISO 3166 [13] three-letter uppercase country code such as USA and FRA) email address Family family name Function job title Given given name Prefix prefix Sex sex (Female or Male) Suffix suffix URL URL
The Location element contains the following elements:
City city name Continent continent (Africa, Antarctica, Asia, Europe, NorthAmerica, or SouthAmerica) Country country (ISO 3166 three-letter uppercase country code) Latitude latitude (north is positive, south is negative) Longitude longitude (east is positive, west is negative) Region region (such as Appalachia or Scotland) State state (United States Postal Service 2-letter uppercase state/province/possession abbreviation as defined in USPS Publications 28 [15] and 65 [16] such as CT and PQ) URL URL
The Organization element describes a company, government agency, or other organization. It contains the following elements:
email address FullName full name Nickname nickname OrganizationType industry or type of organization Sport sport for a sports team (Archery, AutoRacing, Badminton, Baseball, Basketball, Biathlon, Boating, Bobsledding, Boxing, Cricket, Curling, Cycling, ExtremeSports, Fencing, Fishing, Football, Golf, Gymnastics, HighJump, Hockey, HorseRacing, Javelin, LongJump, Martial Arts, Olympics, PoleVault, Rodeo, Rowing, Rugby, Running, ShotPut, Snowboarding, Soccer, Softball, Sport, Tennis, Track, Triathlon, Volleyball, WaterSports, Weightlifting, WinterSports, or Wrestling) Ticker exchange ticker for a company URL URL
The Competition element contains the following elements:
CompetitionCode competition (such as ATPTour and WorldSeriesOfGolf) CompetitionOutcome outcome of the competition (Loss, Tie, or Win) Player player who won, lost, or tied (Person) Sport sport of the competition (see Organization above) Team team who won, lost, or tied (Organization)
The Deal element describes mergers and acquisitions. It contains the following elements:
Acquirer acquiring company (Organization) Advisor advisor (Organization) DealStatus deal status (Rumored, InTalks, Agreed, Approved, Completed, or Failed) DealValue deal value (Money) SharePrice share price for tender offer (Money) Stake percentage stake StockRatio stock ratio (x means one share of Acquirer stock for every x shares of Target stock) Successor successor company for consolidation (Organization) Survivor surviving company for merger (Organization) Target company being acquired (Organization)
The Earnings element describes earnings announcements and reports. It contains the following elements:
Company company (Organization) EPS earnings per share (Money whose Currency is represented using an extended ISO 4217 three-letter currency identifier such as GBP, JPY, and USD) EarningsAmount earnings (Money) GoodBad good or bad earnings report (Good or Bad) Loss amount of loss (Money) PreviousEPS previous earnings per share (Money) PreviousEarnings previous earnings amount (Money) Sales sales/revenues (Money) SalesPS sales/revenues per share (Money)
The EconomicRelease element describes economic releases. It contains the following elements:
AnnualRate annual rate Direction direction (Down, Unchanged, or Up) EconomicReleaseType type of economic release (such as AdvanceRetailSales and WorldPopulation) Growth growth of number (Money) GrowthRate growth rate of number PreviousRate previous rate Rate rate Source source of economic release (Organization or Person)
The FedWatch element describes actions by the United States Federal Reserve and other organizations. It contains the following elements:
Actor organization performing action FedAction action (Hold, Lower, or Raise) InterestRate interest rate affected (1MCommercialPaperRate, 3MCommercialPaperRate, 6MCommercialPaperRate, BaseRate, CommercialPaperRate, DiscountRate, FederalFundsRate, FederalFundsTarget, InterestRate, PrimeRate, TBill, TBill1Y, TBill3M, TBill6M, TBond30Y, TNote2Y, TNote5Y, or TNote10Y) Rate rate
The IPO element describes initial public offerings. It contains the following elements:
Company company (Organization) MarketCap market capitalization (Money) Raised amount raised (Money) Shares shares offered Stake percentage stake in company represented by new issue
The InjuryFatality element describes injuries and fatalities. It contains the following elements:
AccidentCar car involved in an accident AccidentPlane plane involved in an accident AtLocation location of the injury/fatality Boat boat involved in an accident (Battleship, Boat, CabinCruiser, Cruiser, Dinghy, InflatableDinghy, LifeRaft, Lifeboat, PassengerShip, Powerboat, Raft, Ship, SmallBoat, Steamship, Vessel, Warship, or Windsurfer) Cause cause of injury or fatality (AerialBomb, Alert, BoatCollision, Bomb, CarCrash, Curfew, Disaster, Dynamite, Earthquake, Evacuation, Explosive, Fire, Firebomb, Grenade, Mine, MolotovCocktail, PlaneCrash, or VehicleBomb) CauseEvent event (such as illness or weather) causing the injury or fatality Hospitalized person hospitalized Injured person injured InjuredCount number of people injured Killed person killed KilledCount number of people killed LandedPlane plane that landed Source source of information (Organization or Person) SurvivedBy description of those surviving the person killed
The JointVenture element describes joint ventures and other arrangements. It contains the following elements:
Company company involved in the joint venture or other arrangement (Organization) Item product involved in the joint venture JointVentureType type of joint venture (Agreement, Alliance, Deal, DistributionAgreement, LaunchContract, LicensingAgreement, MarketingAlliance, Partnership, PromotionAgreement, Relationship, StrategicAlliance, or Venture) Source source of information (Organization or Person)
The LegalEvent element describes legal events, including arrests, lawsuits, pleas, testimony, judgments, sentencing, and releases. It contains the following elements:
AccusationAction accusation (AggravatedAssault, Assassination, Assault, CapitalMurder, Conspiracy, ConspiringToKill, Crime, DisorderlyConduct, FirstDegreeMurder, Genocide, Harassment, InvoluntaryManslaughter, Manslaughter, Massacre, Murder, ObstructionOfJustice, PremeditatedMurder, RacialHarassment, Rape, SecondDegreeMurder, SexualAssault, SexualHarassment, Slaughter, Torture, or Wrongdoing) Accused accused entity or defendant (Organization or Person) Accuser accusing entity or plaintiff (Organization or Person) Arbiter arbiter or judge (Organization or Person) Arrested person arrested Attorney attorney (Person) Award award (Money) DispositionMethod disposition method (CourtTrial, JuryTrial, SummaryJudgment, ConsentJudgment, DefaultJudgment, DirectedVerdict, ArbitrationAward, Settlement, Dismissal, or Transfer) Forum forum or court (Organization) Judgment judgment or finding (Guilty or Innocent) LegalAction legal action (Argue, Arrest, Charge, File, Judge, Plead, Release, Sentence, Settle, or Testify) LegalFiling legal filing (Complaint, Motion, ObscenityComplaint, Pleading, or Suit) Plea plea (Guilty or Innocent) Released person released Releaser entity releasing the person (Organization or Person) SentenceDuration duration of sentence SentenceType type of sentence (Execution, Jail, JailLife, JailLifeWithoutPossibleParole, JailLifeWithPossibleParole, Probation, or StateCustody) Witness testifying witness (Person)
The MedicalFinding element describes medical findings. It contains the following elements:
Illness illness (such as AbdominalPain, Chlamydia, and Osteoporosis) IllnessFactor possible factor in illness (AirPollution, Alcohol, AnabolicSteroids, BreastImplant, CigarSmoking, CigaretteSmoking, Circumcision, Cocaine, Contraception, ContraceptivePill, Dieting, Ecstasy, Heroin, Hysterectomy, Immunization, LSD, Mescaline, Opium, Pollution, Smoking, Stress, Tobacco, or Vaccination)
The Negotiation element describes negotiations. It contains the following elements:
Agreement agreement (Accord, Agreement, FinalSettlement, Legislation, Measure, PeaceAgreement, PeaceDeal, PeaceTreaty, Settlement, or Treaty) NegotiationStatus negotiation status (AgreementReached, InitialTalks, or Talks) Negotiator negotiator (Person) Party involved party (Organization or Person)
The NewProduct element describes new product releases and product recalls. It contains the following elements:
Company company (Organization) Item product Price price of product (Money) ProductStatus status of product (Released or Recalled) Source source of information (Organization or Person) SupportFor something that product provides support for
Employer employer (Organization or Person) Function job title In person entering job Out person leaving job Source source of information (Organization or Person)
The Trip element describes trips and visits. It contains the following elements:
Host host of visit (Organization or Person) ToLocation location of visit (Location) Visitor visitor (Person) VisitorCount number of visitors
The Vote element describes votes. It contains the following elements:
Against number of votes against InFavor number of votes in favor Law name of law Legislation legislation (Amendment, Bill, ConcurrentResolution, CongressionalJointResolution, HouseAmendment, HouseBill, HouseConcurrentResolution, HouseJointResolution, HouseResolution, JointResolution, Resolution, SenateAmendment, SenateBill, SenateConcurrentResolution, SenateJointResolution, or SenateResolution) Signer signer (Person) VoteStatus outcome or status of vote (Passed, Rejected, Signed, or VetoThreat) VotingBody voting body (Organization)
The War element describes wars and conflicts. It contains the following elements:
ArmedConflict type of armed conflict (AirBattle, AirStrike, ArmedConflict, ArtilleryFire, Attack, Battle, Bombing, CivilUnrest, CivilWar, Clash, Conflict, Coup, Fighting, Fire, GuerrillaActivities, Hostilities, LandBattle, LandWar, Massacre, Skirmish, SniperFire, Unrest, Violence, War, or Warfare) ArmedForce armed force (including count, type, location, and religion) ArmedForceAction action of armed force (Arrive, Begin, Depart, Deploy, or Movement) AtLocation location of the war/conflict Leader leader of armed force (Organization or Person) Source source of information (Organization or Person) Victim victim VictimAction action of victim (Flee or Return)
The Weather element describes weather conditions. It contains the following elements:
AtLocation (nearby) location of weather condition (Location) CompassDirection compass direction of weather condition with respect to the nearby location (East, North, Northeast, Northwest, South, Southeast, Southwest, or West) DeclaredState declared state (Alert, Disaster, Evacuation, or Fire) Declarer entity declaring the state (Organization or Person) DistanceFromLocation distance from the nearby location Given given name of weather condition (such as hurricane) High high temperature Issuer issuer of warning (Organization or Person) Low low temperature Meteor meteor or heat condition (Category1Storm, Category2Storm, Category3Storm, Category4Storm, ContinuousDrizzle, ContinuousRain, ContinuousSnow, Cyclone, Dew, Drizzle, DustStorm, ExcessiveHeat, Fog, FreezingRain, Frost, Hail, Heat, HeavyDriftingSnowLow, HeavyThunderstorm, Hurricane, IntermittentDrizzle, IntermittentRain, IntermittentSnow, Lightning, Mist, PlateCrystal, Rain, RainShower, Raindrop, Rainstorm, Sandstorm, Sleet, SlightDriftingSnowLow, Smoke, Snow, SnowFlurry, SnowShower, Snowflake, Snowstorm, Squall, StellarCrystal, Storm, Thunder, Thunderstorm, Tornado, or TropicalStorm) Warning warning WindSpeed wind speed
Who will create NewsForms and how will they be created?
Wire services, newspapers, and broadcasters could attach NewsForms to their existing text, video, and audio stories. NewsForms could be created by the original reporter or later in the editorial process by copy editors or editorial assistants.
A portable device could be built to enable a NewsForm reporter to enter NewsForms quickly. The reporter would wait for the statement following a Federal Open Market Committee (FOMC) meeting, enter the key information, and hit [SEND]:
FedWatch | |
FedAction | Lower/Hold/Raise |
InterestRate | FederalFundsTarget |
Rate | 5.25 |
[SEND] | [CANCEL] |
Original news sources could also create NewsForms. Government agencies, companies, and organizations could all provide NewsForms. The FOMC would provide a NewsForm along with its statement. Corporate public relations departments would provide NewsForms along with earnings, joint venture, new product, and other releases.
Individuals could also create NewsForms. People witnessing news events or learning of them on CNN could post NewsForms to Usenet where they would become available to everyone. People could sign up to be responsible for a particular area of news.
NewsForms can also be created automatically from the wealth of text news stories already available. NewsExtract uses information extraction techniques [4] to convert text stories into NewsForms, though with some errors and omissions. It consists of the following components:
Sentence boundary identifier |
Part-of-speech tagger |
Noun group parser |
Entity parser |
Reference resolver |
Pattern-based parser |
Commonsense rules |
NewsExtract operates as follows: First, the sentence boundary identifier breaks the input text into sentences.
Next the part-of-speech tagger identifies the part of speech of each word of each sentence.
Next the noun group parser identifies noun groups. Examples are:
Washington area economy government salary increases
Next the entity parser identifies and parses noun groups representing people, locations, organizations, products, numbers, percentages, money, durations, temperatures, speeds, distances, and other entities. Locations include cities, states, countries, regions, and continents. Organizations include companies and government agencies. Examples are:
Entities are parsed with the help of 59 specialized lexicons. Ambiguities may be introduced at this point: New York may refer to a city, state, or any of several sports teams.Prime Minister Lionel Jospin the Prime Minister he Buenos Aires Argentina Department of Justice Healtheon Corp Concentric DSL 2000 Audi TT Quattro United Airlines Boeing 777 five thousand 5,000 $2 million three hours 4 miles
Next the reference resolver assigns an identifier (such as PERSON3, ORG1, or LOC4) to each unique entity. In the following text the bold items are all assigned the same identifier:
Ambiguities may also be introduced here: He may refer to several males previously mentioned.Prime Minister Lionel Jospin kept silent on the fate of a key government ally ... Jospin, confronted with a political time-bomb ... Beyond saying this was an affair for justice system, he maintained an awkward silence.
Next the pattern-based parser uses 380 rules for converting partial parses from the above steps into XML NewsForms. An example is:
Some ambiguities are resolved in this stage: Though the entity parser parses Al Khartum into a city in Sudan as well as a person's name, the above rule restricts the entity to being a person in the text Al Khartum was injured.?Person was injured => <InjuryFatality><Injured>?Person</Injured></InjuryFatality>
Finally, commonsense rules attempt to resolve remaining ambiguities and perform final validation of output NewsForms. Examples are:
A team in a competition must play the right sport. (That is, a football team does not play in a baseball game.) A defendant who is found not guilty cannot be sentenced.
Human editors can be brought in to resolve and correct any remaining ambiguities and errors in the automatically parsed NewsForms.
NITF [9] and XMLNews [19] are the other XML document types for news stories. The design goals of NewsForms differ from those of NITF and XMLNews, as shown in the following table:
Design goal | NITF | XMLNews | NewsForms |
Enable reuse of news story text | yes - major goal | yes | no - no text |
Allow incremental markup of text | yes - major goal | yes - major goal | no - no text |
Standardize text news feed | yes | yes - major goal | no - no text |
Improve indexing and retrieval | yes | yes - major goal | yes - major goal |
Make easy to implement | yes | yes - major goal | yes - major goal |
Represent events in detail | no | no | yes - major goal |
Whereas NewsForms represent 17 types of events in detail, the event element of NITF describes an event only with character data, an identifier, a start date, and an end date:
The event element of XMLNews-Story is merely:<!ELEMENT event (#PCDATA)> <!ATTLIST event id ID #IMPLIED start-date CDATA #IMPLIED end-date CDATA #IMPLIED>
<!ELEMENT event (#PCDATA)>
Countries are identified in NITF with ISO 3166 country codes, but XMLNews-Story dropped this feature.
NewsForms could be incorporated into NITF and XMLNews. The NewsForm element could be inserted wherever metadata (as opposed to story text) occurs: the head or body.head elements of NITF and XMLNews-Story, or the root element xn:Resource of XMLNews-Meta. To reduce confusion, NewsForms use similar names to those used in NITF and XMLNews where they exist (as is the case for Person, Given, Family, Location, Country, City, State, Region, Money, and others).
An early information extraction system for news was the FRUMP program [6]. Information extraction technology was refined in a series of Message Understanding Conferences (MUCs) beginning in 1987 [2, 4, 7]. Four of the 17 NewsForm event elements overlap MUC domains, as shown in the following table:
MUC conference | MUC domains | NewsForms |
MUCK-I, MUCK-II | naval sightings and engagements | none |
MUC-3, MUC-4 | terrorist events | InjuryFatality |
MUC-5 | international joint ventures, electronic circuit fabrication | JointVenture |
MUC-6 | management successions, labor contract negotiations | Succession, Negotiation |
Future work includes adding more events to NewsForms, improving the precision and recall of NewsExtract, and evolving applications of NewsForms. Some common types of news events not yet incorporated into NewsForms are: corporate buybacks, corporate restructuring, bankruptcies, layoffs, insider trading, market research, surveys, polls, ad campaigns, scientific discoveries, space missions, computer security, y2k problems, outages, campaigning, election results, opinions, proposals, human interest stories, interviews, outdoor events, and rescue efforts.
By using NewsForms, computers can understand the essence of 17 common types of news events. NewsExtract can be used to convert existing text news stories into NewsForms, and individuals and organizations can create and post their own NewsForms. By deploying and evolving NewsForms, we will move closer to the realization of Tim Berners-Lee's dream of a Semantic Web [1].