arXiv:cs.IR/0003001 1 Mar 2000

Making news understandable to computers

Erik T. Mueller
Signiform
Washington, DC
erik@signiform.com

Abstract

Computers and devices are largely unaware of events taking place in the world. This could be changed if news were made available in a computer-understandable form. In this paper we present XML documents called NewsForms that represent the key points of 17 types of news events. We discuss the benefits of computer-understandable news and present the NewsExtract program for converting text news stories into NewsForms.

Keywords

XML, news, information extraction

Introduction

Computers and devices today have little or no awareness of events taking place in the real world:

The rise of XML [3] presents a unique opportunity to change this. An XML document type can be created that represents the content of news events. Then news stories can be made available as XML documents to computers at the same time that text news stories are made available to people.

In fact, two XML document types for news have been defined.

The News Industry Text Format (NITF) [9] was created by the International Press Telecommunications Council and the Newspaper Association of America. NITF is a text markup language that enables reuse of news stories across print publications, broadcast news, databases, and the web. It incorporates a number of HTML tags for structure, appearance, and linking such as a, hr, p, and pre. To those, it adds tags for metadata or information about a news story such as byline, distributor, and keyword. Finally, it includes tags for marking up entities in the story such as event, location, money, org, and postaddr.

XMLNews [17, 18] was created by WavePhore, a news amalgamator, in order to provide web sites with a feed of news stories gathered from various sources in a standard format. XMLNews is split into two parts: XMLNews-Story is an easier-to-implement subset of NITF for marking up text news stories, while XMLNews-Meta provides information about news stories, text or otherwise.

NITF and XMLNews help make news more understandable to computers, but they do not go far enough. First of all, knowing what entities are present in a story is not sufficient. It is also important to know how those entities relate to one another. Consider the following XMLNews-Story markup of the first paragraph of a story about an earthquake:

<p>An <event>earthquake</event> struck <location>western
<country>Colombia</country></location> on
<chron norm="19990125">Monday</chron>, killing at least 143 people and
injuring more than 900 as it toppled buildings across the country's
coffee-growing heartland, <function>civil defense officials</function>
said.</p>
[19]
This markup does not specify that western Colombia was the location of the earthquake rather than some other event mentioned later in the story. It does not specify that 19990125 was the date of the earthquake and not some other event. Nor does it specify the relationship of the numbers 143 and 900 to the earthquake, namely, that 143 is the minimum number of people killed in the earthquake and 900 is the minimum number of people injured.

Furthermore, what if the story had said a quake shook western Colombia? Then the markup would have been:

<p>A <event>quake</event> shook ...
How is a computer to know that a quake is the same as an earthquake? What if the event is a trial? How is a computer to know whether this refers to a legal proceeding or the act of testing something? Natural language is highly ambiguous, so marking up text news stories with tags such as event is not sufficient. Rather, it is important to represent concepts in a standard, unambiguous format.

At this point, one might be tempted to use an AI knowledge representation language such as CycL [5] or the web-based SHOE [8] to represent the meaning of every sentence in a news story. Our solution is simpler and more practical, along the lines of the templates defined in the United States government sponsored Message Understanding Conferences (MUCs) [7].

We propose an XML document type, called NewsForm, for representing the key points of common news events. Though not all events fall into predefined classes, many do. NewsForm documents, which we call NewsForms, describe 17 types of news events: competitions, deals, earnings reports, economic releases, Fed watching, IPOs, injuries and fatalities, joint ventures, legal events, medical findings, negotiations, new products, management successions, trips and visits, votes, war, and weather reports. The NewsForm document type defines:

Here is a NewsForm for the above earthquake story:

<NewsForm>
  <Head>
    <DatelineTime>19990125T181917Z</DatelineTime>
  </Head>
  <InjuryFatality>
    <Cause>Earthquake</Cause>
    <InjuredCount>900</InjuredCount>
    <KilledCount>143</KilledCount>
    <Source><Function>Civil Defense Official</Function></Source>
    <AtLocation>
      <Country>COL</Country>
      <Latitude>4.29</Latitude>
      <Longitude>-75.68</Longitude>
    </AtLocation>
  </InjuryFatality>
</NewsForm>
This makes it clear that Colombia is the location of an InjuryFatality caused by an Earthquake, resulting in 900 injuries and 143 deaths.

We have developed a program called NewsExtract for converting text news stories into NewsForms, and a news search engine called Padoof that uses NewsForms.

In the remainder of this paper, we discuss the benefits of computer-understandable news, how news events are represented by NewsForms, the creation of NewsForms, the NewsExtract program, and related and future work.

Benefits of computer-understandable news

In this section we describe some of the benefits of computer-understandable news in the areas of tracking and notification, research studies and informational graphics, world-aware software, ubiquitous computing, and accessibility.

Tracking and notification services

A number of web-based services allow tracking and notification of news stories, Usenet postings, and web page updates on topics of interest to the user. In order to be notified when Bell Atlantic is involved in a merger or acquisition, the user must enter appropriate keywords and phrases such as Bell Atlantic, BEL, acquisition, acquire, and buy. Some news services provide indexing terms that simplify this task. The Dow Jones search statement:

ns=tnm and co=bel
finds articles indexed as being about acquisitions, mergers, or takeovers and containing Bell Atlantic's stock symbol. But using these services the user cannot perform highly focused queries such as finding only those stories in which Bell Atlantic is the target of a deal.

In contrast, the Deal NewsForm element specifies the acquirer and target, so the user can request a Deal whose Target's Ticker is BEL.

NewsForms allow other queries such as:

find stories about the Federal Reserve raising interest rates
find stories about someone pleading guilty
find negative earnings reports
The Padoof NewsForm-powered search engine [14] accepts such queries and allows sorting of results by the content of NewsForm elements (such as DatelineTime, AtLocation, or InjuredCount).

Online trading sites allow the user to request e-mail notification when the price of a financial instrument hits a certain level. NewsForms take this a step further, enabling computation and tracking of novel statistics on news events such as the number of product announcements per day or the number of announced deals in an industry over the last week.

Research studies and graphics

Computer-understandable news facilitates historical studies on news events such as:

the impact of news on financial markets
the impact of financial markets on news
the effect of management successions on company performance
the effect of layoffs on company performance
the effect of takeovers on stock prices

Computer-understandable news can help in the production of informational graphics. A number of statistics could be computed and visualized such as:

changes of government versus time
CEO changes versus time
IPOs versus time
IPO locations
Padoof displays a United States or world map showing the geographical distribution of retrieved events. On the map, a red B indicates a negative event, a green G indicates a positive event, and a blue X indicates other events.

World-aware software

Computer-understandable news allows calendar and other application programs to be more aware of the world. We are building a smart calendar program [12] that helps the user avoid obvious blunders such as entering a dinner appointment for 6 am instead of 6 pm, or taking a vegetarian to a steakhouse. With awareness of news, the program could inform the user if an appointment is in a building that has just been evacuated due to an accident, or suggest the user arrange an alternative means of transportation in case of a subway outage.

In an e-commerce application, when the user is about to order a product, the application could inform the user if a new product of a related type was just announced. The user might wish to investigate the new product before making the purchase. The application could also inform the user if bad news was just announced about the manufacturer.

Ubiquitous computing

Computer-understandable news opens up the possibility of news-aware rooms, furniture, toys, and clothing. Based on the outcome of a news event such as a vote or legal proceeding, a room could alter the lighting and music, furniture could change color, or stuffed animals could alter their facial expression.

Sounds could be associated with each type of news event in a geographical region to give a feel for what is happening in that region. A competition event would produce a cheering crowd, a car crash event would produce a crashing sound, and a thunderstorm would produce thunder. The user might wish to monitor the current region, the region of family or friends, or a travel destination.

Accessibility

NewsForms make information about news stories accessible in a compact, mostly text-free format. For people who have difficulty reading text, NewsForms can be hooked up to audible, visual, tactile, and other transducers. NewsForms do not contain graphics and are suitable for transmission over a slow net connection.

Representing news events

In this section we first present the NewsForm elements for representing people, locations, and organizations. We then present the 17 NewsForm elements for representing news events. Here we present only the immediate child elements of each element. The complete NewsForm specification and DTD are presented elsewhere [10, 11].

Person

The Person element contains the following elements:

Additionalmiddle initial, middle name, or additional name
Ageage in years
Countryorigin or nationality country (ISO 3166 [13] three-letter uppercase country code such as USA and FRA)
Emailemail address
Familyfamily name
Functionjob title
Givengiven name
Prefixprefix
Sexsex (Female or Male)
Suffixsuffix
URLURL

Location

The Location element contains the following elements:

Citycity name
Continentcontinent (Africa, Antarctica, Asia, Europe, NorthAmerica, or SouthAmerica)
Countrycountry (ISO 3166 three-letter uppercase country code)
Latitudelatitude (north is positive, south is negative)
Longitudelongitude (east is positive, west is negative)
Regionregion (such as Appalachia or Scotland)
Statestate (United States Postal Service 2-letter uppercase state/province/possession abbreviation as defined in USPS Publications 28 [15] and 65 [16] such as CT and PQ)
URLURL

Organization

The Organization element describes a company, government agency, or other organization. It contains the following elements:

Emailemail address
FullNamefull name
Nicknamenickname
OrganizationTypeindustry or type of organization
Sportsport for a sports team (Archery, AutoRacing, Badminton, Baseball, Basketball, Biathlon, Boating, Bobsledding, Boxing, Cricket, Curling, Cycling, ExtremeSports, Fencing, Fishing, Football, Golf, Gymnastics, HighJump, Hockey, HorseRacing, Javelin, LongJump, Martial Arts, Olympics, PoleVault, Rodeo, Rowing, Rugby, Running, ShotPut, Snowboarding, Soccer, Softball, Sport, Tennis, Track, Triathlon, Volleyball, WaterSports, Weightlifting, WinterSports, or Wrestling)
Tickerexchange ticker for a company
URLURL

Competition

The Competition element contains the following elements:

CompetitionCodecompetition (such as ATPTour and WorldSeriesOfGolf)
CompetitionOutcomeoutcome of the competition (Loss, Tie, or Win)
Playerplayer who won, lost, or tied (Person)
Sportsport of the competition (see Organization above)
Teamteam who won, lost, or tied (Organization)

Deal

The Deal element describes mergers and acquisitions. It contains the following elements:

Acquireracquiring company (Organization)
Advisoradvisor (Organization)
DealStatusdeal status (Rumored, InTalks, Agreed, Approved, Completed, or Failed)
DealValuedeal value (Money)
SharePriceshare price for tender offer (Money)
Stakepercentage stake
StockRatiostock ratio (x means one share of Acquirer stock for every x shares of Target stock)
Successorsuccessor company for consolidation (Organization)
Survivorsurviving company for merger (Organization)
Targetcompany being acquired (Organization)

Earnings

The Earnings element describes earnings announcements and reports. It contains the following elements:

Companycompany (Organization)
EPSearnings per share (Money whose Currency is represented using an extended ISO 4217 three-letter currency identifier such as GBP, JPY, and USD)
EarningsAmountearnings (Money)
GoodBadgood or bad earnings report (Good or Bad)
Lossamount of loss (Money)
PreviousEPSprevious earnings per share (Money)
PreviousEarningsprevious earnings amount (Money)
Salessales/revenues (Money)
SalesPSsales/revenues per share (Money)

EconomicRelease

The EconomicRelease element describes economic releases. It contains the following elements:

AnnualRateannual rate
Directiondirection (Down, Unchanged, or Up)
EconomicReleaseTypetype of economic release (such as AdvanceRetailSales and WorldPopulation)
Growthgrowth of number (Money)
GrowthRategrowth rate of number
PreviousRateprevious rate
Raterate
Sourcesource of economic release (Organization or Person)

FedWatch

The FedWatch element describes actions by the United States Federal Reserve and other organizations. It contains the following elements:

Actororganization performing action
FedActionaction (Hold, Lower, or Raise)
InterestRateinterest rate affected (1MCommercialPaperRate, 3MCommercialPaperRate, 6MCommercialPaperRate, BaseRate, CommercialPaperRate, DiscountRate, FederalFundsRate, FederalFundsTarget, InterestRate, PrimeRate, TBill, TBill1Y, TBill3M, TBill6M, TBond30Y, TNote2Y, TNote5Y, or TNote10Y)
Raterate

IPO

The IPO element describes initial public offerings. It contains the following elements:

Companycompany (Organization)
MarketCapmarket capitalization (Money)
Raisedamount raised (Money)
Sharesshares offered
Stakepercentage stake in company represented by new issue

InjuryFatality

The InjuryFatality element describes injuries and fatalities. It contains the following elements:

AccidentCarcar involved in an accident
AccidentPlaneplane involved in an accident
AtLocationlocation of the injury/fatality
Boatboat involved in an accident (Battleship, Boat, CabinCruiser, Cruiser, Dinghy, InflatableDinghy, LifeRaft, Lifeboat, PassengerShip, Powerboat, Raft, Ship, SmallBoat, Steamship, Vessel, Warship, or Windsurfer)
Causecause of injury or fatality (AerialBomb, Alert, BoatCollision, Bomb, CarCrash, Curfew, Disaster, Dynamite, Earthquake, Evacuation, Explosive, Fire, Firebomb, Grenade, Mine, MolotovCocktail, PlaneCrash, or VehicleBomb)
CauseEventevent (such as illness or weather) causing the injury or fatality
Hospitalizedperson hospitalized
Injuredperson injured
InjuredCountnumber of people injured
Killedperson killed
KilledCountnumber of people killed
LandedPlaneplane that landed
Sourcesource of information (Organization or Person)
SurvivedBydescription of those surviving the person killed

JointVenture

The JointVenture element describes joint ventures and other arrangements. It contains the following elements:

Companycompany involved in the joint venture or other arrangement (Organization)
Itemproduct involved in the joint venture
JointVentureTypetype of joint venture (Agreement, Alliance, Deal, DistributionAgreement, LaunchContract, LicensingAgreement, MarketingAlliance, Partnership, PromotionAgreement, Relationship, StrategicAlliance, or Venture)
Sourcesource of information (Organization or Person)

LegalEvent

The LegalEvent element describes legal events, including arrests, lawsuits, pleas, testimony, judgments, sentencing, and releases. It contains the following elements:

AccusationActionaccusation (AggravatedAssault, Assassination, Assault, CapitalMurder, Conspiracy, ConspiringToKill, Crime, DisorderlyConduct, FirstDegreeMurder, Genocide, Harassment, InvoluntaryManslaughter, Manslaughter, Massacre, Murder, ObstructionOfJustice, PremeditatedMurder, RacialHarassment, Rape, SecondDegreeMurder, SexualAssault, SexualHarassment, Slaughter, Torture, or Wrongdoing)
Accusedaccused entity or defendant (Organization or Person)
Accuseraccusing entity or plaintiff (Organization or Person)
Arbiterarbiter or judge (Organization or Person)
Arrestedperson arrested
Attorneyattorney (Person)
Awardaward (Money)
DispositionMethoddisposition method (CourtTrial, JuryTrial, SummaryJudgment, ConsentJudgment, DefaultJudgment, DirectedVerdict, ArbitrationAward, Settlement, Dismissal, or Transfer)
Forumforum or court (Organization)
Judgmentjudgment or finding (Guilty or Innocent)
LegalActionlegal action (Argue, Arrest, Charge, File, Judge, Plead, Release, Sentence, Settle, or Testify)
LegalFilinglegal filing (Complaint, Motion, ObscenityComplaint, Pleading, or Suit)
Pleaplea (Guilty or Innocent)
Releasedperson released
Releaserentity releasing the person (Organization or Person)
SentenceDurationduration of sentence
SentenceTypetype of sentence (Execution, Jail, JailLife, JailLifeWithoutPossibleParole, JailLifeWithPossibleParole, Probation, or StateCustody)
Witnesstestifying witness (Person)

MedicalFinding

The MedicalFinding element describes medical findings. It contains the following elements:

Illnessillness (such as AbdominalPain, Chlamydia, and Osteoporosis)
IllnessFactorpossible factor in illness (AirPollution, Alcohol, AnabolicSteroids, BreastImplant, CigarSmoking, CigaretteSmoking, Circumcision, Cocaine, Contraception, ContraceptivePill, Dieting, Ecstasy, Heroin, Hysterectomy, Immunization, LSD, Mescaline, Opium, Pollution, Smoking, Stress, Tobacco, or Vaccination)

Negotiation

The Negotiation element describes negotiations. It contains the following elements:

Agreementagreement (Accord, Agreement, FinalSettlement, Legislation, Measure, PeaceAgreement, PeaceDeal, PeaceTreaty, Settlement, or Treaty)
NegotiationStatusnegotiation status (AgreementReached, InitialTalks, or Talks)
Negotiatornegotiator (Person)
Partyinvolved party (Organization or Person)

NewProduct

The NewProduct element describes new product releases and product recalls. It contains the following elements:

Companycompany (Organization)
Itemproduct
Priceprice of product (Money)
ProductStatusstatus of product (Released or Recalled)
Sourcesource of information (Organization or Person)
SupportForsomething that product provides support for

Succession

The Succession element describes management successions. It contains the following elements:
Employeremployer (Organization or Person)
Functionjob title
Inperson entering job
Outperson leaving job
Sourcesource of information (Organization or Person)

Trip

The Trip element describes trips and visits. It contains the following elements:

Hosthost of visit (Organization or Person)
ToLocationlocation of visit (Location)
Visitorvisitor (Person)
VisitorCountnumber of visitors

Vote

The Vote element describes votes. It contains the following elements:

Againstnumber of votes against
InFavornumber of votes in favor
Lawname of law
Legislationlegislation (Amendment, Bill, ConcurrentResolution, CongressionalJointResolution, HouseAmendment, HouseBill, HouseConcurrentResolution, HouseJointResolution, HouseResolution, JointResolution, Resolution, SenateAmendment, SenateBill, SenateConcurrentResolution, SenateJointResolution, or SenateResolution)
Signersigner (Person)
VoteStatusoutcome or status of vote (Passed, Rejected, Signed, or VetoThreat)
VotingBodyvoting body (Organization)

War

The War element describes wars and conflicts. It contains the following elements:

ArmedConflicttype of armed conflict (AirBattle, AirStrike, ArmedConflict, ArtilleryFire, Attack, Battle, Bombing, CivilUnrest, CivilWar, Clash, Conflict, Coup, Fighting, Fire, GuerrillaActivities, Hostilities, LandBattle, LandWar, Massacre, Skirmish, SniperFire, Unrest, Violence, War, or Warfare)
ArmedForcearmed force (including count, type, location, and religion)
ArmedForceActionaction of armed force (Arrive, Begin, Depart, Deploy, or Movement)
AtLocationlocation of the war/conflict
Leaderleader of armed force (Organization or Person)
Sourcesource of information (Organization or Person)
Victimvictim
VictimActionaction of victim (Flee or Return)

Weather

The Weather element describes weather conditions. It contains the following elements:

AtLocation(nearby) location of weather condition (Location)
CompassDirectioncompass direction of weather condition with respect to the nearby location (East, North, Northeast, Northwest, South, Southeast, Southwest, or West)
DeclaredStatedeclared state (Alert, Disaster, Evacuation, or Fire)
Declarerentity declaring the state (Organization or Person)
DistanceFromLocationdistance from the nearby location
Givengiven name of weather condition (such as hurricane)
Highhigh temperature
Issuerissuer of warning (Organization or Person)
Lowlow temperature
Meteormeteor or heat condition (Category1Storm, Category2Storm, Category3Storm, Category4Storm, ContinuousDrizzle, ContinuousRain, ContinuousSnow, Cyclone, Dew, Drizzle, DustStorm, ExcessiveHeat, Fog, FreezingRain, Frost, Hail, Heat, HeavyDriftingSnowLow, HeavyThunderstorm, Hurricane, IntermittentDrizzle, IntermittentRain, IntermittentSnow, Lightning, Mist, PlateCrystal, Rain, RainShower, Raindrop, Rainstorm, Sandstorm, Sleet, SlightDriftingSnowLow, Smoke, Snow, SnowFlurry, SnowShower, Snowflake, Snowstorm, Squall, StellarCrystal, Storm, Thunder, Thunderstorm, Tornado, or TropicalStorm)
Warningwarning
WindSpeedwind speed

Creation of NewsForms

Who will create NewsForms and how will they be created?

Wire services, newspapers, and broadcasters could attach NewsForms to their existing text, video, and audio stories. NewsForms could be created by the original reporter or later in the editorial process by copy editors or editorial assistants.

A portable device could be built to enable a NewsForm reporter to enter NewsForms quickly. The reporter would wait for the statement following a Federal Open Market Committee (FOMC) meeting, enter the key information, and hit [SEND]:

FedWatch
FedActionLower/Hold/Raise
InterestRateFederalFundsTarget
Rate5.25
[SEND][CANCEL]

Original news sources could also create NewsForms. Government agencies, companies, and organizations could all provide NewsForms. The FOMC would provide a NewsForm along with its statement. Corporate public relations departments would provide NewsForms along with earnings, joint venture, new product, and other releases.

Individuals could also create NewsForms. People witnessing news events or learning of them on CNN could post NewsForms to Usenet where they would become available to everyone. People could sign up to be responsible for a particular area of news.

NewsExtract

NewsForms can also be created automatically from the wealth of text news stories already available. NewsExtract uses information extraction techniques [4] to convert text stories into NewsForms, though with some errors and omissions. It consists of the following components:

Sentence boundary identifier
Part-of-speech tagger
Noun group parser
Entity parser
Reference resolver
Pattern-based parser
Commonsense rules

NewsExtract operates as follows: First, the sentence boundary identifier breaks the input text into sentences.

Next the part-of-speech tagger identifies the part of speech of each word of each sentence.

Next the noun group parser identifies noun groups. Examples are:

Washington area economy
government salary increases

Next the entity parser identifies and parses noun groups representing people, locations, organizations, products, numbers, percentages, money, durations, temperatures, speeds, distances, and other entities. Locations include cities, states, countries, regions, and continents. Organizations include companies and government agencies. Examples are:

Prime Minister Lionel Jospin
the Prime Minister
he
Buenos Aires
Argentina
Department of Justice
Healtheon Corp
Concentric DSL
2000 Audi TT Quattro
United Airlines Boeing 777
five thousand
5,000
$2 million
three hours
4 miles
Entities are parsed with the help of 59 specialized lexicons. Ambiguities may be introduced at this point: New York may refer to a city, state, or any of several sports teams.

Next the reference resolver assigns an identifier (such as PERSON3, ORG1, or LOC4) to each unique entity. In the following text the bold items are all assigned the same identifier:

Prime Minister Lionel Jospin kept silent on the fate of a key
government ally ...
Jospin, confronted with a political time-bomb ...
Beyond saying this was an affair for justice system, he maintained
an awkward silence.
Ambiguities may also be introduced here: He may refer to several males previously mentioned.

Next the pattern-based parser uses 380 rules for converting partial parses from the above steps into XML NewsForms. An example is:

?Person was injured =>
<InjuryFatality><Injured>?Person</Injured></InjuryFatality>
Some ambiguities are resolved in this stage: Though the entity parser parses Al Khartum into a city in Sudan as well as a person's name, the above rule restricts the entity to being a person in the text Al Khartum was injured.

Finally, commonsense rules attempt to resolve remaining ambiguities and perform final validation of output NewsForms. Examples are:

A team in a competition must play the right sport.
(That is, a football team does not play in a baseball game.)
A defendant who is found not guilty cannot be sentenced.

Human editors can be brought in to resolve and correct any remaining ambiguities and errors in the automatically parsed NewsForms.

Related work

NITF [9] and XMLNews [19] are the other XML document types for news stories. The design goals of NewsForms differ from those of NITF and XMLNews, as shown in the following table:

Design goal NITF XMLNews NewsForms
Enable reuse of news story text yes - major goal yes no - no text
Allow incremental markup of text yes - major goal yes - major goal no - no text
Standardize text news feed yes yes - major goal no - no text
Improve indexing and retrieval yes yes - major goal yes - major goal
Make easy to implement yes yes - major goal yes - major goal
Represent events in detail no no yes - major goal

Whereas NewsForms represent 17 types of events in detail, the event element of NITF describes an event only with character data, an identifier, a start date, and an end date:

<!ELEMENT event (#PCDATA)>
<!ATTLIST event
          id ID #IMPLIED
          start-date CDATA #IMPLIED
          end-date CDATA #IMPLIED>
The event element of XMLNews-Story is merely:
<!ELEMENT event (#PCDATA)>

Countries are identified in NITF with ISO 3166 country codes, but XMLNews-Story dropped this feature.

NewsForms could be incorporated into NITF and XMLNews. The NewsForm element could be inserted wherever metadata (as opposed to story text) occurs: the head or body.head elements of NITF and XMLNews-Story, or the root element xn:Resource of XMLNews-Meta. To reduce confusion, NewsForms use similar names to those used in NITF and XMLNews where they exist (as is the case for Person, Given, Family, Location, Country, City, State, Region, Money, and others).

An early information extraction system for news was the FRUMP program [6]. Information extraction technology was refined in a series of Message Understanding Conferences (MUCs) beginning in 1987 [2, 4, 7]. Four of the 17 NewsForm event elements overlap MUC domains, as shown in the following table:

MUC conference MUC domains NewsForms
MUCK-I, MUCK-II naval sightings and engagements none
MUC-3, MUC-4 terrorist events InjuryFatality
MUC-5 international joint ventures, electronic circuit fabrication JointVenture
MUC-6 management successions, labor contract negotiations Succession, Negotiation

Future work

Future work includes adding more events to NewsForms, improving the precision and recall of NewsExtract, and evolving applications of NewsForms. Some common types of news events not yet incorporated into NewsForms are: corporate buybacks, corporate restructuring, bankruptcies, layoffs, insider trading, market research, surveys, polls, ad campaigns, scientific discoveries, space missions, computer security, y2k problems, outages, campaigning, election results, opinions, proposals, human interest stories, interviews, outdoor events, and rescue efforts.

Conclusion

By using NewsForms, computers can understand the essence of 17 common types of news events. NewsExtract can be used to convert existing text news stories into NewsForms, and individuals and organizations can create and post their own NewsForms. By deploying and evolving NewsForms, we will move closer to the realization of Tim Berners-Lee's dream of a Semantic Web [1].

References

  1. Berners-Lee, Tim. (1999). Weaving the Web. HarperSanFrancisco.
  2. Chinchor, Nancy A. (Ed.). (1999). MUC-7 proceedings (Online). Available: http://www.muc.saic.com/proceedings/muc_7_toc.html.
  3. Connolly, Dan, Khare, Rohit, and Rifkin, Adam. (1997). The evolution of web documents: The ascent of XML. In Dan Connolly (Ed.), XML: Principles, tools, and techniques. pp. 119-128. Sebastopol, CA: O'Reilly. Available: http://www.cs.caltech.edu/~adam/papers/xml/ascent-of-xml.html.
  4. Cowie, Jim, & Lehnert, Wendy. (1996). Information extraction. Communications of the ACM. 39(1), 80-91.
  5. Cycorp. (1999). The CycL language (Online). Austin, TX: Cycorp. Available: http://www.cyc.com/cycl.html
  6. DeJong, Gerald. (1982). An overview of the FRUMP system. In Wendy G. Lehnert and Martin H. Ringle (Eds.), Strategies for natural language processing. pp. 149-176. Hillsdale, NJ: Lawrence Erlbaum.
  7. Grishman, Ralph, & Sundheim, Beth. (1996). Message Understanding Conference - 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen: Ministry of Research, Denmark. Available: http://cs.nyu.edu/cs/projects/proteus/muc/muc6-history-coling.ps
  8. Heflin, Jeff, Hendler, James, & Luke, Sean. (1999). SHOE: A knowledge representation language for Internet applications (Technical Report CS-TR-4078). Department of Computer Science, University of Maryland at College Park. Available: http://www.cs.umd.edu/projects/plus/SHOE/aij-shoe.ps
  9. IPTC-NAA. (1999). IPTC-NAA News Industry Text Format (NITF) XML Version 1.1 (Online). Available: http://www.iptc.org/iptc/nitfxmldocs.pdf.
  10. Mueller, Erik T. (1999). NewsForms: XML-based forms for representing the content of news events (Online). Washington, DC: Signiform. Available: http://www.signiform.com/padoof/newsforms.html.
  11. Mueller, Erik T. (1999). NewsForm XML DTD (Online). Washington, DC: Signiform. Available: http://www.signiform.com/padoof/newsform.dtd
  12. Mueller, Erik T. (2000). A calendar with common sense. In Proceedings of the 2000 International Conference on Intelligent User Interfaces. New York: Association for Computing Machinery.
  13. RIPE Network Coordination Centre. (1997). Some codes from ISO 3166 (Online). Available: ftp://ftp.ripe.net/iso3166-countrycodes
  14. Signiform. (1999). Padoof news search engine. Washington, DC: Signiform. Available: http://www.signiform.com/padoof.
  15. United States Postal Service. (1997). Postal addressing standards (Publication 28). Available: http://pe.usps.gov/cpim/ftp/pubs/Pub28/pub28.pdf.
  16. United States Postal Service. (1998). National five-digit ZIP code and post office directory (Publication 65).
  17. XMLNews.org. (1999). XMLNews-Meta (1999-04-05) technical specification (Online). Available: http://www.xmlnews.org/docs/meta-spec.html.
  18. XMLNews.org. (1999). XMLNews-Story (1999-04-05) technical specification (Online). Available: http://www.xmlnews.org/docs/story-spec.html.
  19. XMLNews.org. (1999). XMLNews-Story tutorial (Online). Available: http://www.xmlnews.org/docs/story-tutorial.html.

Copyright © 1999 Erik T. Mueller. All Rights Reserved.