Geobase - A Geographical Database
Sample queries:What are the cities of New York? Some of the words that Geobase "knows"
|
The database contains the following information:
Information about states:
- Area of the state in square kilometers
- Population of the state in citizens
- Capital of the state
- Which states border a given state
- Rivers in the state
- Cities in the state
- Highest and lowest point in the state in meters
Information about rivers:
- Length of river in kilometers
Information about cities:
- Population of the city in citizens
Try to ask a few random questions. If Geobase doesn't understand a question, it will tell you the word it can't parse.Take a look at the following sample queries.
What are the states? What are the cities of New York? What is the highest mountain in California? What are the names of the states which border New Mexico? Which rivers run through the state that border the state with the capital Olympia? The language is defined in the file GEOBASE.LAN, and the database is defined in GEOBASE.DBA.
Be imaginative! Geobase will understand many English sentences, but occasionally you will find a sentence that Geobase simply does not recognize. This is the dilemma of a natural language interface. If you find a question, you feel Geobase should be able to answer but can't, you will need to improve Geobase so that it understands the query!
Geobase illustrates one way of implementing a natural language interface to a database. However, developing a complete natural language interface to a database is a very complicated task, as natural languages are far more complex than programming languages. There are far more words in the natural language, and natural languages have difficult ambiguities. But Visual Prolog is extremely well suited for natural language processing, because the backtracking mechanism can be used to handle ambiguities.
In Geobase the stored data is a USA geographical database. However, you could use the same approach for other types of data.
The key idea behind Geobase is simple: The user views the database as a network of entities connected by associations. This is known as an entity association network. The entities are the items stored in the database. In Geobase the entities are states, cities, capitals of states, rivers, lakes, etc. The associations are words that connect the entities in queries. For example:
Cities in the state of California. Here the two entities, cities and state, are connected by the association in. The word "the" is just ignored here, and California is regarded as an actual constant for the state entity.
Geobase is designed to accept simple English. This means that, rather than worrying whether a sentence is grammatically correct, Geobase tries to extract the meaning by attempting to match the user's query with the entity association network.
Queries can be combined to form rather complex queries. For example:
which rivers run through states that border the state with the capital Austin?
In order to make the query match the entity association network, Geobase must simplify the various forms of the query. This occurs while Geobase "parses" the query.
The first step is to ignore certain words, such as:
which, is, are, the, tell, me, what, give, as, that, please to, how, many, live, lives, living, there, do, does
This step makes the query look like this:
rivers run through states border state with capital Austin?
The next step is to find the internal names for entities and associations. Entities can have synonyms, and the query can use plural forms of the entity names. Associations can consist of several words, and they can also have synonyms. After these conversions, the query looks like this:
river in state border state with capital Austin?
Geobase can now classify the words as either entities or associations and group the query into subqueries (E=entity, A=association, C=constant):
river in state border state with capital Austin?
E A (E A (E A E C))
Geobase can then evaluate the query by first finding the name of the state with the capital Austin, then finding all the states that border this state, and finally looking up which rivers run through these states.
Geobase is a natural language query interface to an existing database. You can adapt the Geobase mechanisms to your own natural language query interface; we explain how in this section.
Create Your Database
The first thing you need to do is to create your database. How the database is stored or was created, has nothing to do with Geobase. You can use internal database sections or Visual Prolog's external database system, or you could even access some other database files by means of the Visual Prolog Toolbox. Geobase accesses the actual database through the predicates (db) and b(ent).
For simplicity, the geographical database is stored in an internal database section, which you can load from disk by calling the (consult) predicate. Here are some sample declarations from the geographical database:
/*state(Name,Abbreviation,Capitol,Area,Admit,Population,City,City,City,City*/
state(string,string,string,real,real,integer,string,string,string,string)/*city(State,Abbreviation,Name,Population) */
city(string,string,string,real)/*river(Name,Length,StateList)*/
river(string,integer,list)/*border(State,Abbreviation,StateList) */
border(string,string,list)/*etc.*/
Porting Geobase
The first step in porting Geobase to your own database is to draw the entity association network. The next step is to model this network with the database predicate schema:
schema(Entity,Assoc,Entity)
Here are some examples of schema clauses from Geobase:
schema("capital","of","state")
schema("state","with","capital")
schema("population","of","state")
schema("state","with","population")
schema("area","of","state")
schema("city","in","state")}After you have defined the entity association network, you should implement Geobase's interface to the database. This requires that you define clauses for the two predicates db and ent.
Predicates
db(ent,assoc,ent,string,string)
ent(ent,string)The ent Predicate
The (ent) predicate is responsible for delivering all instances of a given entity. In the first argument of ent, Geobase passes the name of an entity and expects the second to return actual string values for this entity.
Here are some example clauses of ent from Geobase:
ent(continent,usa).
ent(city,Name) :- city(_,_,Name,_).
ent(state,Name) :- state(Name,_,_,_,_,_,_,_,_,_).
ent(capital,Name):- state(_,_,Name,_,_,_,_,_,_,_).
ent(river,Name) :- river(Name,_,_).}The (db) predicate is a bit more complicated than ent. It is responsible for modeling the relation between the two entities (the association). You can also regard the (db) predicate as a function between one entity value and another value. All the arrows in the entity association network (modeled by the (schema) relation) should be implemented in clauses for the (db) predicate. Here are some examples from the geographical database:
db(city,in,state,City,State) :-city(State,_,City,_).
db(state,with,city,State,City) :-city(State,_,City,_).
db(abbreviation,of,state,Ab,State) :- state(State,Ab,_,_,_,_,_,_,_,_).
db(area,of,state,Area,State) :-state(State,_,_,_,Area1,_,_,_,_,_),str_real(Area,Area1).
db(capitol,of,state,Capital,State) :-state(State,_,Capital,_,_,_,_,_,_,_).db(state,border,state,State1,State2):- border(State2,_,List),member(State1,List).
db(length,of,river,Length,River) :-river(River,Length1,_),str_real(Length,Length1).
db(state,with,river,State,River) :-river(River,_,List),member(State,List).
That's really all you need to do in order to provide a natural language interface for your existing database.
Translating Natural Language Queries
Most natural languages (and English in particular) are not simple, straightforward, and consistent. Nouns can be singular or plural, verbs conjugate, synonyms exist. Translating sentences from natural language to something the program recognizes is not a simple task. In the following sections we discuss how the Geobase program deals with these translation issues.
Internal Entity Names
Geobase needs to obtain an internal entity name from the words the user has used. They break down into three separate problems:
1). Plural forms of entities. The user might use the word states, which is the entity name state appended by an s; or the word cities, which comes from the entity name city. The predicate (entn) is responsible for converting plural entities to their singluar forms.
2). Synonyms for entities. The user might type town instead of city, or place instead of point. Synonyms for entities are stored in the database predicate {synonym}.
3). Compound entity values. The entity values might consist of more than one word, like new york or salt lake city. Geobase handles this situation during parsing with the predicate db(get_cmpent).
Some of the involved clauses look like these:
Predicates
ent_name(ent,string) /* Converts between an entity name and an internal entity name */
entn(string,string) /* Converts an entity to singular form */
entity(string) /* Gets all entities */
ent_synonym(string,string) /* Synonyms for entities */Clauses
ent_name(Ent,Navn) :- entn(E,Navn),ent_synonym(E,Ent),entity(Ent).
ent_synonym(E,Ent) :-synonym(E,Ent).
ent_synonym(E,E).
entn(E,N) :-concat(E,"s",N).
entn(E,N) :-free(E), bound(N), concat(X,"ies",N), concat(X,"y",E).
entn(E,E).
entity("name"):-!.
entity("continent"):-!.
entity(X) :- schema(X,_,_).Internal Names for Associations
In the same way that entities can have synonyms and consist of several words, so can the associations in the queries be represented by several words. The alternative forms for the association names are stored in the b(assoc) database predicate. b(assoc) stores a list of words that can be used for the internal association name; for example:
assoc("in",["in"])
assoc("in",["running","through"])
assoc("in",["runs","through"])
assoc("in",["run","through"])
assoc("with",["with"])
assoc("with",["traversed"])
assoc("with",["traversed","by"])The predicate (get_assoc) is responsible for recognizing an association in the beginning of a list of words. It does this by using the nondeterministic version of append to split the list up into two parts. If the first part of the list matches an alternative for an association in the (assoc) predicate, the corresponding internal association name is returned.
get_assoc(IL,OL,A) :- append(ASL,OL,IL), assoc(A,ASL).
The parser is responsible for recognizing the query sentence structure. There are many types of sentences, but these are classified by the parser into nine different cases. Each of these nine cases has alternatives in the domain (query). The (query) domain is defined recursively, which means it can represent nested queries.
Give me cities -ENT - q_e(ENT)
state with the city new york -ENT ASSOC ENT CONST - q_eaec(ENT,ASSOC,ENT,STR)
rivers in (....) -ENT ASSOC SUBQUERY - q_eaq(ENT,ASSOC,ENT,QUERY)
rivers longer than 1000 miles -ENT REL UNIT VAL - q_sel(ENT,RELOP,UNIT,REAL)
the smallest (...) -MIN SUBQUERY - q_min(ENT,QUERY)
the biggest (..) -MAX SUBQUERY - q_max(ENT,QUERY)
rivers that does not traverse -ENT ASSOC NOT SUBQ - q_not(ENT,QUERY)
rivers that are longer than
1 thousand miles
or that run through texas -SUBQUERY OR SUBQUERY - q_or(QUERY,QUERY)
which state borders nevada
and borders arizona -SUBQUERY AND SUBQUERY - q_and(QUERY,QUERY)
The words that users can type for minimum, maximum, units, etc., are stored in the language database section. The definition in Geobase looks like this:
entitysize(entity,keyword)
relop(keywords,relative_size) /* relational operator */
assoc(association_between_entities,keyword)
synonym(keyword,entity)
ignore(keyword)
min(keyword)
max(keyword)
size(entity,keyword)
unit(keyword,keyword)Parsing by Difference Lists
The parser uses a method called "parsing by difference lists." The first two arguments of the parsing predicates are the input list and what remains of the list after part of a query is stripped off. In the last argument the parser builds up a structure for the query.
The parser consists of several predicates and clauses, each of which is responsible for handling special cases in recognizing the query. If you want to understand everything about the parser, study the comments and use trace mode to follow how Geobase parses various queries.
The following clause recognizes the query How large is the town new york. The filter gives the parser list"large", "town", "new", "york".
s_attr([BIG,ENAME|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*First s_attr clause*/
ent_name(E2,ENAME), /*Entity type town is a city. Look up entity in the language scheme*/
size(E2,BIG), /* look up city size is large */
entitysize(E2,E1), /* look up city scale is population */
schema(E1,A,E2), /* look up scheme population of city */
get_ent(S1,S2,X),!./* return an entity name and query */The parser is also able to recognize the more ambiguous query How large is new york. Given this query, the first clause for s_attr fails because it expects an entity type (such as as town or state). Then the program calls the second clause for s_attr, shown here.
s_attr([BIG|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*Second s_attr clause*/
get_ent(S1,S2,X), size(E2,BIG),entitysize(E2,E1), schema(E1,A,E2),ent(E2,X),!.Using this clause, the parser decides that new york refers to the city and that large refers to the number of citizens.
Once the parser returns a query, Geobase calls the (eval) clause that actually determines the query. The actual calls into the database are made with the (db) and (ent) predicates.