Geobase - A Geographical Database

Geobase - A Geographical Database

Geobase demonstrates a natural language interface to a database on U.S. geography written in Visual Prolog.. It shouldn't be too difficult to modify it to your own purposes, so that this will be the first step on the way to designing natural language interfaces to your programs. As all Prolog programs, Geobase is highly extendible. Try yourself to extend Geobase rule base, so that Geobase can manage more sentences. It isn't that difficult! Geobase allows you to query its database in English rather than "computerese"-commands.

Geobase is by no means a complete geographical database of the United States, nor is it a complete English language interface. If you plan to write similar routines in your own programs, studying how the code is put together and how certain routines are implemented should help. Again, we urge you to modify Geobase to be a more complete program. This will not only sharpen your Visual Prolog programming skills, but it will also keep you off the streets late at night.

You can access the information stored in the Geobase database with natural language (in this case, the natural language is English). The database supplied with Geobase is built upon the geography of the United States. You can enter queries (questions) in English and Geobase will parse (translate) these questions into a form that Visual Prolog understands. Geobase will give answers to the queries to the best of its knowledge. The Geobase application demonstrates one of the important areas where Visual Prolog shines: understanding natural language.

One of the most exciting features of Geobase is that you can examine and edit the source code. The code of Geobase is fully documented; you can take any section and modify it to suit your needs. Take a look at the database and modify it to include your home town! Soon you'll be on the road to creating your own natural language interfaces.

Sample queries:

What are the cities of New York?
What is the highest mountain in California?
What are the name of the states which border New Mexico?
Which rivers run through the state that border the state with the capital Olympia?

Some of the words that Geobase "knows"

Nouns Comparatives Relatives
area biggest above
capital greatest bigger
city highest greater
lake least less
mountain lowest longer
point maximum more
population minimum over
river shortest shorter
road   smaller
river   under
state    

Examining Geobase

The database contains the following information:

Information about states:

  1. Area of the state in square kilometers
  2. Population of the state in citizens
  3. Capital of the state
  4. Which states border a given state
  5. Rivers in the state
  6. Cities in the state
  7. Highest and lowest point in the state in meters

Information about rivers:

  1. Length of river in kilometers

Information about cities:

  1. Population of the city in citizens

Try to ask a few random questions. If Geobase doesn't understand a question, it will tell you the word it can't parse.Take a look at the following sample queries.

bullet What are the states?
bullet What are the cities of New York?
bullet What is the highest mountain in California?
bullet What are the names of the states which border New Mexico?
bullet Which rivers run through the state that border the state with the capital Olympia?

The language is defined in the file GEOBASE.LAN, and the database is defined in GEOBASE.DBA.

Be imaginative! Geobase will understand many English sentences, but occasionally you will find a sentence that Geobase simply does not recognize. This is the dilemma of a natural language interface. If you find a question, you feel Geobase should be able to answer but can't, you will need to improve Geobase so that it understands the query!

The Idea Behind Geobase

Geobase illustrates one way of implementing a natural language interface to a database. However, developing a complete natural language interface to a database is a very complicated task, as natural languages are far more complex than programming languages. There are far more words in the natural language, and natural languages have difficult ambiguities. But Visual Prolog is extremely well suited for natural language processing, because the backtracking mechanism can be used to handle ambiguities.

In Geobase the stored data is a USA geographical database. However, you could use the same approach for other types of data.

The key idea behind Geobase is simple: The user views the database as a network of entities connected by associations. This is known as an entity association network. The entities are the items stored in the database. In Geobase the entities are states, cities, capitals of states, rivers, lakes, etc. The associations are words that connect the entities in queries. For example:

Cities in the state of California. Here the two entities, cities and state, are connected by the association in. The word "the" is just ignored here, and California is regarded as an actual constant for the state entity.

Geobase is designed to accept simple English. This means that, rather than worrying whether a sentence is grammatically correct, Geobase tries to extract the meaning by attempting to match the user's query with the entity association network.

Queries can be combined to form rather complex queries. For example:

which rivers run through states that border the state with the capital Austin?

In order to make the query match the entity association network, Geobase must simplify the  various forms of the query. This occurs while Geobase "parses" the query.

The first step is to ignore certain words, such as:

which, is, are, the, tell, me, what, give, as, that, please to, how, many, live, lives, living, there, do, does

This step makes the query look like this:

rivers run through states border state with capital Austin?

The next step is to find the internal names for entities and associations. Entities can have synonyms, and the query can use plural forms of the entity names. Associations can consist of several words, and they can also have synonyms. After these conversions, the query looks like this:

river in state border state with capital Austin?

Geobase can now classify the words as either entities or associations and group the query into subqueries (E=entity, A=association, C=constant):

river in state border state with capital Austin?

E A (E A (E A E C))

Geobase can then evaluate the query by first finding the name of the state with the capital Austin, then finding all the states that border this state, and finally looking up which rivers run through these states.

Adapting the Geobase Idea

Geobase is a natural language query interface to an existing database. You can adapt the Geobase mechanisms to your own natural language query interface; we explain how in this section.

Create Your Database

The first thing you need to do is to create your database. How the database is stored or was created, has nothing to do with Geobase. You can use internal database sections or Visual Prolog's external database system, or you could even access some other database files by means of the Visual Prolog Toolbox. Geobase accesses the actual database through the predicates (db) and b(ent).

For simplicity, the geographical database is stored in an internal database section, which you can load from disk by calling the (consult) predicate. Here are some sample declarations from the geographical database:

/*state(Name,Abbreviation,Capitol,Area,Admit,Population,City,City,City,City*/
state(string,string,string,real,real,integer,string,string,string,string)

/*city(State,Abbreviation,Name,Population) */
city(string,string,string,real)

/*river(Name,Length,StateList)*/
river(string,integer,list)

/*border(State,Abbreviation,StateList) */
border(string,string,list)

/*etc.*/

Porting Geobase

The first step in porting Geobase to your own database is to draw the entity association network. The next step is to model this network with the database predicate schema:

schema(Entity,Assoc,Entity)

Here are some examples of schema clauses from Geobase:

schema("capital","of","state")
schema("state","with","capital")
schema("population","of","state")
schema("state","with","population")
schema("area","of","state")
schema("city","in","state")}

After you have defined the entity association network, you should implement Geobase's interface to the database. This requires that you define clauses for the two predicates db and ent.

Predicates

db(ent,assoc,ent,string,string)
ent(ent,string)

The ent Predicate

The (ent) predicate is responsible for delivering all instances of a given entity. In the first argument of ent, Geobase passes the name of an entity and expects the second to return actual string values for this entity.

Here are some example clauses of ent from Geobase:

ent(continent,usa).
ent(city,Name) :- city(_,_,Name,_).
ent(state,Name) :- state(Name,_,_,_,_,_,_,_,_,_).
ent(capital,Name):- state(_,_,Name,_,_,_,_,_,_,_).
ent(river,Name) :- river(Name,_,_).}

The (db) predicate is a bit more complicated than ent. It is responsible for modeling the relation between the two entities (the association). You can also regard the (db) predicate as a function between one entity value and another value. All the arrows in the entity association network (modeled by the (schema) relation) should be implemented in clauses for the (db) predicate. Here are some examples from the geographical database:

db(city,in,state,City,State) :-city(State,_,City,_).

db(state,with,city,State,City) :-city(State,_,City,_).

db(abbreviation,of,state,Ab,State) :- state(State,Ab,_,_,_,_,_,_,_,_).

db(area,of,state,Area,State) :-state(State,_,_,_,Area1,_,_,_,_,_),str_real(Area,Area1).

db(capitol,of,state,Capital,State) :-state(State,_,Capital,_,_,_,_,_,_,_).

db(state,border,state,State1,State2):- border(State2,_,List),member(State1,List).

db(length,of,river,Length,River) :-river(River,Length1,_),str_real(Length,Length1).

db(state,with,river,State,River) :-river(River,_,List),member(State,List).

That's really all you need to do in order to provide a natural language interface for your existing database.

Translating Natural Language Queries

Most natural languages (and English in particular) are not simple, straightforward, and consistent. Nouns can be singular or plural, verbs conjugate, synonyms exist. Translating sentences from natural language to something the program recognizes is not a simple task. In the following sections we discuss how the Geobase program deals with these translation issues.

Internal Entity Names

Geobase needs to obtain an internal entity name from the words the user has used. They break down into three separate problems:

1). Plural forms of entities. The user might use the word states, which is the entity name state appended by an s; or the word cities, which comes from the entity name city. The predicate (entn) is responsible for converting plural entities to their singluar forms.

2). Synonyms for entities. The user might type town instead of city, or place instead of point. Synonyms for entities are stored in the database predicate {synonym}.

3). Compound entity values. The entity values might consist of more than one word, like new york or salt lake city. Geobase handles this situation during parsing with the predicate db(get_cmpent).

Some of the involved clauses look like these:

Predicates

ent_name(ent,string) /* Converts between an entity name and an internal entity name */
entn(string,string) /* Converts an entity to singular form */
entity(string) /* Gets all entities */
ent_synonym(string,string) /* Synonyms for entities */

Clauses

ent_name(Ent,Navn) :- entn(E,Navn),ent_synonym(E,Ent),entity(Ent).
ent_synonym(E,Ent) :-synonym(E,Ent).
ent_synonym(E,E).
entn(E,N) :-concat(E,"s",N).
entn(E,N) :-free(E), bound(N), concat(X,"ies",N), concat(X,"y",E).
entn(E,E).
entity("name"):-!.
entity("continent"):-!.
entity(X) :- schema(X,_,_).

Internal Names for Associations

In the same way that entities can have synonyms and consist of several words, so can the associations in the queries be represented by several words. The alternative forms for the association names are stored in the b(assoc) database predicate. b(assoc) stores a list of words that can be used for the internal association name; for example:

assoc("in",["in"])
assoc("in",["running","through"])
assoc("in",["runs","through"])
assoc("in",["run","through"])
assoc("with",["with"])
assoc("with",["traversed"])
assoc("with",["traversed","by"])

The predicate (get_assoc) is responsible for recognizing an association in the beginning of a list of words. It does this by using the nondeterministic version of append to split the list up into two parts. If the first part of the list matches an alternative for an association in the (assoc) predicate, the corresponding internal association name is returned.

get_assoc(IL,OL,A) :- append(ASL,OL,IL), assoc(A,ASL).

The parser is responsible for recognizing the query sentence structure. There are many types of sentences, but these are classified by the parser into nine different cases. Each of these nine cases has alternatives in the domain (query). The (query) domain is defined recursively, which means it can represent nested queries.

Give me cities -ENT - q_e(ENT)

state with the city new york -ENT ASSOC ENT CONST - q_eaec(ENT,ASSOC,ENT,STR)

rivers in (....) -ENT ASSOC SUBQUERY - q_eaq(ENT,ASSOC,ENT,QUERY)

rivers longer than 1000 miles -ENT REL UNIT VAL - q_sel(ENT,RELOP,UNIT,REAL)

the smallest (...) -MIN SUBQUERY - q_min(ENT,QUERY)

the biggest (..) -MAX SUBQUERY - q_max(ENT,QUERY)

rivers that does not traverse -ENT ASSOC NOT SUBQ - q_not(ENT,QUERY)

rivers that are longer than

1 thousand miles

or that run through texas -SUBQUERY OR SUBQUERY - q_or(QUERY,QUERY)

which state borders nevada

and borders arizona -SUBQUERY AND SUBQUERY - q_and(QUERY,QUERY)

The words that users can type for minimum, maximum, units, etc., are stored in the language database section. The definition in Geobase looks like this:

entitysize(entity,keyword)
relop(keywords,relative_size) /* relational operator */
assoc(association_between_entities,keyword)
synonym(keyword,entity)
ignore(keyword)
min(keyword)
max(keyword)
size(entity,keyword)
unit(keyword,keyword)

Parsing by Difference Lists

The parser uses a method called "parsing by difference lists." The first two arguments of the parsing predicates are the input list and what remains of the list after part of a query is stripped off. In the last argument the parser builds up a structure for the query.

The parser consists of several predicates and clauses, each of which is responsible for handling special cases in recognizing the query. If you want to understand everything about the parser, study the comments and use trace mode to follow how Geobase parses various queries.

The following clause recognizes the query How large is the town new york. The filter gives the parser list"large", "town", "new", "york".

s_attr([BIG,ENAME|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*First s_attr clause*/
ent_name(E2,ENAME), /*Entity type town is a city. Look up entity in the language scheme*/
size(E2,BIG), /* look up city size is large */
entitysize(E2,E1), /* look up city scale is population */
schema(E1,A,E2), /* look up scheme population of city */
get_ent(S1,S2,X),!./* return an entity name and query */

The parser is also able to recognize the more ambiguous query How large is new york. Given this query, the first clause for s_attr fails because it expects an entity type (such as as town or state). Then the program calls the second clause for s_attr, shown here.

s_attr([BIG|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*Second s_attr clause*/
get_ent(S1,S2,X), size(E2,BIG),entitysize(E2,E1), schema(E1,A,E2),ent(E2,X),!.

Using this clause, the parser decides that new york refers to the city and that large refers to the number of citizens.

Once the parser returns a query, Geobase calls the (eval) clause that actually determines the query. The actual calls into the database are made with the (db) and (ent) predicates.