Geobase demonstrates a natural language interface to a database on U.S. geography written in Visual Prolog.. It shouldn't be too difficult to modify it , to your own purposes, so that this will be the first step on the way to designing natural language interfaces to your programs. As all prolog programs, geobase is highly extendible. Try your self to extend Geobase rule base, so that Geobase can more sentences. It isn't that difficult! Geobase allows you to query its database in English rather than "computerese"-commands.
Geobase is by no means a complete geological database of the United States, nor is it a complete English language interface. If you plan to write similar routines in your own programs, studying how the code is put together and how certain routines are implemented should help. Again, we urge you to modify Geobase to be a more complete program. This will not only sharpen your Visual Prolog programming skills, but it will also keep you off the streets late at night.
This means that you can access the information stored in the Geobase database with natural language (in this case, the natural language is English). The database supplied with Geobase is built upon the geography of the United States. You can enter queries (questions) in English prose and Geobase will parse (translate) these questions into a form that Visual Prolog understands. Geobase will give answers to the queries to the best of its knowledge. The Geobase application demonstrates one of the important areas where Visual Prolog shines: understanding natural language.
One of the most exciting features of Geobase is that you can examine and edit the source code. The code to Geobase is fully documented; you can take any section and modify it to suit your needs. Take a look at the database and modify it to include your home town! Soon you'll be on the road to creating your own natural language interfaces.
The database contains the following information:
Try to ask a few random questions. If Geobase doesn't understand a question, it will tell you the word it can't parse. For a list of sample questions, take a look at the following for some sample queries.
The language is defined in the file GEOBASE.LAN, and the database in defined in GEOBASE.DBA
Be imaginative! Geobase will understand many English sentences, but occasionally you will find a sentence that Geobase simply does not recognize. This is the dilemma of a natural language interface. If you find a question you feel Geobase should be able to answer but can't, try to improve Geobase so that it understands the query!
Geobase illustrates one way of implementing a natural language interface to a database. Understanding a natural language is far more complex than parsing a programming language. There are far more words in the natural language and natural languages have difficult ambiguities. But Visual Prolog is extremely well suited for natural language processing, because the backtracking mechanism can be used to handle ambiguities.
In Geobase the stored data is a USA geography database. However, you could use the same approach for other types of data.
The key idea behind Geobase is simple: The user views the database as a network of entities connected by associations. This is known as an entity association network. The entities are the items stored in the database. In Geobase, the entities are states, cities, capitals of states, rivers, lakes, etc. The associations are words that connect the entities in queries. For example:
Cities in the state California. Here the two entities, cities and state, are connected by the association in. The word the is just ignored here, and california is regarded as an actual constant for the state entity.
Geobase is designed to accept simple English. This means that, rather than worrying about whether a sentence is grammatically correct, Geobase tries to extract the meaning by attempting to match the user's query with the entity association network.
Queries can be combined to form rather complex queries. For example:
which rivers run through states that border the state with the capital austin?
In order to make the query match to the entity association network, Geobase must simplify the query's various forms. This occurs while Geobase "parses" the query.
The first step is to ignore certain words, such as:
which, is, are, the, tell, me, what, give, as, that, please to, how, many, live, lives, living, there, do, does
This step makes the query look like this: rivers run through states border state with capital austin?
The next step is to find the internal names for entities and associations. Entities can have synonyms, and the query can use
plural forms of the entity names. Associations can consist of several words and they can also have synonyms. After these
conversions, the query looks like this:
river in state border state with capital austin?
Geobase can now classify the words as either entities or associations and group the query into subqueries (E=entity, A=association, C=constant):
river in state border state with capital austin?
E A (E A (E A E C))
Geobase can then evaluate the query by first finding the name of the state with the capital austin, then finding all states that border this state, and finally looking up which rivers run through these states.
Geobase is a natural language query interface to an existing database. You can adapt the Geobase mechanisms to your own natural language query interface; we explain how in this section.
The first thing you need to do is to create your database. How the database is stored or was created has nothing to do with Geobase. You can use internal database sections or Visual Prolog's external database system, or you could even access some other database files by means of the Visual Prolog Toolbox. Geobase accesses the actual database through the predicates (db) and b(ent).
For simplicity, the geography database is stored in an internal database section, which you can load from disk by calling the (consult) predicate. Here are some sample declarations from the geography database:
/*state(Name,Abbreviation,Capitol,Area,Admit,Population,City,City,City,City*/
state(string,string,string,real,real,integer,string,string,string,string)
/*city(State,Abbreviation,Name,Population) */
city(string,string,string,real)
/*river(Name,Length,StateList)*/
river(string,integer,list)
/*border(State,Abbreviation,StateList) */
border(string,string,list)
/*etc.*/
The first step in porting Geobase to your own database is to draw the entity association network. The next step is to model this network with the database predicate schema:
schema(Entity,Assoc,Entity)
Here are some examples of schema clauses from Geobase:
schema("capital","of","state")
schema("state","with","capital")
schema("population","of","state")
schema("state","with","population")
schema("area","of","state")
schema("city","in","state")}
After you have defined the entity association network, you should implement Geobase's interface to the database. This requires that you define clauses for the two predicates db and ent
Predicates
db(ent,assoc,ent,string,string)
ent(ent,string)
The ent Predicate
The(ent) predicate is responsible for delivering all instances of a given entity. In the first argument to ent, Geobase passes the name of an entity and expects the second to return actual string values for this entity.
Here are some example clauses for ent from Geobase:
ent(continent,usa).
ent(city,Name) :- city(_,_,Name,_).
ent(state,Name) :- state(Name,_,_,_,_,_,_,_,_,_).
ent(capital,Name):- state(_,_,Name,_,_,_,_,_,_,_).
ent(river,Name) :- river(Name,_,_).}
The (db) predicate is a bit more complicated then ent. It is responsible for modeling the relation between the two entities (the association). You can also regard the (db) predicate as a function between one entity value and another value. All the arrows in the entity association network (modeled by the (schema) relation) should be implemented in clauses for the (db) predicate. Here are some examples from the geography database:
db(city,in,state,City,State) :-city(State,_,City,_).
db(state,with,city,State,City) :-city(State,_,City,_).
db(abbreviation,of,state,Ab,State) :- state(State,Ab,_,_,_,_,_,_,_,_).
db(area,of,state,Area,State) :-state(State,_,_,_,Area1,_,_,_,_,_),
str_real(Area,Area1).
db(capitol,of,state,Capital,State) :-state(State,_,Capital,_,_,_,_,_,_,_).
db(state,border,state,State1,State2):- border(State2,_,List),member(State1,List).
db(length,of,river,Length,River) :-river(River,Length1,_),str_real(Length,Length1).
db(state,with,river,State,River) :-river(River,_,List),member(State,List).
That's really all you need to do in order to provide a natural language interface to your existing database.
Most natural languages (and English in particular) are not simple, straightforward, and consistent. Nouns can be singular or plural, verbs conjugate, synonyms exist. Translating sentences from natural language to something the program recognizes is not a simple task. In the following sections we discuss how the Geobase program deals with these translation issues.
Internal Entity Names
Geobase needs to obtain an internal entity name from the words the user has used. They break down into three separate problems:
Plural forms of entities. The user might use the word states, which is the entity name state appended by an s; or cities, which comes from the entity name city. The predicate (entn) is responsible for converting plural entities to their singluar forms.
Synonyms for entities. The user might type town instead of city, or place instead of point. Synonyms for entities are stored in the database predicate {synonym}.
Compound entity values. The entity values might consist of more than one word, like new york or salt lake city. Geobase handles this situation during parsing with the predicate db(get_cmpent).
Some of the involved clauses look like these:
predicates
ent_name(ent,string) /* Convert between an entity name and an internal entity name */
entn(string,string) /* Convert an entity to singular form */
entity(string) /* Get all entities */
ent_synonym(string,string) /* Synonyms for entities */
clauses
ent_name(Ent,Navn) :- entn(E,Navn),
ent_synonym(E,Ent),
entity(Ent).
ent_synonym(E,Ent) :-synonym(E,Ent).
ent_synonym(E,E).
entn(E,N) :-concat(E,"s",N).
entn(E,N) :-free(E), bound(N), concat(X,"ies",N), concat(X,"y",E).
entn(E,E).
entity("name"):-!.
entity("continent"):-!.
entity(X) :- schema(X,_,_).
In the same way that entities can have synonyms and consist of several words, so can the associations in the queries be represented by several words. The alternative forms for the association names are stored in the b(assoc) database predicate. b(assoc) stores a list of words that can be used for the internal association name; for example:
assoc("in",["in"])
assoc("in",["running","through"])
assoc("in",["runs","through"])
assoc("in",["run","through"])
assoc("with",["with"])
assoc("with",["traversed"])
assoc("with",["traversed","by"])
The predicate (get_assoc) is responsible for recognizing an association in the beginning of a list of words. It does this by using the nondeterministic version of append to split the list up into two parts. If the first part of the list matches an alternative for an association in the (assoc) predicate, the corresponding internal association name is returned.
get_assoc(IL,OL,A) :- append(ASL,OL,IL), assoc(A,ASL).
The parser is responsible for recognizing the query sentence structure. There are many types of sentences, but these are classified by the parser into nine different cases. Each of these nine cases has alternatives in the domain (query). The (query) domain is defined recursively, which means it can represent nested queries.
Give me cities -ENT - q_e(ENT)
state with the city new york -ENT ASSOC ENT CONST - q_eaec(ENT,ASSOC,ENT,STR)
rivers in (....) -ENT ASSOC SUBQUERY - q_eaq(ENT,ASSOC,ENT,QUERY)
rivers longer than 1000 miles -ENT REL UNIT VAL - q_sel(ENT,RELOP,UNIT,REAL)
the smallest (...) -MIN SUBQUERY - q_min(ENT,QUERY)
the biggest (..) -MAX SUBQUERY - q_max(ENT,QUERY)
rivers that does not traverse -ENT ASSOC NOT SUBQ - q_not(ENT,QUERY)
rivers that are longer than
1 thousand miles
or that run through texas -SUBQUERY OR SUBQUERY - q_or(QUERY,QUERY)
which state borders nevada
and borders arizona -SUBQUERY AND SUBQUERY - q_and(QUERY,QUERY)
The words that users can type for minimum, maximum, units, etc., are stored in the language database section. The definition in Geobase looks like this:
entitysize(entity,keyword)
relop(keywords,relative_size) /* relational operator */
assoc(association_between_entities,keyword)
synonym(keyword,entity)
ignore(keyword)
min(keyword)
max(keyword)
size(entity,keyword)
unit(keyword,keyword)
The parser uses a method called "parsing by difference lists." The first two arguments for the parsing predicates are the input list and what remains of the list after part of a query is stripped off. In the last argument, the parser builds up a structure for the query.
The parser consists of several predicates and clauses, each of which is responsible for handling special cases in recognizing the query. If you want to understand everything about the parser, study the comments and use trace mode to follow how Geobase parses various queries.
The following clause recognizes the query How large is the town new york. The filter gives the parser list"large", "town", "new", "york".
s_attr([BIG,ENAME|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*First s_attr clause*/
ent_name(E2,ENAME), /* Entity type town is a city. Look
up entity in the language scheme */
size(E2,BIG), /* look up city size is large */
entitysize(E2,E1), /* look up city scale is population */
schema(E1,A,E2), /* look up scheme population of city */
get_ent(S1,S2,X),!./* return an entity name and query */
The parser is also able to recognize the more ambiguous query How large is new york. Given this query, the first clause for s_attr fails because it expects an entity type (such as as town or state). Then the program calls the second clause for s_attr, shown here.
s_attr([BIG|S1],S2,E1,q_eaec(E1,A,E2,X)):- /*Second s_attr clause*/
get_ent(S1,S2,X), size(E2,BIG),entitysize(E2,E1), schema(E1,A,E2),ent(E2,X),!.
Using this clause, the parser decides that new york refers to the city and that large refers to the number of citizens.
Once the parser returns a query, Geobase calls the (eval) clause that actually determines which query into the database should be called. The actual calls into the database are made with the (db) and (ent) predicates.