During the last week we enabled the audience of the MSDN/TechNet cinema to get in touch with the sones GraphDB. Our demo showed the German Corpus based on one million sentences, 812K words and 118K sources. In my last post i showed a little capture of the VisualGraph handling on the MIcrosoft Surface table. In contrast. this one is about the type scheme of the GraphDB in comparism to the MySQL model.
MySQL model [Extracted from LCC documentation]
The most important table of the data base schema is the word list, called words.
The actual corpus is a collection of sentences stored in the table with the name sentences.
source and inv_so
Sometimes it is interesting to know from where a particularly peculiar example was drawn. For
research purposes it is also important to know that it is not an artificial example. Therefore the table
sources stores from which websites or other sources a given sentence was obtained and table inv_so
allows to look this information up conveniently.
co_n and co_s
As mentioned in the introduction, information about which words co-occur with each other is
very useful. The two tables co_n and co_s store this information. co_n stores, which words cooccurred
directly next to each other (bigrams). This expresses mostly typical uses of words with
each other. co_s on the other hand stores, which words co-occurred anywhere within sentences.
This expresses typically related or associated words.
In order to efficiently find out in which sentences a given word occurred, the table inv_w has to
be accessed. It stores relations between word numbers and sentence numbers.
GraphDB type scheme
The TextElement is the generalization of all further types. It consists of one attribute named “Content”. So the “Content” value of the word Microsoft is “Microsoft” :).
A Source might be a plain text or a website. It consists one attribute with a list of Sentences.
A Sentence is part of a source and contains words. So there have to be two attributes. On the one hand WordsInSentence which represents a weighted list of words (the weight is of type Integer and represents the position within the sentence) and on the other hand a BackwardEdge attribute named IsInSource. It points to Sources which contain the actual Sentence.
A Word is a part of a Sentence. The IsInSentence attribute represents this relation. It is a BackwardEdge to Sentences that contain the actual word. Furthermore there are the neighbourship relations LeftNeighbour and RightNeighbour which are realized as weighted lists that point to other words (the weight is the significance of the relation between the words). Cooccurrences are analogue to neighbourships.
The following queries intent to find the Top10 cooccurrences of the word “Laptop”.
select w.word as wort, k.sig as sig from co_s k, words w where k.w1_id=
(SELECT w_id FROM words w where word = “Laptop”) and k.w2_id=w.w_id
order by k.sig desc limit 10;
from Word select TOP(Cooccurrences, 10) where Content = ‘Laptop’;