Author: Sungjin Ahn (email@example.com) Last update: Oct. 22, 2016
Each file in the folder, tagged_film_actor_v0.1, has the following format
- topic_id is the id that is given by us to identify a topic in wikipedia and freebase.
- freebase_id is the original freebase entity id. This typically looks like "m.052xjt".
- ext is the extension. We have four different files for each topic ending with extensions, sm, bd, fb, and en.
Annotated Wikipedia Text (.sm and .bd files)
The .sm and .bd files contain wikipedia text of a topic. Here .sm denotes the summary and .sm the body. The summary is the first a few paragraphs in wikipedia that introduce or summarize the topic. The body is all the other paragraphs.
Text in .sm and .bd are annotated with freebase entities as follows. For example, 0.m.010q36.sm which is about a topic on Fred Rogers starts with the following sentence
$$fred_mcfeely_rogers/f/ns/m.010q36$$ (march 20, 1928 – february 27, 2003) was an american $$television_personality/a/ns/m.01rfz$$, $$puppeteer/a/ns/m.014kbl$$, $$educator/f/ns/g.121bkpjb$$, $$presbyterian/f/ns/m.0631_$$ $$minister/f/ns/m.0377kt$$, $$composer/a/ns/m.01c72t$$, $$songwriter/f/ns/m.0nbcg$$, $$author/f/ns/m.0kyk$$, and $$activist/a/ns/m.0xzm$$. $$rogers/f/ns/m.010q36$$ was most famous for creating, hosting, and composing the theme music for the $$educational/a/ns/m.0gg81w$$ $$preschool/a/ns/m.027wyv$$ television series $$mister_rogers'_neighborhood/f/ns/m.010qcv$$ (1968–2001), which featured his kind, gentle, soft-spoken personality and directness to his audiences.
Note that in the actual file, we use '@@' instead of '$$'.
Here, a string starting with $$ and ending with $$ denotes an existence of a fact between the topic (Fred Rogers) and the string inside $$. The string inside has the following format
Here, [f|a] denotes two different types of facts. The 'f' denotes a freebase fact and 'a' denotes an anchor fact.
The freebase fact is a fact for which a fact between the topic and the entity string explicitly exists in the Freebase. For example, the following string is annotated due to the existence of a freebase fact "Fred_Rogers--Profession--Educator".
Note that the id "g.121bkpjb" is the entity id of "educator" in Freebase.
The anchor fact is a fact where the relationship is unknown. For example, the following stands for "Fred_Rogers-[Unknown_Relation]-Composer".
It is unknown because the fact is not found in Freebase. But it is a fact because in the Wikipedia text the entity string "composer" is a linked string. This implicitly indicate that there is a certain relation between the topic and an anchored entity in the description.
Freebase Topic Facts (.fb and .en files)
The .fb and .en files contain a set of freebase facts of the topic specified in the file name. For example, the file 0.m.010q36.fb contains the following fact in line 280
/ns/g.121bkpjb /ns/people.profession.people_with_this_profession /ns/m.010q36
and the file 0.m.010q36.en contains its translation in English in the same line
Educator People_With_This_Profession Fred_Rogers
Freebase provides a special type of entity, called composite value type (CVT). A CVT entity itself is a fact. For example, in the following example, the triple inside  is a CVT subject of a fact whose relation is "award_nominee" and object is "Fred Rogers" (m.010q36)
[/ns/m.0nclx35 /ns/award.award_nomination.ceremony /ns/m.0ncl3xy] /ns/award.award_nomination.award_nominee /ns/m.010q36