Quantcast
Channel: eurion.net » gnome
Viewing all articles
Browse latest Browse all 16

A gentle introduction to Zeitgeist’s Python API

$
0
0

In this post I’ll make a quick introduction by example on how to use Zeitgeist’s Python API for good and profit.

If you’re interested in using Zeitgeist from C instead, see the libzeitgeist examples; to use it with C++/Qt, Trever’s «a web browser in 4 steps» may be of interest.

First things first

In case you’re not familiar with Zeitgeist, it may prove helpful to first read Mikkel’s introduction to Zeitgeist post.

If that’s too much to read, just should know that Zeitgeist is an event log. Like the history in your browser, it keeps track of what websites you open at which point in time. It also keeps track of when you close them, and of what browser you used, since it’s a system-wide service. Furthermore, it does the same for files, conversations, e-mails, and anything else you want to insert into it.

So Zeitgeist is a database of events, and an event can be pretty much anything. But what does it look like? It’s main attributes are the following:

  • timestamp – when did the event happen (milliseconds since Unix epoch)
  • interpretation – what sort of event is it (eg. opened, closed)
  • manifestation – why did it happen (user activity, notification…)
  • actor – which is the primary application involved
  • origin – where did it come from (eg. website where you clicked the link that opened this page)

Additionally, each event has one or more subjects, which have the following attributes:

  • uri
  • current_uri – updated URI if it changed since the event
  • interpretation – abstract type (document, image, video…)
  • manifestation – how it is stored (file, remote object, website)
  • origin – parent folder for files, domain name for websites
  • mimetype
  • text – a title for the event (eg. filename, website title…)
  • storage – identifier for the storage medium of the subject (eg. local, online, pendrive X)

Retrieving recent data

Okay, so let’s say you want to know the last song you’ve listened to (if you have a Zeitgeist-enabled music player). It’s a simple as:

from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_events_received(events):
    if events:
        song = events[0]
        print "Last song: %s" % song.subjects[0].uri
    else:
        print "You haven't listened to any songs."
 
template = Event.new_for_values(subject_interpretation=Interpretation.AUDIO)
zeitgeist.find_events_for_template(template, on_events_received, num_events=1)
 
# Start a mainloop - note: the Qt mainloop also works
from gi.repository import GLib
GLib.MainLoop().run()

This may need some explaining. We start by importing the Zeitgeist client and some associated data structures (such as Event and Interpretation), and create an instance of the client. The ZeitgeistClient class is a wrapper around Zeitgeist’s D-Bus API which not only makes it much nicer to use, but also makes it easy to install monitors (as we will see later) and provides convenient functionality such as automatic reconnection if the connection to the engine is lost.

To query for the most recent song, we just need to create a template with the restrictions we want to impose and submit it to Zeitgeist using the find_events_for_template call. If you haven’t read Mikkel’s post yet, please do so, as it introduces the structure of events and subjects (in short: an event has a timestamp, some other properties, and one or more subjects representing the resources -files, websites, people…- involved in the event).

The Python API is inherently asynchronous (if for some reason you need a synchronous API, you may still use the lower level ZeitgeistDBusInterface class), so we need to define a callback function to handle the results we receive from the Zeitgeist engine.

Finally, we need to create a main loop so the asynchronous functions can run.


Now this was a pretty simple example. Let’s make it more interesting. One song isn’t much, so let’s get the 5 most recent songs. Also, now we want both songs and videos.

The first part is pretty easy, we just need to change the num_events parameter. For the second extension, we have to change the event template. In fact, now we need two different event templates and the find_events_for_templates function, which takes an arbitrary number of event templates and ORs them. The result is as follows:

from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_events_received(events):
    for event in events:
        print "- %s" % event.subjects[0].uri
 
tmpl1 = Event.new_for_values(subject_interpretation=Interpretation.AUDIO)
tmpl2 = Event.new_for_values(subject_interpretation=Interpretation.VIDEO)
zeitgeist.find_events_for_templates([tmpl1, tmpl2],
                                    on_events_received, num_events=5)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

This will work, but unless you’re lucky you’re likely to get some duplicate line. Why is this? Well, other than that you may have used the same file twice, don’t forget that what you are requesting are actually events. If you’ve started playing a given song, you probably also stopped playing it, so that’s actually two of them (an AccessEvent and a LeaveEvent). Since this isn’t what we want, we’ll change the query a bit:

zeitgeist.find_events_for_templates(
    [tmpl1, tmpl2],
    on_events_received,
    num_events=5,
    result_type=ResultType.MostRecentSubjects,
    storage_state=StorageState.Available)

By requesting the most recent subjects, vs. the most recent events, we can filter out events with duplicate URI. See the ResultType documentation for other modes you can use. Note particularly the MostPopularSubjects result type.

I also used the chance to introduce the storage_state parameter. This one will filter out events for files Zeitgeist knows aren’t available (this mostly means online resources won’t be shown if you don’t have a network connnection; there’s also support for handling external storage media, but because of problems with GIO this is currently disabled).

Last but not least, the find_events_for_* methods also accept a timerange parameter. It defaults to TimeRange.until_now(), but you may change it to TimeRange.always() (if for some reason you’re working with events in the future) or to any other time range of your choice. Here it’s important to note that Zeitgeist’s timestamps use millisecond precision.


For more advanced queries, you can use more complex combinations of events and subject templates. The rule to keep in mind here is that events are OR’d and subjects are ANDed.

Additionally, some field (actor, origin, mimetype, uri and current_uri) may be prefixed with an exclamation mark (“!”) for NOT, or you may append an asterisc (“*”) to them for prefix search. You can even combine the two operators together. Here’s an example of a template you could build:

subj1 = Subject.new_for_values(interpretation=Interpretation.SOURCE_CODE,
                               uri="file:///home/rainct/Development/*")
subj2 = Subject.new_for_values(uri="!file:///home/rainct/Development/zeitgeist/*")
tmpl1 = Event.new_for_values(interpretation=Interpretation.MODIFY_EVENT,
                             subjects=[subj1, subj2])
templates = [tmpl1]

In my case, this template would fetch a list of the source code files I modified most recently but excluding those related to the Zeitgeist project.

Working with big sets of data

In case you’re trying to do something crazy, you may end up with a Zeitgeist query complaining that it exceeded the memory limit. You’re not supposed to do that. Instead, we provide some methods for working with large collections of events.

from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_events_received(events):
    for event in events:
        print '- %s' % event.subjects[0].uri
 
def on_ids_received(event_ids):
    print 'A total of %d source code files were found.' % len(event_ids)
    print 'Fetching the first 100...'
    zeitgeist.get_events(event_ids[:100], on_events_received)
 
tmpl = Event.new_for_values(subject_interpretation=Interpretation.SOURCE_CODE)
zeitgeist.find_event_ids_for_templates(
    [tmpl],
    on_ids_received,
    num_events=10000, # you can use 0 for "all events", but do you really need to?
    timerange=TimeRange.from_seconds_ago(3600*24*30*3),
    result_type=ResultType.MostPopularSubjects)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

And there you have the source code files you worked with during the last 3 months, ordered from most to least popular (popularity is measured counting the number of events; for more precision, maybe you could limit the results to events with interpretation AccessEvent).

Why do we provide this mechanism instead of querying with a simple offset? Well, this avoids problems when the log changes (events are inserted or deleted). Have you ever been exploring the latest posts in some website, and as you change to the next page some of the results from the previous page show up again (because new posts have been added in the meantime)? With Zeitgeist this won’t happen.

Receiving information in real time

At this point you’re an expert at requesting all sorts of data from Zeitgeist, but now you want to show a list of the last kitten images you’ve viewed, updated in real time. Don’t worry, Zeitgeist can provide for this:

from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_insert(time_range, events):
    # do awesome stuff with the events here
    print events
 
def on_delete(time_range, event_ids):
    # a previously inserted event was deleted
    print event_ids
 
templates = [Event.new_for_values(subject_uri='file:///home/user/kittens/*',
                                  subject_interpretation=Interpretation.IMAGE)]
zeitgeist.install_monitor(TimeRange.always(), templates, on_insert, on_delete)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

It’s important to note that on_delete won’t be called when an image is deleted (that’d be a newly inserted event with interpretation=DELETE_EVENT); rather, it’s called when a previously inserted event is deleted (for example, using the “forget recent history» option in Activity Log Manager).

In case you’re curious: for best performance, this doesn’t actually use D-Bus signals. Instead, this little call will setup a D-Bus object behind the scenes and register it with the Zeitgeist engine, so it can notify said object when (and only when) an event of its interest is registered.

To stop receiving notifications for a template, you’ll need the save the object returned by the install_monitor call:

m = zeitgeist.install_monitor(TimeRange.always(), templates, on_insert, on_delete)
zeitgeist.remove_monitor(m)

Pro Tip: You can use the Zeitgeist Explorer GUI to quickly try out different queries (note: it’s still work in progress, so much funcionality is missing, but it does work somewhat).

Contextual awesomeness: finding related events

By now you’re familiar with retrieving events and keeping them up to date. Now it’s time for a little secret:

import time
from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_related_received(uris):
    print 'Related URIs:'
    for uri in uris:
        print ' - %s' % uri
 
query_templates = [Event.new_for_values(
    subject_interpretation=Interpretation.SOURCE_CODE,
    subject_uri='file:///home/rainct/Development/zeitgeist/*',
    subject_mimetype="text/x-vala")]
 
result_templates = [Event.new_for_values(
    subject_interpretation=Interpretation.WEBSITE,
    subject_manifestation=Manifestation.WEB_DATA_OBJECT)]
 
now = time.time()*1000
zeitgeist.find_related_uris_for_events(
    query_templates,
    on_related_received,
    time_range=TimeRange(now - 1000*3600*24*30*6, now),
    result_event_templates=result_templates,
    num_events=10)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

This little query example will return up to 10 websites I used at the same time as the Vala files inside my Zeitgeist directory, considering only data from the last 6 months. Nice, huh?

This is an experimental feature, and it doesn’t work well when operating on big inputs, so it’s usually better to use the find_related_uris_for_uris variant (which replaces the first query_templates parameter with a list of URIs).

Advanced searching: the FTS extension

Some people think prefix searches aren’t good enough for them, and this is why the Zeitgeist engine ships by default with a FTS (Full Text Search) extension.

Using the methods provided by this extension you can perform more advanced queries against subjects’ current_uri and text properties (unlike the name may suggest, the FTS extension doesn’t index the content of the files, but just the information in the event).

This is exposed as zeitgeist_index_search in libzeitgeist (the C library), but unfortunately isn’t currently available in the Python API. If you still need it, you’ll have to fallback to pretty much using the D-Bus interface (you still get reconnection support, though). Here’s an example:

from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
index = zeitgeist._iface.get_extension('Index', 'index/activity')
 
query            = 'hello' # search query
time_range       = TimeRange.always()
event_templates  = []
offset           = 0
num_events       = 10
result_type      = 100 # magic number for "relevancy" (ResultType.* also work)
 
def on_reply(events, num_estimated_matches):
    print 'Got %d out of ~%d results.' % (len(events), num_estimated_matches)
    events = map(Event, events)
    for event in events:
        print ' - %s' % event.subjects[0].uri
 
def on_error(exception):
    print 'Error from FTS:', exception
 
index.Search(query, time_range, event_templates,
             offset, num_events, result_type,
             reply_handler=on_reply, error_handler=on_error)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

The most interesting thing here is the query parameter. Quoting from the C documentation:

The default boolean operator is AND. Thus the query foo bar will
be interpreted as foo AND bar. To exclude a term from the result
set prepend it with a minus sign - eg foo -bar. Phrase queries
can be done by double quoting the string "foo is a bar". You can
truncate terms by appending a *.

There are a few keys you can prefix to a term or phrase to search
within a specific set of metadata. They are used like key:value.
The keys name and title search strictly within the text field of
the event subjects. The key app searches within the application
name or description that is found in the actor attribute of the
events. Lastly you can use the site key to search within the
domain name of the subject URIs.

You can also control the results with the boolean operators AND
and OR and you may use brackets, ( and ), to control the operator
precedence.

Modifying the log

So far we’ve only queried Zeitgeist for information, let’s get a bit more active.

You can delete events from Zeitgeist with the following query:

from zeitgeist.client import ZeitgeistClient
 
zeitgeist = ZeitgeistClient()
 
def on_deleted(timerange):
    print 'Deleted events going from %s to %s' % (timerange[0], timerange[1])
 
event_ids = [50] # put the IDs of the events you want to delete here
 
zeitgeist.delete_events(event_ids, on_deleted)
 
# Start a mainloop
from gi.repository import GLib
GLib.MainLoop().run()

The confirmation callback will receive a timerange going from the first to the last event. If no events were deleted (because they didn’t exist), you’ll get (-1, -1).


And now for the interesting part. If your application involves resources (files, websites, contacts, etc.) of any sort, you’ll probably want to let Zeitgeist know that you’re using them. It’s time that you write a data-source!

We start by registering the data-source. Here we go:

import time
from gi.repository import GLib
from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import *
 
zeitgeist = ZeitgeistClient()
 
def on_status_changed_callback(enabled):
    """ This method will be called whenever someone enables or disables
        the data-source. """
    if enabled:
        print 'Data-source enabled and ready to send events!'
    else:
        print 'Data-source disabled; don\'t send event, they\'ll be ignored.'
 
def register():
    # Always use the same unique_id. Name and description can change
    # freely.
    unique_id = 'com.example.your.data.source'
    name = 'user visible name (may be translated)'
    description = 'user visible description (may be translated)'
 
    # Describe what sort of events will be inserted (optional)
    subject_template = Subject()
    subject_template.interpretation = Interpretation.PLAIN_TEXT_DOCUMENT
    subject_template.manifestation = Manifestation.FILE_DATA_OBJECT
    templates = []
    for interp in (Interpretation.ACCESS_EVENT, Interpretation.LEAVE_EVENT):
        event_template = Event()
        event_template.interpretation = interp
        event_template.manifestation = Manifestation.USER_ACTIVITY
        event_template.append_subject(subject_template)
        templates.append(event_template)
 
    zeitgeist.register_data_source(unique_id, name, description, templates,
                                   on_status_changed_callback)

Once that’s done (and if it is enabled), we are free to send our events:

def log(title, uri, opened):
    subject = Subject.new_for_values(
        uri=uri,
        interpretation=Interpretation.PLAIN_TEXT_DOCUMENT,
        manifestation=Manifestation.FILE_DATA_OBJECT,
        origin=GLib.path_get_dirname(uri),
        mimetype='text/plain',
        text=title)
    event = Event.new_for_values(
        timestamp=time.time()*1000,
        manifestation=Manifestation.USER_ACTIVITY,
        actor='application://your_application_name.desktop',
        subjects=[subject])
    if opened:
        event.interpretation = Interpretation.ACCESS_EVENT
    else:
        event.interpretation = Interpretation.LEAVE_EVENT
 
    def on_id_received(event_ids):
        print 'Logged %s (%d) with event id %d.' % (title, opened, event_ids[0])
 
    zeitgeist.insert_events([event], on_id_received)
 
if __name__ == '__main__':
    register()
 
    log('test.txt', 'file:///tmp/test.txt', opened=True)
    log('another_file.txt', 'file:///tmp/another_file.txt', opened=True)
    log('another_file.txt', 'file:///tmp/another_file.txt', opened=False)
    log('test.txt', 'file:///tmp/test.txt', opened=False)
 
    # Start a mainloop
    GLib.MainLoop().run()

If you don’t know what interpretation and manifestation your subject should have, you can use the following utility methods:

from zeitgeist.mimetypes import *
 
print get_interpretation_for_mimetype('text/plain')
print get_manifestation_for_uri('file:///tmp/test.txt')

Event better, with Zeitgeist 0.9 you can just leave the subject (but not event!) interpretation and manifestation fields empty, and they’ll be guessed the same way as if you used those utility methods.

Pro Tip: You can examine all registered data-sources and toggle whether they are enabled or not using the zeitgeist-data-sources-gtk.py tool.

Conclusion

Wow, I’m impressed if you’ve got this far. By now you should have quite a good idea on how to use the Zeitgeist API, and I’m looking forward to seeing what you do with it in your next awesome project.

If you have any problem with Zeitgeist, feel free to visit us on IRC (#zeitgeist on irc.freenode.net), or join our mailing list. We’ll also be at GUADEC next week, so if you’re there make sure to say hi!

In case you missed them, here are some useful links:


4 comments
© Siegfried-Angel Gevatter Pujals, 2012. | Permalink | License | Post tags: ,


Viewing all articles
Browse latest Browse all 16

Latest Images

Trending Articles





Latest Images