Understanding Analytic GET Requests

The Ex Libris Developer Network's Tech Blog provides a useful overview of the Analytic REST API. Still, there are still several nuances and fine points that were identified in the development of this software package. This page provides an overview of the XML structure of the returned analytic data as well as usage issues involved in making queries.

Understanding Analytic GET Requests
- The Returned XML Report Object
- Response Types

The Returned XML Report Object

Ex Libris provides an overview of the XML object returned by a GET request. Below is an excerpt of the sample they provide.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<report>
   <QueryResult>
      <ResumptionToken>8D288490B28E6...</ResumptionToken>
      <IsFinished>false</IsFinished>
      <ResultXml>
         <rowset:rowset xmlns:rowset="urn:schemas-microsoft-com:xml-analysis:rowset" xmlns="urn:schemas-microsoft-com:xml-analysis:rowset">
            <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:saw-sql="urn:saw-sql" targetNamespace="urn:schemas-microsoft-com:xml-analysis:rowset">
            </xsd:schema>
            <Row>
               <Column0>0</Column0>
               <Column1>39031025815804</Column1>
               <Column2>-1</Column2>
               <Column3>283289200001021</Column3>
               <Column4/>
            </Row>
            <Row>
               ...
            </Row>
         </rowset:rowset>
      </ResultXml>
   </QueryResult>
</report>

Overall, there are three main elements worth paying attention towards: ResumptionToken, IsFinished, and ResultXml. Each has particular nuances to it.

<IsFinished>

The IsFinished element appears in all successful queries (HTTP Status 200). It is a boolean value that is either the words true or false.

<ResumptionToken>

The ResumptionToken element only appears in the first query you make. Inside is the 128-character string that identifies a communication exchange. After the first successful exchange, the token should be reused with each subsequent query until the IsFinished element returns True. Note that this element does not appear in queries that use the token parameter.

<ResultsXml>

The ResultsXml element is, of course, the most important in that it contains the actual data. Grabbing the data from within, though, is not as direct as one would expect.

Rowset, XSD, Namespaces, ...

Fundamentally, the returned XML represents basic tabular data: rows and columns. Some of the initial subelements in the ResultsXml tag are descriptors of that structure. For the most part, these elements are fairly inconsequential. The XSD gives a minimal description that the contents will be in rows and columns.

Most relevant, however, is that the Rowset element introduces a namespace. To access the rows and columns within, one must use this namespace. The namespace is not immediately obvious. Although it appears to remain constant, our code pulls out the namespace each time the returned XML is processed. In general, the namespace is:

{urn:schemas-microsoft-com:xml-analysis:rowset}

Rows and Columns

As said, the analytic data is presented in plain tabular form. Understanding this is crucial. Even if you have played around in the OBIEE and fashioned the data into a pivot table, added graphs, or any other whatnot, the data that is in the XML will just be the underlying tabular data: each row representing a database entry and the columns being the selected fields. Totals do appear, but they will not be in obvious places and may be buried within rows as separate columns.

This leads to another difficulty in regards to working with the data: the column names. The column elements are simply named Column0, Column1, Column2, etc. There is no clear mapping between the fields in the OBIEE to the XML. It is up to you to figure out which column maps to which field. Unfortunately, there does not seem to be a universal rule for the mapping, but the following has been determined:

<Column0> is unimportant. It appears in every row and the data is not reflected in the Analytic on the OBIEE.
Generally, the mapping is in alphabetic order based upon the name of the dimension table and the field name within that table. For example, Bibliographic Details -> Title will typically appear in a column before Holding Details -> Holding ID and then followed by Holding Details -> Summary Holding. This at least holds for fields that are of type string.
The above rule breaks, however, for fields of other types. Fields of other types get ordered separately apparently. The full logic of the sorting has yet to be revealed.
If a field is empty for an entry in the database, the corresponding column element may not appear in the XML or will be an empty element: <Column#/>.

The key point here is that to figure out how the column numbers map to the fields in the analytic will require inspection. Hence, it is useful to begin with small queries through tools like this repository's query_page.py or through a browser REST clients.

Response Types

Ideally, when one makes a query, the response will be XML report talked about in the previous section. However, this is not always the case. In current practice, a response to an analytic GET request can take several forms. Distinguishing valid responses from errors is not always clear, so the following is meant to provide some insight gained from experience and testing.

Error Statuses

As reported on the Ex Libris Developer Network, the RESTful services generally return 400 errors when then the request is malformed. Generally, there is often a helpful error message in the returned result. If the limit parameter is outside of the 25-1000 range, the message says so. An unknown resumption token will be noted. Less helpful are error messages associated with an improperly formatted filter.

Other HTTP error statuses that are sometimes encountered include 401 (unauthorized, probably a bad API key), 403 (Forbidden), and 500 (generic internal server error). Both the 403 and 500 error are usually indicative of a network issue, most likely on either Alma or the OBIEE side. These generally can only be waited out.

Analytic Still Loading

As described earlier, a successful query (HTTP Status 200) returns an XML report structure. However, sometimes the analytic is still loading the data, resulting in an empty result. Such a result will look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<report>
   <QueryResult>
      <IsFinished>false</IsFinished>
      <ResultXml/>
   </QueryResult>
</report>

Note that if this was the first query sent, a resumption token would also be present. That token should be used to send the query again after waiting a brief period. Eventually, actual data will be returned.

Detecting that the analytic is still loading can be determined easily as follows:

<IsFinished> element contains false
<ResultXml> is an empty element (does not even contain the XSD or rowset)

Unexpected Zero Results

There is one more erroneous response that can occur with a successful query (HTTP Status 200), and this one is actually problematic. In this case, the returned XML will look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<report>
   <QueryResult>
      <IsFinished>true</IsFinished>
      <ResultXml>
         <rowset:rowset xmlns:rowset="urn:schemas-microsoft-com:xml-analysis:rowset" xmlns="urn:schemas-microsoft-com:xml-analysis:rowset">
            <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:saw-sql="urn:saw-sql" targetNamespace="urn:schemas-microsoft-com:xml-analysis:rowset">
            </xsd:schema>
         </rowset:rowset>
      </ResultXml>
   </QueryResult>
</report>

Since IsFinished is true and <ResultXml> contains no <Row> elements, this result suggests that there was no data in the analytic. This is a fine result if the analytic were truly empty. Unfortunately, this response can occur even if there IS data in the analytic.

Such false results seem to occur if one has been pushing the analytic servers too hard. This means sending multiple queries to the same analytic path repeatedly and maybe in parallel (but with different resumption tokens). Only time seems to fix this problem, and this means waiting hours or days. Right now, the current practice we recommend is that if you are going to to make repeated queries on a regular basis, one should use multiple copies of the analytic and different API keys. Even if you are making one query at a time, changing up which keys and which analytic paths seems to avoid whatever hidden quota system leads to the this zero results error.

Although we have put in a support case on this issue, this is most certainly a problematic error. In general, one has to determine if zero results is a valid response from an analytic. If it is, this is not an error. Otherwise, it needs to be treated as an error.

The 65001 Result Barrier

The final strange issue with queries comes down to how many results one can ultimately download. As we know, downloading data involves one or more GET requests (each downloading a limit of 25-1000 results) connected by a resumption token. Thus, one could presumably download the entire data in an analytic by making repeated calls of 1000 requests at a time. Unfortunately, the OBIEE software limits the number of results in an analytic to at most 65001 results. This value is hard set in the OBIEE database and cannot currently be changed. Fortunately, one can work around this limit if the analytic is structured appropriately.

Query Types

Before getting into the details of how to get around the limit, it is useful to think of downloading analytic data as involving three types of related queries: page, report, and all. A page is a single GET request and only returns up to 1000 results. A report consists of multiple page queries connected by a single resumption token. A report can thus contain up to 65001 results. Performing an all query involves conducting multiple report queries (each of which will have its own resumption token) while incorporating a filter into the GET requests based upon the last result from the previous report.

Bypassing the 65001 Limit

As stated, the means to bypassing the 65001 limit is to use the filter parameter in the GET request. To do this, we require the analytic to meet the following criteria:

The analytic must include a field/column (SORT) that sorts the results
The SORT field should be prompted for filtering
The analytic should have an additional, different field (UNIQUE) that is a unique ID for each row

Together, these properties allow us to do the following to achieve an all query:

Run a report request (with no filter) in which we:
- Process each result.
- Record each UNIQUE value seen.
When report is finished, grab the SORT value of the last result (last_value).
Run a report request with the filter is: SORT is greater than or equal to last_value):
- Process each result.
- If the UNIQUE value has been seen, discard the whole result and move on to the next. Otherwise, record the UNIQUE value.
Repeat steps 2 and 3 until no more new results are returned.

It can be easily reasoned that the above algorithm will return all results without any repeats, but there are a few critical points for understanding certain choices. For example, one might suggest that it would be more efficient to use a greater than filter instead of the the greater than or equal to filter. However, such a design choice could lead to some results being skipped. For example, consider an analytic of physical items sorted by the item's title. Clearly, several items may have the same title, say Hamlet. Assume there are 5 copies of Hamlet, but the report ends after the second one. If we use the greater than filter, Hamlet copies 3 through 5 will be skipped over. Hence, we use the greater than or equal to filter and check for unique IDs to avoid any duplication.

So why not just make sure that the analytic is sorted by a unique ID (make SORT and UNIQUE the same field)? There are two answers as to why our software design insists that the two be distinct fields. First, the code was designed to permit parallelism. A natural way to separate a large download into separate tasks is to filter on a sorted field (see JobBounds in RequestObject). Unique IDs like an item's MMS ID are hard to separate into separate job bounds compared to easier ranges based on the alphabet, numbers, or dates. The second reason for requiring distinct fields is due to difficulties experienced with the OBIEE interface itself. On multiple occasions, attempts to sort a large number of results (hundreds of thousands) by a unique ID led to the database returning errors both within OBIEE and with RESTful requests. Making the two fields distinct avoids such situations.

Wiki

uwlib-alma-analytic-tools / Understanding_Analytic_GET_Requests