Switch to ints or strings for ids

Issue #24 resolved
karl krughoff created an issue

Currently the object identifiers are treated as floating point numbers. This causes lots of problems with comparison and formatting. Would the phosim team be willing to consider going to string (or int) object identifiers instead? That would make things much easier for those of us dealing with the analysis.

Comments (30)

  1. John Peterson

    if we did an int, how many digits do you need? there was a related ticket saying that 16 digits isn't enough, which seems a lot to me.

  2. karl krughoff reporter

    The problem is that it's very common to bit pack extra information on the object ids, so they can get very long. I would strongly prefer moving to strings instead of ints. That gives the most flexibility while still preserving strict comparability.

  3. John Peterson

    yeah, i understand but we use it as a number in some places, so its not a trivial change. i will look into it, but keep in mind the number of objects in the whole Universe should only require 13 digits, so still not sure why you need more than 16.

  4. karl krughoff reporter

    The reason we need long ids is that we don't perfectly pack our identifiers. The current scheme for making sure that there are no id collisions is by packing a unique identifier for each catalog on the end of the id for each object. This means that any catalog can index in any way they want without fearing collisions. I think currently we are using 10 bits for the catalog id. The galaxies use another 10 bits to pack the tile id since we have to tile the same galaxy catalog over the sky and we want to have unique galaxy ids even if the underlying catalog is tiled. This means the minimum id we have is 1024*1024. If you perfectly pack the id range, we could come in under 16 digits, but the underlying galaxy catalog ids come directly from a semi-analytic model and are not sequential, thus we go over 16 digits.

    yeah, i understand but we use it as a number in some places, so its not a trivial change.

    I did a quick look for usage of the id, and the only places I see it being treated as a float is where you assume it's a float for formatting. I don't see any place it's actually involved in a numerical computation. Can you point me to specific code that makes this hard?

    For completeness, here is the grep I did. Maybe I'm missing something.

    $> grep -ir sources.id source
    source/raytrace/observation.cpp:    sources.id.push_back(id);
    source/raytrace/observation.cpp:    sources.id.push_back(id);
    source/raytrace/observation.cpp:                                        sources.id.push_back(0.0);
    source/raytrace/photonloop.cpp:            state.pEventLogging->logPhoton(photon.time, sources.id[ssource], 0, 1);
    source/raytrace/settings.cpp:    // sources.id = (double*)calloc(MAX_SOURCE, sizeof(double));
    source/raytrace/sourceloop.cpp:    if (centroidfile) writeCentroidFile(outputdir, outputfilename, sourcePhoton, sourceXpos, sourceYpos, sources.id, nsource);
    $>  grep -ir source.id source
    source/raytrace/counter.cpp:void writeCentroidFile (const std::string & outputdir, const std::string & outputfilename, long long *source_saturation, long long *source_xpos, long long *source_ypos, std::vector<double> source_id, long nsource) {
    source/raytrace/counter.cpp:            fprintf(outdafile, "%lf %lld %lf %lf\n", source_id[k], source_saturation[k], ((double)source_xpos[k])/((double)source_saturation[k]),
    source/raytrace/counter.h:                        std::vector<double> source_id, long nsource);
    
  5. karl krughoff reporter

    This actually turned out to be trivial. Below is a patch from v3.6.1 that does what I need. The only bit that may need a little work is that the logging is basically abusing the logPhoton method to log several different types of information. I changed to just log the index of the source rather than the source id. That seems fairly safe, but maybe I'm wrong.

    If you'd prefer a PR, let me know.

    diff --git a/source/raytrace/counter.cpp b/source/raytrace/counter.cpp
    index 618524a..1be8134 100644
    --- a/source/raytrace/counter.cpp
    +++ b/source/raytrace/counter.cpp
    @@ -168,7 +168,7 @@ void writeThroughputFile (const std::string & outputdir, const std::string & out
    
     }
    
    -void writeCentroidFile (const std::string & outputdir, const std::string & outputfilename, long long *source_saturation, long long *source_xpos, long long *source_ypos, std::vector<double> source_id, long nsource) {
    +void writeCentroidFile (const std::string & outputdir, const std::string & outputfilename, long long *source_saturation, long long *source_xpos, long long *source_ypos, std::vector<std::string> source_id, long nsource) {
    
         FILE *outdafile;
         long k;
    @@ -179,7 +179,7 @@ void writeCentroidFile (const std::string & outputdir, const std::string & outpu
             outdafile = fopen(tempstring, "w");
             fprintf(outdafile, "SourceID Photons AvgX AvgY\n");
             for (k = 0; k < nsource; k++) {
    -            fprintf(outdafile, "%lf %lld %lf %lf\n", source_id[k], source_saturation[k], ((double)source_xpos[k])/((double)source_saturation[k]),
    +            fprintf(outdafile, "%s %lld %lf %lf\n", source_id[k].c_str(), source_saturation[k], ((double)source_xpos[k])/((double)source_saturation[k]),
                         ((double)source_ypos[k])/((double)source_saturation[k]));
             }
             fclose(outdafile);
    diff --git a/source/raytrace/counter.h b/source/raytrace/counter.h
    index ddfb6c0..cdb368d 100644
    --- a/source/raytrace/counter.h
    +++ b/source/raytrace/counter.h
    @@ -42,7 +42,7 @@ void addThroughput (Tlog *throughpuTlog, long surf, long waveindex, long long so
     void initThroughput (Tlog *throughpuTlog, long nsurf);
     void writeCentroidFile (const std::string & outputdir, const std::string & outputfilename,
                             long long *source_saturation, long long *source_xpos, long long *source_ypos,
    -                        std::vector<double> source_id, long nsource);
    +                        std::vector<std::string> source_id, long nsource);
    
     // Waits for a new tick. Returns number of ticks since program start.
     inline clock_t GetNewTick() {
    diff --git a/source/raytrace/observation.cpp b/source/raytrace/observation.cpp
    index f2eda85..11853ca 100644
    --- a/source/raytrace/observation.cpp
    +++ b/source/raytrace/observation.cpp
    @@ -47,8 +47,9 @@ int Observation::addSource (const std::string & object, int sourcetype) {
         double x, y;
         double nn;
         double mag;
    -    double ra, dec, id, redshift, gamma1, gamma2, kappa, magnification,
    +    double ra, dec, redshift, gamma1, gamma2, kappa, magnification,
             deltara, deltadec;
    +    std::string id;
    
         nspatialpar = 0;
    
    @@ -246,9 +247,10 @@ skipsedread:;
    
     int Observation::addOpd (const std::string & opd) {
    
    +    std::string id;
         double x, y;
         double nn;
    -    double ra, dec, id;
    +    double ra, dec;
         double wave;
    
         std::istringstream iss(opd);
    @@ -596,7 +598,7 @@ int Observation::background () {
    
                                             nspatialpar = 0;
    
    -                                        sources.id.push_back(0.0);
    +                                        sources.id.push_back("None");
                                             sources.ra.push_back(ra);
                                             sources.dec.push_back(dec);
                                             if (diffusetype == 0 && domewave == 0.0) sources.sedfilename[nsource] = dir + "/sed_dome.txt";
    diff --git a/source/raytrace/photonloop.cpp b/source/raytrace/photonloop.cpp
    index c1f402f..5279759 100644
    --- a/source/raytrace/photonloop.cpp
    +++ b/source/raytrace/photonloop.cpp
    @@ -104,7 +104,8 @@ void Image::photonLoop(long ssource) {
             getAngle(&angle, photon.time, ssource);
             if (eventfile) {
                 state.pEventLogging->logPhoton(angle.x, angle.y, photon.wavelength, 0);
    -            state.pEventLogging->logPhoton(photon.time, sources.id[ssource], 0, 1);
    +            // Can't log a string ID, so log the index of the source
    +            state.pEventLogging->logPhoton(photon.time, ssource, 0, 1);
             }
    
             //   Saturation Optimization
    diff --git a/source/raytrace/source.h b/source/raytrace/source.h
    index 0ed6bd1..e98bd57 100644
    --- a/source/raytrace/source.h
    +++ b/source/raytrace/source.h
    @@ -33,7 +33,7 @@ struct Source {
         double **spatialpar;
         double **dustpar;
         double **dustparz;
    -    std::vector<double> id;
    +    std::vector<std::string> id;
         int *spatialtype;
         int *dusttype;
         int *dusttypez;
    
  6. karl krughoff reporter

    @johnrpeterson Can you give me some notion of whether this is a change you'd be willing to incorporate and what the timescale for that would be?

  7. John Peterson

    yes, we will look into this, and will put it in a new patch in the next week as long as it doesn't break anything. the line where you put the index of the source is a problem though.

  8. karl krughoff reporter

    the line where you put the index of the source is a problem though.

    Can you say more? It didn't seem that bad to me, but the real solution is to not pretend you are logging coordinates when you are actually logging information about the source.

  9. karl krughoff reporter

    @johnrpeterson Did you have anything more to say on this? It didn't seem like that big a problem when I was looking through. Even just saving the index seemed ~o.k. as long as you also publish a mapping from index to id.

  10. John Peterson

    that index goes through to the event file (another diagnostic file like the centroid file). so we'd be removing the ID there. would you be able to use the centroid file without the ID?

  11. karl krughoff reporter

    Right, that's the problem. You are logging the ID in a float column in the event file. You need another mechanism to log the source information because it doesn't have the same schema as the events you are logging. It just so happened that if you assume the ID is a float, you can use the same schema for both, but that was just a coincidence, really. My suggestion was to log the source index in the event table and store a mapping from index to id (possibly in another extension of that same FITS file).

    Another option would be to have an extension per source and hold the source info in the header and have the table hold the events for that source. I don't think there is any limit on the number of extensions you can have.

    Just to be clear, I never intended to imply that you should not still keep enough information to get the source ID for each event.

  12. John Peterson

    yes, but could you simply use the centroid file without the ID? the number in the file would then correspond with the order in the local instance catalog. then we wouldn't have to even read in the source ID. you'd then have the same situation as the users of the event file for what you are proposing.

  13. karl krughoff reporter

    you'd then have the same situation as the users of the event file for what you are proposing.

    That is not at all what I'm proposing for the users of the event file. I'm saying users must have a way of getting the original id back for the events. I've suggested two different implementations of a system that will do that. Relying strictly on order is fraught with peril and should be avoided if at all possible.

    The ID should be printed to the centroid file as it was put in the object file (since that is trivial) and the problem with logging the source id as a float should be solved by some other mechanism (feel free to use either of the ideas in my previous post).

  14. John Peterson

    its not trivial though. even with the centroid file, by switching to strings it uses more memory (if you are planning on using as many as 17 digits) for a non-physics bookkeeping situation. but i am saying for both the event file and the centroid file we could have a mapping file like you propose.

  15. John Peterson

    by the way, the code you put above does literally base it on order. so i didn't propose that, i was just reading what the code snippets you made did.

  16. karl krughoff reporter

    but i am saying for both the event file and the centroid file we could have a mapping file like you propose.

    O.K. Missed that. I would be happy with a mapping for both. It adds a bit of bookkeeping for the user of the centroid file, but if that trades off against the extra memory, I'm on board.

    by the way, the code you put above does literally base it on order

    No it doesn't. The code above records the index into the sources.id array. Your code happens to iterate over those indexes sequentially, but the part I added is agnostic to that. You could iterate in any order and this would still be safe.

  17. karl krughoff reporter

    I should add that I concede that ints would be easier to support than std::strings. It didn't seem that much harder to support strings and they are much more flexible. If there are concrete reasons, like the memory concern, to require ints instead, we can certainly look at what it would take for CatSim to fit in a 64 bit int. It may be easy for them.

  18. karl krughoff reporter

    @johnrpeterson I had a chat with sims people on slack and they would prefer strings to ints for this since they are so much more flexible and still allow sorting and strict comparison.

  19. John Peterson

    Simon, i think we have a solution. If we switch to strings, then first that will make you happy. I've also just realized that it may take up less memory as there are many background sources which we can have null string IDs, so ironically it could take up less memory than a bunch of double precision zeros. As for the event file, we can use the source number rather than source ID in the main part of the file, and then put the translation table in the header (string source ID -> integer source number).

  20. karl krughoff reporter

    @johnrpeterson I'm a little worried about putting the mapping in the header. That's really not what the header is for. I'd be much happier if you just included another extension with a table to contain that mapping.

  21. karl krughoff reporter

    I understand. I don't think FITS headers are meant for that purpose. It should be another extension in the same file as the events that contains a two column table with the ID to index mapping.

  22. John Peterson

    yeah sure. it doesn't really matter as most uses of the event file are with a handful of sources.

  23. karl krughoff reporter

    It does matter. You are storing table like data, so it should go in a table (I've confirmed this works using astropy). Shoehorning this kind of content in the header is a recipe for confusion, not to mention that the header keys and values have strict conventions that tables don't have.

  24. karl krughoff reporter

    O.K. That's good. I read the "it doesn't really matter" part as saying you didn't think the implementation mechanism mattered. Sorry for my confusion.

  25. karl krughoff reporter

    Any update on this? Are you planning this change for a point release, or am I waiting for a major release?

  26. John Peterson

    we were just going to put it in the next major release (v3.7) since it wasn't completely trivial. but let me know if you need it sooner.

  27. Log in to comment