A discussion of generic particle deposition affecting all frontend codes, but especially so SPH and octree codes.
UPDATE 1: This updates the YTEP to say "Accepted" (perhaps prematurely) and to include a bunch of modifications to reflect the current state of the pull request. I believe this reflects all the major concerns and should be accepted, along with the yt-3.0 PR.
Hi Chris, I like this a lot. I have a bunch of questions, but right now my main one is -- how do you see the deposited fields interacting with fluid fields, from the perspective of the deposit function? Should fluid fields be able to be passed in? Or should all interaction between the two be kept at a higher level, as a combination of fields like ('deposit', 'mass') and ('gas', 'mass')?
I'd like to see this incorporated independently of ART changes. If you decide to start working on it, could you create a bookmark at ae0003c and start from there? I think the process of code review will be greatly simplified if this is a single, isolated change. We'll also need to have extensive tests for this, but I am not yet sure how to set up good octree tests.
I'd like to have feedback from Britton Smith , Douglas Rudd and Sam Leitner as well on this. Even though this won't be conducted identically for ARTIO, since it's not using octree structures, we'll still need to think about how it interacts with them.
This code is called from io.py. That means you could easy call _read_fluid_selection just as in the examples below we've called _read_particle_selection. So the scope is: the deposit field type could use both fluid and particle information, but cannot use any derived field information. That's by coincidence not by design; so you could in theory 'deposit' an identical fluid field. But combinations would be more interesting, you could also deposit a hydro+particle field of total mass, but this mirrors the functionality of a derived field that could do the same thing. Having two ways to do the same thing is generally a bad, unpythonic, idea. The only place you might want to define a deposit field combining particle and hydro fields that could not be replicated as a derived field is with spatial fields. We lose spatial info going to a derived field. Perhaps this would be useful if you wanted to calculate a total (hydro+particle) mass flux across the faces of each oct?
The other option is to encourage a more limited scope, and to do this we could pass the deposit methods below not just the chunk, and selector but also pass in the appropriate _read_particle_selection for that frontend code so we would have instead: def deposit_stellar_mass(subset, selector, _read_particle_selection): This would not strictly limit the code, since you could still get at _read_fluid_selection, it would just make it relatively more convoluted. Having just written this out, I'm -1 on this change.
Since the comment:code ratio is pretty huge, I'm going to try and summarize the most salient points before updating this PR. I'm going to default to what I like best; feel free to argue or support an alternative.
Field naming should go like ('grid', 'mass') instead of ('deposit', 'mass').
Derived fields can depend on particle, fluid, or grid (e.g. deposited particle) native fields. Optionally they can flag a spatial field dependence. I would also like to find a way that derived fields can mask out a particle selection instead of having to form a new deposition field, but this may not be possible.
We can simplify the field definitions and remove explicit IO calls:
I guess this leaves me confused about how we organize yt fields.
I think our original thinking on this (described in this set of hangout notes from last year) was that fields in 3.0 would be selected via a two element tuple: (field_type, field_name). As described in Matt's notes, this is necessary to be able to easily select on particle type and also has the nice bonus that we can easily support multifluid datasets. The virtue of this approach is that it is physically motivated and the details of how data is stored on disk is abstracted away. The downside is that particle-like fields are unstructured and must be interpolated onto a mesh to enable straightforward slicing, projection, and comparison with data that lives on a mesh on disk.
If we add the ability to use grid field_types, we've added an additional level of complexity for this system in that now a user has to think a bit about whether their data falls on a regular structured mesh or on an irregular unstructured mesh in a particle-like way. I don't think that's necessarily a bad thing and right now yt doesn't do much to help people think about how their data is structured. It would be nice if we could come up with a set of rules describing how field selection tuples should work - but that'll have to done in the context of another YTEP :).
I suggested 'grid' as a field_type to remove the need for the user to think about how the data was structured, since it would be explicit in the field type. The most natural name for a CIC deposited grid-like field of stellar mass is ('star','density'), but this would lead to confusion when ('star','mass') is a particle-like field. Is using the same field-type name for bit grid-like and particle-like fields even possible? Is that desirable? It would lead to two fields with the same type that are not interoperable.
I'm revisiting deposited particle fields with a mask/selection applied. The basic problem is that ('grid','stars_mass') is kind of clear about what particle (eg 'stars') and quantity ('mass') are being deposited, but not the selection ('young'). Here are a list of syntaxes that I think are possible:
Define a new field for every mask.
You would have ('grid','stars_mass') as well as ('grid','stars_mass_young'). This would add a new field for every different permutation of (particle type, quantity, mask) and might lead to an explosion in the number of fields. I also worry that what someone else calls ('grid','stars_mass') might have a selection rolled in, whereas someone else defining their own fields might not. So if a yt library somewhere asks for ('grid','stars_mass') it might not get a consistent answer across frontends. If we choose this option, we would likely have to very explicit and detailed about field names to make sure frontends return the most consistent thing possible.
Add a new element to the field tuple.
You would have ('grid','stars_mass', 'young') but also ('grid','stars_mass') which would be equivalent to ('grid','stars_mass',None). This would force us to write a particle selection registry, something similar to how we have a registry for defined derived fields. We would access this like: add_particle_mask("young",function=lambda x: x<10Myr)
Add two new elements to the field tuple.
This makes the deposit process even more explicit. You would have ('grid',original_field_type, original_field_name, selection) and thus would have ('grid','stars','particle_mass','young') . Rules are similar to #2 wherein the selection is an optional argument.
Request a field via a method lookup.
Unfortunately retrieving a deposit field when the field has an optional argument of selection means that instead of looking up a tuple in a dictionary this is probably better represented by asking for a field by a method with optional keyword arguments. Ex: dd.get_field('stars','particle_mass', deposit=dict(method='sum', kernel='cic'), mask=lambda x: x<10Myr, mask_arguments=[('stars','particle_age')]).
Personally, I have no concrete opinion on which of these is best. #1 is the closest to the current convention, but I think is broken for getting fields which require so much parameterization. #2-#4 better capture the extra parameters, but stray increasingly further away from the norm and break intuition as well as code. At the end of the day though, all of these entry points eventually will flow through a single call like in line #4.
I am +1 on the last two being explicit, and +1 on the add_particle_mask registry.
Maybe we could go a step further from (3): instead of 'grid'/'no grid'(?) for the first element of the tuple, say what deposition kernel we want to use? And then catch the kernel keyword during IO/deposit? That way the user doesn't have to know/care about where the data is coming from to know how it is deposited. For example, we shouldn't have to remember that ('nogrid', 'stars', 'particle_mass') is ngp; instead we should request ('ngp', 'stars', 'particle_mass'), which the frontend picks up to be accessible in normal derived field phase (whereas cic would drop into deposit).
This also sounds like it could mesh well with SPH support, where everything is gridded/deposited.
One possible sticking point is that if field tuples can have three elements, that might cause issues elsewhere. I'd like to reiterate my suggestion that we need a real spec (probably a YTEP of its own) for field tuples.
Hi Chris. In advance of the hangout (is that set up?) I have been working on the spatial field nature of Octrees as well as thinking quite in depth about how I would like to see this particle deposition work.
There are only a few operations we want to do, but I believe we need to be relatively flexible. In the octree spatial field stuff I have inserted enough machinery to make it work that you can have a DomainSubset and then go from an Oct to an index into a fluid field.
What I think would look nice is if we had a base class for deposition routines. This class will be defined in Cython and will be similar in spirit to the Selector object. It will have routines for initialize, finalize and process, where initialize will be handed the full fluid field (shape n_oct by 2 by 2 by 2), as will finalize, and process will get just the particle and the index into the fluid. The base class will accept an OctHandler and will handle all the iterations. Each call to the Python routine deposit will instantiate a new one of these items.
What do you think about this? Will it meet your needs? If so, I'd be happy to update the YTEP to reflect this, and to identify how it fits in with the Python code and APIs you've laid out.
Additionally, the more I think about it, the more I actually prefer the idea of having the particle positions all selected by the derived field call, as you originally had.
Yes, unfortunately we need to have step by step routines like initialize, finalize and process. If we need to iterate over the particles twice (and we do sometimes!), then we have initialize, process, process, and finalize. This feels right, even if it is unfortunately verbose but necessary vocabulary.
Well, two things. The first is the initialize can allocate temporary arrays. Most operations that would otherwise require two passes can be accommodated by temporary arrays. The second thing is that for those operations that are in fact being performed fundamentally differently, perhaps in two passes, we can require that they subclass and override the actual traversal.
As an update, I am working on an implementation that synthesizes a lot of this code. I will update the YTEP when that has been completed, and we can discuss the implementation as well as the specification, at that time.
This is just the YTEP, but the yt-3.0 branch has live & working code that demonstrates this approach -- thanks Matt!
I think the only thing that's left is to recommend a naming scheme for deposited fields. This was so contentious last time we broached the subject that I hesitate to bring it up again. Maybe it's better to have this discussion when we talk about standardized, frontend-universal, yt-3.0 field names.