Select removes duplicate columns, but Insert.from_select might want them

Issue #4022 new
Stijn van Drongelen created an issue

I programmatically create a bunch of (complex) INSERT ... SELECT statements. I regularly use the pattern

target_columns, source_elements = *query_components
statement = target_table.insert().from_select(
    target_columns,
    select(source_elements).select_from(source_selectable)
)

after building query_components and source_selectable from some specification. When I (granted: inadvertedly) had a duplicate column in source_elements, the database complained that there was a mismatch in the numer of columns in the INSERT ... SELECT statement.

It took me a while to figure out what the problem was, but I've distilled a MVCE that explains my expectations and how they mismatch SQLAlchemy's actual behaviour. I'm not sure what the reasoning behind it is, as having duplicate columns after SELECT isn't semantically wrong. If I wanted to target multiple columns with the same source column, I could of course work around it by labeling every source column, but unlike .alias(), .label() can't be called without an argument (and I'd like to leverage SQLAlchemy's ability to generate fresh names).

So, in short: simple code may end up generating broken SQL because duplicate columns get dropped by Select. Why does it do that, and how do I generate fresh .label()s if I'd want to work around that for Insert.from_select?

The problem occurs in 1.0.15, 1.0.17, and 1.1.11.

Comments (3)

  1. Mike Bayer repo owner
    • changed component to sql
    • changed milestone to 1.3
    • marked as major

    Every use case select() was built for up until the "insert from select" feature was eventually added treats the columns clause as a namespace. The row you get back from result.fetchone() is a namespace, the select.c collection used for subqueries is a namespace, and up until version 1.1, I think you could not even fetch a redundantly-named column from a result because everything worked from string indexes internally (this may not be totally accurate).

    it's very common that when some frequently requested feature gets added on to existing constructs years later, it conflicts with some of the base assumptions of that construct.

    You can generate your labels with anonymous names using:

    cols = [col.label(None) for col in cols]
    

    approach for this would be conservatively starting with dedupe just for insert from select:

    diff --git a/lib/sqlalchemy/sql/compiler.py b/lib/sqlalchemy/sql/compiler.py
    index 6da0647..a20929b 100644
    --- a/lib/sqlalchemy/sql/compiler.py
    +++ b/lib/sqlalchemy/sql/compiler.py
    @@ -1697,6 +1697,7 @@ class SQLCompiler(Compiled):
                          nested_join_translation=False,
                          select_wraps_for=None,
                          lateral=False,
    +                     dedupe_select_cols=True,
                          **kwargs):
    
             needs_nested_translation = \
    @@ -1767,7 +1768,8 @@ class SQLCompiler(Compiled):
                         populate_result_map, asfrom,
                         column_clause_args,
                         name=name)
    -                for name, column in select._columns_plus_names
    +                for name, column
    +                in select._columns_plus_names(dedupe_select_cols=dedupe_select_cols)
                 ]
                 if c is not None
             ]
    @@ -2074,7 +2076,8 @@ class SQLCompiler(Compiled):
                 returning_clause = None
    
             if insert_stmt.select is not None:
    -            text += " %s" % self.process(self._insert_from_select, **kw)
    +            text += " %s" % self.process(
    +                self._insert_from_select, dedupe_select_cols=False, **kw)
             elif not crud_params and supports_default_values:
                 text += " DEFAULT VALUES"
             elif insert_stmt._has_multi_parameters:
    diff --git a/lib/sqlalchemy/sql/selectable.py b/lib/sqlalchemy/sql/selectable.py
    index a1e3abc..3f578cb 100644
    --- a/lib/sqlalchemy/sql/selectable.py
    +++ b/lib/sqlalchemy/sql/selectable.py
    @@ -3429,8 +3429,8 @@ class Select(HasPrefixes, HasSuffixes, GenerativeSelect):
             fromclause = _interpret_as_from(fromclause)
             self._from_obj = self._from_obj.union([fromclause])
    
    -    @_memoized_property
    -    def _columns_plus_names(self):
    +    #@_memoized_property
    +    def _columns_plus_names(self, dedupe_select_cols):
             if self.use_labels:
                 names = set()
    
    @@ -3445,17 +3445,30 @@ class Select(HasPrefixes, HasSuffixes, GenerativeSelect):
                         names.add(name)
                     return name, c
    
    -            return [
    -                name_for_col(c)
    -                for c in util.unique_list(
    -                    _select_iterables(self._raw_columns))
    -            ]
    +            if dedupe_select_cols:
    +                return [
    +                    name_for_col(c)
    +                    for c in util.unique_list(
    +                        _select_iterables(self._raw_columns))
    +                ]
    +            else:
    +                return [
    +                    name_for_col(c)
    +                    for c in _select_iterables(self._raw_columns)
    +                ]
    +
             else:
    -            return [
    -                (None, c)
    -                for c in util.unique_list(
    -                    _select_iterables(self._raw_columns))
    -            ]
    +            if dedupe_select_cols:
    +                return [
    +                    (None, c)
    +                    for c in util.unique_list(
    +                        _select_iterables(self._raw_columns))
    +                ]
    +            else:
    +                return [
    +                    (None, c)
    +                    for c in _select_iterables(self._raw_columns)
    +                ]
    
         def _populate_column_collection(self):
             for name, c in self._columns_plus_names:
    
  2. Stijn van Drongelen reporter

    You can generate your labels with anonymous names using:

    cols = [col.label(None) for col in cols]
    

    Aha, I missed that note in the documentation. Thanks for the hint!

  3. Log in to comment