UNII to NDFRT

With coverage of UNII in UMLS 2008 much better, we can do this pretty well now;

select kind, count(1)
  from umls.unii_cui_map u
    inner join umls.ndfrt_role on(target_value = cui or target_value = unii)
    inner join umls.ndfrt_concept on(id = source_id)
  group by kind

All are from kind 11, which is Chemical/Ingredient.

For SPL we want NUIs:

select unii, r.source_id as id, n.target_value as nui
  from umls.unii_cui_map u
    inner join umls.ndfrt_role r 
      on(r.role_def_id = 'P141' and r.target_value = cui)
    inner join umls.ndfrt_role n on(n.source_id = r.source_id and n.role_def_id = 'P262794')

and those now are the NUIs for chemical class. We could put the NDF-RT NUI hierarchy in there and use ids for the classes rather than codes and use NUI and MeSH code as synonyms. That way both will work.

Also, I checked and found that the UNII mapping inside NDF-RT is all contained in the one through UMLS:

select code as unii_code, r.source_id as id, n.target_value as nui
  from umls.unii_concept u
    inner join umls.ndfrt_role r 
      on(r.role_def_id = 'P262793' and r.target_value = u.code)
    inner join umls.ndfrt_role n on(n.source_id = r.source_id and n.role_def_id = 'P262794')
minus
select unii, r.source_id as id, n.target_value as nui
  from umls.unii_cui_map u
    inner join umls.ndfrt_role r 
      on(r.role_def_id = 'P141' and r.target_value = cui)
    inner join umls.ndfrt_role n on(n.source_id = r.source_id and n.role_def_id = 'P262794')

so nothing more to do. BTW, its 1303 mappings, not too bad, although not great. However, the numbers are deceptive because UNII now contains many (if not most) inactive ingredients which we would not expect to be mapped to NDF-RT.

select unii, r.source_id as id, n.target_value as nui
  from umls.unii_cui_map u
    inner join umls.ndfrt_role r 
      on(r.role_def_id = 'P141' and r.target_value = cui)
    inner join umls.ndfrt_role n on(n.source_id = r.source_id and n.role_def_id = 'P262794')

Now how do we add the other annotations? We have to go through the single-ingredient preparation. It turns out to be best combined with the unii mapping:

drop table umls.unii_ndfrt_map
go
create table umls.unii_ndfrt_map as
select distinct unii, r.source_id as ingredient_id, n.target_value as ingredient_nui, i.source_id as preparation_id
  from umls.unii_cui_map u
    inner join umls.ndfrt_role r 
      on(r.role_def_id = 'P141' and r.target_value = cui)
    inner join umls.ndfrt_role n on(n.source_id = r.source_id and n.role_def_id = 'P262794')
    left outer join (select source_id, min(target_id) as target_id
                       from umls.ndfrt_role i 
                       where i.role_def_id = 'R25'
                       group by source_id having count(1) = 1) i on(i.target_id = r.source_id)

1557 items, now we can see there must be some duplicates, but not to worry.

From here it's easy to get the other annotations:

PE:

drop table umls.unii_ndfrt_annotation
go
create table umls.unii_ndfrt_annotation as
select distinct unii, r.role_def_id, r.target_id as id, n.target_value as nui
  from umls.unii_ndfrt_map m
    inner join umls.ndfrt_role r on(r.source_id = m.preparation_id and r.role_def_id IN ('R29', 'R21'))
    inner join umls.ndfrt_role n on(n.source_id = r.target_id and n.role_def_id = 'P262794')
go
create index umls.unii_ndfrt_annotation_idx on umls.unii_ndfrt_annotation(unii, role_def_id)

And that's all we need.

We also need to make a full NDF-RT export of the MoA and PE hierarchies.

OLD APPROACH

Here is the UNII to NDFRT map. Almost all active moiety UNIIs are mapped, save 63 that are very amenable to manual mapping:

SELECT * 
  FROM (SELECT *
          FROM UNII u
          WHERE EXISTS (SELECT 1 
                          FROM mw.Entity e 
                            INNER JOIN mw.Role_ r ON(e.internalId = r.playerInternalId)
                          WHERE e.code_code = u.code
                            AND r.classCode = 'ACTM')
       ) u
    LEFT OUTER JOIN UMLS.NDFRT_CONCEPT n 
      ON(n.name = u.displayName 
         OR (u.displayName LIKE '%IC ACID' AND n.name = SUBSTR(u.displayName, 1, LENGTH(u.displayName) - 7)||'ATE'))

Uploading the manually completed map as per attached spreadsheet and then completing the map as follows:

DROP TABLE UMLS.UNII_NDFRT_MAP
GO
CREATE TABLE UMLS.UNII_NDFRT_MAP AS
SELECT unii_code, n.id
  FROM UMLS.NDFRT_CONCEPT n
    INNER JOIN (
      SELECT UNII_CODE, COALESCE(NDF_NAME,NDF_HAND_MAPPED,CANDIDATE) AS NDF_NAME
        FROM UNII_NDFRT_MAP
    ) m ON(n.NAME = m.NDF_NAME)
  GROUP BY unii_code, n.id

848 records added, i.e. we have 848 mappings between active moieties and NDFRT things.

And this hasn't even made use of the UNII mappings that exist in NDF-RT itself.

WITH ndfrt_ndf_unii_map AS (
  SELECT source_id AS ndf_id, target_value AS unii
    FROM UMLS.NDFRT_ROLE
      WHERE role_def_id = 'P262793'
), text_ndf_unii_map AS (
  SELECT id AS ndf_id, unii_code AS unii
    FROM UMLS.UNII_NDFRT_MAP
), combined_ndf_unii_map AS (
  SELECT max(ndf_id) AS ndf_id, unii FROM (
  SELECT * FROM ndfrt_ndf_unii_map
  UNION
  SELECT * FROM text_ndf_unii_map
  ) GROUP BY unii
), unii_moiety AS (
  SELECT code_code AS unii, trivialname AS name
  FROM mw.Entity e
    INNER JOIN mw.Entity_name n ON(n.entityInternalId = e.internalId)
  WHERE EXISTS (SELECT 1 FROM mw.Role_ r
                  WHERE e.InternalId = r.playerInternalId
                    AND r.classCode IN ('ACTM'))
) SELECT COUNT(DISTINCT unii), COUNT(DISTINCT ndf_id), COUNT(DISTINCT ndf_id)/COUNT(DISTINCT unii) 
  FROM unii_moiety
    LEFT OUTER JOIN ndfrt_ndf_unii_map nr USING(unii)
UNION ALL
SELECT COUNT(DISTINCT unii), COUNT(DISTINCT ndf_id), COUNT(DISTINCT ndf_id)/COUNT(DISTINCT unii) 
  FROM unii_moiety
    LEFT OUTER JOIN text_ndf_unii_map nr USING(unii)
UNION ALL
SELECT COUNT(DISTINCT unii), COUNT(DISTINCT ndf_id), COUNT(DISTINCT ndf_id)/COUNT(DISTINCT unii) 
  FROM unii_moiety
    LEFT OUTER JOIN combined_ndf_unii_map nr USING(unii)

Result:

Mapping SourceUNIINDF%
NDFRT's mapping105934232.3%
Text-Based105982878.2%
COMBINED105985380.5%

However we also found the combined mapping was not unique, and before we can use the combined mapping we need to reconcile it properly.

First to analyze:

WITH ndfrt_ndf_unii_map AS (
  SELECT 'N' AS source, source_id AS ndf_id, target_value AS unii
    FROM UMLS.NDFRT_ROLE
      WHERE role_def_id = 'P262793'
), text_ndf_unii_map AS (
  SELECT 'T' AS source, id AS ndf_id, unii_code AS unii
    FROM UMLS.UNII_NDFRT_MAP
), combined_ndf_unii_map AS (
--  SELECT max(ndf_id) AS ndf_id, unii FROM (
  SELECT * FROM ndfrt_ndf_unii_map
  UNION
  SELECT * FROM text_ndf_unii_map
--  ) GROUP BY unii
), combined_ndf_unii_ambig AS (
  SELECT unii, COUNT(1) AS count
    FROM combined_ndf_unii_map
  GROUP BY unii HAVING COUNT(1) > 1
) SELECT unii, u.displayname, source, count, ndf_id, n.name, n.kind
    FROM combined_ndf_unii_ambig 
      INNER JOIN combined_ndf_unii_map USING(unii)
      INNER JOIN unii u ON(u.code = unii)
      INNER JOIN umls.ndfrt_concept n ON(n.id = ndf_id)
    ORDER BY count DESC, UNII, NDF_ID

We see that our text-based map, while of better coverage, maps to the wrong NDFRT terms. It should map to kind 11 (ingredient/chemical) not to 9 (NDF kind). That is easy to fix though:

WITH ndfrt_ndf_unii_map AS (
  SELECT 'N' AS source, source_id AS ndf_id, target_value AS unii
    FROM UMLS.NDFRT_ROLE
      WHERE role_def_id = 'P262793'
), text_ndf_unii_map AS (
  SELECT CASE 
           WHEN mc.kind = '11' THEN 'T1'
           WHEN nc.kind = '11' THEN 'T2'
           ELSE 'T3'
         END||'['||nc.kind||':'||nc.id||':'||nc.name||'-'||mc.kind||':'||mc.id||':'||mc.name||']' AS source, 
         CASE 
           WHEN mc.kind = '11' THEN mc.id
           WHEN nc.kind = '11' THEN nc.id
           ELSE m.id
         END AS ndf_id, 
         unii_code AS unii
    FROM UMLS.UNII_NDFRT_MAP m
      INNER JOIN UMLS.NDFRT_CONCEPT mc ON(mc.id = m.id)
      LEFT OUTER JOIN UMLS.NDFRT_ROLE r ON(r.source_id = m.id AND r.role_def_id = 'R25')
      LEFT OUTER JOIN UMLS.NDFRT_CONCEPT nc ON(nc.id = r.target_id)
), combined_ndf_unii_map AS (
--  SELECT max(ndf_id) AS ndf_id, unii FROM (
  SELECT * FROM ndfrt_ndf_unii_map
  UNION
  SELECT * FROM text_ndf_unii_map
--  ) GROUP BY unii
), combined_ndf_unii_ambig AS (
  SELECT unii, COUNT(1) AS count
    FROM combined_ndf_unii_map
  GROUP BY unii HAVING COUNT(1) > 1
) SELECT unii, u.displayname, source, count, ndf_id, n.name, n.kind
    FROM combined_ndf_unii_ambig 
      INNER JOIN combined_ndf_unii_map USING(unii)
      INNER JOIN unii u ON(u.code = unii)
      INNER JOIN umls.ndfrt_concept n ON(n.id = ndf_id)
    ORDER BY count DESC, UNII, NDF_ID

Ech, it's got to be too complicated now. We should start over. Mappings between UNII and NDFRT come from the folllowing sources:

  1. from NDFRT UNII property itself (8421 mappings, property P262793)
  2. from a text based mapping to the NDF_KIND term (all upper case)
  3. from a text based mapping to the NDF_INGREDIENT term (just added)
  4. from manual mapping (as per attached spreadsheet in UsingNdfRt)

Either way, the NDF-RT concepts we get should be mapped to NDF_INGREDIENT concepts, as those are the true concepts of the chemical structure classification.

But we need to do that in some modular way lest we go crazy.

  • UNII -> NDF -> INGR
-- DROP TABLE UNII_NDF_WO_CHEM
-- GO
--CREATE TABLE UNII_NDF_WO_CHEM AS
WITH unii_ndf_map_source AS (
   SELECT 'N' as source, target_value AS unii, TO_CHAR(source_id) AS ndf_id, '' AS text
    FROM UMLS.NDFRT_ROLE WHERE role_def_id = 'P262793' 
  UNION ALL
   SELECT 'T' as source, u.code AS unii, TO_CHAR(n.id) AS ndf_id, u.displayName||'~'||n.name AS text
     FROM UNII u
      INNER JOIN UMLS.NDFRT_CONCEPT n 
         ON(n.name = u.displayName 
            OR (u.displayName LIKE '%IC ACID' AND n.name = SUBSTR(u.displayName, 1, LENGTH(u.displayName) - 7)||'ATE'))
  UNION ALL
   SELECT 'M' as source, unii_code AS unii, TO_CHAR(ndf_code) AS ndf_id, unii_name||'~'||ndf_name AS text
     FROM UNII_NDFRT_MAP WHERE ndf_code IS NOT NULL
  UNION ALL
   SELECT 'I' as source, u.code AS unii, TO_CHAR(n.id) AS ndf_id, u.displayName||'~'||n.name AS text
     FROM UNII u
      INNER JOIN UMLS.NDFRT_CONCEPT n 
        ON( lower(n.name) = lower(u.displayName)||' [chemical/ingredient]' AND n.kind = 11)
--  UNION ALL
--   SELECT 'U' as source, unii, TO_CHAR(ndf_id), '' AS TEXT
--     FROM UNII_NDF_WI_CHEM_MSH_CHEM
), unii_ndf_map_source2 AS (
  SELECT source, unii, 
      CASE 
        WHEN c.kind = 11 THEN c.id
        WHEN nc.kind = 11 THEN nc.id
      END ndf_id,
      ndf_id AS ndf_id1, c.kind AS kind1, nc.id AS ndf_id2, nc.kind AS kind2, text
    FROM unii_ndf_map_source
      INNER JOIN UMLS.NDFRT_CONCEPT c ON(TO_CHAR(c.id) = ndf_id)
      LEFT OUTER JOIN UMLS.NDFRT_ROLE r ON(r.source_id = c.id AND r.role_def_id IN ('R25', 'R80'))
      LEFT OUTER JOIN UMLS.NDFRT_CONCEPT nc ON(nc.id = r.target_id)
), unii_ndf_map_consolidated1 AS (
  SELECT unii, MIN(ndf_id) AS ndf_id, MAX(ndf_id) AS ndf_id2, COUNT(1) AS count,
        STRINGJOIN(source||ndf_id1||':'||ndf_id||' ') as source,
        STRINGJOIN(text||' ') as text
    FROM unii_ndf_map_source2
    GROUP BY unii
) SELECT * FROM unii_ndf_map_consolidated1 m
    WHERE ( (ndf_id IS NULL
           AND NOT EXISTS (SELECT 1 FROM unii_ndf_map_consolidated1 x
                            WHERE x.unii = m.unii AND x.ndf_id IS NOT NULL))
      OR ndf_id <> ndf_id2 )

It turns out that 730 UNII codes could be mapped through text but still don't have a has_Ingredient link to a chemical structure. That's pretty sad as there are fine things in them, e.g., MONTELUKAST SODIUM, MECLICILIN HYDROCHLORIDE, and tons more. The problem is much reduced (to 88 items only) if we focus only on those concepts that appear as active moieties. FDA might map to NDFRT using those items, but they don't have any chemical structure.

How about other relationships? Well, it looks like all NDFRT knowledge is connected to the NDF_KIND term anyway. So for that it's not a great loss. OTOH the chemical structure, relevant for allergens, is MeSH even in NDF RT. May be we should instead map directly to a MeSH concept for chemical structure?

Let's try this through UMLS, of course there is no problem mapping it all to MeSH directly. And we have no problem to find things under a MeSH heading under Chemical. This shows that the NDF RT is broken in this regard and that it would be better to use MeSH directly.

DROP TABLE UNII_NDF_WO_CHEM
GO
CREATE TABLE UNII_NDF_WO_CHEM AS
WITH ... -- as above
GO
DROP TABLE UNII_NDF_WO_CHEM_MSH_CUI
GO
CREATE TABLE UNII_NDF_WO_CHEM_MSH_CUI AS 
SELECT DISTINCT unii, u.displayName, n.cui, n.aui
  FROM UNII_NDF_WO_CHEM
    INNER JOIN UNII u ON(u.code = unii)
    INNER JOIN UMLS.MRCONSO m ON(lower(u.displayName) = lower(m.str) AND m.LAT='ENG')
    INNER JOIN UMLS.MRCONSO n ON(n.cui = m.cui AND n.sab='MSH' AND n.LAT='ENG' AND n.TS='P' AND n.STT='PF')
-- 502 record(s) affected 
GO
CREATE TABLE UNII_NDF_WI_CHEM_MSH_CHEM AS
SELECT DISTINCT unii, nc.id AS ndf_id
  FROM UNII_NDF_WO_CHEM_MSH_CUI n
    INNER JOIN UMLS.MRREL nr ON(nr.aui1 = n.aui AND nr.sab = 'MSH' AND nr.rel='RB')
    INNER JOIN UMLS.MRCONSO nrn ON(nrn.aui = nr.aui2 AND nrn.sab = 'MSH' AND nrn.LAT='ENG' AND nrn.TS='P' AND nrn.STT='PF')
    INNER JOIN UMLS.MRCONSO ndn ON(ndn.cui = nrn.cui AND ndn.sab='NDFRT' AND ndn.TTY='IN')
    INNER JOIN UMLS.NDFRT_CONCEPT nc
       ON(lower(ndn.str)||' [chemical/ingredient]' = lower(nc.name) AND nc.kind = 11)
-- 419 record(s) affected
GO
-- then add
WITH ...
...
    UNION ALL
     SELECT 'U' as source, unii, TO_CHAR(ndf_id), '' AS TEXT
       FROM UNII_NDF_WI_CHEM_MSH_CHEM
...

After which there are only 334 records left. Some mappings are now ambiguous because of multiple MSH headings.

But we can't change that fact that SPL is set to use NDF-RT structural classes and that it will use the N-codes (NUIs) for that. Our only direction for now is to focus on active moieties.

SELECT * 
  FROM UNII_NDF_WO_CHEM 
   WHERE EXISTS (SELECT 1 FROM mw.Entity e 
                            INNER JOIN mw.Role_ r ON(e.internalId = r.playerInternalId)
                          WHERE e.code_code = unii
                            AND r.classCode = 'ACTM')

But even there we are still hosed for such common things as PROGUANIL and DEXTROSE.

So the final map is now:

CREATE TABLE UNII_NDF_MAP AS
WITH unii_ndf_map_source AS (
   SELECT 'N' as source, target_value AS unii, TO_CHAR(source_id) AS ndf_id, '' AS text
    FROM UMLS.NDFRT_ROLE WHERE role_def_id = 'P262793' 
  UNION ALL
   SELECT 'T' as source, u.code AS unii, TO_CHAR(n.id) AS ndf_id, u.displayName||'~'||n.name AS text
     FROM UNII u
      INNER JOIN UMLS.NDFRT_CONCEPT n 
         ON(n.name = u.displayName 
            OR (u.displayName LIKE '%IC ACID' AND n.name = SUBSTR(u.displayName, 1, LENGTH(u.displayName) - 7)||'ATE'))
  UNION ALL
   SELECT 'M' as source, unii_code AS unii, TO_CHAR(ndf_code) AS ndf_id, unii_name||'~'||ndf_name AS text
     FROM UNII_NDFRT_MAP WHERE ndf_code IS NOT NULL
  UNION ALL
   SELECT 'I' as source, u.code AS unii, TO_CHAR(n.id) AS ndf_id, u.displayName||'~'||n.name AS text
     FROM UNII u
      INNER JOIN UMLS.NDFRT_CONCEPT n 
        ON( lower(n.name) = lower(u.displayName)||' [chemical/ingredient]' AND n.kind = 11)
  UNION ALL
    SELECT 'U' as source, unii, TO_CHAR(ndf_id), '' AS TEXT
      FROM UNII_NDF_WI_CHEM_MSH_CHEM
), unii_ndf_map_source2 AS (
  SELECT source, unii, 
      CASE 
        WHEN c.kind = 11 THEN c.id
        WHEN nc.kind = 11 THEN nc.id
      END ndf_id,
      ndf_id AS ndf_id1, c.kind AS kind1, nc.id AS ndf_id2, nc.kind AS kind2, text
    FROM unii_ndf_map_source
      INNER JOIN UMLS.NDFRT_CONCEPT c ON(TO_CHAR(c.id) = ndf_id)
      LEFT OUTER JOIN UMLS.NDFRT_ROLE r ON(r.source_id = c.id AND r.role_def_id IN ('R25', 'R80'))
      LEFT OUTER JOIN UMLS.NDFRT_CONCEPT nc ON(nc.id = r.target_id)
), unii_ndf_map_consolidated1 AS (
  SELECT unii, MIN(ndf_id) AS ndf_id, MAX(ndf_id) AS ndf_id2, COUNT(1) AS count,
        STRINGJOIN(source||ndf_id1||':'||ndf_id||' ') as source,
        STRINGJOIN(text||' ') as text
    FROM unii_ndf_map_source2
    GROUP BY unii
) SELECT * FROM unii_ndf_map_consolidated1 m
    WHERE NOT ( (ndf_id IS NULL
           AND NOT EXISTS (SELECT 1 FROM unii_ndf_map_consolidated1 x
                            WHERE x.unii = m.unii AND x.ndf_id IS NOT NULL))
      OR ndf_id <> ndf_id2 )

Even after all this we have only 1556 of 3242 UNIIs mapped to a proper NDF RT chemical structure term and it's even worse than it was for active moieties!!!

This is a big unforseen painful problem in the NDF RT.

So, let's forget about this and use the MeSH directly. MeSH Chemical Structure