|File name extension|
|Type of format||chemical file format|
The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. This article discusses some common molecular file formats including usage and converting between them Chemistry (from Egyptian kēme (chem meaning "earth") is the Science concerned with the composition structure and properties In Chemistry, a molecule is defined as a sufficiently stable electrically neutral group of at least two Atoms in a definite arrangement held together by American Standard Code for Information Interchange ( ASCII) In Computer programming and some branches of Mathematics, a string is an ordered Sequence of Symbols. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules. A molecule editor is a Computer program for creating and modifying representations of Chemical structures There are a number types of molecule editor In mathematics the dimension of a Space is roughly defined as the minimum number of Coordinates needed to specify every point within it In mathematics the dimension of a Space is roughly defined as the minimum number of Coordinates needed to specify every point within it
The original SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s. David Weininger is a chemist and entrepreneur He is founder of Daylight Chemical Information Systems, a company in Santa Fe New Mexico that does rapid analysis of The 1980s was the decade spanning from January 1 1980 to December 31 1989. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. An open standard is a Standard that is publicly available and has various rights to use associated with it Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc). Wiswesser Line Notation, also referred to as WLN, invented by William J The SYBYL line notation or SLN is a specification for unambiguously describing the structure of chemical Molecules using short ASCII
In August of 2006, the IUPAC introduced the InChI as a standard for formula representation. The International Union of Pure and Applied Chemistry ( IUPAC) (aɪjuːpæk or ay-yoo-pec) is an international Non-governmental organization The IUPAC International Chemical Identifier ( InChI, pronounced "INchee" is a textual Identifier for Chemical substances designed to provide a SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical (e. g. , graph theory) backing. In Mathematics and Computer science, graph theory is the study of graphs: mathematical structures used to model pairwise relations between objects
The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings and the exact meaning is usually apparent from the context. The terms Canonical and Isomeric can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it, and is termed the Canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure and do not simply manipulate strings as is sometimes thought. Algorithms for generating Canonical SMILES have been developed at Daylight Chemical Information Systems, OpenEye Scientific Software and Chemical Computing Group. A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database. A chemical database is a Database specifically designed to store chemical information.
SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. The configuration of a Molecule is the permanent geometry that results from the spatial arrangement of its bonds. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed Isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term Isomeric SMILES is also applied to SMILES in which isotopes are specified. Isotopes (Greek isos = "equal" tópos = "site place" are any of the different types of atoms ( Nuclides
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. Depth-first search ( DFS) is an Algorithm for traversing or searching a tree, Tree structure, or graph. In Computer science, tree-traversal refers to the process of visiting each node in a Tree data structure, exactly once in a systematic way In Chemical graph theory and in Mathematical chemistry, a molecular graph or chemical graph is a representation of the Structural formula of The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. In the mathematical field of Graph theory, a spanning tree T of a connected, Undirected graph G is a tree composed Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. History See also Atomic theory, Atomism The concept that matter is composed of discrete units and cannot be divided into arbitrarily tiny A chemical element is a type of Atom that is distinguished by its Atomic number; that is by the number of Protons in its nucleus. Gold (ˈɡoʊld is a Chemical element with the symbol Au (from its Latin name aurum) and Atomic number 79 The hydroxide anion is [OH-]. In Chemistry, hydroxide is the most common name for the diatomic Anion OH− consisting of Oxygen and Hydrogen An ion is an Atom or Molecule which has lost or gained one or more Valence electrons giving it a positive or negative electrical charge Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O. Water is a common Chemical substance that is essential for the survival of all known forms of Life.
Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Cyclohexane is a Cycloalkane with the Molecular formula C 6 H 12 Double and triple bonds are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O (carbon dioxide) and C#N (hydrogen cyanide). A chemical bond is the physical process responsible for the attractive interactions between Atoms and Molecules and which confers stability to diatomic and polyatomic Carbon dioxide ( Chemical formula:) is a Chemical compound composed of two Oxygen Atoms covalently bonded to a single Hydrogen cyanide is a Chemical compound with Chemical formula HCN
Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Propionic acid (systematically named propanoic acid) is a naturally-occurring Carboxylic acid with Chemical formula C[[Hydrogen H]]3CH2C Fluoroform is the Chemical compound with the formula CHF3 It is one of the " haloforms " a class of compounds with the formula CHX3 Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Benzene, or benzol, is an organic Chemical compound and a known Carcinogen with the molecular formula C 6 H 6 Pyridine is a Chemical compound with the formula C5[[Hydrogen H5]] N. Furan, also known as furane and furfuran, is a heterocyclic Organic compound. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl can be represented by c1ccccc1-c2ccccc2. Biphenyl (or diphenyl or phenyl benzene or 11'-biphenyl or lemonene) is a solid Organic compound that forms colorless to yellowish crystals Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1. Pyrrole, or pyrrol, is a Heterocyclic Aromatic Organic compound, a five-membered ring with the formula C 4 Imidazole is a Organic compound with the formula HC3H3N2 This Aromatic Heterocyclic is classified as an Alkaloid
Configuration around double bonds is specified using the characters "/" and "\". For example, F/C=C/F (see depiction)is one representation of trans-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=C\F (see depiction) is one possible representation of cis-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.
Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer of the amino acid alanine can be written as N[C@@H](C)C(=O)O (see depiction). In Chemistry, an enantiomer ( from the Greek ἐνάντιος opposite and μέρος part or portion is one of two Stereoisomers that are nonsuperimposable In Chemistry, an amino acid is a Molecule containing both Amine and Carboxyl Functional groups In Biochemistry, this Alanine (abbreviated as Ala or A) is an α- Amino acid with the Chemical formula HO2CCH(NH2CH3 The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O)appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O (see depiction). The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[C@@H](C(=O)O)C (see depiction).
Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Isotopes (Greek isos = "equal" tópos = "site place" are any of the different types of atoms ( Nuclides Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl. Benzene, or benzol, is an organic Chemical compound and a known Carcinogen with the molecular formula C 6 H 6 Carbon-14, 14C, or radiocarbon, is a Radioactive isotope of Carbon discovered on February 27, 1940, by Deuterated chloroform (CDCl3 is an Isotopologue of Chloroform (CHCl3 in which the Hydrogen atom ("H" is replaced with
The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.
SMARTS is a line notation for specification of substructural patterns in molecules. Smiles ARbitrary Target Specification (SMARTS is a language for specifying substructural patterns in Molecules The SMARTS line notation is expressive and allows extremely precise While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. For other meanings of 'wild card' see Wild card. The term wildcard character has the following meanings Telecommunication In A chemical database is a Database specifically designed to store chemical information. One common misconception is that SMARTS-based subtructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism. Graph theory is a growing area in mathematical research and has a large specialized vocabulary In Abstract algebra, an isomorphism ( Greek: ἴσος isos "equal" and μορφή morphe "shape" is a bijective SMIRKS is a line notation for specifying reaction transforms.
SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities.