This repo is an experiment in:
- reading Haskell package information from
~/.cabal/packages/hackage.haskell.org/01-index.tar
- parsing cabal details using flatparse and streamly.
- creating dependency charts and graphing them with graphviz via dotparse and chart-svg.
- using org-mode for rapid Haskell development.
Notes on how to run ghci within org-mode are available at checklist: How I start Haskell.
:set prompt "> "
:set -XOverloadedStrings
:set -Wno-type-defaults
putStrLn "ok"
:r
:set -Wno-deprecations
import Research.Hackage
import qualified Streamly.Prelude as S
import qualified Streamly.Internal.Data.Unfold as Unfold
import Data.Function
import Streamly.External.Archive
import Data.Either
import qualified Data.ByteString.Char8 as C
import Data.Bifunctor
import qualified Data.Map.Strict as Map
import DotParse
import FlatParse.Basic qualified as FP
import Algebra.Graph
import qualified Algebra.Graph.ToGraph as ToGraph
import Data.Foldable
import Chart
import Data.String.Interpolate
import Optics.Core
putStrLn "ok"
import System.Directory
import Control.Monad
import Data.List
h <- getHomeDirectory & fmap (<> "/haskell")
ds <- getDirectoryContents h
ds' = filter (\x -> x /= "." && x /= "..") ds
ds'' <- filterM doesDirectoryExist $ (\x -> h <> "/" <> x) <$> ds'
fs <- mapM (\x -> (x,) <$> getDirectoryContents x) ds''
cabals = mconcat $ fmap ((\(d,fs)-> (\f -> d <> "/" <> f) <$> fs) . second (filter (isSuffixOf ".cabal"))) fs
cabals
Cabal file contents in the haskell directory:
haskellStream = S.unfold Unfold.fromListM ((\x -> (x,) <$> readFile x) <$> cabals)
:t haskellStream
s = fmap (first C.pack . second C.pack) haskellStream
package count
s & S.map (const 1) & S.sum
files
fields <- S.toList $ fmap (fromRight undefined . readFields . snd) s
fmap length fields
count_ $ mconcat $ fmap (fmap names) fields
- which cabal has no author?
- common?
- extra-source-files
- stability
- test-suite * 3
finding exclusions
S.toList $ fmap fst $ S.filter (not . any ((=="copyright") . names) . snd) $ fmap (second (fromRight undefined . readFields)) s
looking at single fields
S.toList $ fmap (second (mconcat . fmap (fieldValue "copyright"))) $ fmap (second (fromRight undefined . readFields)) s
:t count
yearList = [("numhask",2016),("mealy",2013),("box",2017),("formatn",2016),("prettychart",2023),("code",2023),("poker-fold",2020),("numhask-space",2016),("iqfeed",2014),("box-socket",2017),("numhask-array",2016),("euler",2023),("tonyday567",2020),("foo",2023),("web-rep",2015),("dotparse",2022),("perf",2018),("anal",2023),("research-hackage",2022),("chart-svg",2017),("ephemeral",2020)]
:t yearList :: [(String, Int)]
license a y = [i|
Copyright #{a} (c) #{y}
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of #{a} nor the names of other
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|]
The development loop largely starts with re-establishment of state by running the code below, which represents milestones in parsing cabal index data, and (eventual) reification in Research.Hackage
.
vlibs <- Map.delete "acme-everything" <$> validLatestLibs
deps = fmap (fromRight undefined . parseDeps . mconcat . mconcat . rawBuildDeps . snd) vlibs
bdnames <- fmap (fmap fst) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps
depsExclude = filter (not . (`elem` (Map.keys vlibs))) bdnames
vdeps = Map.filter (not . null) $ fmap (filter (not . (`elem` depsExclude))) deps
depG = stars (Map.toList vdeps)
vertexCount depG
edgeCount depG
15135 109900
depG is an algebraic-graph of the latest cabal library package names as the vertexes and their dependencies as the edges.
archive is located at ~/.cabal/packages/hackage.haskell.org/01-index.tar
and contains @ 290k unique entries (May 2022).
All pathNames exist, all file types are regular and there are no utf8 issues with pathNames so we use the header pathName to roll up the archive
package count:
:t groupByPathName
:t Unfold.take 10000000 archive
:t groupByPathName (Unfold.take 10000000 archive)
packageStream & S.map (const 1) & S.sum
groupByPathName :: S.IsStream t => Unfold IO a (Either Header ByteString) -> t IO (ByteString, ByteString) Unfold.take 10000000 archive :: Unfold IO Void (Either Header ByteString) groupByPathName (Unfold.take 10000000 archive) :: S.IsStream t => t IO (ByteString, ByteString) 303794
S.toList $ S.filter ((/= Just (Just FileTypeRegular)) . fmap fileType) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)
S.toList $ S.filter (\x -> fmap pathName x /= fmap pathNameUtf8 x) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)
S.toList $ S.filter (\x -> fmap pathName x == Nothing) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)
[] > [] > []
The first 10 package names
S.toList $ S.take 10 $ fmap fst packageStream
iconv/0.2/iconv.cabal | Crypto/3.0.3/Crypto.cabal | HDBC/1.0.1/HDBC.cabal | HDBC-odbc/1.0.1.0/HDBC-odbc.cabal | HDBC-postgresql/1.0.1.0/HDBC-postgresql.cabal | HDBC-sqlite3/1.0.1.0/HDBC-sqlite3.cabal | darcs-graph/0.1/darcs-graph.cabal | hask-home/2006.3.23/hask-home.cabal | hmp3/1.1/hmp3.cabal | lambdabot/4.0/lambdabot.cabal |
Some have no cabal file content, but these are preferred-version
types.
S.length $ S.filter ((=="") . snd) $ packageStream
43
package path names are either preferred-versions, .cabal or package.json
S.length $ fmap fst $ S.filter (not . (\x -> B.isSuffixOf "preferred-versions" x || B.isSuffixOf ".cabal" x || B.isSuffixOf "package.json" x) . fst) $ packageStream
0
Reifying this as NameType:
:i NameType
S.fold count $ fmap (bimap toNameType (=="")) $ packageStream
type NameType :: * data NameType = CabalName | PreferredVersions | PackageJson | BadlyNamed -- Defined at src/Research/Hackage.hs:192:1 instance Eq NameType -- Defined at src/Research/Hackage.hs:192:95 instance Ord NameType -- Defined at src/Research/Hackage.hs:192:90 instance Show NameType -- Defined at src/Research/Hackage.hs:192:84 fromList [((CabalName,False),168535),((PreferredVersions,False),3115),((PreferredVersions,True),43),((PackageJson,False),132101)]
S.toList $ S.take 10 $ S.filter (\(x,c) -> B.isSuffixOf "preferred-versions" x && c /= "") $ packages archive
package-json
content is a security/signing feature you can read about in hackage-security.
S.length $ S.filter ((\x -> B.isSuffixOf "package.json" x) . fst) $ packageStream
132101
S.toList $ S.take 4 $ S.filter ((\x -> B.isSuffixOf "package.json" x) . fst) $ packageStream
S.length $ S.filter ((\x -> B.isSuffixOf ".cabal" x) . fst) $ packageStream
168535
fmap fst <$> (S.toList $ S.take 10 $ S.filter ((\x -> B.isSuffixOf ".cabal" x) . fst) $ packageStream)
So there is about 160k cabal files to R&D …
malformed version number check
mErrs <- S.fold (collect fst snd) $ S.filter (isLeft . snd) $ fmap (second (parseVersion . C.pack)) $ fmap (fromRight undefined) $ S.filter isRight $ fmap (Research.Hackage.parsePath . fst) $ S.filter ((==CabalName) . toNameType . fst) packageStream
length mErrs
Total number of names
t1 <- S.fold (collect fst snd) $ fmap (second (fromRight undefined)) $ S.filter (isRight . snd) $ fmap (second (parseVersion . C.pack)) $ fmap (fromRight undefined) $ S.filter isRight $ fmap (Research.Hackage.parsePath . fst) $ S.filter ((==CabalName) . toNameType . fst) packageStream
length t1
> 17055
Average number of versions:
fromIntegral (sum $ Map.elems $ length <$> t1) / fromIntegral (length t1)
9.658348979468233
All of the latest cabal files have content:
latest = Map.map maximum t1
length $ Map.toList $ Map.filter (==[]) latest
0
lcf <- latestCabalFiles
length $ Map.toList lcf
16511
field errors
fmap (\x -> C.pack (fst x) <> "-" <> toVer (fst (snd x))) $ Map.toList $ Map.filter (isLeft . readFields . snd) lcf
DSTM-0.1.2 | control-monad-exception-mtl-0.10.3 | ds-kanren-0.2.0.1 | metric-0.2.0 | phasechange-0.1 | smartword-0.0.0.5 |
valid cabal files with ok parsing of all fields:
vlcs <- validLatestCabals
:t vlcs
length vlcs
17049
import Data.Ord
fmap (take 10 . List.sortOn (Down . snd) . Map.toList) $ S.fold count $ S.fromList $ fmap names $ mconcat $ fmap snd $ Map.toList $ fmap snd vlcs
fmap (take 10 . List.sortOn (Down . snd) . Map.toList) $ S.fold count $ S.fromList $ mconcat $ fmap authors $ fmap snd $ Map.toList $ fmap snd vlcs
not libraries
Map.size $ Map.filter ((0==) . length) $ fmap (catMaybes . fmap (sec "library") . snd) vlcs
1743
multiple libraries
Map.size $ Map.filter ((>1) . length) $ fmap (catMaybes . fmap (sec "library") . snd) vlcs
79
Multiple libraries are usually “internal” libraries that can only be used inside the cabal file.
take 10 $ Map.toList $ Map.filter (\x -> x/=[[]] && x/=[] && listToMaybe x /= Just []) $ fmap (fmap (fmap secName) . fmap fst . catMaybes . fmap (sec "library") . snd) vlcs
common stanzas
length $ Map.toList $ Map.filter (/=[]) $ fmap (catMaybes . fmap (sec "common")) $ fmap snd vlcs
737
valid cabal files that have a library section:
vlibs <- Map.delete "acme-everything" <$> validLatestLibs
Map.size vlibs
15305
Total number of build dependencies in library stanzas and in common stanzas:
sum $ fmap snd $ Map.toList $ fmap (sum . fmap length) $ fmap (fmap (fieldValues "build-depends")) $ Map.filter (/=[]) $ fmap (fmap snd . catMaybes . fmap (sec "library") . snd) vlibs
sum $ fmap snd $ Map.toList $ fmap (sum . fmap length) $ fmap (fmap (fieldValues "build-depends")) $ Map.filter (/=[]) $ fmap (fmap snd . catMaybes . fmap (sec "common") . snd) vlibs
105233 > 3440
no dependencies
Map.size $ Map.filter (==[]) $ fmap (rawBuildDeps . snd) $ Map.delete "acme-everything" vlcs
1725
These are mostly parse errors from not properly parsing conditionals.
unique dependencies
Map.size $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ Map.delete "acme-everything" vlibs
raw build-deps example:
take 1 $ Map.toList $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ vlibs
2captcha | (aeson >=1.5.6.0 && <1.6,base >=4.7 && <5,bytestring >=0.10.12.0 && <0.11,clock >=0.8.2 && <0.9,exceptions >=0.10.4 && <0.11,http-client >=0.6.4.1 && <0.7,lens >=4.19.2 && <4.20,lens-aeson >=1.1.1 && <1.2,parsec >=3.1.14.0 && <3.2,text >=1.2.4.1 && <1.3,wreq >=0.5.3.3 && <0.6 ) |
lex check:
S.fold count $ S.concatMap S.fromList $ fmap C.unpack $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ vlibs
fromList [('\t',42),(' ',572471),('&',86160),('(',486),(')',486),('*',5969),(',',92554),('-',32183),('.',140854),('0',77745),('1',63104),('2',32240),('3',20269),('4',29110),('5',22316),('6',9901),('7',9590),('8',6678),('9',6284),('<',45145),('=',78780),('>',65175),('A',259),('B',234),('C',1113),('D',474),('E',75),('F',143),('G',334),('H',809),('I',103),('J',112),('K',15),('L',502),('M',399),('N',79),('O',280),('P',422),('Q',602),('R',240),('S',544),('T',524),('U',200),('V',75),('W',73),('X',92),('Y',24),('Z',18),('^',2855),('a',73691),('b',29688),('c',35787),('d',20249),('e',109010),('f',12413),('g',16508),('h',16656),('i',52533),('j',527),('k',7435),('l',34131),('m',26121),('n',54342),('o',47497),('p',28317),('q',2380),('r',67213),('s',78990),('t',90097),('u',14024),('v',6600),('w',3782),('x',10090),('y',17960),('z',1406),('{',38),('|',1936),('}',38)]
parsing the dependencies for just the names:
deps = fmap (fromRight undefined . parseDeps . mconcat . mconcat . rawBuildDeps . snd) vlibs
Map.size deps
sum $ Map.elems $ fmap length deps
:
14779 106678
take 3 $ Map.toList deps
[("2captcha",["aeson","base","bytestring","clock","exceptions","http-client","lens","lens-aeson","parsec","text","wreq"]),("3dmodels",["base","attoparsec","bytestring","linear","packer"]),("AAI",["base"])]
packages with the most dependencies:
take 20 $ List.sortOn (Down . snd) $ fmap (second length) $ Map.toList deps
yesod-platform | 132 |
hackport | 127 |
planet-mitchell | 109 |
raaz | 104 |
hevm | 84 |
sockets | 82 |
btc-lsp | 71 |
too-many-cells | 70 |
ghcide | 69 |
pandoc | 68 |
cachix | 67 |
sprinkles | 67 |
emanote | 64 |
freckle-app | 64 |
pantry-tmp | 64 |
taffybar | 63 |
neuron | 61 |
project-m36 | 61 |
NGLess | 60 |
stack | 59 |
dependees
fmap (take 20) $ fmap (List.sortOn (Down . snd)) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps
base | 14709 |
bytestring | 5399 |
text | 4969 |
containers | 4712 |
mtl | 3473 |
transformers | 3069 |
aeson | 2021 |
time | 1932 |
vector | 1797 |
directory | 1608 |
filepath | 1532 |
template-haskell | 1456 |
unordered-containers | 1388 |
deepseq | 1248 |
lens | 1175 |
binary | 932 |
hashable | 930 |
array | 889 |
exceptions | 855 |
process | 851 |
All the dependees found:
bdnames <- fmap (fmap fst) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps
length bdnames
> 5873
dependees not in the cabal index:
length $ filter (not . (`elem` (Map.keys vlibs))) bdnames
take 10 $ filter (not . (`elem` (Map.keys vlibs))) bdnames
233 > ["Codec-Compression-LZF","Consumer","DOM","DebugTraceHelpers","FieldTrip","FindBin","HJavaScript","HTTP-Simple","Imlib","LRU"]
excluding these:
depsExclude = filter (not . (`elem` (Map.keys vlibs))) bdnames
vdeps = fmap (filter (not . (`elem` depsExclude))) deps
Map.size vdeps
sum $ fmap snd $ Map.toList $ fmap length vdeps
:
> 14779 106238
- [X] error 1 - commas can be inside braces
- [ ] error 2 - plain old dodgy depends acme-everything, cabal, deprecated packages
- [ ] error 3 - multiple build-depends in one stanza
- [ ] error 4 - cpp & conditionals
- [ ] error 5 - packages not on Hackage
cardano “This library requires quite a few exotic dependencies from the cardano realm which aren’t necessarily on hackage nor stackage. The dependencies are listed in stack.yaml, make sure to also include those for importing cardano-transactions.” ~ https://raw.githubusercontent.com/input-output-hk/cardano-haskell/d80bdbaaef560b8904a828197e3b94e667647749/snapshots/cardano-1.24.0.yaml
- [ ] error 6 - internal library (only available to the main cabal library stanza) yahoo-prices, vector-endian, symantic-parser
Empty lists are mostly due to bad conditional parsing
Map.size $ Map.filter null deps
243
An (algebraic) graph of dependencies:
depG = stars (Map.toList vdeps)
:t depG
ToGraph.preSet "folds" depG
ToGraph.postSet "folds" depG
https://hackage.haskell.org/package/proton
vertexCount depG
edgeCount depG
14779 105693
text
package dependency example
supers = upstreams "text" depG <> Set.singleton "text"
superG = induce (`elem` (toList supers)) depG
folds
supers = upstreams "folds" depG <> Set.singleton "folds"
superG = induce (`elem` (toList supers)) depG
mealy
package dependencies
supers = upstreams "mealy" depG <> Set.singleton "mealy"
superG = induce (`elem` (toList (Set.delete "base" supers))) depG