Skip to content
This repository was archived by the owner on Dec 31, 2023. It is now read-only.

archive567/research-hackage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f3a6ceb · Aug 3, 2023

History

10 Commits
Jan 4, 2022
Aug 4, 2022
Aug 3, 2023
Jan 4, 2022
Jan 4, 2022
Jan 4, 2022
Jan 4, 2022
Aug 3, 2023
Aug 3, 2023
Aug 3, 2023
Aug 3, 2023

Repository files navigation

research-hackage

https://img.shields.io/hackage/v/research-hackage.svg https://github.com/tonyday567/research-hackage/workflows/haskell-ci/badge.svg

This repo is an experiment in:

  • reading Haskell package information from ~/.cabal/packages/hackage.haskell.org/01-index.tar
  • parsing cabal details using flatparse and streamly.
  • creating dependency charts and graphing them with graphviz via dotparse and chart-svg.
  • using org-mode for rapid Haskell development.

code

setup & development process

Notes on how to run ghci within org-mode are available at checklist: How I start Haskell.

:set prompt "> "
:set -XOverloadedStrings
:set -Wno-type-defaults
putStrLn "ok"
:r
:set -Wno-deprecations
import Research.Hackage
import qualified Streamly.Prelude as S
import qualified Streamly.Internal.Data.Unfold as Unfold
import Data.Function
import Streamly.External.Archive
import Data.Either
import qualified Data.ByteString.Char8 as C
import Data.Bifunctor
import qualified Data.Map.Strict as Map
import DotParse
import FlatParse.Basic qualified as FP
import Algebra.Graph
import qualified Algebra.Graph.ToGraph as ToGraph
import Data.Foldable
import Chart
import Data.String.Interpolate
import Optics.Core
putStrLn "ok"

ToDo directory listing

import System.Directory
import Control.Monad
import Data.List
h <- getHomeDirectory & fmap (<> "/haskell")
ds <- getDirectoryContents h
ds' = filter (\x -> x /= "." && x /= "..") ds
ds'' <- filterM doesDirectoryExist $ (\x -> h <> "/" <> x) <$> ds'
fs <- mapM (\x -> (x,) <$> getDirectoryContents x) ds''
cabals = mconcat $ fmap ((\(d,fs)-> (\f -> d <> "/" <> f) <$> fs) . second (filter (isSuffixOf ".cabal"))) fs
cabals

Cabal file contents in the haskell directory:

haskellStream = S.unfold Unfold.fromListM ((\x -> (x,) <$> readFile x) <$> cabals)
:t haskellStream
s = fmap (first C.pack . second C.pack) haskellStream

package count

s & S.map (const 1) & S.sum

files

fields <- S.toList $ fmap (fromRight undefined . readFields . snd) s
fmap length fields
count_ $ mconcat $ fmap (fmap names) fields

ToDo questions

  • which cabal has no author?
  • common?
  • extra-source-files
  • stability
  • test-suite * 3

finding exclusions

S.toList $ fmap fst $ S.filter (not . any ((=="copyright") . names) . snd) $ fmap (second (fromRight undefined . readFields)) s

looking at single fields

S.toList $ fmap (second (mconcat . fmap (fieldValue "copyright"))) $ fmap (second (fromRight undefined . readFields)) s
:t count
yearList = [("numhask",2016),("mealy",2013),("box",2017),("formatn",2016),("prettychart",2023),("code",2023),("poker-fold",2020),("numhask-space",2016),("iqfeed",2014),("box-socket",2017),("numhask-array",2016),("euler",2023),("tonyday567",2020),("foo",2023),("web-rep",2015),("dotparse",2022),("perf",2018),("anal",2023),("research-hackage",2022),("chart-svg",2017),("ephemeral",2020)]
:t yearList :: [(String, Int)]
license a y = [i|

Copyright #{a} (c) #{y}

All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above
      copyright notice, this list of conditions and the following
      disclaimer in the documentation and/or other materials provided
      with the distribution.

    * Neither the name of #{a} nor the names of other
      contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|]

archive

development loop

The development loop largely starts with re-establishment of state by running the code below, which represents milestones in parsing cabal index data, and (eventual) reification in Research.Hackage.

vlibs <- Map.delete "acme-everything" <$> validLatestLibs
deps = fmap (fromRight undefined . parseDeps . mconcat . mconcat . rawBuildDeps . snd) vlibs
bdnames <- fmap (fmap fst) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps
depsExclude = filter (not . (`elem` (Map.keys vlibs))) bdnames
vdeps = Map.filter (not . null) $ fmap (filter (not . (`elem` depsExclude))) deps
depG = stars (Map.toList vdeps)
vertexCount depG
edgeCount depG
15135
109900

depG is an algebraic-graph of the latest cabal library package names as the vertexes and their dependencies as the edges.

archive is located at ~/.cabal/packages/hackage.haskell.org/01-index.tar and contains @ 290k unique entries (May 2022).

All pathNames exist, all file types are regular and there are no utf8 issues with pathNames so we use the header pathName to roll up the archive

package count:

:t groupByPathName
:t Unfold.take 10000000 archive
:t groupByPathName (Unfold.take 10000000 archive)
packageStream & S.map (const 1) & S.sum
groupByPathName
  :: S.IsStream t =>
     Unfold IO a (Either Header ByteString)
     -> t IO (ByteString, ByteString)
Unfold.take 10000000 archive
  :: Unfold IO Void (Either Header ByteString)
groupByPathName (Unfold.take 10000000 archive)
  :: S.IsStream t => t IO (ByteString, ByteString)
303794

package names

weird name checks

S.toList $ S.filter ((/= Just (Just FileTypeRegular)) . fmap fileType) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)

S.toList $ S.filter (\x -> fmap pathName x /= fmap pathNameUtf8 x) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)

S.toList $ S.filter (\x -> fmap pathName x == Nothing) $ S.take 10 $ fmap fst $ groupByHeader (Unfold.take 10000000 archive)
[]
> []
> []

empty content

The first 10 package names

S.toList $ S.take 10 $ fmap fst packageStream
iconv/0.2/iconv.cabalCrypto/3.0.3/Crypto.cabalHDBC/1.0.1/HDBC.cabalHDBC-odbc/1.0.1.0/HDBC-odbc.cabalHDBC-postgresql/1.0.1.0/HDBC-postgresql.cabalHDBC-sqlite3/1.0.1.0/HDBC-sqlite3.cabaldarcs-graph/0.1/darcs-graph.cabalhask-home/2006.3.23/hask-home.cabalhmp3/1.1/hmp3.caballambdabot/4.0/lambdabot.cabal

Some have no cabal file content, but these are preferred-version types.

S.length $ S.filter ((=="") . snd) $ packageStream
43

types of packages

package path names are either preferred-versions, .cabal or package.json

S.length $ fmap fst $ S.filter (not . (\x -> B.isSuffixOf "preferred-versions" x || B.isSuffixOf ".cabal" x || B.isSuffixOf "package.json" x) . fst) $ packageStream
0

Reifying this as NameType:

:i NameType
S.fold count $ fmap (bimap toNameType (=="")) $ packageStream
type NameType :: *
data NameType
  = CabalName | PreferredVersions | PackageJson | BadlyNamed
  	-- Defined at src/Research/Hackage.hs:192:1
instance Eq NameType -- Defined at src/Research/Hackage.hs:192:95
instance Ord NameType -- Defined at src/Research/Hackage.hs:192:90
instance Show NameType -- Defined at src/Research/Hackage.hs:192:84
fromList [((CabalName,False),168535),((PreferredVersions,False),3115),((PreferredVersions,True),43),((PackageJson,False),132101)]

preferred-versions

S.toList $ S.take 10 $ S.filter (\(x,c) -> B.isSuffixOf "preferred-versions" x && c /= "") $ packages archive

package-json

package-json content is a security/signing feature you can read about in hackage-security.

S.length $ S.filter ((\x -> B.isSuffixOf "package.json" x) . fst) $ packageStream
132101
S.toList $ S.take 4 $ S.filter ((\x -> B.isSuffixOf "package.json" x) . fst) $ packageStream

.cabal

S.length $ S.filter ((\x -> B.isSuffixOf ".cabal" x) . fst) $ packageStream
168535
fmap fst <$> (S.toList $ S.take 10 $ S.filter ((\x -> B.isSuffixOf ".cabal" x) . fst) $ packageStream)

.cabal paths

So there is about 160k cabal files to R&D …

malformed version number check

mErrs <- S.fold (collect fst snd) $ S.filter (isLeft . snd) $ fmap (second (parseVersion . C.pack)) $ fmap (fromRight undefined) $ S.filter isRight $ fmap (Research.Hackage.parsePath . fst) $ S.filter ((==CabalName) . toNameType . fst) packageStream

length mErrs

Total number of names

t1 <- S.fold (collect fst snd) $ fmap (second (fromRight undefined)) $ S.filter (isRight . snd) $ fmap (second (parseVersion . C.pack)) $ fmap (fromRight undefined) $ S.filter isRight $ fmap (Research.Hackage.parsePath . fst) $ S.filter ((==CabalName) . toNameType . fst) packageStream

length t1
> 17055

Average number of versions:

fromIntegral (sum $ Map.elems $ length <$> t1) / fromIntegral (length t1)
9.658348979468233

All of the latest cabal files have content:

latest = Map.map maximum t1
length $ Map.toList $ Map.filter (==[]) latest
0

latest versions

lcf <- latestCabalFiles
length $ Map.toList lcf
16511

field parsing errors

field errors

fmap (\x -> C.pack (fst x) <> "-" <> toVer (fst (snd x))) $ Map.toList $ Map.filter (isLeft . readFields . snd) lcf
DSTM-0.1.2control-monad-exception-mtl-0.10.3ds-kanren-0.2.0.1metric-0.2.0phasechange-0.1smartword-0.0.0.5

busting up cabal files into fields

valid cabal files with ok parsing of all fields:

vlcs <- validLatestCabals
:t vlcs
length vlcs
17049

field counts across all files

import Data.Ord
fmap (take 10 . List.sortOn (Down . snd) . Map.toList) $ S.fold count $ S.fromList $ fmap names $ mconcat $ fmap snd $ Map.toList $ fmap snd vlcs

authors

fmap (take 10 . List.sortOn (Down . snd) . Map.toList) $ S.fold count $ S.fromList $ mconcat $ fmap authors $ fmap snd $ Map.toList $ fmap snd vlcs

libraries

not libraries

Map.size $ Map.filter ((0==) . length) $ fmap (catMaybes . fmap (sec "library") . snd) vlcs
1743

multiple libraries

Map.size $ Map.filter ((>1) . length) $ fmap (catMaybes . fmap (sec "library") . snd) vlcs
79

Multiple libraries are usually “internal” libraries that can only be used inside the cabal file.

take 10 $ Map.toList $ Map.filter (\x -> x/=[[]] && x/=[] && listToMaybe x /= Just []) $ fmap (fmap (fmap secName) . fmap fst . catMaybes . fmap (sec "library") . snd) vlcs

common stanzas

length $ Map.toList $ Map.filter (/=[]) $ fmap (catMaybes . fmap (sec "common")) $ fmap snd vlcs
737

valid cabal files that have a library section:

vlibs <- Map.delete "acme-everything" <$> validLatestLibs
Map.size vlibs
15305

dependencies

Total number of build dependencies in library stanzas and in common stanzas:

sum $ fmap snd $ Map.toList $ fmap (sum . fmap length) $ fmap (fmap (fieldValues "build-depends")) $ Map.filter (/=[]) $ fmap (fmap snd . catMaybes . fmap (sec "library") . snd) vlibs

sum $ fmap snd $ Map.toList $ fmap (sum . fmap length) $ fmap (fmap (fieldValues "build-depends")) $ Map.filter (/=[]) $ fmap (fmap snd . catMaybes . fmap (sec "common") . snd) vlibs
105233
> 3440

no dependencies

Map.size $ Map.filter (==[]) $ fmap (rawBuildDeps . snd) $ Map.delete "acme-everything" vlcs
1725

These are mostly parse errors from not properly parsing conditionals.

unique dependencies

Map.size $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ Map.delete "acme-everything" vlibs

raw build-deps example:

take 1 $ Map.toList $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ vlibs
2captcha(aeson >=1.5.6.0 && <1.6,base >=4.7 && <5,bytestring >=0.10.12.0 && <0.11,clock >=0.8.2 && <0.9,exceptions >=0.10.4 && <0.11,http-client >=0.6.4.1 && <0.7,lens >=4.19.2 && <4.20,lens-aeson >=1.1.1 && <1.2,parsec >=3.1.14.0 && <3.2,text >=1.2.4.1 && <1.3,wreq >=0.5.3.3 && <0.6 )

lex check:

S.fold count $ S.concatMap S.fromList $ fmap C.unpack $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList $ fmap (fmap mconcat) $ Map.filter (/=[]) $ fmap (rawBuildDeps . snd) $ vlibs
fromList [('\t',42),(' ',572471),('&',86160),('(',486),(')',486),('*',5969),(',',92554),('-',32183),('.',140854),('0',77745),('1',63104),('2',32240),('3',20269),('4',29110),('5',22316),('6',9901),('7',9590),('8',6678),('9',6284),('<',45145),('=',78780),('>',65175),('A',259),('B',234),('C',1113),('D',474),('E',75),('F',143),('G',334),('H',809),('I',103),('J',112),('K',15),('L',502),('M',399),('N',79),('O',280),('P',422),('Q',602),('R',240),('S',544),('T',524),('U',200),('V',75),('W',73),('X',92),('Y',24),('Z',18),('^',2855),('a',73691),('b',29688),('c',35787),('d',20249),('e',109010),('f',12413),('g',16508),('h',16656),('i',52533),('j',527),('k',7435),('l',34131),('m',26121),('n',54342),('o',47497),('p',28317),('q',2380),('r',67213),('s',78990),('t',90097),('u',14024),('v',6600),('w',3782),('x',10090),('y',17960),('z',1406),('{',38),('|',1936),('}',38)]

deps

parsing the dependencies for just the names:

deps = fmap (fromRight undefined . parseDeps . mconcat . mconcat . rawBuildDeps . snd) vlibs
Map.size deps
sum $ Map.elems $ fmap length deps

:

14779
106678
take 3 $ Map.toList deps
[("2captcha",["aeson","base","bytestring","clock","exceptions","http-client","lens","lens-aeson","parsec","text","wreq"]),("3dmodels",["base","attoparsec","bytestring","linear","packer"]),("AAI",["base"])]

packages with the most dependencies:

take 20 $ List.sortOn (Down . snd) $ fmap (second length) $ Map.toList deps
yesod-platform132
hackport127
planet-mitchell109
raaz104
hevm84
sockets82
btc-lsp71
too-many-cells70
ghcide69
pandoc68
cachix67
sprinkles67
emanote64
freckle-app64
pantry-tmp64
taffybar63
neuron61
project-m3661
NGLess60
stack59

dependees

fmap (take 20) $ fmap (List.sortOn (Down . snd)) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps
base14709
bytestring5399
text4969
containers4712
mtl3473
transformers3069
aeson2021
time1932
vector1797
directory1608
filepath1532
template-haskell1456
unordered-containers1388
deepseq1248
lens1175
binary932
hashable930
array889
exceptions855
process851

All the dependees found:

bdnames <- fmap (fmap fst) $ fmap Map.toList $ S.fold count $ S.concatMap S.fromList $ S.fromList $ fmap snd $ Map.toList deps

length bdnames
> 5873

dependency name errors

dependees not in the cabal index:

length $ filter (not . (`elem` (Map.keys vlibs))) bdnames

take 10 $ filter (not . (`elem` (Map.keys vlibs))) bdnames
233
> ["Codec-Compression-LZF","Consumer","DOM","DebugTraceHelpers","FieldTrip","FindBin","HJavaScript","HTTP-Simple","Imlib","LRU"]

excluding these:

depsExclude = filter (not . (`elem` (Map.keys vlibs))) bdnames
vdeps = fmap (filter (not . (`elem` depsExclude))) deps
Map.size vdeps
sum $ fmap snd $ Map.toList $ fmap length vdeps

:

> 14779
106238

ToDo potential error sources

  • [X] error 1 - commas can be inside braces
  • [ ] error 2 - plain old dodgy depends acme-everything, cabal, deprecated packages
  • [ ] error 3 - multiple build-depends in one stanza
  • [ ] error 4 - cpp & conditionals
  • [ ] error 5 - packages not on Hackage

    cardano “This library requires quite a few exotic dependencies from the cardano realm which aren’t necessarily on hackage nor stackage. The dependencies are listed in stack.yaml, make sure to also include those for importing cardano-transactions.” ~ https://raw.githubusercontent.com/input-output-hk/cardano-haskell/d80bdbaaef560b8904a828197e3b94e667647749/snapshots/cardano-1.24.0.yaml

  • [ ] error 6 - internal library (only available to the main cabal library stanza) yahoo-prices, vector-endian, symantic-parser

Empty lists are mostly due to bad conditional parsing

Map.size $ Map.filter null deps
243

algebraic-graphs

An (algebraic) graph of dependencies:

depG = stars (Map.toList vdeps)
:t depG
ToGraph.preSet "folds" depG
ToGraph.postSet "folds" depG

https://hackage.haskell.org/package/proton

vertexCount depG
edgeCount depG
14779
105693

graphics

text package dependency example

supers = upstreams "text" depG <> Set.singleton "text"
superG = induce (`elem` (toList supers)) depG

other/textdeps.svg

folds

supers = upstreams "folds" depG <> Set.singleton "folds"
superG = induce (`elem` (toList supers)) depG

other/foldsdeps.svg

mealy package dependencies

supers = upstreams "mealy" depG <> Set.singleton "mealy"
superG = induce (`elem` (toList (Set.delete "base" supers))) depG

other/mealy.svg

reference

packages

other hackage parsing examples

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published