Hi all,
You might have read the short take in the X.org board meeting minutes already, here's the long version.
The good news: gitlab.fd.o has become very popular with our communities, and is used extensively. This especially includes all the CI integration. Modern development process and tooling, yay!
The bad news: The cost in growth has also been tremendous, and it's breaking our bank account. With reasonable estimates for continued growth we're expecting hosting expenses totalling 75k USD this year, and 90k USD next year. With the current sponsors we've set up we can't sustain that. We estimate that hosting expenses for gitlab.fd.o without any of the CI features enabled would total 30k USD, which is within X.org's ability to support through various sponsorships, mostly through XDC.
Note that X.org does no longer sponsor any CI runners themselves, we've stopped that. The huge additional expenses are all just in storing and serving build artifacts and images to outside CI runners sponsored by various companies. A related topic is that with the growth in fd.o it's becoming infeasible to maintain it all on volunteer admin time. X.org is therefore also looking for admin sponsorship, at least medium term.
Assuming that we want cash flow reserves for one year of gitlab.fd.o (without CI support) and a trimmed XDC and assuming no sponsor payment meanwhile, we'd have to cut CI services somewhere between May and June this year. The board is of course working on acquiring sponsors, but filling a shortfall of this magnitude is neither easy nor quick work, and we therefore decided to give an early warning as soon as possible. Any help in finding sponsors for fd.o is very much appreciated.
Thanks, Daniel
On Thu, Feb 27, 2020 at 1:27 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Some clarification I got from Daniel in a private conversation, since I was confused about what the money was paying for exactly:
We're paying 75K USD for the bandwidth to transfer data from the GitLab cloud instance. i.e., for viewing the https site, for cloning/updating git repos, and for downloading CI artifacts/images to the testing machines (AFAIU).
I was not aware that we were being charged for anything wrt GitLab hosting yet (and neither was anyone on my team at Intel that I've asked). This... kind of needs to be communicated.
A consistent concern put forth when we were discussing switching to GitLab and building CI was... how do we pay for it. It felt like that concern was always handwaved away. I heard many times that if we needed more runners that we could just ask Google to spin up a few more. If we needed testing machines they'd be donated. No one mentioned that all the while we were paying for bandwidth... Perhaps people building the CI would make different decisions about its structure if they knew it was going to wipe out the bank account.
What percentage of the bandwidth is consumed by transferring CI images, etc? Wouldn't 75K USD would be enough to buy all the testing machines we need and host them within Google or wherever so we don't need to pay for huge amounts of bandwidth?
I understand that self-hosting was attractive so that we didn't find ourselves on the SourceForge-equivalent hosting platform of 2022, but is that risk real enough to justify spending 75K+ per year? If we were hosted on gitlab.com or github.com, we wouldn't be paying for transferring CI images to CI test machines, etc, would we?
So what do we do now? Have we painted ourselves into a corner?
Hi Matt,
On Thu, 27 Feb 2020 at 23:45, Matt Turner mattst88@gmail.com wrote:
I believe that in January, we had $2082 of network cost (almost entirely egress; ingress is basically free) and $1750 of cloud-storage cost (almost all of which was download). That's based on 16TB of cloud-storage (CI artifacts, container images, file uploads, Git LFS) egress and 17.9TB of other egress (the web service itself, repo activity). Projecting that out gives us roughly $45k of network activity alone, so it looks like this figure is based on a projected increase of ~50%.
The actual compute capacity is closer to $1150/month.
The original answer is that GitLab themselves offered to sponsor enough credit on Google Cloud to get us started. They used GCP themselves so they could assist us (me) in getting bootstrapped, which was invaluable. After that, Google's open-source program office offered to sponsor us for $30k/year, which was I believe last April. Since then the service usage has increased roughly by a factor of 10, so our 12-month sponsorship is no longer enough to cover 12 months.
Unless the Google Cloud Platform starts offering DragonBoards, it wouldn't reduce our bandwidth usage as the corporate network is treated separately for egress.
Cheers, Daniel
On Friday 2020-02-28 08:59, Daniel Stone wrote:
I had come to a similar conclusion a few years back: It is not very economic to run ephemereal buildroots (and anything like it) between two (or more) "significant locations" of which one end is located in a Large Cloud datacenter like EC2/AWS/etc.
As for such usecases, me and my surrounding peers have used (other) offerings where there is 50 TB free network/month, and yes that may have entailed doing more adminning than elsewhere - but an admin appreciates $2000 a lot more than a corporation, too.
Hi Jan,
On Fri, 28 Feb 2020 at 10:09, Jan Engelhardt jengelh@inai.de wrote:
Yes, absolutely. For context, our storage & network costs have increased >10x in the past 12 months (~$320 Jan 2019), >3x in the past 6 months (~$1350 July 2019), and ~2x in the past 3 months (~$2000 Oct 2019).
I do now (personally) think that it's crossed the point at which it would be worthwhile paying an admin to solve the problems that cloud services currently solve for us - which wasn't true before. Such an admin could also deal with things like our SMTP delivery failure rate, which in the past year has spiked over 50% (see previous email), demand for new services such as Discourse which will enable user support without either a) users having to subscribe to a mailing list, or b) bug trackers being cluttered up with user requests and other non-bugs, etc.
Cheers, Daniel
On Fri, Feb 28, 2020 at 12:00 AM Daniel Stone daniel@fooishbar.org wrote:
Could we have the full GCP bill posted?
On Thu, 27 Feb 2020 22:27:04 +0100 Daniel Vetter daniel.vetter@ffwll.ch said:
Might I suggest that given the kind of expenses detailed here, literally buying 1 - 4 reasonably specced boxes and hosting them at OSUOSL would be incredibly cheaper? (we (enlightenment.org) have been doing so for years on a single box). We farm out CI to travis via gihub mirrors as it's not considered an essential core service (unlike mailing lists, git, phabricator whch nwe still run - we can live without CI for a while and find other ways).
The cost is the odd HDD replacement every few years and maybe every 10y or so a new box. That's a massively lower cost than you are quoting below.
OSUOSL provide bandwidth, power, rack space etc. for free. They have been fantastic IMHO and the whole "no fat bills" is awesome and you get a full system to set up any way you like. You just bring the box. That should drop cost through the floor. It will require some setup and admin though.
On 02/27/2020 01:27 PM, Daniel Vetter wrote:
Have you looked into applying for free credits from amazon:
https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-...
-Tom
On 02/27/2020 05:00 PM, Tom Stellard wrote:
Also fastly provides free CDN services to some Open Source projects:
https://www.fastly.com/open-source?utm_medium=social&utm_source=t.co&...
It might also be worth looking into if the main costs are coming from data transfers.
-Tom
On Fri, 28 Feb 2020 at 07:27, Daniel Vetter daniel.vetter@ffwll.ch wrote:
a) Ouch.
b) we probably need to take a large step back here.
Look at this from a sponsor POV, why would I give X.org/fd.o sponsorship money that they are just giving straight to google to pay for hosting credits? Google are profiting in some minor way from these hosting credits being bought by us, and I assume we aren't getting any sort of discounts here. Having google sponsor the credits costs google substantially less than having any other company give us money to do it.
If our current CI architecture is going to burn this amount of money a year and we hadn't worked this out in advance of deploying it then I suggest the system should be taken offline until we work out what a sustainable system would look like within the budget we have, whether that be never transferring containers and build artifacts from the google network, just having local runner/build combos etc.
Dave.
On Fri, Feb 28, 2020 at 4:38 AM Dave Airlie airlied@gmail.com wrote:
Google has sponsored 30k in hosting credits last year, these simply ran out _much_ faster than anyone planned for. So this is by far not a free thing for them. Plus there's also other companies sponsoring CI runners and what not else in equally substantial amounts, plus the biggest thing, sponsored admin time (more or less officially). So there's a _lot_ of room for companies like Red Hat to sponsor without throwing any money in google's revenue stream.
Or it doesn't happen, and then yeah the decision has already been made to shutter the CI services. So this is also a question of whether we (as a community and all the companies benefitting from the work done) really want this, or maybe not quite. -Daniel
On Fri, 28 Feb 2020 at 03:38, Dave Airlie airlied@gmail.com wrote:
The last I looked, Google GCP / Amazon AWS / Azure were all pretty comparable in terms of what you get and what you pay for them. Obviously providers like Packet and Digital Ocean who offer bare-metal services are cheaper, but then you need to find someone who is going to properly administer the various machines, install decent monitoring, make sure that more storage is provisioned when we need more storage (which is basically all the time), make sure that the hardware is maintained in decent shape (pretty sure one of the fd.o machines has had a drive in imminent-failure state for the last few months), etc.
Given the size of our service, that's a much better plan (IMO) than relying on someone who a) isn't an admin by trade, b) has a million other things to do, and c) hasn't wanted to do it for the past several years. But as long as that's the resources we have, then we're paying the cloud tradeoff, where we pay more money in exchange for fewer problems.
Yes, we could federate everything back out so everyone runs their own builds and executes those. Tinderbox did something really similar to that IIRC; not sure if Buildbot does as well. Probably rules out pre-merge testing, mind.
The reason we hadn't worked everything out in advance of deploying is because Mesa has had 3993 MRs in the not long over a year since moving, and a similar number in GStreamer, just taking the two biggest users. At the start it was 'maybe let's use MRs if you want to but make sure everything still goes through the list', and now it's something different. Similarly the CI architecture hasn't been 'designed', so much as that people want to run dEQP and Piglit on their hardware pre-merge in an open fashion that's actually accessible to people, and have just done it.
Again, if you want everything to be centrally designed/approved/monitored/controlled, that's a fine enough idea, and I'd be happy to support whoever it was who was doing that for all of fd.o.
Cheers, Daniel
On Fri, 28 Feb 2020 at 18:18, Daniel Stone daniel@fooishbar.org wrote:
Admin for gitlab and CI is a full time role anyways. The system is definitely not self sustaining without time being put in by you and anholt still. If we have $75k to burn on credits, and it was diverted to just pay an admin to admin the real hw + gitlab/CI would that not be a better use of the money? I didn't know if we can afford $75k for an admin, but suddenly we can afford it for gitlab credits?
Why? does gitlab not support the model? having builds done in parallel on runners closer to the test runners seems like it should be a thing. I guess artifact transfer would cost less then as a result.
I don't think we have any choice but to have someone centrally controlling it, You can't have a system in place that lets CI users burn largs sums of money without authorisation, and that is what we have now.
Dave.
On Fri, 28 Feb 2020 at 08:48, Dave Airlie airlied@gmail.com wrote:
s/gitlab credits/GCP credits/
I took a quick look at HPE, which we previously used for bare metal, and it looks like we'd be spending $25-50k (depending on how much storage you want to provision, how much room you want to leave to provision more storage later, how much you care about backups) to run a similar level of service so that'd put a bit of a dint in your year-one budget.
The bare-metal hosting providers also add up to more expensive than you might think, again especially if you want either redundancy or just backups.
It does support the model but if every single build executor is also compiling Mesa from scratch locally, how long do you think that's going to take?
OK, not sure who it is who's going to be approving every update to every .gitlab-ci.yml in the repository, or maybe we just have zero shared runners and anyone who wants to do builds can BYO.
On Fri, Feb 28, 2020 at 12:48 AM Dave Airlie airlied@gmail.com wrote:
As I think about the time that I've spent at google in less than a year on trying to keep the lights on for CI and optimize our infrastructure in the current cloud environment, that's more than the entire yearly budget you're talking about here. Saying "let's just pay for people to do more work instead of paying for full-service cloud" is not a cost optimization.
Let's do some napkin math. The biggest artifacts cost we have in Mesa is probably meson-arm64/meson-arm (60MB zipped from meson-arm64, downloaded by 4 freedreno and 6ish lava, about 100 pipelines/day, makes ~1.8TB/month ($180 or so). We could build a local storage next to the lava dispatcher so that the artifacts didn't have to contain the rootfs that came from the container (~2/3 of the insides of the zip file), but that's another service to build and maintain. Building the drivers once locally and storing it would save downloading the other ~1/3 of the inside of the zip file, but that requires a big enough system to do builds in time.
I'm planning on doing a local filestore for google's lava lab, since I need to be able to move our xml files off of the lava DUTs to get the xml results we've become accustomed to, but this would not bubble up to being a priority for my time if I wasn't doing it anyway. If it takes me a single day to set all this up (I estimate a couple of weeks), that costs my employer a lot more than sponsoring the costs of the inefficiencies of the system that has accumulated.
On Sat, 29 Feb 2020 at 05:34, Eric Anholt eric@anholt.net wrote:
I'm not trying to knock the engineering works the CI contributors have done at all, but I've never seen a real discussion about costs until now. Engineers aren't accountants.
The thing we seem to be missing here is fiscal responsibility. I know this email is us being fiscally responsible, but it's kinda after the fact.
I cannot commit my employer to spending a large amount of money (> 0 actually) without a long and lengthy process with checks and bounds. Can you?
The X.org board has budgets and procedures as well. I as a developer of Mesa should not be able to commit the X.org foundation to spending large amounts of money without checks and bounds.
The CI infrastructure lacks any checks and bounds. There is no link between editing .gitlab-ci/* and cashflow. There is no link to me adding support for a new feature to llvmpipe that blows out test times (granted it won't affect CI budget but just an example).
The fact that clouds run on credit means that it's not possible to say budget 30K and say when that runs out it runs out, you end up getting bills for ever increasing amounts that you have to cover, with nobody "responsible" for ever reducing those bills. Higher Faster Further baby comes to mind.
Has X.org actually allocated the remaining cash in it's bank account to this task previously? Was there plans for this money that can't be executed now because we have to pay the cloud fees? If we continue to May and the X.org bank account hits 0, can XDC happen?
Budgeting and cloud is hard, the feedback loops are messy. In the old system the feedback loop was simple, we don't have admin time or money for servers we don't get the features, cloud allows us to get the features and enjoy them and at some point in the future the bill gets paid by someone else. Credit cards lifestyles all the way.
Like maybe we can grow up here and find sponsors to cover all of this, but it still feels a bit backwards from a fiscal pov.
Again I'm not knocking the work people have done at all, CI is very valuable to the projects involved, but that doesn't absolve us from costs.
Dave.
On Fri, Feb 28, 2020 at 9:31 PM Dave Airlie airlied@gmail.com wrote:
We're working to get the logging in place to know which projects exactly burn down the money so that we can take specific actions. If needed. So pretty soon you wont be able to just burn down endless amounts of cash with a few gitlab-ci commits. Or at least not for long until we catch you and you either fix things up or CI is gone for your project.
We're working on this, since it's the boards responsibility to be on top of stuff. It's simply that we didn't expect a massive growth of this scale and this quickly, so we're a bit behind on the controlling aspect.
Also I guess it wasnt clear, but the board decision yesterday was the stop loss order where we cut the cord (for CI at least). So yeah the short term budget is firmly in place now.
There's numbers elsewhere in this thread, but if you'd read the original announcement it states that the stop loss would still guarantee that we can pay for everything for at least one year. We're not going to get even close to 0 in the bank account.
So yeah XDC happens, and it'll also still happen next year. Also fd.o servers will keep running. The only thing we might need to switch off is the CI support.
Uh ... where exactly do you get the credit card approach from? SPI is legally not allowed to extend us a credit (we're not a legal org anymore), so if we hit 0 it's out real quick. No credit for us. If SPI isnt on top of that it's their loss (but they're getting pretty good at tracking stuff with the contractor they now have and all that).
Which is not going to happen btw, if you've read the announcement mail and all that.
Cheers, Daniel
Hi All,
I know there's been a lot of discussion already, but I wanted to respond to Daniel's original post.
I joined GitLab earlier this month as their new Open Source Program Manager [1] and wanted to introduce myself here since I’ll be involved from the GitLab side as we work together to problem-solve the financial situation here. My role at GitLab is to help make it easier for Open Source organizations to migrate (by helping to smooth out some of the current pain points), and to help advocate internally for changes to the product and our workflows to make GitLab better for Open Source orgs. We want to make sure that our Open Source community feels supported beyond just migration. As such, I’ll be running the GitLab Open Source Program [2].
My background is that I’m the former President and Chairperson of the GNOME Foundation, which is one of the earliest Free Software projects to migrate to GitLab. GNOME initially faced some limitations with the CI runner costs too, but thanks to generous support from donors, has no longer experienced those issues in recent times. I know there's already a working relationship between our communities, but it could be good to examine what GNOME and KDE have done and see if there's anything we can apply here. We've reached out to Daniel Stone, our main contact for the freedesktop.org migration, and he has gotten us in touch with Daniel V. and the X.Org Foundation Board to learn more about what's already been done and what we can do next.
Please bear with me as I continue to get ramped up in my new job, but I’d like to offer as much support as possible with this issue. We’ll be exploring ways for GitLab to help make sure there isn’t a gap in coverage during the time that freedesktop looks for sponsors. I know that on GitLab’s side, supporting our Open Source user community is a priority.
Best,
Nuritzi
[1] https://about.gitlab.com/company/team/#nuritzi [2] https://about.gitlab.com/handbook/marketing/community-relations/opensource-p...
On Fri, Feb 28, 2020 at 1:22 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
The problem of data transfer costs is not new in Cloud environments. At work we usually just opt for paying for it since dev time is scarser. For private projects though, I opt for aggressive (remote) caching. So you can setup a global cache in Google Cloud Storage and more local caches wherever your executors are (reduces egress as much as possible). This setup works great with Bazel and Pants among others. Note that these systems are pretty hermetic in contrast to Meson. IIRC Eric by now works at Google. They internally use Blaze which AFAIK does aggressive caching, too. So maybe using any of these systems would be a way of not having to sacrifice any of the current functionality. Downside is that you have lower a bit of dev productivity since you cannot eyeball your build definitions anymore.
ym2c
On Fri, 28 Feb 2020 at 20:34, Eric Anholt eric@anholt.net wrote:
Le samedi 04 avril 2020 à 15:55 +0200, Andreas Bergmeier a écrit :
Did you mean Bazel [0] ? I'm not sure I follow your reflection, why is Meson vs Bazel related to this issue ?
Nicolas
On Fri, 2020-02-28 at 13:37 +1000, Dave Airlie wrote:
I kinda agree, but maybe the step doesn't have to be *too* large?
I wonder if we could solve this by restructuring the project a bit. I'm talking purely from a Mesa point of view here, so it might not solve the full problem, but:
1. It feels silly that we need to test changes to e.g the i965 driver on dragonboards. We only have a big "do not run CI at all" escape- hatch.
2. A lot of us are working for a company that can probably pay for their own needs in terms of CI. Perhaps moving some costs "up front" to the company that needs it can make the future of CI for those who can't do this
3. I think we need a much more detailed break-down of the cost to make educated changes. For instance, how expensive is Docker image uploads/downloads (e.g intermediary artifacts) compared to build logs and final test-results? What kind of artifacts?
One suggestion would be to do something more similar to what the kernel does, and separate into different repos for different subsystems. This could allow us to have separate testing-pipelines for these repos, which would mean that for instance a change to RADV didn't trigger a full Panfrost test-run.
This would probably require us to accept using a more branch-heavy work-flow. I don't personally think that would be a bad thing.
But this is all kinda based on an assumption that running hardware- testing is the expensive part. I think that's quite possibly the case, but some more numbers would be helpful. I mean, it might turn out that just throwing up a Docker cache inside the organizations that host runners might be sufficient for all I know...
We could also do stuff like reducing the amount of tests we run on each commit, and punt some testing to a per-weekend test-run or someting like that. We don't *need* to know about every problem up front, just the stuff that's about to be released, really. The other stuff is just nice to have. If it's too expensive, I would say drop it.
I would really hope that we can consider approaches like this before we throw out the baby with the bathwater...
On 28/02/2020 11:28, Erik Faye-Lund wrote:
Yeah, changes on vulkan drivers or backend compilers should be fairly sandboxed.
We also have tools that only work for intel stuff, that should never trigger anything on other people's HW.
Could something be worked out using the tags?
-Lionel
On Fri, 2020-02-28 at 11:40 +0200, Lionel Landwerlin wrote:
I think so! We have the pre-defined environment variable CI_MERGE_REQUEST_LABELS, and we can do variable conditions:
https://docs.gitlab.com/ee/ci/yaml/#onlyvariablesexceptvariables
That sounds like a pretty neat middle-ground to me. I just hope that new pipelines are triggered if new labels are added, because not everyone is allowed to set labels, and sometimes people forget...
On Fri, 28 Feb 2020 at 10:06, Erik Faye-Lund erik.faye-lund@collabora.com wrote:
There's also this which is somewhat more robust: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2569
Cheers, Daniel
On Fri, 2020-02-28 at 10:43 +0000, Daniel Stone wrote:
I'm not sure it's more robust, but yeah that a useful tool too.
The reason I'm skeptical about the robustness is that we'll miss testing if this misses a path. That can of course be fixed by testing everything once things are in master, and fixing up that list when something breaks on master.
The person who wrote a change knows more about the intricacies of the changes than a computer will ever do. But humans are also good at making mistakes, so I'm not sure which one is better. Maybe the union of both?
As long as we have both rigorous testing after something landed in master (doesn't nessecarily need to happen right after, but for now that's probably fine), as well as a reasonable heuristic for what testing is needed pre-merge, I think we're good.
On 2020-02-28 12:02 p.m., Erik Faye-Lund wrote:
Surely missing a path will be less likely / often to happen compared to an MR missing a label. (Users which aren't members of the project can't even set labels for an MR)
On Fri, 2020-02-28 at 10:43 +0000, Daniel Stone wrote:
My 20 cents:
1. I think we should completely disable running the CI on MRs which are marked WIP. Speaking from personal experience, I usually make a lot of changes to my MRs before they are merged, so it is a waste of CI resources.
2. Maybe we could take this one step further and only allow the CI to be only triggered manually instead of automatically on every push.
3. I completely agree with Pierre-Eric on MR 2569, let's not run the full CI pipeline on every change, only those parts which are affected by the change. It not only costs money, but is also frustrating when you submit a change and you get unrelated failures from a completely unrelated driver.
Best regards, Timur
Le samedi 29 février 2020 à 19:14 +0100, Timur Kristóf a écrit :
In the mean time, you can help by taking the habit to use:
git push -o ci.skip
CI is in fact run for all branches that you push. When we (GStreamer Project) started our CI we wanted to limit this to MR, but haven't found a good way yet (and Gitlab is not helping much). The main issue is that it's near impossible to use gitlab web API from a runner (requires private key, in an all or nothing manner). But with the current situation we are revisiting this.
The truth is that probably every CI have lot of room for optimization, but it can be really time consuming. So until we have a reason to, we live with inefficiency, like over sized artifact, unused artifacts, over-sized docker image, etc. Doing a new round of optimization is obviously a clear short term goals for project, including GStreamer project. We have discussions going on and are trying to find solutions. Notably, we would like to get rid of the post merge CI, as in a rebase flow like we have in GStreamer, it's a really minor risk.
That's a much more difficult goal then it looks like. Let each projects manage their CI graph and content, as each case is unique. Running more tests, or building more code isn't the main issue as the CPU time is mostly sponsored. The data transfers between the cloud of gitlab and the runners (which are external), along to sending OS image to Lava labs is what is likely the most expensive.
As it was already mention in the thread, what we are missing now, and being worked on, is per group/project statistics that give us the hotspot so we can better target the optimization work.
On Sat, 2020-02-29 at 14:46 -0500, Nicolas Dufresne wrote:
Thanks for the advice, I wasn't aware such an option exists. Does this also work on the mesa gitlab or is this a GStreamer only thing?
How hard would it be to make this the default?
Yes, would be nice to know what the hotspot is, indeed.
As far as I understand, the problem is not CI itself, but the bandwidth needed by the build artifacts, right? Would it be possible to not host the build artifacts on the gitlab, but rather only the place where the build actually happened? Or at least, only transfer the build artifacts on-demand?
I'm not exactly familiar with how the system works, so sorry if this is a silly question.
On Sat, Feb 29, 2020 at 3:47 PM Timur Kristóf timur.kristof@gmail.com wrote:
Mesa is already set up so that it only runs on MRs and branches named ci-* (or maybe it's ci/*; I can't remember).
How hard would it be to make this the default?
I strongly suggest looking at how Mesa does it and doing that in GStreamer if you can. It seems to work pretty well in Mesa.
--Jason
Le samedi 29 février 2020 à 15:54 -0600, Jason Ekstrand a écrit :
You are right, they added CI_MERGE_REQUEST_SOURCE_BRANCH_NAME in 11.6 (we started our CI a while ago). But there is even better now, ou can do:
only: refs: - merge_requests
Thanks for the hint, I'll suggest that. I've lookup some of the backend of mesa, I think it's really nice, though there is a lot of concept that won't work in a multi-repo CI. Again, I need to refresh on what was moved from the enterprise to the community version in this regard,
For Mesa, we could run CI only when Marge pushes, so that it's a strictly pre-merge CI.
Marek
On Sat., Feb. 29, 2020, 17:20 Nicolas Dufresne, nicolas@ndufresne.ca wrote:
On 2020-03-01 6:46 a.m., Marek Olšák wrote:
For Mesa, we could run CI only when Marge pushes, so that it's a strictly pre-merge CI.
Thanks for the suggestion! I implemented something like this for Mesa:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4432
On Fri, Apr 3, 2020 at 7:12 AM Michel Dänzer michel@daenzer.net wrote:
I wouldn't mind manually triggering pipelines, but unless there is some trick I'm not realizing, it is super cumbersome. Ie. you have to click first the container jobs.. then wait.. then the build jobs.. then wait some more.. and then finally the actual runners. That would be a real step back in terms of usefulness of CI.. one might call it a regression :-(
Is there a possible middle ground where pre-marge pipelines that touch a particular driver trigger that driver's CI jobs, but MRs that don't touch that driver but do touch shared code don't until triggered by marge? Ie. if I have a MR that only touches nir, it's probably ok to not run freedreno jobs until marge triggers it. But if I have a MR that is touching freedreno, I'd really rather not have to wait until marge triggers the freedreno CI jobs.
Btw, I was under the impression (from periodically skimming the logs in #freedesktop, so I could well be missing or misunderstanding something) that caching/etc had been improved and mesa's part of the egress wasn't the bigger issue at this point?
BR, -R
Le samedi 04 avril 2020 à 08:11 -0700, Rob Clark a écrit :
On GStreamer side we have moved some existing pipeline to manual mode. As we use needs: between jobs, we could simply set the first job to manual (in our case it's a single job called manifest in your case it would be the N container jobs). This way you can have a manual pipeline that is triggered in single (or fewer) clicks. Here's an example:
https://gitlab.freedesktop.org/gstreamer/gstreamer/pipelines/128292
That our post-merge pipelines, we only trigger then if we suspect a problem.
On Sat, Apr 4, 2020 at 10:47 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:
I'm not sure that would work for mesa since the hierarchy of jobs branches out pretty far.. ie. if I just clicked the arm64 build + test container jobs, and everything else ran automatically after that, it would end up running all the CI jobs for all the arm devices (or at least all the 64b ones)
I'm not sure why gitlab works this way, a more sensible approach would be to click on the last jobs you want to run and for that to automatically propagate up to run the jobs needed to run clicked job.
BR, -R
On Sat, Apr 4, 2020 at 11:16 AM Rob Clark robdclark@gmail.com wrote:
update: pepp pointed out on #dri-devel that the path-based rules should still apply to prune out hw CI jobs for hw not affected by the MR. If that is the case, and we only need to click the container jobs (without then doing the wait&click dance), then this doesn't sound as bad as I feared.
BR, -R
On Sat, Apr 4, 2020 at 11:41 AM Rob Clark robdclark@gmail.com wrote:
PS. I should add, that in these wfh days, I'm relying on CI to be able to test changes on some generations of hw that I don't physically have with me. It's easy to take for granted, I did until I thought about what I'd do without CI. So big thanks to all the people who are working on CI, it's more important these days than you might realize :-)
BR, -R
On Sat, Apr 04, 2020 at 11:16:08AM -0700, Rob Clark wrote:
generate your gitlab-ci from a template so each pipeline has its own job dependency. The duplication won't hurt you if it's expanded through templating and it gives you fine-grained running of the manual jobs.
We're using this in ci-templates/libevdev/libinput for the various distributions and their versions so each distribution+version is effectively its own pipeline. But we only need to maintain one job in the actual template file.
https://freedesktop.pages.freedesktop.org/ci-templates/ci-fairy.html#templat...
Cheers, Peter
On Sat, Apr 04, 2020 at 08:11:23AM -0700, Rob Clark wrote:
I *think* this should work though if you set up the right job dependencies. very simple example: https://gitlab.freedesktop.org/whot/ci-playground/pipelines/128601
job1 is "when:manual", job2 has "needs: job1", job3 has "needs: job2". Nothing runs at first, if you trigger job1 it'll cascade down to job 2 and 3.
The main limit you have here are the stages - where a job is part of a stage but does not have an explicit "needs:" it will wait for the previous stage to complete. That will never happen if one job in that stage has a manual dependency. See this pipeline as an example: https://gitlab.freedesktop.org/whot/ci-playground/pipelines/128605
So basically: if you set up all your jobs with the correct "needs" you could even have a noop stage for user interface purposes. Here's an example: https://gitlab.freedesktop.org/whot/ci-playground/pipelines/128606
It has a UI stage with "test-arm" and "test-x86" manual jobs. It has other stages with dependent jobs on those (cascading down) but it also has a set of autorun jobs that run independent of the manual triggers. When you push, the autorun jobs run. When you trigger "test-arm" manually, it triggers the various dependent jobs.
So I think what you want to do is possible, it just requires some tweaking of the "needs" entries.
Cheers, Peter
On Sat, 2020-04-04 at 08:11 -0700, Rob Clark wrote:
I think that's mostly a complaint about the conditionals we've written so far, tbh. As I commented on the bug, when I clicked the container job (which the rules happen to have evaluated to being "manual"), every job (recursively) downstream of it got enqueued, which isn't what you're describing. So I think if you can describe the UX you'd like we can write rules to make that reality.
But I don't really know which jobs are most expensive in terms of bandwidth, or storage, or CPUs, and even if I knew those I don't know how to map those to currency. So I'm not sure if the UI we'd like would minimize the cost the way we'd like.
- ajax
On Mon, Apr 6, 2020 at 8:43 AM Adam Jackson ajax@redhat.com wrote:
Ok, I was fearing that we'd have to manually trigger each stage of dependencies in the pipeline. Which wouldn't be so bad except that gitlab makes you wait for the previous stage to complete before triggering the next one.
The ideal thing would be to be able to click any jobs that we want to run, say "arm64_a630_gles31", and for gitlab to realize that it needs to automatically trigger dependencies of that job (meson-arm64, and arm_build+arm_test). But not sure if that is a thing gitlab can do.
Triggering just the first container jobs and having everything from there run automatically would be ok if the current rules to filter out unneeded jobs still apply, ie. a panfrost change isn't triggering freedreno CI jobs and visa versa. But tbh I don't understand enough of what that MR is doing to understand if that is what it does. (It was suggested on IRC that this is probably what it does.)
BR, -R
On 2020-04-06 6:34 p.m., Rob Clark wrote:
Not that I know of. The dependency handling is still pretty rudimentary in general.
It is. Filtered jobs don't exist at all in the pipeline, so they can't be triggered by any means. :)
On Mon, Apr 6, 2020 at 10:04 AM Michel Dänzer michel@daenzer.net wrote:
ahh, ok, I didn't realize that.. thx for the explaination
BR, -R
On 2020-02-29 8:46 p.m., Nicolas Dufresne wrote:
Interesting idea, do you want to create an MR implementing it?
In the mean time, you can help by taking the habit to use:
git push -o ci.skip
That breaks Marge Bot.
Notably, we would like to get rid of the post merge CI, as in a rebase flow like we have in GStreamer, it's a really minor risk.
That should be pretty easy, see Mesa and https://docs.gitlab.com/ce/ci/variables/predefined_variables.html. Something like this should work:
rules: - if: '$CI_PROJECT_NAMESPACE != "gstreamer"' when: never
This is another interesting idea we could consider for Mesa as well. It would however require (mostly) banning direct pushes to the main repository.
That would again break Marge Bot.
Le dimanche 01 mars 2020 à 15:14 +0100, Michel Dänzer a écrit :
We already have this policy in GStreamer group. We rely on maintainers to make the right call though, as we have few cases in multi-repo usage where pushing manually is the only way to reduce the breakage time (e.g. when we undo a new API in development branch). (We have implemented support so that CI is run across users repository with the same branch name, so that allow doing CI with all the changes, but the merge remains non-atomic.)
Marge is just a software, we can update it to trigger CI on rebases, or if the CI haven't been run. There was proposal to actually do that and let marge trigger CI on merge from maintainers. Though, from my point view, having a longer delay between submission and the author being aware of CI breakage have some side effects. Authors are often less available a week later, when someone review and try to merge, which make merging patches a lot longer.
One idea for Marge-bot (don't know if you already do this): Rust-lang has their bot (bors) automatically group together a few merge requests into a single merge commit, which it then tests, then, then the tests pass, it merges. This could help reduce CI runs to once a day (or some other rate). If the tests fail, then it could automatically deduce which one failed, by recursive subdivision or similar. There's also a mechanism to adjust priority and grouping behavior when the defaults aren't sufficient.
Jacob
I don't think we need to worry so much about the cost of CI that we need to micro-optimize to to get the minimal number of CI runs. We especially shouldn't if it begins to impact coffee quality, people's ability to merge patches in a timely manner, or visibility into what went wrong when CI fails. I've seen a number of suggestions which will do one or both of those things including:
- Batching merge requests - Not running CI on the master branch - Shutting off CI - Preventing CI on other non-MR branches - Disabling CI on WIP MRs - I'm sure there are more...
I think there are things we can do to make CI runs more efficient with some sort of end-point caching and we can probably find some truly wasteful CI to remove. Most of the things in the list above, I've seen presented by people who are only lightly involved the project to my knowledge (no offense to anyone intended). Developers depend on the CI system for their day-to-day work and hampering it will only show down development, reduce code quality, and ultimately hurt our customers and community. If we're so desperate as to be considering painful solutions which will have a negative impact on development, we're better off trying to find more money.
--Jason
On March 1, 2020 13:51:32 Jacob Lifshay programmerjake@gmail.com wrote:
[AMD Official Use Only - Internal Distribution Only]
The one suggestion I saw that definitely seemed worth looking at was adding download caches if the larger CI systems didn't already have them.
Then again do we know that CI traffic is generating the bulk of the costs ? My guess would have been that individual developers and users would be generating as much traffic as the CI rigs.
________________________________ From: amd-gfx amd-gfx-bounces@lists.freedesktop.org on behalf of Jason Ekstrand jason@jlekstrand.net Sent: March 1, 2020 3:18 PM To: Jacob Lifshay programmerjake@gmail.com; Nicolas Dufresne nicolas@ndufresne.ca Cc: Erik Faye-Lund erik.faye-lund@collabora.com; Daniel Vetter daniel.vetter@ffwll.ch; Michel Dänzer michel@daenzer.net; X.Org development xorg-devel@lists.x.org; amd-gfx list amd-gfx@lists.freedesktop.org; wayland wayland-devel@lists.freedesktop.org; X.Org Foundation Board board@foundation.x.org; Xorg Members List members@x.org; dri-devel dri-devel@lists.freedesktop.org; Mesa Dev mesa-dev@lists.freedesktop.org; intel-gfx intel-gfx@lists.freedesktop.org; Discussion of the development of and with GStreamer gstreamer-devel@lists.freedesktop.org Subject: Re: [Intel-gfx] [Mesa-dev] gitlab.fd.o financial situation and impact on services
I don't think we need to worry so much about the cost of CI that we need to micro-optimize to to get the minimal number of CI runs. We especially shouldn't if it begins to impact coffee quality, people's ability to merge patches in a timely manner, or visibility into what went wrong when CI fails. I've seen a number of suggestions which will do one or both of those things including:
- Batching merge requests - Not running CI on the master branch - Shutting off CI - Preventing CI on other non-MR branches - Disabling CI on WIP MRs - I'm sure there are more...
I think there are things we can do to make CI runs more efficient with some sort of end-point caching and we can probably find some truly wasteful CI to remove. Most of the things in the list above, I've seen presented by people who are only lightly involved the project to my knowledge (no offense to anyone intended). Developers depend on the CI system for their day-to-day work and hampering it will only show down development, reduce code quality, and ultimately hurt our customers and community. If we're so desperate as to be considering painful solutions which will have a negative impact on development, we're better off trying to find more money.
--Jason
On March 1, 2020 13:51:32 Jacob Lifshay programmerjake@gmail.com wrote:
One idea for Marge-bot (don't know if you already do this): Rust-lang has their bot (bors) automatically group together a few merge requests into a single merge commit, which it then tests, then, then the tests pass, it merges. This could help reduce CI runs to once a day (or some other rate). If the tests fail, then it could automatically deduce which one failed, by recursive subdivision or similar. There's also a mechanism to adjust priority and grouping behavior when the defaults aren't sufficient.
Jacob _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.orgmailto:Intel-gfx%40lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfxhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fintel-gfx&data=02%7C01%7Cjohn.bridgman%40amd.com%7C96fa507073f24b02f4b808d7be1daf8a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637186907338419170&sdata=eT%2FUHbHaS1bZdvQOPjJ6wm0pqZSj2YE8k54%2FZHurRgA%3D&reserved=0
Hi Jason,
I personally think the suggestion are still a relatively good brainstorm data for those implicated. Of course, those not implicated in the CI scripting itself, I'd say just keep in mind that nothing is black and white and every changes end-up being time consuming.
Le dimanche 01 mars 2020 à 14:18 -0600, Jason Ekstrand a écrit :
Agreed. Or at least I foresee quite complicated code to handle the case of one batched merge failing the tests, or worst, with flicky tests.
A small clarification, this depends on the chosen work-flow. In GStreamer, we use a rebase flow, so "merge" button isn't really merging. It means that to merge you need your branch to be rebased on top of the latest. As it is multi-repo, there is always a tiny chance of breakage due to mid-air collision in changes in other repos. What we see is that the post "merge" cannot even catch them all (as we already observed once). In fact, it usually does not catch anything. Or each time it cached something, we only notice on the next MR.0 So we are really considering doing this as for this specific workflow/project, we found very little gain of having it.
With real merge, the code being tested before/after the merge is different, and for that I agree with you.
Of course :-), specially that we had CI before gitlab in GStreamer (just not pre-commit), we don't want a regress that far in the past.
Another small nuance, mesa does not prevent CI, it only makes it manual on non-MR. Users can go click run to get CI results. We could also have option to trigger the ci (the opposite of ci.skip) from git command line.
That I'm also mitigated about.
regards, Nicolas
On Sun, Mar 1, 2020 at 2:49 PM Nicolas Dufresne nicolas@ndufresne.ca wrote:
Sorry. I didn't intend to stop a useful brainstorming session. I'm just trying to say that CI is useful and we shouldn't hurt our development flows just to save a little money unless we're truly desperate. From what I understand, I don't think we're that desperate yet. So I was mostly trying to re-focus the discussion towards straightforward things we can do to get rid of pointless waste (there probably is some pretty low-hanging fruit) and away from "OMG X.org is running out of money; CI as little as possible". I don't think you're saying those things; but I've sensed a good bit of fear in this thread. (I could just be totally misreading people, but I don't think so.)
One of the things that someone pointed out on this thread is that we need data. Some has been provided here but it's still a bit unclear exactly what the break-down is so it's hard for people to come up with good solutions beyond "just do less CI". We do know that the biggest cost is egress web traffic and that's something we didn't know before. My understanding is that people on the X.org board and/or Daniel are working to get better data. I'm fairly hopeful that, once we understand better what the costs are (or even with just the new data we have), we can bring it down to reasonable and/or come up with money to pay for it in fairly short order.
Again, sorry I was so terse. I was just trying to slow the panic.
Even with a rebase model, it's still potentially different; though marge re-runs CI before merging. I agree the risk is low, however, and if you have GitLab set up to block MRs that don't pass CI, then you may be able to drop the master branch to a daily run or something like that. Again, should be project-by-project.
Hence my use of "prevent". :-) It's very useful but, IMO, it should be opt-in and not opt-out. I think we agree here. :-)
--Jason
On Fri, Feb 28, 2020 at 10:29 AM Erik Faye-Lund erik.faye-lund@collabora.com wrote:
We have logs somewhere, but no one yet got around to analyzing that. Which will be quite a bit of work to do since the cloud storage is totally disconnected from the gitlab front-end, making the connection to which project or CI job caused something is going to require scripting. Volunteers definitely very much welcome I think.
Uh as someone who lives the kernel multi-tree model daily, there's a _lot_ of pain involved. I think much better to look at filtering out CI targets for when nothing relevant happened. But that gets somewhat tricky, since "nothing relevant" is always only relative to some baseline, so bit of scripting and all involved to make sure you don't run stuff too often or (probably worse) not often enough. -Daniel
On Fri, 2020-02-28 at 10:47 +0100, Daniel Vetter wrote:
Fair enough, but just keep in mind that the same thing as optimizing software applies here; working blindly reduces the impact. So if we want to fix the current situation, this is going to have to be a priority, I think.
Could you please elaborate a bit? We're not the kernel, so I'm not sure all of the kernel-pains apply to us. But we probably have other pains as well ;-)
But yeah, it might be better to take smaller steps first, and see if that helps.
Yes, not running things often enough is the biggest problem, but I think an important thing to come to terms with is that we don't need to know about *every single issue* before things hit master, we need to know about:
- Build failures (prevents others from getting their stuff done) - Fundamental brokenness (again, prevents others)
There's probably some cases I missed, but you get my point.
We do need to know things are good to go periodically, as well as on release-branches, though. But we can set up different rules for different branches in GitLab CI.
So for instance, we could run some basic sanity check on one of each (major) target for each commit rather than a full set of dEQP variants etc. Then we could run all tests once a commit has been merged. This would already cut down a lot of runs for a lot of targets.
Combine this with label-based triggering like Lionel suggested, and we might have something that's not too big of a change but still might save significan cost.
On Fr, 2020-02-28 at 10:47 +0100, Daniel Vetter wrote:
It's very surprising to me that this kind of cost monitoring is treated as an afterthought, especially since one of the main jobs of the X.Org board is to keep spending under control and transparent.
Also from all the conversations it's still unclear to me if the google hosting costs are already over the sponsored credits (so is burning a hole into X.org bank account right now) or if this is only going to happen at a later point in time.
Even with CI disabled it seems that the board estimates a cost of 30k annually for the plain gitlab hosting. Is this covered by the credits sponsored by google? If not, why wasn't there a board voting on this spending? All other spending seem to require pre-approval by the board. Why wasn't gitlab hosting cost discussed much earlier in the public board meetings, especially if it's going to be such an big chunk of the overall spending of the X.Org foundation?
Regards, Lucas
On 2020-02-28 10:28 a.m., Erik Faye-Lund wrote:
I don't agree that pre-merge testing is just nice to have. A problem which is only caught after it lands in mainline has a much bigger impact than one which is already caught earlier.
On Fri, Feb 28, 2020 at 3:43 AM Michel Dänzer michel@daenzer.net wrote:
one thought.. since with mesa+margebot we effectively get at least two(ish) CI runs per MR, ie. one when it is initially pushed, and one when margebot rebases and tries to merge, could we leverage this to have trimmed down pre-margebot CI which tries to just target affected drivers, with margebot doing a full CI run (when it is potentially batching together multiple MRs)?
Seems like a way to reduce our CI runs with a good safety net to prevent things from slipping through the cracks.
(Not sure how much that would help reduce bandwidth costs, but I guess it should help a bit.)
BR, -R
On Fri, Feb 28, 2020 at 11:00 AM Rob Clark robdclark@gmail.com wrote:
Here are a couple more hopefully constructive but possibly bogus ideas:
1. Suggest people put their CI farms behind a squid transparent caching proxy. There seem to be many HowTo's on the internet for doing this and it shouldn't be terribly hard. Maybe GitLab uses too much HTTPS and that messes things up? If not, this would cut downloads to one-per-farm rather than one-per-machine
2. Add -Dstrip=true to the meson config. We want asserts but do we really need those debug symbols? Quick testing on my machine, it seems to reduce the size of build artifacts by about 60%
Feel free to tell the peanut gallery (me) why I'm wrong. :-)
--Jason
On Thu, Feb 27, 2020 at 7:38 PM Dave Airlie airlied@gmail.com wrote:
If we're taking a step back here, I also want to recognize what a tremendous success this has been so far and thank everybody involved for building something so useful. Between gitlab and the CI, our workflow has improved and code quality has gone up. I don't have anything useful to add to the technical discussion, except that that it seems pretty standard engineering practice to build a system, observe it and identify and eliminate bottlenecks. Planning never hurts, of course, but I don't think anybody could have realistically modeled and projected the cost of this infrastructure as it's grown organically and fast.
Kristian
dri-devel@lists.freedesktop.org